Importing pandas library

You need to import or load the Pandas library first in order to use it. By "Importing a library", it means loading it into the memory and then you can use it. Run the following code to import pandas library:


The "pd" is an alias or abbreviation which will be used as a shortcut to access or call pandas functions. To access the functions from pandas library, you just need to type pd.function instead of  pandas.function every time you need to apply it.

In [None]:
import pandas as pd

Importing Dataset

To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file.

In [None]:
income = pd.read_csv("income.csv")

Get Variable Names

By using income.columnscommand, you can fetch the names of variables of a data frame.

In [None]:
income.columns

income.columns[0:2] returns first two column names 'Index', 'State'. In python, indexing starts from 0. 

In [None]:
income.columns[0:2]

Knowing the Variable types

You can use the dataFrameName.dtypes command to extract the information of types of variables stored in the data frame.

In [None]:
income.dtypes 

Here 'object' means strings or character variables. 'int64' refers to numeric variables (without decimals).

To see the variable type of one variable (let's say "State") instead of all the variables, you can use the command below -

Creating a frequency distribution

income.Index selects the 'Index' column of 'income' dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the 'Index' having the maximum frequency on the top.

In [None]:
income.Index.value_counts(ascending = True)


To draw the samples
income.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.

In [None]:
income.sample(n = 5)


In [None]:
income.sample(frac = 0.1)

Selecting only a few of the columns
To select only a specific columns we use either loc[ ] or iloc[ ] commands. The index or columns to be selected are passed as lists. "Index":"Y2008" denotes the that all the columns from Index to Y2008 are to be selected.


In [None]:
income.loc[:,["Index","State","Y2008"]]


In [None]:
income.loc[:,"Index":"Y2008"]  #Selecting consecutive columns


In [None]:
#In the above command both Index and Y2008 are included.
income.iloc[:,0:5]  #Columns from 1 to 5 are included. 6th column not included

The difference between loc and iloc is that loc requires the column(rows) names to be selected while iloc requires the column(rows) indices (position).

You can also use the following syntax to select specific variables.

In [None]:
income[["Index","State","Y2008"]]


Renaming the variables
We create a dataframe 'data' for information of people and their respective zodiac signs.

In [None]:
data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
data 

If all the columns are to be renamed then we can use data.columns and assign the list of new column names.

In [None]:
#Renaming all the variables.
data.columns = ['Names','Zodiac Signs']

In [None]:
data

If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.


In [None]:
#Renaming only some of the variables.
data.rename(columns = {"Names":"Cust_Name"}, inplace=True)

By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.

Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain "Y" as "Year"

In [None]:
income.columns = income.columns.str.replace('Y' , 'Year ')
income.columns

Setting one column in the data frame as the index
Using set_index("column name") we can set the indices as that column and that column gets removed.

In [None]:
income.sort_values(["Index","Year 2002"]) 


Create new variables
Using eval( ) arithmetic operations on various columns can be carried out i

In [None]:
#income["difference"] = income.Y2008-income.Y2009

#Alternatively
income["difference"] = income['Year 2008']-income['Year 2009']
income.head()

In [None]:
income.ratio = income['Year 2008']/income['Year 2009']

The above command does not work, thus to create new columns we need to use square brackets.
We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.


In [None]:
income['ratio'] = income['Year 2008']/income['Year 2009']
income.head()

Finding Descriptive Statistics
describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.

In [None]:
income.describe() #for numeric variables


To find the total count, maximum occuring string and its frequency we write include = ['object']

In [None]:
income.describe(include = ['object'])  #Only for strings / objects


In [None]:
income.columns = income.columns.str.replace('Year ' , 'Y')
income.columns

Mean, median, maximum and minimum can be obtained for a particular column(s) as:


In [None]:
income.set_index("Index",inplace = True)


In [None]:
income.head()


In [None]:
#Note that the indices have changed and Index column is now no more a column
income.columns


In [None]:
income.reset_index(inplace = True)


In [None]:
income.head()

reset_index( ) tells us that one should use the by default indices.


Removing the columns and rows
To drop a column we use drop( ) where the first argument is a list of columns to be removed. 

By default axis = 0 which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1

 _get_numeric_data also provides utility to select the numeric columns only.


In [None]:
data3 = iris._get_numeric_data()
data3.head(3)

For selecting categorical variables



Concatenating
We create 2 dataframes containing the details of the students:

In [None]:
students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                         'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                          'Marks' : [50,81,98,25,35]})

 using pd.concat( ) function we can join the 2 dataframes:


In [None]:
data = pd.concat([students,students2])  #by default axis = 0
data

By default axis = 0 thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1

In [None]:
data = pd.concat([students,students2],axis = 1)
data

Using append function we can join the dataframes row-wise


In [None]:
students.append(students2)  #for rows


Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise

In [None]:
classes = {'x': students, 'y': students2}
result = pd.concat(classes)
result 

Merging or joining on the basis of common variable.
We take 2 dataframes with different number of observations:

In [None]:
students = pd.DataFrame({'Names': ['John','Mary','Henry','Maria'],
                         'Zodiac Signs': ['Aquarius','Libra','Gemini','Capricorn']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                          'Marks' : [50,81,98,25,35]})

Using pd.merge we can join the two dataframes. on = 'Names' denotes the common variable on the basis of which the dataframes are to be combined is 'Names'


In [None]:
result = pd.merge(students, students2, on='Names')  #it only takes intersections
result

By default how = "inner" thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = "outer"


In [None]:
result = pd.merge(students, students2, on='Names',how = "outer")  #it only takes unions
result

To take only intersections and all the values in left df set how = 'left'


Calculating the percentiles.
Various quantiles can be obtained by using quantile( )

In [None]:
iris.quantile(0.5)


In [None]:
iris.quantile([0.1,0.2,0.5])


In [None]:
iris.quantile(0.55)

if else in Python
We create a new dataframe of students' name and their respective zodiac signs.


In [None]:
students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                         'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})

In [None]:
def name(row):
    if row["Names"] in ["John","Henry"]:
        return "yes"
    else:
        return "no"

students['flag'] = students.apply(name, axis=1)
students

Functions in python are defined using the block keyword def , followed with the function's name as the block's name. apply( ) function applies function along rows or columns of dataframe.

Note :If using simple 'if else' we need to take care of the indentation . Python does not involve curly braces for the loops and if else.


Alternatively, By importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.

In [None]:
import numpy as np
students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')
students

In [None]:
def mname(row):
    if row["Names"] == "John" and row["Zodiac Signs"] == "Aquarius" :
        return "yellow"
    elif row["Names"] == "Mary" and row["Zodiac Signs"] == "Libra" :
        return "blue"
    elif row["Zodiac Signs"] == "Pisces" :
        return "blue"
    else:
        return "black"

students['color'] = students.apply(mname, axis=1)
students

We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False

crops.cost.isnull() firstly subsets the 'cost' from the dataframe and returns a logical vector with isnull()


In [None]:
crops[crops.cost.isnull()] #shows the rows with NAs.


In [None]:
crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop


In [None]:
crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop

To drop all the rows which have missing values in any rows we use dropna(how = "any") . By default inplace = False . If how = "all" means drop a row if all the elements in that row are missing

In [None]:
crops.dropna(how = "any")


In [None]:
crops.dropna(how = "all") 

To remove NaNs if any of 'Yield' or'cost' are missing we use the subset parameter and pass a list:

In [None]:
crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
crops.dropna(subset = ['Yield',"cost"],how = 'all').shape

Replacing the missing values by "UNKNOWN" sub attribute in Column name.


In [None]:
crops['cost'].fillna(value =crops.cost.mean(),inplace = True)
crops

Dealing with duplicates


In [None]:
data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
data

duplicated() returns a logical vector returning True when encounters duplicated.


In [None]:
data.loc[data.duplicated(),:]


In [None]:
data.loc[data.duplicated(keep = "first"),:]

By default keep = 'first' i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates.
If keep = "last" the last occurence is considered a unique value and all its repetitions are considered as duplicates.

In [None]:
income.drop('Index',axis = 1)


In [None]:

#Alternatively
income.drop("Index",axis = "columns")


In [None]:
income.drop(['Index','State'],axis = 1)


In [None]:
income.drop(0,axis = 0)


In [None]:
income.drop(0,axis = "index")


In [None]:
income.drop([0,1,2,3],axis = 0)

Also inplace = False by default thus no alterations are made in the original dataset.  axis = "columns"  and axis = "index" means the column and row(index) should be removed respectively.

Sorting the data
To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.


In [None]:
income.sort_values("State",ascending = False)
income.sort_values("State",ascending = False,inplace = True)
income['Year 2006'].sort_values() 

We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002

In [None]:
income.Y2008.mean()
income.Y2008.median()
income.Y2008.min()
income.loc[:,["Y2002","Y2008"]].max()


Groupby function
To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.

In [None]:
income.groupby("Index").Y2008.min()


In [None]:
income.groupby("Index")["Y2008","Y2010"].max()

agg( ) function is used to find all the functions for a given variable.


In [None]:
income.groupby("Index").Y2002.agg(["count","min","max","mean"])


In [None]:
income.groupby("Index")["Y2002","Y2003"].agg(["count","min","max","mean"])

The following command finds minimum and maximum values for Y2002 and only mean for Y2003

In [None]:
income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})


Filtering
To filter only those rows which have Index as "A" we write:


In [None]:
income[income.Index == "A"]

#Alternatively
income.loc[income.Index == "A",:]

To select the States having Index as "A":


In [None]:
income.loc[income.Index == "A",:].State

To filter the rows with Index as "A" and income for 2002 > 1500000"


In [None]:
income.loc[(income.Index == "A") & (income.Y2002 > 1500000),:]


To filter the rows with index either "A" or "W", we can use isin( ) function:


In [None]:
income.loc[(income.Index == "A") | (income.Index == "W"),:]

#Alternatively.
income.loc[income.Index.isin(["A","W"]),:]

Alternatively we can use query( ) function and write our filtering criteria:


In [None]:
income.query('Y2002>1700000 & Y2003 > 1500000')


Dealing with missing values
We create a new dataframe named 'crops' and to create a NaN value we use np.nan by importing numpy.

In [None]:
import numpy as np
mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
        'Yield': [1010, 1025.2, 1404.2, 1251.7],
        'cost' : [102, np.nan, 20, 68]}
crops = pd.DataFrame(mydata)
crops

isnull( ) returns True and notnull( ) returns False if the value is NaN.


In [None]:
#crops.isnull()  #same as is.na in R
#crops.notnull()  #opposite of previous command.
crops.isnull().sum()  #No. of missing values.

In [None]:
data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.

if keep = "False" then it considers all the occurences of the repeated observations as duplicates

In [None]:
data.loc[data.duplicated(keep = False),:]  #all the duplicates, including unique are shown.

To drop the duplicates drop_duplicates is used with default inplace = False, keep = 'first' or 'last' or 'False' have the respective meanings as in duplicated( )

In [None]:
data.drop_duplicates(keep = "first")


In [None]:
data.drop_duplicates(keep = "last")


In [None]:
data.drop_duplicates(keep = False,inplace = True)  #by default inplace = False


Creating dummies
Now we will consider the iris dataset. 

In [None]:
iris = pd.read_csv("https://sites.google.com/site/pocketecoworld/iris.csv")
iris.head()

map( ) function is used to match the values and replace them in the new series automatically created.

In [None]:
pd.get_dummies(iris.Species,prefix = "Species")


In [None]:
pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1]  #1 is not included


In [None]:
species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]

with concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.

In [None]:
iris = pd.concat([iris,species_dummies],axis = 1)
iris.head()

It is usual that for a variable with 'n' categories we creat 'n-1' dummies, thus to drop the first 'dummy' column we write drop_first = True


In [None]:
pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()


Ranking
 To create a dataframe of all the ranks we use rank( )
 
 Ranking by a specific variable
Suppose we want to rank the Sepal.Length for different species in ascending order:

In [None]:

iris['Rank'] = iris.sort_values(['Sepal.Length'], ascending=[True]).groupby(['Species']).cumcount() + 1
iris.head( )

#Alternatively
iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)
iris.head()

Calculating the Cumulative sum
Using cumsum( ) function we can obtain the cu

In [None]:
iris['cum_sum'] = iris["Sepal.Length"].cumsum()
iris.head()

Cumulative sum by a variable
To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )

In [None]:
iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()
iris.head()

In [None]:
conditions = [
    (students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
    (students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
    (students['Zodiac Signs'] == 'Pisces')]
choices = ['yellow', 'blue', 'purple']
students['color'] = np.select(conditions, choices, default='black')
students

Select numeric or categorical columns only
To include numeric columns we use select_dtypes( ) 


In [None]:
income.loc[income.Index == "A","State"]


In [None]:
print(income.shape[0])
print(income.shape[1])

To view only some of the rows

By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.

In [None]:
income.head(2) #shows first 2 rows.

In [None]:
income.tail() 


In [None]:
income.tail(2) #shows last 2 rows

Alternatively, any of the following commands can be used to fetch first five rows.


In [None]:
income[0:5] 


In [None]:
income.iloc[1:4,0:5]

Extract Unique Values

The unique() function shows the unique levels or categories in the dataset.


In [None]:
income.Index.unique()

The nunique( ) shows the number of unique values.


In [None]:
income.Index.nunique()

It returns 19 as index column contains distinct 19 values.

Generate Cross Tab

pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.

In [None]:
pd.crosstab(income.Index,income.State)


In [None]:
iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
iris.head()

To create dummies get_dummies( ) is used. iris.Species.prefix = "Species" adds a prefix ' Species' to the new series created.

In [None]:
income['State'].dtypes

It returns dtype('O'). In this case, 'O' refers to object i.e. type of variable as character.


Changing the data types

Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:

In [None]:
income.Y2008 = income.Y2008.astype(float)
income.dtypes

To view the dimensions or shape of the data


In [None]:
income.shape


51 is the number of rows and 16 is the number of columns.

You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R). 

In [None]:
data1 = iris.select_dtypes(include=[np.number])
data1.head()

In [None]:
result = pd.merge(students, students2, on='Names',how = "left")
result

Similarly how = 'right' takes only intersections and all the values in right df.
