# Introduction to Pandas

We will learn how to use pandas for data analysis. Pandas as an extremely powerful python library for manipulating data, with a lot of features. To study pandas, we would go through:

* Introduction to Pandas
* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining,and Concatenating
* Operations
* Data Input and Output

# Series

The first main data type we will learn about for pandas is the Series data type. To use series, we must first import Pandas.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let us look at some examples:

In [1]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [8]:
labels = ['A','I','L','Y']
my_list = [10,20,30,40]

In [3]:
pd.Series?

In [4]:
## to convert the list to a pandas series:
pd.Series(data=my_list, name = 'numbers')

0    10
1    20
2    30
3    40
Name: numbers, dtype: int64

In [9]:
## you can specify index of a pandas series
pd.Series(data=my_list ,index=labels)
## note that the same thing can be done with: pd.Series(my_list,labels)

A    10
I    20
L    30
Y    40
dtype: int64

In [11]:
array = np.array([10,20,30,40])
## to convert the array to a pandas series:
pd.Series(array)

0    10
1    20
2    30
3    40
dtype: int32

In [12]:
## we can specify the array index:
pd.Series(array,labels)

A    10
I    20
L    30
Y    40
dtype: int32

In [14]:
## we can convert a dictionary to a pandas series:
my_dict = {'A':10,'I':20,'L':30, 'Y':40}
pd.Series(my_dict)

A    10
I    20
L    30
Y    40
dtype: int64

In [15]:
## pandas series can hold different data types:
pd.Series(data=labels)

0    A
1    I
2    L
3    Y
dtype: object

## Indexing

Understanding indexing is important to use pandas series. Pandas makes use of index names or numbers to allow fast look ups of information (works like a literal dictionary).

Let's see some examples of how to grab information from a Series. Let us create two series, series1 and series2:

In [16]:
series1 = pd.Series([1,2,3,4],index = ['ADE', 'OLA','EBI', 'UTI'])
series1

ADE    1
OLA    2
EBI    3
UTI    4
dtype: int64

In [17]:
series2 = pd.Series([1,2,3,4],index=['ADE','OLA','UGO','UTI'])
series2

ADE    1
OLA    2
UGO    3
UTI    4
dtype: int64

In [18]:
series1['OLA']

2

In [19]:
series1[0]

1

In [20]:
series1 + series2

ADE    2.0
EBI    NaN
OLA    4.0
UGO    NaN
UTI    8.0
dtype: float64

# Dataframes

DataFrames are the major feature of pandas and present data in a spreadsheet format very similar to Excel. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's us consider some examples:

In [21]:
from numpy.random import randn
np.random.seed(101)

In [22]:
randn?

In [23]:
randn(5,1)

array([[2.70684984],
       [0.62813271],
       [0.90796945],
       [0.50382575],
       [0.65111795]])

In [24]:
pd.DataFrame?

In [25]:
pd.DataFrame(data = randn(5,4))

Unnamed: 0,0,1,2,3
0,-0.319318,-0.848077,0.605965,-2.018168
1,0.740122,0.528813,-0.589001,0.188695
2,-0.758872,-0.933237,0.955057,0.190794
3,1.978757,2.605967,0.683509,0.302665
4,1.693723,-1.706086,-1.159119,-0.134841


In [29]:
'A B C D E'.split()
#'A-B-C-D-E'.split('-')

['A', 'B', 'C', 'D', 'E']

In [30]:
#randn will give us random numbers from the standard noraml distribution in the shape specified
df = pd.DataFrame(randn(5,4),index=['A','B','C','D','E'],columns='W X Y Z'.split())

In [31]:
df

Unnamed: 0,W,X,Y,Z
A,0.390528,0.166905,0.184502,0.807706
B,0.07296,0.638787,0.329646,-0.497104
C,-0.75407,-0.943406,0.484752,-0.116773
D,1.901755,0.238127,1.996652,-0.993263
E,0.1968,-1.136645,0.000366,1.025984


## Selection and Indexing

Let us consider the various methods to grab data from a DataFrame:

In [32]:
df['W']

A    0.390528
B    0.072960
C   -0.754070
D    1.901755
E    0.196800
Name: W, dtype: float64

In [33]:
df[['W']]

Unnamed: 0,W
A,0.390528
B,0.07296
C,-0.75407
D,1.901755
E,0.1968


In [35]:
## to return a dataframe, you have to pass in a list of column names:

df[['W','Z']]

Unnamed: 0,W,Z
A,0.390528,0.807706
B,0.07296,-0.497104
C,-0.75407,-0.116773
D,1.901755,-0.993263
E,0.1968,1.025984


In [37]:
## Dataframe columns are themselves Pandas series, you can confirm this by doing:

type(df['Y'])

pandas.core.series.Series

In [38]:
df

Unnamed: 0,W,X,Y,Z
A,0.390528,0.166905,0.184502,0.807706
B,0.07296,0.638787,0.329646,-0.497104
C,-0.75407,-0.943406,0.484752,-0.116773
D,1.901755,0.238127,1.996652,-0.993263
E,0.1968,-1.136645,0.000366,1.025984


In [39]:
# you can create new columns in your dataframe:

df['NEW'] = df['X'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,NEW
A,0.390528,0.166905,0.184502,0.807706,0.351406
B,0.07296,0.638787,0.329646,-0.497104,0.968433
C,-0.75407,-0.943406,0.484752,-0.116773,-0.458655
D,1.901755,0.238127,1.996652,-0.993263,2.234779
E,0.1968,-1.136645,0.000366,1.025984,-1.136278


In [40]:
df.drop?

In [41]:
# you can also remove existing columns from the dataframe:
df.drop('NEW', axis = 1)

Unnamed: 0,W,X,Y,Z
A,0.390528,0.166905,0.184502,0.807706
B,0.07296,0.638787,0.329646,-0.497104
C,-0.75407,-0.943406,0.484752,-0.116773
D,1.901755,0.238127,1.996652,-0.993263
E,0.1968,-1.136645,0.000366,1.025984


In [42]:
## you may need to explicitly indicate that a column should be permanently dropped
df

Unnamed: 0,W,X,Y,Z,NEW
A,0.390528,0.166905,0.184502,0.807706,0.351406
B,0.07296,0.638787,0.329646,-0.497104,0.968433
C,-0.75407,-0.943406,0.484752,-0.116773,-0.458655
D,1.901755,0.238127,1.996652,-0.993263,2.234779
E,0.1968,-1.136645,0.000366,1.025984,-1.136278


In [43]:
df2 =df.drop('NEW',axis=1)
df2

Unnamed: 0,W,X,Y,Z
A,0.390528,0.166905,0.184502,0.807706
B,0.07296,0.638787,0.329646,-0.497104
C,-0.75407,-0.943406,0.484752,-0.116773
D,1.901755,0.238127,1.996652,-0.993263
E,0.1968,-1.136645,0.000366,1.025984


In [45]:
df.drop('NEW',axis=1,inplace=True)

In [46]:
df

Unnamed: 0,W,X,Y,Z
A,0.390528,0.166905,0.184502,0.807706
B,0.07296,0.638787,0.329646,-0.497104
C,-0.75407,-0.943406,0.484752,-0.116773
D,1.901755,0.238127,1.996652,-0.993263
E,0.1968,-1.136645,0.000366,1.025984


In [47]:
df['X'] = df['X'] **2
df

Unnamed: 0,W,X,Y,Z
A,0.390528,0.027857,0.184502,0.807706
B,0.07296,0.408049,0.329646,-0.497104
C,-0.75407,0.890016,0.484752,-0.116773
D,1.901755,0.056704,1.996652,-0.993263
E,0.1968,1.291961,0.000366,1.025984


In [48]:
new_values = np.random.randint(10, size = 5)
new_values 

array([4, 5, 9, 5, 8])

In [49]:
df['W'] = new_values
df

Unnamed: 0,W,X,Y,Z
A,4,0.027857,0.184502,0.807706
B,5,0.408049,0.329646,-0.497104
C,9,0.890016,0.484752,-0.116773
D,5,0.056704,1.996652,-0.993263
E,8,1.291961,0.000366,1.025984


In [50]:
## you can also select rows
df.loc['D']

W    5.000000
X    0.056704
Y    1.996652
Z   -0.993263
Name: D, dtype: float64

In [51]:
## you can select rows based on index number
df.iloc[3]

W    5.000000
X    0.056704
Y    1.996652
Z   -0.993263
Name: D, dtype: float64

In [52]:
df.iloc[-2]

W    5.000000
X    0.056704
Y    1.996652
Z   -0.993263
Name: D, dtype: float64

In [56]:
## you can select a subset of rows and columns. It will give you entries where rows and columns intersect
#df.loc['E','Z']
df.loc['E']['Z']

1.025984152081572

In [57]:
df.loc[['A','E'],['W','Z']]

Unnamed: 0,W,Z
A,4,0.807706
E,8,1.025984


In [58]:
df.loc['B':'D','X':'Z']

Unnamed: 0,X,Y,Z
B,0.408049,0.329646,-0.497104
C,0.890016,0.484752,-0.116773
D,0.056704,1.996652,-0.993263


In [59]:
df.iloc[[1,3], [0,2]]

Unnamed: 0,W,Y
B,5,0.329646
D,5,1.996652


In [60]:
df.iloc[0:3, :]

Unnamed: 0,W,X,Y,Z
A,4,0.027857,0.184502,0.807706
B,5,0.408049,0.329646,-0.497104
C,9,0.890016,0.484752,-0.116773


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [61]:
df

Unnamed: 0,W,X,Y,Z
A,4,0.027857,0.184502,0.807706
B,5,0.408049,0.329646,-0.497104
C,9,0.890016,0.484752,-0.116773
D,5,0.056704,1.996652,-0.993263
E,8,1.291961,0.000366,1.025984


In [62]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,True,True,False
C,True,True,True,False
D,True,True,True,False
E,True,True,True,True


In [None]:
df[df>0]

In [63]:
# The bracket helps us to return actual values and not just boolean logic
df[df>0]

Unnamed: 0,W,X,Y,Z
A,4,0.027857,0.184502,0.807706
B,5,0.408049,0.329646,
C,9,0.890016,0.484752,
D,5,0.056704,1.996652,
E,8,1.291961,0.000366,1.025984


In [64]:
df[df['Z']>0]

Unnamed: 0,W,X,Y,Z
A,4,0.027857,0.184502,0.807706
E,8,1.291961,0.000366,1.025984


In [68]:
#here we are doing conditional selection of values in a column X based on values in column W meeting a condition
df[df['W']>0][['X']]

Unnamed: 0,X
A,0.027857
B,0.408049
C,0.890016
D,0.056704
E,1.291961


In [69]:
d = df[df['W']>0]
d['X']

A    0.027857
B    0.408049
C    0.890016
D    0.056704
E    1.291961
Name: X, dtype: float64

In [70]:
#This is similar to above. Here we are selecting columns Y and X
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.184502,0.027857
B,0.329646,0.408049
C,0.484752,0.890016
D,1.996652,0.056704
E,0.000366,1.291961


In [72]:
## when you need to select with two conditionals, you can use | or &
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
D,5,0.056704,1.996652,-0.993263


In [73]:
df[(df['W']>0) | (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
A,4,0.027857,0.184502,0.807706
B,5,0.408049,0.329646,-0.497104
C,9,0.890016,0.484752,-0.116773
D,5,0.056704,1.996652,-0.993263
E,8,1.291961,0.000366,1.025984


In [74]:
## you can also reset index to the default of 0,1,2...n
df3 = df.reset_index()
df3

Unnamed: 0,index,W,X,Y,Z
0,A,4,0.027857,0.184502,0.807706
1,B,5,0.408049,0.329646,-0.497104
2,C,9,0.890016,0.484752,-0.116773
3,D,5,0.056704,1.996652,-0.993263
4,E,8,1.291961,0.000366,1.025984


In [75]:
df3 = df.reset_index(drop = True)
df3

Unnamed: 0,W,X,Y,Z
0,4,0.027857,0.184502,0.807706
1,5,0.408049,0.329646,-0.497104
2,9,0.890016,0.484752,-0.116773
3,5,0.056704,1.996652,-0.993263
4,8,1.291961,0.000366,1.025984


In [76]:
countries = 'USA GER RUS UK FRA'.split()
df['COUNTRY']= countries
df

Unnamed: 0,W,X,Y,Z,COUNTRY
A,4,0.027857,0.184502,0.807706,USA
B,5,0.408049,0.329646,-0.497104,GER
C,9,0.890016,0.484752,-0.116773,RUS
D,5,0.056704,1.996652,-0.993263,UK
E,8,1.291961,0.000366,1.025984,FRA


In [77]:
df4 = df.set_index('COUNTRY')
#df.set_index('COUNTRY', inplace = TRUE)

In [78]:
df4

Unnamed: 0_level_0,W,X,Y,Z
COUNTRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
USA,4,0.027857,0.184502,0.807706
GER,5,0.408049,0.329646,-0.497104
RUS,9,0.890016,0.484752,-0.116773
UK,5,0.056704,1.996652,-0.993263
FRA,8,1.291961,0.000366,1.025984


In [80]:
df4.loc['GER',['X', 'Y']]

X    0.408049
Y    0.329646
Name: GER, dtype: float64

In [None]:
## remember to make inplace True to keep the change permanently

In [None]:
df.set_index('COUNTRY',inplace=True)

In [None]:
df

## Multi-Level Index and Index Hierarchy

Pandas also allows multi level indexing and index hierarchy:

In [81]:
# Index Levels
level_1 = ['J1','J1','J1','J2','J2','J2']
level_2 = [1,2,3,1,2,3]

In [83]:
list(zip(level_1, level_2))

[('J1', 1), ('J1', 2), ('J1', 3), ('J2', 1), ('J2', 2), ('J2', 3)]

In [84]:
hier_index = list(zip(level_1,level_2))
hier_index

[('J1', 1), ('J1', 2), ('J1', 3), ('J2', 1), ('J2', 2), ('J2', 3)]

In [85]:
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index

MultiIndex([('J1', 1),
            ('J1', 2),
            ('J1', 3),
            ('J2', 1),
            ('J2', 2),
            ('J2', 3)],
           )

In [86]:
# here we set the index to the hierarchical index we have created 
new_df = pd.DataFrame(np.random.randn(6,3),index=hier_index,columns=['A','B','C'])
new_df

Unnamed: 0,Unnamed: 1,A,B,C
J1,1,-0.156598,0.050221,1.55679
J1,2,1.345434,1.501737,-0.633348
J1,3,-0.487281,-1.647469,0.543758
J2,1,-1.210949,-0.365949,0.632181
J2,2,-0.393214,-1.826066,1.257824
J2,3,1.291497,-0.200139,1.255905


Now let's look at how to index this. For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [87]:
new_df.loc['J1']

Unnamed: 0,A,B,C
1,-0.156598,0.050221,1.55679
2,1.345434,1.501737,-0.633348
3,-0.487281,-1.647469,0.543758


In [88]:
new_df.loc['J1'].loc[1]

A   -0.156598
B    0.050221
C    1.556790
Name: 1, dtype: float64

In [89]:
# we are creating index names for the two index levels
new_df.index.names= ['Group','Number']

In [90]:
new_df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
Group,Number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
J1,1,-0.156598,0.050221,1.55679
J1,2,1.345434,1.501737,-0.633348
J1,3,-0.487281,-1.647469,0.543758
J2,1,-1.210949,-0.365949,0.632181
J2,2,-0.393214,-1.826066,1.257824
J2,3,1.291497,-0.200139,1.255905


In [91]:
# the function xs helps us to slice through 1 or more levels of index at one. We can specify argument level and put in index level name
new_df.xs('J1')

Unnamed: 0_level_0,A,B,C
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,-0.156598,0.050221,1.55679
2,1.345434,1.501737,-0.633348
3,-0.487281,-1.647469,0.543758


In [92]:
new_df.xs(1,level='Number')

Unnamed: 0_level_0,A,B,C
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
J1,-0.156598,0.050221,1.55679
J2,-1.210949,-0.365949,0.632181


In [93]:
new_df.xs(3,level='Number')

Unnamed: 0_level_0,A,B,C
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
J1,-0.487281,-1.647469,0.543758
J2,1.291497,-0.200139,1.255905
