# Introduction to Pandas

We will learn how to use pandas for data analysis. Pandas as an extremely powerful python library for manipulating data, with a lot of features. To study pandas, we would go through:

* Introduction to Pandas
* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining,and Concatenating
* Operations
* Data Input and Output

# Series

The first main data type we will learn about for pandas is the Series data type. To use series, we must first import Pandas.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let us look at some examples:

In [1]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [4]:
labels = ['A','I','L','Y']
my_list = [10,20,30,40]
array = np.array([10,20,30,40])
my_dict = {'A':10,'I':20,'L':30, 'Y':40}

In [5]:
## to convert the list to a pandas series:
pd.Series(data=my_list)

0    10
1    20
2    30
3    40
dtype: int64

In [6]:
## you can specify index of a pandas series

pd.Series(data=my_list,index=labels)

## note thta the same thing can be done with: pd.Series(my_list,labels)

A    10
I    20
L    30
Y    40
dtype: int64

In [7]:
## to convert the array to a pandas series:
pd.Series(array)

0    10
1    20
2    30
3    40
dtype: int32

In [8]:
## we can specify the array index:
pd.Series(array,labels)

A    10
I    20
L    30
Y    40
dtype: int32

In [9]:
## we can convert a dictionary to a pandas series:
pd.Series(my_dict)

A    10
I    20
L    30
Y    40
dtype: int64

In [10]:
## pandas series can hold different data types:
pd.Series(data=labels)

0    A
1    I
2    L
3    Y
dtype: object

## Indexing

Understanding indexing is important to use pandas series. Pandas makes use of index names or numbers to allow fast look ups of information (works like a literal dictionary).

Let's see some examples of how to grab information from a Series. Let us create two series, series1 and series2:

In [11]:
series1 = pd.Series([1,2,3,4],index = ['ADE', 'OLA','EBI', 'UTI'])
series1

ADE    1
OLA    2
EBI    3
UTI    4
dtype: int64

In [12]:
series2 = pd.Series([1,2,3,4],index=['ADE','OLA','UGO','UTI'])
series2

ADE    1
OLA    2
UGO    3
UTI    4
dtype: int64

In [13]:
series1['ADE']

1

In [14]:
series1 + series2

ADE    2.0
EBI    NaN
OLA    4.0
UGO    NaN
UTI    8.0
dtype: float64

# Dataframes

DataFrames are the major feature of pandas and present data in a spreadsheet format very similar to Excel. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's us consider some examples:

In [16]:
from numpy.random import randn
np.random.seed(101)

In [17]:
#randn will give us random numbers from the standard noraml distribution in the shape specified
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [18]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selection and Indexing

Let us consider the various methods to grab data from a DataFrame:

In [19]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [20]:
## to return a dataframe, you have to pass in a list of column names:

df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [21]:
## Dataframe columns are themselves Pandas series, you can confirm this by doing:
# Pass a list of column names
type(df['W'])

pandas.core.series.Series

In [22]:
# you can create new columns in your dataframe:

df['NEW'] = df['X'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,NEW
A,2.70685,0.628133,0.907969,0.503826,1.536102
B,0.651118,-0.319318,-0.848077,0.605965,-1.167395
C,-2.018168,0.740122,0.528813,-0.589001,1.268936
D,0.188695,-0.758872,-0.933237,0.955057,-1.692109
E,0.190794,1.978757,2.605967,0.683509,4.584725


In [23]:
# you can also remove existing columns from the dataframe:
df.drop('NEW',axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [24]:
## you may need to explicitly indicate that a column should be permanently dropped
df

Unnamed: 0,W,X,Y,Z,NEW
A,2.70685,0.628133,0.907969,0.503826,1.536102
B,0.651118,-0.319318,-0.848077,0.605965,-1.167395
C,-2.018168,0.740122,0.528813,-0.589001,1.268936
D,0.188695,-0.758872,-0.933237,0.955057,-1.692109
E,0.190794,1.978757,2.605967,0.683509,4.584725


In [25]:
df.drop('NEW',axis=1,inplace=True)

In [26]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [27]:
## you can also select rows

df.loc['D']

W    0.188695
X   -0.758872
Y   -0.933237
Z    0.955057
Name: D, dtype: float64

In [28]:
## you can select rows based on index number
df.iloc[3]

W    0.188695
X   -0.758872
Y   -0.933237
Z    0.955057
Name: D, dtype: float64

In [29]:
## you can select a subset of rows and columns. It will give you entries where rows and columns intersect
df.loc['E','Z']

0.6835088855389145

In [29]:
df.loc[['A','B'],['Y','Z']]

Unnamed: 0,Y,Z
A,0.907969,0.503826
B,-0.848077,0.605965


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [30]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [31]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [32]:
# The bracket helps us to return actual values and not just boolean logic
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [33]:
df[df['Z']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [34]:
#here we are doing conditional selection of values in a column X based on values in column W meeting a condition
df[df['W']>0]['X']

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64

In [35]:
#This is similar to above. Here we are selecting columns Y and X
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


In [36]:
## when you need to select with two conditionals, you can use | or &
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


In [37]:
## you can also reset index to the default of 0,1,2...n
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [44]:
countries = 'USA GER RUS UK FRA'.split()
df['COUNTRY']= countries

In [45]:
df.set_index('COUNTRY')

Unnamed: 0_level_0,W,X,Y,Z
COUNTRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
USA,2.70685,0.628133,0.907969,0.503826
GER,0.651118,-0.319318,-0.848077,0.605965
RUS,-2.018168,0.740122,0.528813,-0.589001
UK,0.188695,-0.758872,-0.933237,0.955057
FRA,0.190794,1.978757,2.605967,0.683509


In [46]:
df.loc['GER','X']

-0.31931804459303326

In [70]:
## remember to make inplace True to keep the change permanently

In [41]:
df.set_index('COUNTRY',inplace=True)

In [42]:
df

Unnamed: 0_level_0,W,X,Y,Z
COUNTRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
USA,2.70685,0.628133,0.907969,0.503826
GER,0.651118,-0.319318,-0.848077,0.605965
RUS,-2.018168,0.740122,0.528813,-0.589001
UK,0.188695,-0.758872,-0.933237,0.955057
FRA,0.190794,1.978757,2.605967,0.683509


## Multi-Level Index and Index Hierarchy

Pandas also allows multi level indexing and index hierarchy:

In [26]:
# Index Levels
level_1 = ['J1','J1','J1','J2','J2','J2']
level_2 = [1,2,3,1,2,3]
hier_index = list(zip(level_1,level_2))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [27]:
print(hier_index)

MultiIndex(levels=[['J1', 'J2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])


In [28]:
# here we set the index to the hierarchical index we have created 
new_df = pd.DataFrame(np.random.randn(6,3),index=hier_index,columns=['A','B','C'])
new_df

Unnamed: 0,Unnamed: 1,A,B,C
J1,1,1.704595,0.784145,0.86434
J1,2,-0.257714,-0.645444,1.408747
J1,3,1.913947,0.030306,0.17
J2,1,1.498862,-0.626152,-1.195288
J2,2,-0.031243,2.195136,0.928152
J2,3,-0.373486,-0.134088,0.4547


Now let's look at how to index this. For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [29]:
new_df.loc['J1']

Unnamed: 0,A,B,C
1,1.704595,0.784145,0.86434
2,-0.257714,-0.645444,1.408747
3,1.913947,0.030306,0.17


In [30]:
new_df.loc['J1'].iloc[0]

A    1.704595
B    0.784145
C    0.864340
Name: 1, dtype: float64

In [31]:
# we are creating index names for the two index levels
new_df.index.names= ['Group','Number']

In [32]:
new_df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
Group,Number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
J1,1,1.704595,0.784145,0.86434
J1,2,-0.257714,-0.645444,1.408747
J1,3,1.913947,0.030306,0.17
J2,1,1.498862,-0.626152,-1.195288
J2,2,-0.031243,2.195136,0.928152
J2,3,-0.373486,-0.134088,0.4547


In [33]:
# the function xs helps us to slice through 1 or more levels of index at once. We can specify argument level and put in index level name
new_df.xs('J2')

Unnamed: 0_level_0,A,B,C
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1.498862,-0.626152,-1.195288
2,-0.031243,2.195136,0.928152
3,-0.373486,-0.134088,0.4547


In [36]:
new_df.xs(1,level='Number')

Unnamed: 0_level_0,A,B,C
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
J1,1.704595,0.784145,0.86434
J2,1.498862,-0.626152,-1.195288
