## Pandas

<i>This notebook is to give a quick overview on Pandas for Data Analysis.</i>

### Imports

In [1]:
import pandas as pd
import numpy as np

#### Difference between python List and Pandas Series

List is a one dimentional array 

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. 

##### Series Index VS List Index

In [2]:
list_var = [0.25, 0.5, 0.75, 1.0]

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2,3,1,0])
data

2    0.25
3    0.50
1    0.75
0    1.00
dtype: float64

In [4]:
print("Element of 0th index in list: ", list_var[0])
print("Element of 0th index in series: ", data[0])

Element of 0th index in list:  0.25
Element of 0th index in series:  1.0


Thus this way of series indexing takes the argument in square bracket as key whereas the same is considered as position in case of list

### Series

##### Series like a `dictionary`

In [5]:
print("Value of key(index) 0 :", data[0])
data[0] = "100"
print("Value of key(index) 0 after above operation: ", data[0])

Value of key(index) 0 : 1.0
Value of key(index) 0 after above operation:  100.0


##### Series like `vector`

In [6]:
print("Vector operation on list: ", list_var*2)

Vector operation on list:  [0.25, 0.5, 0.75, 1.0, 0.25, 0.5, 0.75, 1.0]


In [7]:
print("Vector operation on series:\n ")
print(data*2)

Vector operation on series:
 
2      0.5
3      1.0
1      1.5
0    200.0
dtype: float64


In [8]:
l = [1,2,3,4]

In [9]:
l[0:3]

[1, 2, 3]

In [10]:
data[1:3]

3    0.50
1    0.75
dtype: float64

In [11]:
data[0:3]

2    0.25
3    0.50
1    0.75
dtype: float64

##### Some convenient methods for Pandas Series 

In [12]:
## Return numpy array of values of series 
print("Values in the Series: ", data.values)
print("Type: ", type(data.values))

Values in the Series:  [  0.25   0.5    0.75 100.  ]
Type:  <class 'numpy.ndarray'>


In [13]:
## return Index object containing keys/index of the Series
print("Index: ", data.index)
print("Type: ", type(data.index))

Index:  Int64Index([2, 3, 1, 0], dtype='int64')
Type:  <class 'pandas.core.indexes.numeric.Int64Index'>


In [14]:
## This way of indexing takes the argument in the square bracket as key
data[2]

0.25

In [15]:
## This way of indexing(slicing operation as you know in pyhton list) takes the range as position of the values(Series).
data[0:3]

2    0.25
3    0.50
1    0.75
dtype: float64

In [16]:
## Index of series can be int, float and string. 
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [17]:
data['b']

0.5

##### Series from `dictionary`

In [18]:
## The key of the dict becomes index of the series
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [19]:
## Accessing dict
population_dict['California']

38332521

In [20]:
## Accessing Series
population['California']

38332521

In [21]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}

area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

### DataFrame

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

##### DataFrame from `dictionary`

In [22]:
## Here the keys of the dictionary become column names for the DataFrame
states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [23]:
print("Type: ", type(states))

Type:  <class 'pandas.core.frame.DataFrame'>


##### Some convenient methods for Pandas DataFrame

In [24]:
## return Index object containing index of the DataFrame
print("Index: ", states.index)
print("Type: ", type(states.index))

Index:  Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
Type:  <class 'pandas.core.indexes.base.Index'>


In [25]:
## return Index object containing column names
print("Column names: ", states.columns)
print("Type: ", type(states.columns))

Column names:  Index(['population', 'area'], dtype='object')
Type:  <class 'pandas.core.indexes.base.Index'>


In [26]:
## Return numpy array of values of DataFrame
print("Values in the Series: \n", states.values)
print("Type: ", type(states.values))

Values in the Series: 
 [[38332521   423967]
 [26448193   695662]
 [19651127   141297]
 [19552860   170312]
 [12882135   149995]]
Type:  <class 'numpy.ndarray'>


In [27]:
## Accessing column of DataFrame as Series. Notice single square bracket
print(states['population'])
print("\nType: ", type(states['population']))

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

Type:  <class 'pandas.core.series.Series'>


##### Some statistical tools

They come in handy when performing `Exploratory Data Analysis or EDA`

In [28]:
## Give column wise mean 
states.mean()

population    23373367.2
area            316246.6
dtype: float64

In [29]:
# Return maximum of the column
states.max()

population    38332521
area            695662
dtype: int64

In [30]:
## Return minimum of the column 
states.min()

population    12882135
area            141297
dtype: int64

In [31]:
## Return stardard deviation of the column 
states.std()

population    9.640386e+06
area          2.424374e+05
dtype: float64

In [32]:
## Gives corelation between each columns(features in the literature of Data Science). 
## This is a very handy method to help understand relationships between features.
states.corr()

Unnamed: 0,population,area
population,1.0,0.61302
area,0.61302,1.0


In [33]:
## This is yet another very handy feature. It gives basic statistical summary
states.describe()

Unnamed: 0,population,area
count,5.0,5.0
mean,23373370.0,316246.6
std,9640386.0,242437.411951
min,12882140.0,141297.0
25%,19552860.0,149995.0
50%,19651130.0,170312.0
75%,26448190.0,423967.0
max,38332520.0,695662.0


##### DataFrame from `list` or `numpy array`

In [34]:
names = ['Cersie','Jon','Sansa','Danny','Arya', 'Jamie', 'Bran', 'NightKing']
power = [968, 155, 77, 578, 973, 349, 220, 532]

In [35]:
list(zip(names, power))

[('Cersie', 968),
 ('Jon', 155),
 ('Sansa', 77),
 ('Danny', 578),
 ('Arya', 973),
 ('Jamie', 349),
 ('Bran', 220),
 ('NightKing', 532)]

In [36]:
dataset = list(zip(names, power))
print(dataset)

[('Cersie', 968), ('Jon', 155), ('Sansa', 77), ('Danny', 578), ('Arya', 973), ('Jamie', 349), ('Bran', 220), ('NightKing', 532)]


In [37]:
df = pd.DataFrame(dataset)

In [38]:
df

Unnamed: 0,0,1
0,Cersie,968
1,Jon,155
2,Sansa,77
3,Danny,578
4,Arya,973
5,Jamie,349
6,Bran,220
7,NightKing,532


##### More convenient methods

In [39]:
## For large dataset it's not wise to type df and press enter. It's better to visualize n number of rows at a time. 
## This method helps do the same. head() takes in an argument n which is number of rows to be displayed. 
## By default n = 5
df.head()

Unnamed: 0,0,1
0,Cersie,968
1,Jon,155
2,Sansa,77
3,Danny,578
4,Arya,973


In [40]:
## Use this to visualize from the last index of the dataset
df.tail()

Unnamed: 0,0,1
3,Danny,578
4,Arya,973
5,Jamie,349
6,Bran,220
7,NightKing,532


##### Add `columns` and `index` to the DataFrame manually

In [41]:
columns = ['names', 'power']
df.columns = columns

In [42]:
df.columns

Index(['names', 'power'], dtype='object')

In [43]:
index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8']
df.index = index

In [44]:
df

Unnamed: 0,names,power
r1,Cersie,968
r2,Jon,155
r3,Sansa,77
r4,Danny,578
r5,Arya,973
r6,Jamie,349
r7,Bran,220
r8,NightKing,532


In [45]:
## Same as df.columns 
## Return Index object of column names
df.keys()

Index(['names', 'power'], dtype='object')

##### Use of `loc` and `iloc`

In [46]:
## Pandas Series 
data = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[1, 3, 5, 7, 8])
data

1    a
3    b
5    c
7    d
8    e
dtype: object

In [47]:
## Pandas DataFrame
df

Unnamed: 0,names,power
r1,Cersie,968
r2,Jon,155
r3,Sansa,77
r4,Danny,578
r5,Arya,973
r6,Jamie,349
r7,Bran,220
r8,NightKing,532


In [48]:
## Explicit index when indexing. Argument to the square bracket is used as key        
data[3]

'b'

In [49]:
## Implicit index when slicing
## Argument to the square bracket as range(Used for slicing list or numpy array) used as position of the values
data[1:6]

3    b
5    c
7    d
8    e
dtype: object

###### Thus this creates lot of confusion. 

In [50]:
## loc attribute allows indexing and slicing that always references the explicit index
## Argument to the loc is considered as key 
data.loc[5]

'c'

In [51]:
data.loc[3:7]

3    b
5    c
7    d
dtype: object

In [52]:
## The iloc attribute allows indexing and slicing that always references the implicit Python-style index
## Argument to the iloc is considered as position 
data.iloc[2]

'c'

In [53]:
data.iloc[1:3]

3    b
5    c
dtype: object

In [54]:
df.loc['r3': 'r7']

Unnamed: 0,names,power
r3,Sansa,77
r4,Danny,578
r5,Arya,973
r6,Jamie,349
r7,Bran,220


In [55]:
df.iloc[0:5]

Unnamed: 0,names,power
r1,Cersie,968
r2,Jon,155
r3,Sansa,77
r4,Danny,578
r5,Arya,973


##### DataFrame from `csv` 

You can create DataFrame from `excel`, `html`, `json`, `sql` etc

In [56]:
df = pd.read_csv('titanic_dataset.csv')

In [57]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [58]:
## Gives information about total entries, column names along with the dtype of the same and memory usage
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [59]:
## Gives statistical summary. This is of significance while performing EDA.
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


##### Missing values

In [60]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [61]:
df.notna().sum()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [62]:
## This method returns an array containing unique values in the selected column. This is equivalent to set 
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [63]:
set(df['Embarked'])

{'C', 'Q', 'S', nan}

In [64]:
## Groupby is a handy method which when used with other methods can provide better insight into the featues. 
df.groupby(['Sex', 'Survived'])['Survived'].count()

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

##### DataFrame using url`