Topics to be covered in this Notebook:

1. Numpy
2. Pandas
3. Exploratory Data Analysis

- Author : Aritra Sen (www.aritrasen.com)
- LinkedIn : https://www.linkedin.com/in/aritrasen/

#### Numpy:

Numpy is the short form of ‘Numerical Python‘. This is one library you should learn if you are interested in Data Science with Python. The core of Numpy is it’s N-Dimension Array which is similar to List but has many advance over List like –

- More compact.
- Faster access in Reading and writing items.
- Ease of doing element wise operation in Numpy.
- Numpy has inbuilt functions to do mathematical operations on arrays.

As we are using Jupyter notebook from Anaconda , Numpy library is already installed in Anaconda. But we need to import the Numpy library. Now lets look into the Numpy array , it’s advantages over list and different available functionalities of Numpy.

In [1]:
#Import the Numpy Library
import numpy as np 

In [2]:
a = np.array([1,2,3]) #Creating an 1-D Numpy Array
print(a)

[1 2 3]


In [3]:
#Arguement list to creare Numpy array 
# np.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)

#Creating an 2-D Array 

b = np.array([[1,2,3],[4,5,6]])
print(b)

# Print out the shape of array
print('Shape:' + str(b.shape))

# Shape is a tuple , first is no of rows and secords element is no of columns

# Print out the data type of array
print(b.dtype)

[[1 2 3]
 [4 5 6]]
Shape:(2, 3)
int64


##### Comparison between Numpy and List

In [4]:
my_list = [1,2,3]

# To multipl each element of the list with we will need to write loop
emp_list = []
for element in my_list:
    emp_list.append(element*2)
print(emp_list)  

#We can't do straight way elementwise multiplication in list 
my_list*2 #This is do concatenation

[2, 4, 6]


[1, 2, 3, 1, 2, 3]

In [5]:
# However in Numpy Array we can easily do elementwise operation
# Avoiding the slow for loop

a = np.array([1,2,3])
print(a*2) #Much faster way to do

[2 4 6]


##### Mathematical Operations using numpy

In [7]:
a = np.array([1,2,3])
b = np.array([4,5,6])
print('Add :', a+b) #same as np.add(a,b)
print('Substract: ',a-b) #same as np.substract(a,b)
print ('Multiply: ',np.multiply(a,b))  
print ('Divide: ',np.divide(a,b)) 
print('Square root: ', np.sqrt(a)) #Elementwise square root

Add : [5 7 9]
Substract:  [-3 -3 -3]
Multiply:  [ 4 10 18]
Divide:  [0.25 0.4  0.5 ]
Square root:  [1.         1.41421356 1.73205081]


In [8]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
print(np.dot(x,y)) #Matrix Product

[[19 22]
 [43 50]]


In [9]:
x = np.array([[1,2],[3,4]])
print(np.transpose(x)) # Matrix Transpose

[[1 3]
 [2 4]]


##### Functions to create different types of arrays

In [10]:
a = np.zeros((2,2)) # 2*2 array of zeros , assign the dimensions as you need
print(a)

[[0. 0.]
 [0. 0.]]


In [11]:
b = np.ones((1,2)) # array of all ones
print(b)

[[1. 1.]]


In [12]:
c = np.eye(2) # Creates an 2*2 Identity matrix
print(c)

[[1. 0.]
 [0. 1.]]


In [13]:
d = np.random.random((2,2)) # Creates an array random numbers
print(d)

[[0.5275685  0.30631075]
 [0.73400404 0.57945371]]


##### Array Indexing and Slicing

- To Slice and Dice an N-D Array below is rule of positions array[row_start : row_end , column_start : column_end] 
- Excludes the end positions
- Indexing here also starts from 0

In [14]:
# Create an 2-D array of 3 rows and 4 columns
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) 
print(a)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [15]:
print (a[ : , : ]) #Prints all rows and Columns

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [16]:
print(a[1 :  ,  : ]) # From 2nd row to all rows & all the col

[[ 5  6  7  8]
 [ 9 10 11 12]]


- Print the 4th element of 1st row
- Print the first element of 3rd row

In [17]:
print(a[0,3]) # Print the 4th element of 1st row
print(a[2,0]) # Print the first element of 3rd row

4
9


In [18]:
print(a[ :  , 1 : 3]) # Fetch all rows along with 2nd & 3rd columns

[[ 2  3]
 [ 6  7]
 [10 11]]


In [19]:
# Boolean Indexing 
# Suppose we want all the records from array greater than 3
print(a>2)
print(a[a>2])

[[False False  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]
[ 3  4  5  6  7  8  9 10 11 12]


In [20]:
# Create an 2-D array of 3 rows and 4 columns
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) 
print(a)
print(a.shape)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
(3, 4)


In [21]:
# Suppose we want to reshape the array with 2 rows and 6 columns
a = a.reshape(2,6)
print(a)

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]]


In [22]:
# Create an array of evenly spaced numbers
b = np.arange(24)
print(b)
print(b.ndim) # This is an 1-D Array

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
1


In [30]:
b = b.reshape(2,12) # Converts it to 2-D Array 
print(b)
print("No of dimensions: ",b.ndim)

[[ 0  1  2  3  4  5  6  7  8  9 10 11]
 [12 13 14 15 16 17 18 19 20 21 22 23]]
No of dimensions:  2


In [29]:
c = b.reshape(2,2,6) # Converts it to 3-D Array 
print(c)
print("No of dimensions: ",c.ndim)

[[[ 0  1  2  3  4  5]
  [ 6  7  8  9 10 11]]

 [[12 13 14 15 16 17]
  [18 19 20 21 22 23]]]
No of dimensions:  3


##### Numpy Statistical Functions

In [31]:
a = np.array([[1,2,3],[4,5,6],[10,15,100]])
print(np.median(a)) # Median 
print(np.mean(a)) #Mean
print(np.var(a)) # Variance - average of squared deviations from mean 
print(np.std(a)) # Standard Deviation - square root of variance

5.0
16.22222222222222
894.1728395061727
29.90272294467801


#### Pandas

Pandas is built on Numpy. It has two types of easy to use and fast data structures , which helps to do Data Analysis with ease.

Data Structures:

- Series : One Dimensional array.
- Dataframe : Two dimenionsal data structure with Row , Columns of different types.

In [32]:
import pandas as pd # import pandas Library
import numpy as np

##### Series

In [33]:
se = pd.Series([1,2,3] , index = ['first', 'second','third']) #Index acts as reference 
print(se)

first     1
second    2
third     3
dtype: int64


In [35]:
se['first'] #Using index to access Data elements

1

In [36]:
se[0] #Using postion to access data elements

1

##### DataFrame

In [37]:
# Python Dictionary
p_dict = {'name' : ['Sachin','Sourav','Rahul'], 
         'country' : ['Ind','Ind','Ind'],
         'Runs': ['10000','10500','10100']}

In [38]:
#Dataframe - This will convert the dictionary to DataFrame
df = pd.DataFrame(p_dict) 
df.head()  #by default shows the first five rows

Unnamed: 0,name,country,Runs
0,Sachin,Ind,10000
1,Sourav,Ind,10500
2,Rahul,Ind,10100


In [39]:
df.tail(2) #last five rows

Unnamed: 0,name,country,Runs
1,Sourav,Ind,10500
2,Rahul,Ind,10100


In [40]:
#Selecting an specific column
#Below are the two ways which produces same result
print(df.Runs)
print(df['Runs'])
#First won't work if you spaces in column name

0    10000
1    10500
2    10100
Name: Runs, dtype: object
0    10000
1    10500
2    10100
Name: Runs, dtype: object


iloc is used for integer-location based indexing / selection by position.Here we select rows and columns by number. df.iloc[row_selecton , column_selection]

In [41]:
print(df.iloc[0]) #First row of the dataframe
print(df.iloc[-1]) #Last row
print(df.iloc[ :,1]) #all the rows and first column

name       Sachin
country       Ind
Runs        10000
Name: 0, dtype: object
name       Rahul
country      Ind
Runs       10100
Name: 2, dtype: object
0    Ind
1    Ind
2    Ind
Name: country, dtype: object


Please note a selection e.g.[1:6], will run from the first number to one minus the second number. e.g. [1:6] will go 1,2,3,4,5. (excludes the last number)

In [42]:
print(df.iloc[ :,0:2]) #All rows and 1st to 2nd column , : -> this will retrieve all rows/columns

     name country
0  Sachin     Ind
1  Sourav     Ind
2   Rahul     Ind


In [43]:
print(df.iloc[ :,1:]) #All rows , 2nd to all the columns

  country   Runs
0     Ind  10000
1     Ind  10500
2     Ind  10100


### Common Dataframe Operations and looking deep into the data

In [45]:
from IPython.display import Image
Image(url= "https://static1.squarespace.com/static/5006453fe4b09ef2252ba068/5095eabce4b06cb305058603/5095eabce4b02d37bef4c24c/1352002236895/100_anniversary_titanic_sinking_by_esai8mellows-d4xbme8.jpg")

In [46]:
df=pd.read_csv("http://bit.ly/kaggletrain")

In [47]:
df.shape # To check the count of rows and columns

(891, 12)

In [48]:
df.columns #to check the column names

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [49]:
df.dtypes # to see datatypes of the columns

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Finding & Replacing NULL Values

In [50]:
df.isnull().sum() # check null values present in the dataframe columns

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [51]:
# From above we can see that Age column has null values
# With below column we can fill those null values with mean value of the age in the current df
df['Age'].fillna(np.mean(df['Age']) , inplace = True)

df.isnull().sum() # This will not show any null values for age c

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [52]:
df["Embarked"].value_counts() # to check unique values present in a column

S    644
C    168
Q     77
Name: Embarked, dtype: int64

##### Boolean Indexing

or -> condition 1 | condition 2

and -> condition 1 & condition 2

Not -> ~ (not condition)

equal -> == (equal criteria)

In [53]:
# Female passengers who has survived
df[(df.Sex =='female') & (df.Survived ==1)].head() #() are mandatory

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [54]:
#Selecting rows with male passengers or from PClass as 1
df[(df.Sex =='female') | (df.Pclass ==1)].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S


In [55]:
# % of Male survivors
(len(df[(df.Sex =='male') & (df.Survived ==1)])/len(df))*100

12.2334455667789

In [56]:
# % of Female survivors
(len(df[(df.Sex =='female') & (df.Survived ==1)])/len(df))*100

26.15039281705948

In [57]:
# To check if there is any duplicate PassengerID
sum(df.PassengerId.duplicated())

0

In [58]:
#Average Age of passengers grouped by Gender and Survival Status
df.groupby(by = ['Sex','Survived']).mean().Age

  df.groupby(by = ['Sex','Survived']).mean().Age


Sex     Survived
female  0           26.023272
        1           28.979263
male    0           31.175224
        1           27.631705
Name: Age, dtype: float64

##### Missing value imputation

In [59]:
df['Age'].fillna(np.mean(df['Age']) , inplace = True)

from statistics import mode
df['Embarked'].fillna(mode(df['Embarked']) , inplace = True)

In [60]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [61]:
# You may want to drop the columns Cabin 
df.Cabin.isnull().sum()/df.shape[0]

0.7710437710437711