# Introduction to Pandas

In [1]:
import pandas as pd

We will open the data set on list of passenger on ill-fated Titanic cruise

In [3]:
# Use CSV reader  
df = pd.read_csv("D:\\PythonFiles\\Codes (1)\\2_Codes\\Titanic_Survival_train.csv")

See more about [read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv) in pandas documentation

See first few rows of the data frame

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# Display data type of the variable df
type(df)

pandas.core.frame.DataFrame

We can display data types of individual columns of the data read into data frame using *dtypes* 

In [6]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

To find the size or the shape of dataframe object

In [7]:
df.shape

(891, 12)

To get summarized information about data frame

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We can get statistical description of the data using *describe()* method of the dataframe

In [9]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


*describe()* is a part of descriptive statistics methods available to pandas object. See [documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats) for different available functions.

## Referencing

* Each column of the dataframe is referenced by its "Label".
* Similar to numpy array we can use index based referencing to reference elements in each column of the data frame.

In [10]:
df['Age'][0:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

df['Age'].mean()

Each column of the dataframe is pandas object series. So all descriptive statistics methods are available to each of the columns.

In [11]:
# Compute median age
df['Age'].median()

28.0

Check if the above median ignores *NaN* 

Multiple columns can be referenced by passing a list of columns to dataframe object as shown below.

In [12]:
MyColumns = ['Sex', 'Pclass','Age']
df[MyColumns].head()

Unnamed: 0,Sex,Pclass,Age
0,male,3,22.0
1,female,1,38.0
2,female,3,26.0
3,female,1,35.0
4,male,3,35.0


## Filtering

Dataframe object can take logical statements as inputs. Depending upon value of this logical index, it will return the resulting dataframe object.

In [13]:
# Select all passenger with age greater than 60

df_AgeMoreThan60 = df[df['Age']>60]
df_AgeMoreThan60.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
170,171,0,1,"Van der hoef, Mr. Wyckoff",male,61.0,0,0,111240,33.5,B19,S


In [14]:
# Select all passengers with age less than or equal to 15
df_AgeLessThan15=df[df['Age']<=15]

# Number of passengers with Age less than or equal to 15
df_AgeLessThan15['Age'].count()

83

Passengers whose age is more than 60 and are male

Lets see only passengers who are male and above 60 years old

In [15]:
# Method-1: apply two filters sepeartly
df_AgeMoreThan60 = df[df['Age'] > 60]
temp1 = df_AgeMoreThan60[df['Sex']=='male']
temp1 ['Sex'].head()

  This is separate from the ipykernel package so we can avoid doing imports until


33     male
54     male
96     male
116    male
170    male
Name: Sex, dtype: object

In [16]:
# Method-2: Applying  filters together
SurvivedMaleMoreThan60 = df[(df['Age']>60) & (df['Sex']=='male') ]
SurvivedMaleMoreThan60['Sex'].head()

33     male
54     male
96     male
116    male
170    male
Name: Sex, dtype: object

In [17]:
# Method-2: Applying two or more filters together
SurvivedMaleMoreThan60 = df[(df['Age']>60) & (df['Sex']=='male') & (df['Survived']==1)]
SurvivedMaleMoreThan60['Sex'].head()

570    male
630    male
Name: Sex, dtype: object

## Tabulation

In [18]:
mySeries = df['Pclass']

method *[value_counts()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)* will return counts of unique values

In [54]:
mySeries.value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [19]:
#Cross tabulation
pd.crosstab(df['Sex'],df['Pclass'])

Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,94,76,144
male,122,108,347


## Dropping rows and columns

In [20]:
# Drop columns
df.drop('Age',axis=1).head() # Note axis=1 indicates that label "Age" is along dimension (index) 1 (0 for rows, 1 for column)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S


## Data frame with row lables and column labels

In [21]:
#Generate some data
import numpy as np
data = np.random.random(16)
data =  data.reshape(4,4)
data

array([[0.96659739, 0.57472695, 0.05777594, 0.06466455],
       [0.22400136, 0.77720148, 0.80550528, 0.41278442],
       [0.15016113, 0.78340363, 0.29991403, 0.0310128 ],
       [0.20600396, 0.62194191, 0.94248881, 0.82970929]])

In [22]:
# Generate column and row labels
ColumnLables=['One','Two','Three','Four']
RowLables =['Ohio','Colarado','Utah','New York']

In [23]:
# Use DataFrame method to create dataframe object
df2=pd.DataFrame(data,RowLables,ColumnLables)

In [24]:
df2.drop('Utah')

Unnamed: 0,One,Two,Three,Four
Ohio,0.966597,0.574727,0.057776,0.064665
Colarado,0.224001,0.777201,0.805505,0.412784
New York,0.206004,0.621942,0.942489,0.829709


In [25]:
df3=df.dropna()
df3.shape

(183, 12)

## Combining, merging and concatenating two data frames

We will create two dataframe objects

In [26]:
df2=pd.DataFrame(data,RowLables,ColumnLables)
df3=pd.DataFrame(data*4,RowLables,ColumnLables)

### **Merge pandas objects **
by performing a database-style join operation by columns or indexes.
see [merge documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html#pandas.merge) for details

In [27]:
#merge
df4=pd.merge(df2,df3) # default inner join
df4

Unnamed: 0,One,Two,Three,Four


In [28]:
df5=pd.merge(df2,df3,how='outer')
df5

Unnamed: 0,One,Two,Three,Four
0,0.966597,0.574727,0.057776,0.064665
1,0.224001,0.777201,0.805505,0.412784
2,0.150161,0.783404,0.299914,0.031013
3,0.206004,0.621942,0.942489,0.829709
4,3.86639,2.298908,0.231104,0.258658
5,0.896005,3.108806,3.222021,1.651138
6,0.600645,3.133615,1.199656,0.124051
7,0.824016,2.487768,3.769955,3.318837


### **Concatenate pandas objects** 
along a particular axis with optional set logic along the other axes.
see [concat documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html#pandas.concat) for details

In [29]:
pd.concat([df2,df2])

Unnamed: 0,One,Two,Three,Four
Ohio,0.966597,0.574727,0.057776,0.064665
Colarado,0.224001,0.777201,0.805505,0.412784
Utah,0.150161,0.783404,0.299914,0.031013
New York,0.206004,0.621942,0.942489,0.829709
Ohio,0.966597,0.574727,0.057776,0.064665
Colarado,0.224001,0.777201,0.805505,0.412784
Utah,0.150161,0.783404,0.299914,0.031013
New York,0.206004,0.621942,0.942489,0.829709


## Removing duplicates

*drop_duplicates* will drop duplicate rows

In [30]:
df6=pd.concat([df2,df2])

df6.drop_duplicates()

Unnamed: 0,One,Two,Three,Four
Ohio,0.966597,0.574727,0.057776,0.064665
Colarado,0.224001,0.777201,0.805505,0.412784
Utah,0.150161,0.783404,0.299914,0.031013
New York,0.206004,0.621942,0.942489,0.829709


## Discreatization and Binning

### Cut method
*[cut](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html#pandas.cut)* method return indices of half-open bins to which each value of x belongs. See documentation details for different options.

In [31]:
PassengerAge = df['Age']
PassengerAge = PassengerAge.dropna()
Bins = [0, 10,15,30, 40, 60, 80]

pd.cut(PassengerAge,Bins).head()


0    (15, 30]
1    (30, 40]
2    (15, 30]
3    (30, 40]
4    (30, 40]
Name: Age, dtype: category
Categories (6, interval[int64]): [(0, 10] < (10, 15] < (15, 30] < (30, 40] < (40, 60] < (60, 80]]

### Cut with labels for generated bins

We can also apply "Labels" to each of the generated bin

In [32]:
PassengerAge = df['Age']

PassengerAge = PassengerAge.dropna()

Bins = [0, 10,15,30, 40, 60, 80]

BinLabels = ['Toddler','Young', 'Adult','In Early 50s', 'In 60s', 'Gerand'] #labels for generated bins

pd.cut(PassengerAge,Bins,labels=BinLabels).head()

0           Adult
1    In Early 50s
2           Adult
3    In Early 50s
4    In Early 50s
Name: Age, dtype: category
Categories (6, object): [Toddler < Young < Adult < In Early 50s < In 60s < Gerand]

### Use of precision to cut numbers

In [33]:
import numpy as np
data = np.random.rand(20)
pd.cut(data,4,precision=2)

[(0.74, 0.95], (0.52, 0.74], (0.086, 0.3], (0.086, 0.3], (0.74, 0.95], ..., (0.74, 0.95], (0.74, 0.95], (0.74, 0.95], (0.74, 0.95], (0.74, 0.95]]
Length: 20
Categories (4, interval[float64]): [(0.086, 0.3] < (0.3, 0.52] < (0.52, 0.74] < (0.74, 0.95]]