### Pandas
```Pandas is an open-source data analysis and data manipulation library written in python. Pandas provide you with data structures and functions to work on structured data seamlessly. The name Pandas refer to “Panel Data”, which means a structured dataset. Pandas have two main classes to work on, DataFrame and Series.```

#### Installation
```pip install pandas```

```conda install pandas```

In [1]:
# To install pandas via jupyter notebook
!pip install pandas



The above message is displayed as pandas is already installed 

In [35]:
#import pandas and numpy
import numpy as np
import pandas as pd

In [36]:
#check the version of pandas
pd.__version__

'1.4.1'

### Pandas Data type

* Series
* DataFrame
* Index

#### Pandas Series

In [37]:
data = pd.Series([2.0, 3.0, 5.0, np.nan, 2.7])
data

0    2.0
1    3.0
2    5.0
3    NaN
4    2.7
dtype: float64

In [38]:
# Customizing the index of a Series 
data = pd.Series([4.0, 6.0, 3.0, 8.0, 2.0, 5.0],
                index=['a', 'b', 'c', 'd', 'e', 'f'])
data

a    4.0
b    6.0
c    3.0
d    8.0
e    2.0
f    5.0
dtype: float64

In [7]:
#access the element
data['b']

6.0

In [39]:
data['f']

5.0

In [8]:
# Series as a specialized dictionary
population_dict = {'Abuja': 34567,
                  'California': 67893,
                  'Lagos': 78900,
                  'Florida': 87803,
                  'Alaska': 89237}
pop_Series = pd.Series(population_dict, name='population')    #specifying the name
pop_Series

Abuja         34567
California    67893
Lagos         78900
Florida       87803
Alaska        89237
Name: population, dtype: int64

#### Pandas DataFrame

In [41]:
#let's say we create a data containing dictionary of lists
data = {'Name': ['Jake Benson', 'Mark Robinson', 'Daniel Scout', 'Jennifer Michael', 'Adewale Yusuf', 'Bunmi Akinremi'],
       'Age' : [21, 32, 42, 33, 24, 22],
       'Sex': ['M', 'M', 'M', 'F', 'M', np.nan],
       'Country': ['USA', 'England', 'Scoutland', 'Wales', 'Nigeria', 'Nigeria'],
       'Position': ['CEO', 'CTO', 'Software Engineer', 'Sales manager', 'Data Scientist', 'Data Scientist']}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Sex,Country,Position
0,Jake Benson,21,M,USA,CEO
1,Mark Robinson,32,M,England,CTO
2,Daniel Scout,42,M,Scoutland,Software Engineer
3,Jennifer Michael,33,F,Wales,Sales manager
4,Adewale Yusuf,24,M,Nigeria,Data Scientist
5,Bunmi Akinremi,22,,Nigeria,Data Scientist


In [42]:
#view the top 5
df.head(2)

Unnamed: 0,Name,Age,Sex,Country,Position
0,Jake Benson,21,M,USA,CEO
1,Mark Robinson,32,M,England,CTO


In [10]:
#view the last five
df.tail()

Unnamed: 0,Name,Age,Sex,Country,Position
1,Mark Robinson,32,M,England,CTO
2,Daniel Scout,42,M,Scoutland,Software Engineer
3,Jennifer Michael,33,F,Wales,Sales manager
4,Adewale Yusuf,24,M,Nigeria,Data Scientist
5,Bunmi Akinremi,22,F,Nigeria,Data Scientist


### Accessing columns in a pandas Dataframe - Filtering

In [45]:
#Select employee names
df['Name']

Name        Jake Benson
Age                  21
Sex                   M
Country             USA
Position            CEO
Name: 0, dtype: object

In [12]:
#Select employee from Nigeria
df[df['Country'] == 'Nigeria']

Unnamed: 0,Name,Age,Sex,Country,Position
4,Adewale Yusuf,24,M,Nigeria,Data Scientist
5,Bunmi Akinremi,22,F,Nigeria,Data Scientist


In [13]:
df[df['Country'] == 'USA']

Unnamed: 0,Name,Age,Sex,Country,Position
0,Jake Benson,21,M,USA,CEO


In [46]:
df[df['Age'] > 40]

Unnamed: 0,Name,Age,Sex,Country,Position
2,Daniel Scout,42,M,Scoutland,Software Engineer


In [33]:
#If then logic
df["teen/adult"] = np.where(df["Age"] < 18, "teen", "adult")
df.head()

Unnamed: 0,Name,Age,Sex,Country,Position,teen/adult
0,Jake Benson,21,M,USA,CEO,adult
1,Mark Robinson,32,M,England,CTO,adult
2,Daniel Scout,42,M,Scoutland,Software Engineer,adult
3,Jennifer Michael,33,F,Wales,Sales manager,adult
4,Adewale Yusuf,24,M,Nigeria,Data Scientist,adult


In [16]:
#we can export data as a CSV file
df.to_csv('employee.csv', index=False)

Let's get started with **Data Manipulation with Pandas**. For the purpose, we are going to use Titanic dataset, which is available on `kaggle` by clicking on the link https://www.kaggle.com/c/titanic/data

In [47]:
# first we import our data to notebook
data = pd.read_csv('titanic.csv')

#view the first five rows
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [20]:
#view the last 10 rows
data.tail(10)   #note that the function head and tail has a parameter to pass any integer

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [21]:
# view the shape of the data; the shape of the data tells us the amount of rows and columns contained in our data
data.shape

(891, 12)

for our data, we have got `891` rows and `12` columns

In [54]:
# It is always a good practice to copy the dataframe, so we can use copy function
data_temp = data.copy()

In [23]:
#check the data information using the function info
data_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The information gives the  information about the data contained in the dataset including it's data type which is very useful for our analysis

In [24]:
#check the statistics summary of our data using the describe function
data_temp.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [55]:
#dropped a column in the dataset
data_temp.drop('PassengerId', axis=1, inplace=True)
data_temp.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [26]:
#check for missing values in the dataset
data_temp.isna().sum() #tells the columns with missing values and the amount of missing values 

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [27]:
#Let's drop multiple columns in our dataset
data_temp.drop(['Name', 'Cabin'], axis=1, inplace=True)

#the axis =1 indicates the column where as axis = 0 indicates rows. Also, setting inplace to be True ensure that the columns are dropped completely.

In [28]:
#view the data
data_temp.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,0,3,male,22.0,1,0,A/5 21171,7.25,S
1,1,1,female,38.0,1,0,PC 17599,71.2833,C
2,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,S
3,1,1,female,35.0,1,0,113803,53.1,S
4,0,3,male,35.0,0,0,373450,8.05,S


In [56]:
#dropping rows in the dataframe
data_temp.drop(2, axis = 0, inplace=True)

In [57]:
#view the data
data_temp.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


You can check and see that the `row 2` is not available in our data anymore

In [31]:
#check the columns in the data
data_temp.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Embarked'],
      dtype='object')

In [32]:
#check the index
data_temp.index

Int64Index([  0,   1,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            881, 882, 883, 884, 885, 886, 887, 888, 889, 890],
           dtype='int64', length=890)

In [58]:
#rename a column in the dataset, let's say we want to rename column Sex to Gender
data_temp.columns

data_temp_renamed = data_temp.rename(columns={'Sex': 'gender'})
data_temp_renamed.head()

Unnamed: 0,Survived,Pclass,Name,gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [31]:
#Select a column with a specific datatype
object_dtype = data_temp.select_dtypes('object')
object_dtype.head()

Unnamed: 0,Sex,Ticket,Embarked
0,male,A/5 21171,S
1,female,PC 17599,C
3,female,113803,S
4,male,373450,S
5,male,330877,Q


In [32]:
float_dtype = data_temp.select_dtypes('float')
float_dtype.head()

Unnamed: 0,Age,Fare
0,22.0,7.25
1,38.0,71.2833
3,35.0,53.1
4,35.0,8.05
5,,8.4583


In [33]:
# Slicing the dataset
data_temp.iloc[:5, 0] #selecting with index

0    0
1    1
3    1
4    0
5    0
Name: Survived, dtype: int64

In [34]:
#selecting with column name
data_temp.loc[:5, 'Sex']

0      male
1    female
3    female
4      male
5      male
Name: Sex, dtype: object

In [35]:
df_dup = data.copy()
#duplicate the first row
row = df_dup.iloc[:1]
#add the duplicated row to the dataset
df_dup = df_dup.append(row,ignore_index=True)
df_dup.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [36]:
#check the duplicated data
df_dup[df_dup.duplicated()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
891,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [37]:
#drop duplicate 
df_dup.drop_duplicates()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [38]:
data_temp.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Embarked'],
      dtype='object')

In [39]:
#selecting a column with a specific value
data_temp[data_temp['Pclass'] == 1]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
1,1,1,female,38.0,1,0,PC 17599,71.2833,C
3,1,1,female,35.0,1,0,113803,53.1000,S
6,0,1,male,54.0,0,0,17463,51.8625,S
11,1,1,female,58.0,0,0,113783,26.5500,S
23,1,1,male,28.0,0,0,113788,35.5000,S
...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,11751,52.5542,S
872,0,1,male,33.0,0,0,695,5.0000,S
879,1,1,female,56.0,0,1,11767,83.1583,C
887,1,1,female,19.0,0,0,112053,30.0000,S


In [40]:
data_temp[data_temp['Parch'] == 1]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
7,0,3,male,2.0,3,1,349909,21.0750,S
10,1,3,female,4.0,1,1,PP 9549,16.7000,S
16,0,3,male,2.0,4,1,382652,29.1250,Q
24,0,3,female,8.0,3,1,349909,21.0750,S
50,0,3,male,7.0,4,1,3101295,39.6875,S
...,...,...,...,...,...,...,...,...,...
856,1,1,female,45.0,1,1,36928,164.8667,S
869,1,3,male,4.0,1,1,347742,11.1333,S
871,1,1,female,47.0,1,1,11751,52.5542,S
879,1,1,female,56.0,0,1,11767,83.1583,C


In [41]:
#Select multiple values in a column
data_temp[data_temp['Pclass'].isin([1,0])]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
1,1,1,female,38.0,1,0,PC 17599,71.2833,C
3,1,1,female,35.0,1,0,113803,53.1000,S
6,0,1,male,54.0,0,0,17463,51.8625,S
11,1,1,female,58.0,0,0,113783,26.5500,S
23,1,1,male,28.0,0,0,113788,35.5000,S
...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,11751,52.5542,S
872,0,1,male,33.0,0,0,695,5.0000,S
879,1,1,female,56.0,0,1,11767,83.1583,C
887,1,1,female,19.0,0,0,112053,30.0000,S


### Groupby in DataFrame

In [42]:
data.groupby('Sex').agg({'PassengerId': 'count'})

Unnamed: 0_level_0,PassengerId
Sex,Unnamed: 1_level_1
female,314
male,577


In [43]:
data.groupby('Sex').agg({'Age':'mean'})

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


#### Performing Merge on Pandas

The tips dataset can be downloaded from the link https://www.kaggle.com/code/sanjanabasu/tips-dataset/data

In [45]:
#Assuming we import another dataframe called tips
data_2 = pd.read_csv('data/tips.csv')
data_2.head()

Unnamed: 0,total_bill,tip,gender,smoker,day,time,size
0,2125.5,360.79,Male,No,Thur,Lunch,1
1,2727.18,259.42,Female,No,Sun,Dinner,5
2,1066.02,274.68,Female,Yes,Thur,Dinner,4
3,3493.45,337.9,Female,No,Sun,Dinner,1
4,3470.56,567.89,Male,Yes,Sun,Lunch,6


In [46]:
#export the dataset renamed
data_temp_renamed.to_csv('data/titanic_renamed.csv', index=False)

In [47]:
data_3 = pd.read_csv('data/titanic_renamed.csv')
data_3.head()

Unnamed: 0,Survived,Pclass,gender,Age,SibSp,Parch,Ticket,Fare,Embarked
0,0,3,male,22.0,1,0,A/5 21171,7.25,S
1,1,1,female,38.0,1,0,PC 17599,71.2833,C
2,1,1,female,35.0,1,0,113803,53.1,S
3,0,3,male,35.0,0,0,373450,8.05,S
4,0,3,male,,0,0,330877,8.4583,Q


In [48]:
#Let's try to merge data_3 to data_2 this data on gender
merged_data = data_3.merge(data_2, on='gender', how='left')
merged_data

Unnamed: 0,Survived,Pclass,gender,Age,SibSp,Parch,Ticket,Fare,Embarked,total_bill,tip,smoker,day,time,size
0,0,3,male,22.0,1,0,A/5 21171,7.2500,S,,,,,,
1,1,1,female,38.0,1,0,PC 17599,71.2833,C,,,,,,
2,1,1,female,35.0,1,0,113803,53.1000,S,,,,,,
3,0,3,male,35.0,0,0,373450,8.0500,S,,,,,,
4,0,3,male,,0,0,330877,8.4583,Q,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,0,2,male,27.0,0,0,211536,13.0000,S,,,,,,
886,1,1,female,19.0,0,0,112053,30.0000,S,,,,,,
887,0,3,female,,1,2,W./C. 6607,23.4500,S,,,,,,
888,1,1,male,26.0,0,0,111369,30.0000,C,,,,,,


The `NAN` values are expected because there are no much relationship between the columns of both datasets. Except `gender`

In [49]:
#However, we can drop the NAN values by running the below code
merged_data_dropna = merged_data.dropna(axis=1)

In [50]:
merged_data_dropna.isna().sum()

Survived    0
Pclass      0
gender      0
SibSp       0
Parch       0
Ticket      0
Fare        0
dtype: int64

Noticed that all missing values are dropped. However, this is not a professional way of dealing with missing values as the data may tend to be bias or result to data loss from some important columns

Let's view the initial data and deal with missing values appropriately.

In [51]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [52]:
data.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can drop the Cabin column as it's has about 80% of missing values

In [53]:
data.drop('Cabin', axis=1, inplace=True)

In [54]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [55]:
#check if the cabin column is dropped
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked'],
      dtype='object')

In [56]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


In our analysis, for missing numerical columns, we can use the mean to fill it where as for missing categorical columns, we can use the mode to fill it. Other methods can be using the `bfill` or `ffill` techniques

In [57]:
data.fillna(data['Age'].mean(), inplace=True)
data.fillna(data['Embarked'].mode(), inplace=True)

In [58]:
data.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Now, we can see that all missing columns are filled up and we have no missing columns anymore