## Data Wrangling 02

This part of data wrangling is about data cleaning (replacing NaN values). Primarily, the purpose is to replace categorical data values in columns with dummy values (i.e replacing male with 1 and female with 0 in the gender column) and also Binning is applied to the age column, where ages are grouped into categories. 

Following are the steps of this tutorial:

Import Libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

Load Titanic Dataset

In [3]:
kashti_1 = sns.load_dataset('titanic')
kashti_1.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [4]:
kashti_1.shape

(891, 15)

In [5]:
kashti_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


**Replacing NaN Values of Age with Mean Age**

In [6]:
mean = kashti_1['age'].mean()
mean

29.69911764705882

In [7]:
kashti_1['age'] = kashti_1['age'].replace(np.nan, mean)
kashti_1.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

**Replacing NaN Values of Categorical Columns (deck, embarked, and embark_town) with mode of each series**

In [8]:
columns = kashti_1.filter(['deck', 'embarked', 'embark_town'])
columns

Unnamed: 0,deck,embarked,embark_town
0,,S,Southampton
1,C,C,Cherbourg
2,,S,Southampton
3,C,S,Southampton
4,,S,Southampton
...,...,...,...
886,,S,Southampton
887,B,S,Southampton
888,,S,Southampton
889,C,C,Cherbourg


In [9]:
modes = columns.mode()
modes

Unnamed: 0,deck,embarked,embark_town
0,C,S,Southampton


In [10]:
kashti_2 = kashti_1.fillna(kashti_1.mode().iloc[0])
kashti_2

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.000000,1,0,7.2500,S,Third,man,True,C,Southampton,no,False
1,1,1,female,38.000000,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.000000,0,0,7.9250,S,Third,woman,False,C,Southampton,yes,True
3,1,1,female,35.000000,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.000000,0,0,8.0500,S,Third,man,True,C,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.000000,0,0,13.0000,S,Second,man,True,C,Southampton,no,True
887,1,1,female,19.000000,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,29.699118,1,2,23.4500,S,Third,woman,False,C,Southampton,no,False
889,1,1,male,26.000000,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [11]:
kashti_2.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [12]:
kashti_1.shape

(891, 15)

**Binning:**
Grouping the values of age column into categories 

In [13]:
bins = [1, 4, 12, 19, 39, 60, 100]
labels = ['Infant', 'Toddler', 'Child', 'Teens', 'Adults', 'Middle Age Adult'] 
kashti_2['age'] = pd.cut(kashti_2['age'], bins=bins, labels=labels)
kashti_2

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,Teens,1,0,7.2500,S,Third,man,True,C,Southampton,no,False
1,1,1,female,Teens,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,Teens,0,0,7.9250,S,Third,woman,False,C,Southampton,yes,True
3,1,1,female,Teens,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,Teens,0,0,8.0500,S,Third,man,True,C,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,Teens,0,0,13.0000,S,Second,man,True,C,Southampton,no,True
887,1,1,female,Child,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,Teens,1,2,23.4500,S,Third,woman,False,C,Southampton,no,False
889,1,1,male,Teens,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


**Categorical Values into Dummies**

In [14]:
df = pd. get_dummies(kashti_2, columns=['sex']) 
df.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,sex_female,sex_male
0,0,3,Teens,1,0,7.25,S,Third,man,True,C,Southampton,no,False,0,1
1,1,1,Teens,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1,0
2,1,3,Teens,0,0,7.925,S,Third,woman,False,C,Southampton,yes,True,1,0
3,1,1,Teens,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,1,0
4,0,3,Teens,0,0,8.05,S,Third,man,True,C,Southampton,no,True,0,1
