# Data Wrangling

Data Wrangling is transforming and mapping data from one format into another. The aim is to make data more accessible for things like business analytics or machine learning. The data wrangling process can involve a variety of tasks. These include things like data collection, exploratory analysis, data cleansing, creating data structures, and storage.

**Contents**
1. Handling Missing Values
2. Data Formatting
3. Data Normalization
    - Scaling
    - Centralizing
4. Data Binning
    - For groups of data
5. Making Dummies of Categorical Data
    - categorical ---> Numerical

Import Libraries

In [68]:
import pandas as pd
import numpy as np
import seaborn as sns

Loading Dataset of Kashti

In [69]:
kashti = sns.load_dataset('titanic')
ks1 = kashti
ks2 = kashti

In [70]:
kashti.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Making a change to the whole column

In [71]:
# simple math operation
(kashti['age']+5).head(10)

0    27.0
1    43.0
2    31.0
3    40.0
4    40.0
5     NaN
6    59.0
7     7.0
8    32.0
9    19.0
Name: age, dtype: float64

**Dealing with Missing Values**

In [72]:
# Where exactly missing values are?
kashti.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [73]:
kashti.shape

(891, 15)

In [74]:
# use dropna method to drop a column or rows of missing values
kashti.dropna(subset=['deck'], axis=0, inplace = True) # this will remove the entire column of deck

# inplace = True (modifies the dataframe)

In [75]:
kashti.shape

(203, 15)

In [76]:
kashti.isnull().sum()

survived        0
pclass          0
sex             0
age            19
sibsp           0
parch           0
fare            0
embarked        2
class           0
who             0
adult_male      0
deck            0
embark_town     2
alive           0
alone           0
dtype: int64

In [77]:
# Remove na values from the whole dataframe
kashti = kashti.dropna()
kashti.isnull().sum()


survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [78]:
kashti.shape

(182, 15)

Replacing Missing Values with the Average of that Column

In [79]:
# ks1 = sns.load_dataset('titanic')
# ks1.shape
mean  = ks1['age'].mean()
mean

35.77945652173913

In [80]:
mean_2 = kashti['age'].mean()
mean_2

35.62318681318681

In [81]:
# Replacing nan with the mean of the data (updating as well)
ks1['age'] = ks1['age'].replace(np.nan, mean)

In [82]:
ks1.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
deck           0
embark_town    2
alive          0
alone          0
dtype: int64

Replacing NaN Values of Categorical Columns (deck, embarked, and embark_town) with mode of each series

As 'deck' , 'emark_town' and 'embarked' are *Categorical values* , we wont be able to find Mean of the data. 
To clean the data we have following options:
- It is better to *Drop the 'deck' Column* as it consists of more than 75% of data, so in anyway, this data wont be able to help us in our analysis. 
- If you still dont want to drop the data, you can replace the missing data with Most frequent element of that Descriptive Feature i.e. Mode of the column data.
- For Other 2 Descriptive Features, 'emark_town' and 'embarked' have 2 missing values each, so we can easily remove the Respective Rows. Or, we can replace Nan with mode here too.

In [83]:
columns = ks1.filter(['deck', 'embarked', 'embark_town'])
columns

Unnamed: 0,deck,embarked,embark_town
1,C,C,Cherbourg
3,C,S,Southampton
6,E,S,Southampton
10,G,S,Southampton
11,C,S,Southampton
...,...,...,...
871,D,S,Southampton
872,B,S,Southampton
879,C,C,Cherbourg
887,B,S,Southampton


In [84]:
modes = columns.mode()
modes

Unnamed: 0,deck,embarked,embark_town
0,C,S,Southampton


In [85]:
kashti_clean = ks1.fillna(ks1.mode().iloc[0])
kashti_clean

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [86]:
kashti_clean.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [87]:
kashti_clean.shape

(203, 15)

**Data Formatting**

In [88]:
# know the datatype and convert that into another datatype
kashti_clean.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [89]:
# Convert the datatype from one to another type
kashti_clean['survived'] = kashti_clean['survived'].astype("float64")
kashti_clean.dtypes

survived        float64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [90]:
# convert the age from years into days
kashti_clean['age'] = kashti_clean['age']*365
kashti_clean.head(5) 

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1.0,1,female,13870.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1.0,1,female,12775.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
6,0.0,1,male,19710.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1.0,3,female,1460.0,1,1,16.7,S,Third,child,False,G,Southampton,yes,False
11,1.0,1,female,21170.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True


In [103]:
# Converting float into int in the series of age
kashti_clean['age'] = kashti_clean['age'].astype("int64")
kashti_clean.head(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1.0,1,female,13870,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1.0,1,female,12775,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
6,0.0,1,male,19710,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1.0,3,female,1460,1,1,16.7,S,Third,child,False,G,Southampton,yes,False
11,1.0,1,female,21170,0,0,26.55,S,First,woman,False,C,Southampton,yes,True


In [105]:
# Renaming a column's name
kashti_clean.rename(columns={"age": "Age in Days"}, inplace=True)
kashti_clean.head()

Unnamed: 0,survived,pclass,sex,Age in Days,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1.0,1,female,13870,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1.0,1,female,12775,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
6,0.0,1,male,19710,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1.0,3,female,1460,1,1,16.7,S,Third,child,False,G,Southampton,yes,False
11,1.0,1,female,21170,0,0,26.55,S,First,woman,False,C,Southampton,yes,True


**Data Normalization**

- Uniform the Data
- Making sure they have same impact
- Aik machli samundar mein aor ek jaar mein
- Also for computational reasons

In [106]:
ks4 = kashti_clean[["Age in Days", "fare"]]
ks4.head()

Unnamed: 0,Age in Days,fare
1,13870,71.2833
3,12775,53.1
6,19710,51.8625
10,1460,16.7
11,21170,26.55


- The above data is really in wide range and we need to normalize and had to compare
- Normalization change the values to the range of 0-1

**Modes of Normalization**
1. Simple feature scaling
    - x(new) = x(old/x(max))
2. Min-Max Method
3. Z-score (standard score) -3 to +3
4. Log transformation   

In [107]:
# Simple feature scaling
ks4['fare'] = ks4['fare']/ks4['fare'].max()
ks4['Age in Days'] = ks4['Age in Days']/ks4['Age in Days'].max()
ks4.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['fare'] = ks4['fare']/ks4['fare'].max()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['Age in Days'] = ks4['Age in Days']/ks4['Age in Days'].max()


Unnamed: 0,Age in Days,fare
1,0.475,0.139136
3,0.4375,0.103644
6,0.675,0.101229
10,0.05,0.032596
11,0.725,0.051822


In [108]:
# Min - Max Method

ks4['fare'] = ks4['fare'] - ks4['fare'].min()/ks4['fare'].max() - ks4['fare'].min()
ks4['Age in Days'] = ks4['Age in Days'] - ks4['Age in Days'].min()/ks4['Age in Days'].max() - ks4['Age in Days'].min()
ks4.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['fare'] = ks4['fare'] - ks4['fare'].min()/ks4['fare'].max() - ks4['fare'].min()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['Age in Days'] = ks4['Age in Days'] - ks4['Age in Days'].min()/ks4['Age in Days'].max() - ks4['Age in Days'].min()


Unnamed: 0,Age in Days,fare
1,0.452055,0.139136
3,0.414555,0.103644
6,0.652055,0.101229
10,0.027055,0.032596
11,0.702055,0.051822


In [109]:
# Z-score (standard score)
ks4['fare'] = ks4['fare'] - ks4['fare'].mean()/ks4['fare'].std()
ks4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['fare'] = ks4['fare'] - ks4['fare'].mean()/ks4['fare'].std()


Unnamed: 0,Age in Days,fare
1,0.452055,-0.885303
3,0.414555,-0.920794
6,0.652055,-0.923210
10,0.027055,-0.991842
11,0.702055,-0.972616
...,...,...
871,0.564555,-0.921859
872,0.389555,-1.014679
879,0.677055,-0.862124
887,0.214555,-0.965882


In [114]:
# Log transformation
ks4['fare'] = np.log(ks4['fare'])
ks4.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ks4['fare'] = np.log(ks4['fare'])


Unnamed: 0,Age in Days,fare
1,0.452055,
3,0.414555,
6,0.652055,
10,0.027055,
11,0.702055,
