### Data Wrangling 

- the process of cleaning (disorganized or incomplete raw data) and unifying messy (standardizing) complex datasets for easy access and analysis.

- involves mapping of data which is the most important in any data integration process without proper data mapping strategy, data transformation and filtration, data will always be prone to errors.

- Operation we perform on data in wrangling are joining parsing, cleaning and unifying or filtering row or column in the dataset to produce the desired output.

In [2]:
#import libraries

import pandas as pd
import numpy as np


In [3]:
#load data as a dataframe

dataframe = pd.read_csv("downloads/train_eda.csv")
dataframe.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Features: Name, P-class, age, sex, etc ....

Observation: passenger information

#### Replacing Values

In [4]:
#replace values

dataframe['Sex'].replace(["female","male"],["woman","man"], inplace=True)
dataframe.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",man,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",woman,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",woman,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",woman,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",man,35.0,0,0,373450,8.05,,S


#### Deleting a Column 

In [9]:
# Delete a column 
dataframe.drop('Age', axis=1).head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",man,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",woman,1,0,PC 17599,71.2833,C85,C


In [10]:
# Drop column at once by specifying column name

dataframe.drop(['Age','Sex'], axis='columns').head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0,PC 17599,71.2833,C85,C


In [15]:
# Drop columns by specifying the index

dataframe.drop(dataframe.columns[1], axis = 1).head(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",man,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",woman,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",woman,26.0,0,0,STON/O2. 3101282,7.925,,S


#### Deleting a row

In [16]:
# Show the first two rows with female

dataframe[dataframe['Sex']!='man'].head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",woman,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",woman,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",woman,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",woman,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",woman,14.0,1,0,237736,30.0708,,C


#### Delete a single row by row index 

In [17]:
# To see the first two rows after deleting a row

dataframe[dataframe.index !=0].head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",woman,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",woman,26.0,0,0,STON/O2. 3101282,7.925,,S


#### Dropping duplicate rows

In [18]:
# Drop duplicates show first two rows of output
dataframe.drop_duplicates().head(4)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",man,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",woman,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",woman,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",woman,35.0,1,0,113803,53.1,C123,S


In [19]:
#show numbers of rows

print("length of an original dataframe:", len(dataframe))
print("Number of rows after deduping:", len(dataframe.drop_duplicates()))

length of an original dataframe: 891
Number of rows after deduping: 891


In [20]:
#Drop dupicates
dataframe.drop_duplicates(subset =['Sex'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",man,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",woman,38.0,1,0,PC 17599,71.2833,C85,C


#### Grouping rows by value

In [21]:
# calculating mean of group basis on 'Sex' features
dataframe.groupby('Sex').mean()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
man,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893
woman,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818


In [22]:
#group rows , count rows

dataframe.groupby('Survived')['Name'].count()

Survived
0    549
1    342
Name: Name, dtype: int64

In [23]:
#group rows, calculate mens

dataframe.groupby(['Sex','Survived'])['Age'].mean()

Sex    Survived
man    0           31.618056
       1           27.276022
woman  0           25.046875
       1           28.847716
Name: Age, dtype: float64

In [24]:
#Show counts
dataframe['Sex'].value_counts()

man      577
woman    314
Name: Sex, dtype: int64