<a href="https://colab.research.google.com/github/bipinthecoder/machine-learning-basics/blob/main/ml_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Wrangling is the processes of manipulating data to clear the noise and format it properly for further steps in the Machine Learning.

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'

dataframe = pd.read_csv(url)

dataframe.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


**Creating a Data Frame**

Creating an empty DataFrame and adding observation separately

In [None]:
import pandas as pd

#Creating a data frame
dataframe = pd.DataFrame()

#Add columns
dataframe['Name'] = ['Michael Jackson', 'Steven Steveson']
dataframe['Age'] = [38, 25]
dataframe['Driver'] = [True, False]

dataframe


Unnamed: 0,Name,Age,Driver
0,Michael Jackson,38,True
1,Steven Steveson,25,False


Creating a new row and appending to the bottom of the dataframe

In [None]:
new_person_row = pd.Series(data = ['Molly Mooney', 40, True], index = ['Name', 'Age', 'Driver'], name = 'new_person_row')

#Appending to df with ignore_index = False PS: False by default if not specified
dataframe.append(new_person_row, ignore_index=False)

Unnamed: 0,Name,Age,Driver
0,Michael Jackson,38,True
1,Steven Steveson,25,False
new_person_row,Molly Mooney,40,True


In [None]:
new_person_row = pd.Series(data = ['Molly Mooney', 40, True], index = ['Name', 'Age', 'Driver'])

#Appending to df
dataframe.append(new_person_row, ignore_index=True)

Unnamed: 0,Name,Age,Driver
0,Michael Jackson,38,True
1,Steven Steveson,25,False
2,Molly Mooney,40,True


ignore_index=True ‘ignores’, meaning doesn’t align on the joining axis. it simply pastes them together in the order that they are passed, then reassigns a range for the actual index (e.g. range(len(index))) so the difference between joining on non-overlapping indexes (assume axis=1 in the example), is that with ignore_index=False (the default), you get the concat of the indexes, and with ignore_index=True you get a range.

**Describing the data**

View the characteristics of a DataFrame

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'

dataframe = pd.read_csv(url)

dataframe.head(2)


Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1


Checking number of rows and columns in df

In [None]:
dataframe.shape

(1313, 6)

Getting descriptive statistics

In [None]:
dataframe.describe()

Unnamed: 0,Age,Survived,SexCode
count,756.0,1313.0,1313.0
mean,30.397989,0.342727,0.351866
std,14.259049,0.474802,0.477734
min,0.17,0.0,0.0
25%,21.0,0.0,0.0
50%,28.0,0.0,0.0
75%,39.0,1.0,1.0
max,71.0,1.0,1.0


**Navigating DataFrame**

To select individual data or slices of DataFrame

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

In [4]:
import pandas as pd

url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'

dataframe = pd.read_csv(url, delimiter=',')

#Select first row
dataframe.iloc[0]

Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                 29.0
Sex                               female
Survived                               1
SexCode                                1
Name: 0, dtype: object

Using : operator for slicing

In [5]:
#Selecting 2nd, 3rd and 4th rows

dataframe.iloc[1:4]

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1


In [8]:
#Getting all rows up to 4
dataframe.iloc[:4]

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1


Using set_index() to set a column which has unique values for each row as index for the dataFrame

In [15]:
dataframe.set_index(keys = dataframe['Name'], drop = True, inplace=True)

#Show the row after change in index

# dataframe.head(5)
dataframe.loc['Allen, Miss Elisabeth Walton']



Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                 29.0
Sex                               female
Survived                               1
SexCode                                1
Name: Allen, Miss Elisabeth Walton, dtype: object

loc - is used when index of DataFrame is a label(String)

iloc - is used when index of DataFrame is an integer

**Selecting rows based on conditionals**

In [16]:
import pandas as pd

url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'

dataframe = pd.read_csv(url, delimiter=',')

#Selecting all Female passengers

dataframe[dataframe['Sex'] == 'female'].head(5)


Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1,1
8,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.0,female,1,1


Adding multiple conditions

In [20]:
#Selecting all females with age > 65
dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age'] > 65)]

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
73,"Crosby, Mrs Edward Gifford (Catherine Elizabet...",1st,69.0,female,1,1


**Replace values in a DataFrame**

Pandas provides replace() to replace any instances of values in a DataFrame

Replacing all "female" values in Sex column with "Woman"

In [5]:
import pandas as pd
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'

dataFrame = pd.read_csv(url , delimiter = ',')

modified_df = dataFrame['Sex'].replace("female","woman")

modified_df.head(5)


0    woman
1    woman
2     male
3    woman
4     male
Name: Sex, dtype: object

Replacing multiple values at the same time

In [6]:
#Replacing "male" and "female" with "Man" and "Woman" at the same time
modified_df = dataFrame['Sex'].replace(["male", "female"], ["Man", "Woman"])
modified_df.head(5)

0    Woman
1    Woman
2      Man
3    Woman
4      Man
Name: Sex, dtype: object

Replacing values across the dataFrame

In [7]:
dataFrame.replace(1, "One").head(2)

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,One,One
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,One


Using regular expression in replace()

In [8]:
dataFrame.replace(r"1st", "First", regex = True).head(2)

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",First,29.0,female,1,1
1,"Allison, Miss Helen Loraine",First,2.0,female,0,1
