<a href="https://colab.research.google.com/github/ghatanisuresh/DataScience_tutorial/blob/main/Automobile_Data_Wrangling_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Wrangling

Before diving into the practical aspects of data wrangling, let's first understand the real meaning of the term "wrangle." In everyday language, it means to take care of or deal with someone or something, usually when it is challenging. Now, let's apply this concept to the world of data.

In the data world, "wrangling" is the process of dealing with difficult or messy data. This typically involves cleaning up raw, erroneous, or incomplete data, which can often consume 90% of a data scientist's or data analyst's time. The goal is to transform that messy data into a nice, structured, and easy-to-use format. This is the essence of data wrangling.

At this point, you might assume that data wrangling is simply another term for data cleaning, data remediation, or data munging. While there is overlap, data wrangling encompasses a broader set of processes.

These processes are designed to transform raw data into a format that is more readily usable for analysis. The specific steps involved can differ from project to project, depending on the nature of the data and the goals we are trying to achieve.


Some of the works done include:

* Merging mulitple data sources into a single dataset for analysis
* Identifying gaps in data and either filling or deleting them
* unncessary data removal which are irrelevant to the project
* understanding of outliers(analysing the importance of outliers)

Note: Though data cleaning and data wrangling used interchangeably, there is distinction.

* Data wrangling is the overal process of transformation raw data into a more usable form.

* Data cleaning is a critical step in the data wrangling process to remove inaccurate and inconsistent data.

[Source](https://online.hbs.edu/blog/post/data-wrangling)

# Example: Used Cars Pricing

For this process I have used "Automobile Dataset" from the link:  https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data


# Objectives

* Handling missing values
* Correct data formatting
* Standardize and normalize data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

### Reading dataset from the URL

In [None]:
url =  'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
df = pd.read_csv(url)
df

Unnamed: 0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,...,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
1,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
2,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
3,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
4,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.40,8.5,110,5500,19,25,15250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
200,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
201,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
202,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


As there is no header in this dataset, so let's add the header columns.

In [None]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

In [None]:
df.columns = headers
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
1,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
2,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
3,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
4,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250


We can clearly see the above there is '?'; Those may obviously may not reflect the essence in the data, which is gonna hinder furhter analysis.

__How to work with missing data?__

a. Identify missing data

b. Deal with missing data

c. Correct data format

# Identify and handle missing values

As above '?' has already identified pretty easily, so let's replace that with null values

In [None]:
# library
import numpy as np

# replace "?" to NaN

df.replace('?', np.nan, inplace = True)
df.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
1,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
2,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
3,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
4,2,,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250


### Evaluating for missing Data

There are two ways to find out the missing data.

a. isnull()

b. notnull()

In [None]:
df.isnull().head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
df.notnull().head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,True,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,True,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,True,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


The output is a boolean value indicating whether the value that is passed into the argument:

* is missing data in case of __isnull()__

* not missing data in case of __notnull()__


Let's visulise above boolean expression into another format using __sum()__ method, which returns the summation of total available missing values.

In [None]:
df.isnull().sum()

Unnamed: 0,0
symboling,0
normalized-losses,40
make,0
fuel-type,0
aspiration,0
num-of-doors,2
body-style,0
drive-wheels,0
engine-location,0
wheel-base,0


We can also view above in anther way by using __value_counts()__ method.

In [None]:
# let's set above missing values in the new variable 'missing_data'

missing_data = df.isnull()

In [None]:
#for column in missing_data.columns.values.tolist():
#  print(column)
#  print("Missing data in", missing_data[column].value_counts())
#  print('')

### Deal with missing data:

1. Drop data:

  - Drop the whole row
  - Drop the whole column
2. Replace data

  - Replace with mean value
  - Replace by frequency
  - replace it based on other functions

Notes:

We should only drop the whole columns if most entries in the column are empty. In the data set, none of the columns are empty enough to drop entirely.

__Replace by mean:__

* "normalized-losses": 41 missind data, replace them with mean
* "stroke": 4 missing data, replace with mean
* "bore": 4 missind data, replace with mean
* "horsepower": 2 missing data, replace with mean
* "peak-rpm": 2 missing data, replace with mean


__Replace by Frequecny:__

* "num-of-doors": 2 missing data, replace with mode method.

__Drop the whole row:__

* "price": 4 missing data, simple delete the whole row

  - reason: For the prediction to price, we cannot use any data entry without price data, so better to remove them for now.




### Calculate the mean value for the 'normalized-losses' column

Before doing this, first we check the datatypes with __info() method. __

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          204 non-null    int64  
 1   normalized-losses  164 non-null    object 
 2   make               204 non-null    object 
 3   fuel-type          204 non-null    object 
 4   aspiration         204 non-null    object 
 5   num-of-doors       202 non-null    object 
 6   body-style         204 non-null    object 
 7   drive-wheels       204 non-null    object 
 8   engine-location    204 non-null    object 
 9   wheel-base         204 non-null    float64
 10  length             204 non-null    float64
 11  width              204 non-null    float64
 12  height             204 non-null    float64
 13  curb-weight        204 non-null    int64  
 14  engine-type        204 non-null    object 
 15  num-of-cylinders   204 non-null    object 
 16  engine-size        204 non

To calculate the mean value, string columns has to be converted into numeric type, which could be int or float depending upon the attributes.

For this we are coverting into float.

In [None]:
avg_norm_loss = df['normalized-losses'].astype('float').mean()
avg_norm_loss

122.0

### Replace "NaN with mean value

In [None]:
# replace on normalised losses
df['normalized-losses'].replace(np.nan,avg_norm_loss, inplace = True)

In [None]:
# replace on bore column

avg_bore = df['bore'].astype('float').mean()
df['bore'].replace(np.nan, avg_bore, inplace = True)

In [None]:
# replace on stroke column

avg_stroke = df['stroke'].astype('float').mean()
df['stroke'].replace(np.nan, avg_stroke, inplace = True)

In [None]:
# replace on horsepower column
avg_horsepower = df['horsepower'].astype('float').mean()
df['horsepower'].replace(np.nan, avg_horsepower, inplace = True)

In [None]:
# replace on peakrpm

avg_peak_rpm = df['peak-rpm'].astype('float').mean()
df['peak-rpm'].replace(np.nan, avg_peak_rpm, inplace = True)