# DATA PREPARATION/DATA PREPROCESSING

### 3 Step Approach:

**Step 1: Data Cleaning - drop, dropna. Mean, Median, mode Imputation (or) Any Value Imputation for Nan Values.**

**Step 2: Data Type Convertion(Optional)**

**Step 3: Data Transformation**


## 1. Import Important Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

## 2. Import Data

In [2]:
weather_data_2010 = pd.read_csv('data_clean.csv')
weather_data_2010

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
0,1,41.0,190.0,7.4,67,5,1,2010,67,S
1,2,36.0,118.0,8.0,72,5,2,2010,72,C
2,3,12.0,149.0,12.6,74,5,3,2010,74,PS
3,4,18.0,313.0,11.5,62,5,4,2010,62,S
4,5,,,14.3,56,5,5,2010,56,S
...,...,...,...,...,...,...,...,...,...,...
153,154,41.0,190.0,7.4,67,5,1,2010,67,C
154,155,30.0,193.0,6.9,70,9,26,2010,70,PS
155,156,,145.0,13.2,77,9,27,2010,77,S
156,157,14.0,191.0,14.3,75,9,28,2010,75,S


## 3. Data Understanding

## 3.1 Perform Initial Analysis

In [3]:
weather_data_2010.shape

(158, 10)

In [4]:
weather_data_2010.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  158 non-null    int64  
 1   Ozone       120 non-null    float64
 2   Solar.R     151 non-null    float64
 3   Wind        158 non-null    float64
 4   Temp C      158 non-null    object 
 5   Month       158 non-null    object 
 6   Day         158 non-null    int64  
 7   Year        158 non-null    int64  
 8   Temp        158 non-null    int64  
 9   Weather     155 non-null    object 
dtypes: float64(3), int64(4), object(3)
memory usage: 12.5+ KB


In [5]:
weather_data_2010.isna().sum()

Unnamed: 0     0
Ozone         38
Solar.R        7
Wind           0
Temp C         0
Month          0
Day            0
Year           0
Temp           0
Weather        3
dtype: int64

In [6]:
weather_data_2010.dtypes

Unnamed: 0      int64
Ozone         float64
Solar.R       float64
Wind          float64
Temp C         object
Month          object
Day             int64
Year            int64
Temp            int64
Weather        object
dtype: object

In [7]:
weather_data_2010.describe(include = 'all')

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
count,158.0,120.0,151.0,158.0,158.0,158.0,158.0,158.0,158.0,155
unique,,,,,41.0,6.0,,,,3
top,,,,,81.0,9.0,,,,S
freq,,,,,11.0,34.0,,,,59
mean,79.5,41.583333,185.403974,9.957595,,,16.006329,2010.0,77.727848,
std,45.754781,32.620709,88.723103,3.511261,,,8.997166,0.0,9.377877,
min,1.0,1.0,7.0,1.7,,,1.0,2010.0,56.0,
25%,40.25,18.0,119.0,7.4,,,8.0,2010.0,72.0,
50%,79.5,30.5,197.0,9.7,,,16.0,2010.0,78.5,
75%,118.75,61.5,257.0,11.875,,,24.0,2010.0,84.0,


## 4. Data Preparation

### STAGE 1 - Data Cleaning

In [8]:
weather_data_2010.drop(labels = ['Unnamed: 0','Temp C'],axis = 1,inplace = True)
weather_data_2010

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,41.0,190.0,7.4,5,1,2010,67,S
1,36.0,118.0,8.0,5,2,2010,72,C
2,12.0,149.0,12.6,5,3,2010,74,PS
3,18.0,313.0,11.5,5,4,2010,62,S
4,,,14.3,5,5,2010,56,S
...,...,...,...,...,...,...,...,...
153,41.0,190.0,7.4,5,1,2010,67,C
154,30.0,193.0,6.9,9,26,2010,70,PS
155,,145.0,13.2,9,27,2010,77,S
156,14.0,191.0,14.3,9,28,2010,75,S


### Client is agreed to go with mean imputation for Ozone feature

In [9]:
weather_data_2010['Ozone'].fillna(41.58,inplace = True)

In [10]:
weather_data_2010.head(30)

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,41.0,190.0,7.4,5,1,2010,67,S
1,36.0,118.0,8.0,5,2,2010,72,C
2,12.0,149.0,12.6,5,3,2010,74,PS
3,18.0,313.0,11.5,5,4,2010,62,S
4,41.58,,14.3,5,5,2010,56,S
5,28.0,,14.9,5,6,2010,66,C
6,23.0,299.0,8.6,5,7,2010,65,PS
7,19.0,99.0,13.8,5,8,2010,59,C
8,8.0,19.0,20.1,5,9,2010,61,PS
9,41.58,194.0,8.6,5,10,2010,69,S


In [11]:
weather_data_2010.isna().sum()

Ozone      0
Solar.R    7
Wind       0
Month      0
Day        0
Year       0
Temp       0
Weather    3
dtype: int64

In [12]:
7/158*100

4.430379746835443

### Client is agreed to drop the NaN records for Solar.R feature & also for weather feature

In [13]:
weather_data_2010.dropna(inplace = True)

In [14]:
weather_data_2010.isna().sum()

Ozone      0
Solar.R    0
Wind       0
Month      0
Day        0
Year       0
Temp       0
Weather    0
dtype: int64

### STAGE 2 - DATA CONVERSION

In [15]:
weather_data_2010.dtypes

Ozone      float64
Solar.R    float64
Wind       float64
Month       object
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [16]:
weather_data_2010['Month'] = weather_data_2010['Month'].replace('May',5).astype('int')

In [17]:
weather_data_2010.dtypes

Ozone      float64
Solar.R    float64
Wind       float64
Month        int32
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

### STAGE 3 - DATA TRASFORMATION

### 2 Transformation Techniques: Discrete Vs Continous

**1. Discrete - Label Encoding/One Hot Encoding**

**2. Continous - MinMax Scaler/Standard Scaler/Robust Scaler**

## 1. Label Encoding

In [18]:
weather_data_2010_copy_1 = weather_data_2010.copy()

In [19]:
le = LabelEncoder()

In [20]:
weather_data_2010_copy_1['Weather'] = le.fit_transform(weather_data_2010_copy_1['Weather'])
weather_data_2010_copy_1

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,41.00,190.0,7.4,5,1,2010,67,2
1,36.00,118.0,8.0,5,2,2010,72,0
2,12.00,149.0,12.6,5,3,2010,74,1
3,18.00,313.0,11.5,5,4,2010,62,2
6,23.00,299.0,8.6,5,7,2010,65,1
...,...,...,...,...,...,...,...,...
153,41.00,190.0,7.4,5,1,2010,67,0
154,30.00,193.0,6.9,9,26,2010,70,1
155,41.58,145.0,13.2,9,27,2010,77,2
156,14.00,191.0,14.3,9,28,2010,75,2


In [21]:
weather_data_2010_copy_1.dtypes

Ozone      float64
Solar.R    float64
Wind       float64
Month        int32
Day          int64
Year         int64
Temp         int64
Weather      int32
dtype: object

## 2. One Hot Encoding

It can be built by using 2 libraries:

1. Pandas  - pd.get_dummies()
2. sklearn - OneHotEncoder()

### Using Pandas :

In [22]:
weather_data_2010_copy_2 = weather_data_2010

In [23]:
weather_data_2010_copy_2 = pd.get_dummies(data = weather_data_2010_copy_2, columns= ['Weather'])
weather_data_2010_copy_2

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather_C,Weather_PS,Weather_S
0,41.00,190.0,7.4,5,1,2010,67,0,0,1
1,36.00,118.0,8.0,5,2,2010,72,1,0,0
2,12.00,149.0,12.6,5,3,2010,74,0,1,0
3,18.00,313.0,11.5,5,4,2010,62,0,0,1
6,23.00,299.0,8.6,5,7,2010,65,0,1,0
...,...,...,...,...,...,...,...,...,...,...
153,41.00,190.0,7.4,5,1,2010,67,1,0,0
154,30.00,193.0,6.9,9,26,2010,70,0,1,0
155,41.58,145.0,13.2,9,27,2010,77,0,0,1
156,14.00,191.0,14.3,9,28,2010,75,0,0,1


In [24]:
weather_data_2010_copy_2.dtypes

Ozone         float64
Solar.R       float64
Wind          float64
Month           int32
Day             int64
Year            int64
Temp            int64
Weather_C       uint8
Weather_PS      uint8
Weather_S       uint8
dtype: object

### Using sklearn :

In [25]:
weather_data_2010_copy_3 = weather_data_2010.copy()

In [26]:
weather_data_2010_copy_3

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,41.00,190.0,7.4,5,1,2010,67,S
1,36.00,118.0,8.0,5,2,2010,72,C
2,12.00,149.0,12.6,5,3,2010,74,PS
3,18.00,313.0,11.5,5,4,2010,62,S
6,23.00,299.0,8.6,5,7,2010,65,PS
...,...,...,...,...,...,...,...,...
153,41.00,190.0,7.4,5,1,2010,67,C
154,30.00,193.0,6.9,9,26,2010,70,PS
155,41.58,145.0,13.2,9,27,2010,77,S
156,14.00,191.0,14.3,9,28,2010,75,S


In [27]:
weather_data_2010_copy_3["Weather"].unique()

array(['S', 'C', 'PS'], dtype=object)

In [28]:
ohe = OneHotEncoder()

In [29]:
feature_array = ohe.fit_transform(weather_data_2010_copy_3[['Weather']]).toarray()

In [30]:
feature_array

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0

In [31]:
ohe.categories_

[array(['C', 'PS', 'S'], dtype=object)]

In [32]:
feature_lables = ohe.categories_

In [33]:
encoded_columns = pd.DataFrame(feature_array,columns = feature_lables)
encoded_columns

Unnamed: 0,C,PS,S
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0
...,...,...,...
144,1.0,0.0,0.0
145,0.0,1.0,0.0
146,0.0,0.0,1.0
147,0.0,0.0,1.0


In [38]:
weather_data_2010_copy_3

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,41.00,190.0,7.4,5,1,2010,67,S
1,36.00,118.0,8.0,5,2,2010,72,C
2,12.00,149.0,12.6,5,3,2010,74,PS
3,18.00,313.0,11.5,5,4,2010,62,S
6,23.00,299.0,8.6,5,7,2010,65,PS
...,...,...,...,...,...,...,...,...
153,41.00,190.0,7.4,5,1,2010,67,C
154,30.00,193.0,6.9,9,26,2010,70,PS
155,41.58,145.0,13.2,9,27,2010,77,S
156,14.00,191.0,14.3,9,28,2010,75,S


In [47]:
pd.concat([weather_data_2010_copy_3,encoded_columns], axis =1)

Unnamed: 0,Ozone,Solar.R,Wind,Month,Day,Year,Temp,Weather,"(C,)","(PS,)","(S,)"
0,41.0,190.0,7.4,5.0,1.0,2010.0,67.0,S,0.0,0.0,1.0
1,36.0,118.0,8.0,5.0,2.0,2010.0,72.0,C,1.0,0.0,0.0
2,12.0,149.0,12.6,5.0,3.0,2010.0,74.0,PS,0.0,1.0,0.0
3,18.0,313.0,11.5,5.0,4.0,2010.0,62.0,S,0.0,0.0,1.0
6,23.0,299.0,8.6,5.0,7.0,2010.0,65.0,PS,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
87,,,,,,,,,0.0,1.0,0.0
93,,,,,,,,,0.0,0.0,1.0
95,,,,,,,,,0.0,0.0,1.0
96,,,,,,,,,1.0,0.0,0.0


## How do we choose between Label Encoder and OHE??

### Input feature:
Based on whether it is **Parametric or a Non Parametric Model**, we will choosing between LE or OHE.
* For **Parametric Models** - better to go with **OHE.**
* For **Non-Parametric Models** - always go with **Label Encoding.**

### Output feature:
Always the output feature, it must be **Label Encoder.**

## ========================================================================