## __Data Binning: Formatting and Normalization__

- Formatting is the process of bringing data into a common format, which would then further be used for feature preparation.

- Normalization is the process of adjusting values measured on different scales
to a common scale, which would be easy for the model to assign weights.

- In normalization, there are
a lot of ways in which we can scale. 
For example:

    1. Single-feature scaling
    2. Min-max scaling
    3. Z-score
    4. Log scaling
    5. Clipping


First, let's look at the data, try to format it
and then apply normalization.

## Step 1: Import the Required Libraries and Read the Dataset

- Import the pandas and NumPy libraries
- Read the **CarPrices.csv** file into a DataFrame


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../../Datasets/CarPrices.csv',index_col=0)

In [3]:
df.head(2)

Unnamed: 0_level_0,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
car_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0


__Observation__

- These are the first five rows of the dataset.

## Step 2: Preprocess the Dataset

Let's first drop all the missing values and look at the data types.


In [4]:
df.dropna(inplace=True)

In [5]:
df.dtypes

symboling             int64
CarName              object
fueltype             object
aspiration           object
doornumber           object
carbody              object
drivewheel           object
enginelocation       object
wheelbase           float64
carlength           float64
carwidth            float64
carheight           float64
curbweight            int64
enginetype           object
cylindernumber       object
enginesize            int64
fuelsystem           object
boreratio           float64
stroke              float64
compressionratio    float64
horsepower            int64
peakrpm               int64
citympg               int64
highwaympg            int64
price               float64
dtype: object

__Observation__

- Here, we can see a lot of the columns contain data of the type **object**. They can be strings.

Since we want the data to be uniform and formatted, let's convert
all objects into strings.

To do so: 
1. Put them into columns
2. Convert them into strings
3. Look at the data types

In [6]:
cols = list(df.select_dtypes(np.object_).columns)

In [7]:
df[cols] = df[cols].astype('string')

In [8]:
df.dtypes

symboling                    int64
CarName             string[python]
fueltype            string[python]
aspiration          string[python]
doornumber          string[python]
carbody             string[python]
drivewheel          string[python]
enginelocation      string[python]
wheelbase                  float64
carlength                  float64
carwidth                   float64
carheight                  float64
curbweight                   int64
enginetype          string[python]
cylindernumber      string[python]
enginesize                   int64
fuelsystem          string[python]
boreratio                  float64
stroke                     float64
compressionratio           float64
horsepower                   int64
peakrpm                      int64
citympg                      int64
highwaympg                   int64
price                      float64
dtype: object

__Observation__

- Since all the **object** types are converted to **string** types, we have formatted them correctly.  

Let's now look at the columns, especially the column **CarName**.

In [9]:
df.columns

Index(['symboling', 'CarName', 'fueltype', 'aspiration', 'doornumber',
       'carbody', 'drivewheel', 'enginelocation', 'wheelbase', 'carlength',
       'carwidth', 'carheight', 'curbweight', 'enginetype', 'cylindernumber',
       'enginesize', 'fuelsystem', 'boreratio', 'stroke', 'compressionratio',
       'horsepower', 'peakrpm', 'citympg', 'highwaympg', 'price'],
      dtype='object')

In [10]:
df['CarName'] 

car_ID
1            alfa-romero giulia
2           alfa-romero stelvio
3      alfa-romero Quadrifoglio
4                   audi 100 ls
5                    audi 100ls
                 ...           
201             volvo 145e (sw)
202                 volvo 144ea
203                 volvo 244dl
204                   volvo 246
205                 volvo 264gl
Name: CarName, Length: 205, dtype: string

__Observation__

- Here, we can see that the company name is given first
and then the model. 


## Step 3: Extract the Company Name from the CarName Column

- Let’s create a lambda function that contains the company names.

In [11]:
df['Company Name'] = df['CarName'].apply(lambda x: x.split()[0])

In [12]:
df['Company Name'].unique()

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'maxda', 'mazda', 'buick', 'mercury',
       'mitsubishi', 'Nissan', 'nissan', 'peugeot', 'plymouth', 'porsche',
       'porcshce', 'renault', 'saab', 'subaru', 'toyota', 'toyouta',
       'vokswagen', 'volkswagen', 'vw', 'volvo'], dtype=object)

__Observation__

- Here, we can see that
there is a discrepancy in the data.
- For example, Porsche is spelled incorrectly, the acronym of Volkswagen is given instead of the full name, and Toyota is spelled incorrectly.
- When we format the data, we need to look at all these things.


Let's replace the wrong values
of this column in the dataset and look at the unique values to sort them.

In [13]:
cols = {'maxda':'mazda','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen','Nissan':'nissan'}

In [14]:
df['Company Name'].replace(cols,inplace=True)

In [15]:
df['Company Name'].unique()

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'mazda', 'buick', 'mercury', 'mitsubishi',
       'nissan', 'peugeot', 'plymouth', 'porsche', 'renault', 'saab',
       'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)

__Observation__

- Hence, this column of the dataset is corrected and sorted.

## Step 4: Standardization Techniques

Let's look at standardizing the data with the peak RPM as an example.

In [16]:
df['peakrpm'].head()

car_ID
1    5000
2    5000
3    5000
4    5500
5    5500
Name: peakrpm, dtype: int64

### Single-Feature Scaling

Let's standardize this 
by using single-feature scaling.

In [17]:
df['SF_peakrpm'] = df['peakrpm']/df['peakrpm'].max()

In [18]:
df['SF_peakrpm']

car_ID
1      0.757576
2      0.757576
3      0.757576
4      0.833333
5      0.833333
         ...   
201    0.818182
202    0.803030
203    0.833333
204    0.727273
205    0.818182
Name: SF_peakrpm, Length: 205, dtype: float64

__Observation__

- Here, we can observe from the output that the data is standardized.



### Min-Max Scaling

Let's now look at another example where we can use min-max scaling.

In [19]:
df['MM_peakrpm'] = (df['peakrpm'] - df['peakrpm'].min() ) / (df['peakrpm'].max() - df['peakrpm'].min())

In [20]:
df['MM_peakrpm']

car_ID
1      0.346939
2      0.346939
3      0.346939
4      0.551020
5      0.551020
         ...   
201    0.510204
202    0.469388
203    0.551020
204    0.265306
205    0.510204
Name: MM_peakrpm, Length: 205, dtype: float64

__Observation__

- Hence, we can see that this data is standardized
using min-max scaling.

- Similarly, we can apply standardization
to various columns
using Z-score, log, clipping or single feature in min-max scaling.