**1. Use ToyotaCorolla dataset. Prepare the dataset by doing the following:**

    a. Summarize the dataset. (Do not just put the R/Python output. Make observations based
    on numbers)
    b. Normalize the variable kilometers
    c. Create dummies for the variable Fuel Type (see the script file in class for an example).
    d. Partition the data into three sets (see the script file in class for an example).

## Data Dictionary

| Column              | Description                                                   |
|---------------------|---------------------------------------------------------------|
| **Id**              | Record_ID                                                     |
| **Model**           | Model Description                                              |
| **Price**           | Offer Price in EUROs                                           |
| **Age_08_04**       | Age in months as in August 2004                                |
| **Mfg_Month**       | Manufacturing month (1-12)                                     |
| **Mfg_Year**        | Manufacturing Year                                             |
| **KM**              | Accumulated Kilometers on odometer                             |
| **Fuel_Type**       | Fuel Type (Petrol, Diesel, CNG)                                |
| **HP**              | Horse Power                                                    |
| **Met_Color**       | Metallic Color? (Yes=1, No=0)                                  |
| **Color**           | Color (Blue, Red, Grey, Silver, Black, etc.)                   |
| **Automatic**       | Automatic (Yes=1, No=0)                                        |
| **CC**              | Cylinder Volume in cubic centimeters                           |
| **Doors**           | Number of doors                                                |
| **Cylinders**       | Number of cylinders                                            |
| **Gears**           | Number of gear positions                                       |
| **Quarterly_Tax**   | Quarterly road tax in EUROs                                    |
| **Weight**          | Weight in Kilograms                                            |
| **Mfr_Guarantee**   | Within Manufacturer's Guarantee period (Yes=1, No=0)           |
| **BOVAG_Guarantee** | BOVAG (Dutch dealer network) Guarantee (Yes=1, No=0)           |
| **Guarantee_Period**| Guarantee period in months                                     |
| **ABS**             | Anti-Lock Brake System (Yes=1, No=0)                           |
| **Airbag_1**        | Driver Airbag (Yes=1, No=0)                                    |
| **Airbag_2**        | Passenger Airbag (Yes=1, No=0)                                 |
| **Airco**           | Airconditioning (Yes=1, No=0)                                  |
| **Automatic_airco** | Automatic Airconditioning (Yes=1, No=0)                        |
| **Boardcomputer**   | Boardcomputer (Yes=1, No=0)                                    |
| **CD_Player**       | CD Player (Yes=1, No=0)                                        |
| **Central_Lock**    | Central Lock (Yes=1, No=0)                                     |
| **Powered_Windows** | Powered Windows (Yes=1, No=0)                                  |
| **Power_Steering**  | Power Steering (Yes=1, No=0)                                   |
| **Radio**           | Radio (Yes=1, No=0)                                            |
| **Mistlamps**       | Mistlamps (Yes=1, No=0)                                        |
| **Sport_Model**     | Sport Model (Yes=1, No=0)                                      |
| **Backseat_Divider**| Backseat Divider (Yes=1, No=0)                                 |
| **Metallic_Rim**    | Metallic Rim (Yes=1, No=0)                                     |
| **Radio_cassette**  | Radio Cassette (Yes=1, No=0)                                   |
| **Parking_Assistant**| Parking assistance system (Yes=1, No=0)                       |
| **Tow_Bar**         | Tow Bar (Yes=1, No=0)                                          |


•	Tow_Bar   		Tow Bar  (Yes=1, No=0)


In [35]:
# importing the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 

In [37]:
# Importing the dataset
ToyotaCorolla_df = pd.read_csv('ToyotaCorolla.csv')

### Summarizing the dataset

In [40]:
ToyotaCorolla_df.head()

Unnamed: 0,Id,Model,Price,Age_08_04,Mfg_Month,Mfg_Year,KM,Fuel_Type,HP,Met_Color,...,Powered_Windows,Power_Steering,Radio,Mistlamps,Sport_Model,Backseat_Divider,Metallic_Rim,Radio_cassette,Parking_Assistant,Tow_Bar
0,1,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13500,23,10,2002,46986,Diesel,90,1,...,1,1,0,0,0,1,0,0,0,0
1,2,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13750,23,10,2002,72937,Diesel,90,1,...,0,1,0,0,0,1,0,0,0,0
2,3,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13950,24,9,2002,41711,Diesel,90,1,...,0,1,0,0,0,1,0,0,0,0
3,4,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,14950,26,7,2002,48000,Diesel,90,0,...,0,1,0,0,0,1,0,0,0,0
4,5,TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors,13750,30,3,2002,38500,Diesel,90,0,...,1,1,0,1,0,1,0,0,0,0


In [42]:
ToyotaCorolla_df.columns

Index(['Id', 'Model', 'Price', 'Age_08_04', 'Mfg_Month', 'Mfg_Year', 'KM',
       'Fuel_Type', 'HP', 'Met_Color', 'Color', 'Automatic', 'CC', 'Doors',
       'Cylinders', 'Gears', 'Quarterly_Tax', 'Weight', 'Mfr_Guarantee',
       'BOVAG_Guarantee', 'Guarantee_Period', 'ABS', 'Airbag_1', 'Airbag_2',
       'Airco', 'Automatic_airco', 'Boardcomputer', 'CD_Player',
       'Central_Lock', 'Powered_Windows', 'Power_Steering', 'Radio',
       'Mistlamps', 'Sport_Model', 'Backseat_Divider', 'Metallic_Rim',
       'Radio_cassette', 'Parking_Assistant', 'Tow_Bar'],
      dtype='object')

In [44]:
print(len(ToyotaCorolla_df.columns))
print(len(ToyotaCorolla_df))

39
1436


**There are 39 columns and 1436 rows in the dataset**

In [47]:
ToyotaCorolla_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1436 entries, 0 to 1435
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Id                 1436 non-null   int64 
 1   Model              1436 non-null   object
 2   Price              1436 non-null   int64 
 3   Age_08_04          1436 non-null   int64 
 4   Mfg_Month          1436 non-null   int64 
 5   Mfg_Year           1436 non-null   int64 
 6   KM                 1436 non-null   int64 
 7   Fuel_Type          1436 non-null   object
 8   HP                 1436 non-null   int64 
 9   Met_Color          1436 non-null   int64 
 10  Color              1436 non-null   object
 11  Automatic          1436 non-null   int64 
 12  CC                 1436 non-null   int64 
 13  Doors              1436 non-null   int64 
 14  Cylinders          1436 non-null   int64 
 15  Gears              1436 non-null   int64 
 16  Quarterly_Tax      1436 non-null   int64 


**From the above info and also by observing the data dictionary we can say the following**

**Categorical Features are -**
1. Model
2. Fuel_Type
3. Met_Color
4. Color
5. Automatic
6. Doors
7. Gears
8. Mfr_Guarantee
9. BOVAG_Guarantee
10. ABS
11. Airbag_1
12. Airbag_2
13. Airco
14. Automatic_airco
15. Boardcomputer
16. CD_Player
17. Central_Lock
18. Powered_Windows
19. Power_Steering
20. Radio
21. Mistlamps
22. Sport_Model
23. Backseat_Divider
24. Metallic_Rim
25. Radio_cassette
26. Parking_Assistant
27. Tow_Bar
28. Cylinders

**Continuous Features are -**
1. Price
2. Age_08_04
3. KM
4. CC
5. Quarterly_Tax
6. Guarantee_Period
7. Mfg_Year
8. Mfg_Month
9. HP
10. Weight

In [50]:
ToyotaCorolla_df.isna().sum()

Id                   0
Model                0
Price                0
Age_08_04            0
Mfg_Month            0
Mfg_Year             0
KM                   0
Fuel_Type            0
HP                   0
Met_Color            0
Color                0
Automatic            0
CC                   0
Doors                0
Cylinders            0
Gears                0
Quarterly_Tax        0
Weight               0
Mfr_Guarantee        0
BOVAG_Guarantee      0
Guarantee_Period     0
ABS                  0
Airbag_1             0
Airbag_2             0
Airco                0
Automatic_airco      0
Boardcomputer        0
CD_Player            0
Central_Lock         0
Powered_Windows      0
Power_Steering       0
Radio                0
Mistlamps            0
Sport_Model          0
Backseat_Divider     0
Metallic_Rim         0
Radio_cassette       0
Parking_Assistant    0
Tow_Bar              0
dtype: int64

In [52]:
ToyotaCorolla_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,1436.0,721.555014,416.47689,1.0,361.75,721.5,1081.25,1442.0
Price,1436.0,10730.824513,3626.964585,4350.0,8450.0,9900.0,11950.0,32500.0
Age_08_04,1436.0,55.947075,18.599988,1.0,44.0,61.0,70.0,80.0
Mfg_Month,1436.0,5.548747,3.354085,1.0,3.0,5.0,8.0,12.0
Mfg_Year,1436.0,1999.625348,1.540722,1998.0,1998.0,1999.0,2001.0,2004.0
KM,1436.0,68533.259749,37506.448872,1.0,43000.0,63389.5,87020.75,243000.0
HP,1436.0,101.502089,14.98108,69.0,90.0,110.0,110.0,192.0
Met_Color,1436.0,0.674791,0.468616,0.0,0.0,1.0,1.0,1.0
Automatic,1436.0,0.05571,0.229441,0.0,0.0,0.0,0.0,1.0
CC,1436.0,1576.85585,424.38677,1300.0,1400.0,1600.0,1600.0,16000.0


In [54]:
#summarising the Continuous variables
ToyotaCorolla_df[['Price', 'Age_08_04', 'KM', 'CC', 'Quarterly_Tax', 'Guarantee_Period', 'Mfg_Year', 'Mfg_Month', 'HP', 'Weight']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Price,1436.0,10730.824513,3626.964585,4350.0,8450.0,9900.0,11950.0,32500.0
Age_08_04,1436.0,55.947075,18.599988,1.0,44.0,61.0,70.0,80.0
KM,1436.0,68533.259749,37506.448872,1.0,43000.0,63389.5,87020.75,243000.0
CC,1436.0,1576.85585,424.38677,1300.0,1400.0,1600.0,1600.0,16000.0
Quarterly_Tax,1436.0,87.122563,41.128611,19.0,69.0,85.0,85.0,283.0
Guarantee_Period,1436.0,3.81546,3.011025,3.0,3.0,3.0,3.0,36.0
Mfg_Year,1436.0,1999.625348,1.540722,1998.0,1998.0,1999.0,2001.0,2004.0
Mfg_Month,1436.0,5.548747,3.354085,1.0,3.0,5.0,8.0,12.0
HP,1436.0,101.502089,14.98108,69.0,90.0,110.0,110.0,192.0
Weight,1436.0,1072.45961,52.64112,1000.0,1040.0,1070.0,1085.0,1615.0


### Normalizing the variable kilometers

In [58]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler


# using sklearn: 
scaler = StandardScaler()
ToyotaCorolla_df[['KM']] = pd.DataFrame(scaler.fit_transform(ToyotaCorolla_df[['KM']]))

In [60]:
ToyotaCorolla_df.head().T

Unnamed: 0,0,1,2,3,4
Id,1,2,3,4,5
Model,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors
Price,13500,13750,13950,14950,13750
Age_08_04,23,23,24,26,30
Mfg_Month,10,10,9,7,3
Mfg_Year,2002,2002,2002,2002,2002
KM,-0.574695,0.117454,-0.715386,-0.54765,-0.801028
Fuel_Type,Diesel,Diesel,Diesel,Diesel,Diesel
HP,90,90,90,90,90
Met_Color,1,1,1,0,0


In [62]:
ToyotaCorolla_df['KM'].head()

0   -0.574695
1    0.117454
2   -0.715386
3   -0.547650
4   -0.801028
Name: KM, dtype: float64

### Create dummies for the variable Fuel Type


In [65]:
data_with_dummies = pd.get_dummies(ToyotaCorolla_df, columns=['Fuel_Type'])
data_with_dummies.head().T

Unnamed: 0,0,1,2,3,4
Id,1,2,3,4,5
Model,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors
Price,13500,13750,13950,14950,13750
Age_08_04,23,23,24,26,30
Mfg_Month,10,10,9,7,3
Mfg_Year,2002,2002,2002,2002,2002
KM,-0.574695,0.117454,-0.715386,-0.54765,-0.801028
HP,90,90,90,90,90
Met_Color,1,1,1,0,0
Color,Blue,Silver,Blue,Black,Black


### Partition the data into three sets

In [78]:
# using sklearn
trainData, temp = train_test_split(ToyotaCorolla_df, test_size=0.3, random_state=1)
validData, testData = train_test_split(temp, test_size=0.2, random_state=1)
print('Training   : ', trainData.shape)
print('Validation : ', validData.shape)
print('Test       : ', testData.shape)

Training   :  (1005, 39)
Validation :  (344, 39)
Test       :  (87, 39)
