<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Data Encoding </p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION](#0)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [DATA ENCODING](#2)
* [THE END OF DATA ENCODING](#3)

<a id="0"></a>

## Introduction


Welcome to "***Auto Scout Car Price Prediction Project***". 

**Auto Scout** data used for this project, were scraped from the on-line car trading company, Auto Scout, in 2019, contains many features of 9 different car models. In this project, I will go through all the steps of a data project: data cleaning, modeling, features selection, and model selection. 

In the first part of this project I will apply many commonly used algorithms for data cleaning and exploratory data analysis by using many Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy.

These are the steps for the first part. 
* **[data cleaning](00_data_cleaning.ipynb)** -  dealing with incorrect headers (column names), incorrect format, anomalies, and dropping obviously  useless columns.
* **[data imputation](01_data_imputation.ipynb)** - handling missing values, reducing classes in features to be encoded.
* **[handling outliers](02_data_viz_&_outliers.ipynb)** -  via visualisation libraries. Some insights are extracted.

In the second part of the project I explore many types of models for predicting prices. I explore OLS, Ridge, Lasso, SGD, Random Forest, XGB, light GBM, and catBoost.

* **[data encoding](03_data_encoding.ipynb)** in preparation for modeling: converting multiclass features into dummy columns, making dummy columns from nested features.
* **[modeling](04_modeling.ipynb)** trying out different models, model selection, feature selection, and cross-validation.

<a id="1"></a>

## Importing Libraries Needed For Encoding

In [1]:
import numpy as np
import pandas as pd
import regex as re

<a id="2"></a>
## Data Encoding


In [2]:
df = pd.read_json('data_post02.json', lines=True)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15884 entries, 0 to 15883
Data columns (total 33 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   make_model           15884 non-null  object 
 1   body_type            15884 non-null  object 
 2   price                15884 non-null  int64  
 3   km                   15884 non-null  float64
 4   prev_owner           15884 non-null  int64  
 5   hp                   15884 non-null  int64  
 6   type                 15884 non-null  object 
 7   first_registration   15884 non-null  int64  
 8   body_color           15884 non-null  object 
 9   paint_type           15884 non-null  object 
 10  nr_doors             15884 non-null  object 
 11  nr_seats             15884 non-null  object 
 12  gearing_type         15884 non-null  object 
 13  displacement         15884 non-null  int64  
 14  cylinders            15884 non-null  object 
 15  weight               15884 non-null 

In [4]:
df.columns

Index(['make_model', 'body_type', 'price', 'km', 'prev_owner', 'hp', 'type',
       'first_registration', 'body_color', 'paint_type', 'nr_doors',
       'nr_seats', 'gearing_type', 'displacement', 'cylinders', 'weight',
       'drive_chain', 'fuel', 'co2_emission', 'comfort_convenience',
       'entertainment_media', 'extras', 'safety_security', 'gears',
       'country_version', 'warranty_mo', 'vat_deductible',
       'upholstery_material', 'upholstery_color', 'emission_class',
       'consumption_comb', 'consumption_city', 'consumption_country'],
      dtype='object')

### Encoding multi-class features

In [5]:
cat_cols = ['make_model', 'body_type','prev_owner','type', 'first_registration', 'body_color',
           'paint_type', 'nr_doors', 'nr_seats', 'gearing_type', 'drive_chain', 'fuel', 
            'country_version', 'upholstery_material', 'upholstery_color', 'emission_class', 'gears', 'cylinders']

In [6]:
for f in cat_cols:
    df[f] = df[f].astype(str)

In [7]:
df2 = pd.get_dummies(df[cat_cols], prefix=cat_cols, drop_first=False)
df = pd.concat([df,df2])

In [8]:
df.columns

Index(['make_model', 'body_type', 'price', 'km', 'prev_owner', 'hp', 'type',
       'first_registration', 'body_color', 'paint_type',
       ...
       'emission_class_euro 6', 'emission_class_euro 6c',
       'emission_class_euro 6d', 'emission_class_euro 6d-temp', 'gears_5',
       'gears_6', 'gears_7', 'gears_8', 'cylinders_<=3', 'cylinders_>=4'],
      dtype='object', length=119)

In [9]:
df = df.drop(cat_cols, axis=1)

In [10]:
df.columns

Index(['price', 'km', 'hp', 'displacement', 'weight', 'co2_emission',
       'comfort_convenience', 'entertainment_media', 'extras',
       'safety_security',
       ...
       'emission_class_euro 6', 'emission_class_euro 6c',
       'emission_class_euro 6d', 'emission_class_euro 6d-temp', 'gears_5',
       'gears_6', 'gears_7', 'gears_8', 'cylinders_<=3', 'cylinders_>=4'],
      dtype='object', length=101)

### Encoding nested features
* find unique features from these lists
* make each into a feature with an identifiable prefix.

In [11]:
pre = {'comfort_convenience':'com',
     'entertainment_media':'ent',
     'extras':'ext', 
     'safety_security':'saf'}
for col in ['comfort_convenience','entertainment_media', 
            'extras', 'safety_security']:
    df[col] = [set(x.split(',')) if x.__class__ == str else np.nan 
                       for x in df[col]]
    print(df[col])
    unique = set()
    for v in df[col]:
        if (v.__class__ == float) or (v == None):
            continue
        for f in v:
            unique.add(f)
    for f in unique:
        name = pre[col]+'_'+re.sub('\s','_',f.lower())
        df[name] = [0 if ((x.__class__ == float)|(x == None)) else 1*(f in x) for x in df[col]]
        if df[name].count() != 15884:
            print(name)
df.columns 

0        {Cruise control, Multi-function steering wheel...
1        {Air conditioning, Tinted windows, Parking ass...
2        {Cruise control, Multi-function steering wheel...
3        {Heads-up display, Multi-function steering whe...
4        {Multi-function steering wheel, Rain sensor, A...
                               ...                        
15879                                                  NaN
15880                                                  NaN
15881                                                  NaN
15882                                                  NaN
15883                                                  NaN
Name: comfort_convenience, Length: 31768, dtype: object
com_multi-function_steering_wheel
com_electric_tailgate
com_park_distance_control
com_parking_assist_system_sensors_rear
com_hill_holder
com_rain_sensor
com_panorama_roof
com_navigation_system
com_air_suspension
com_parking_assist_system_sensors_front
com_keyless_central_door_lock
com_split_rea

Index(['price', 'km', 'hp', 'displacement', 'weight', 'co2_emission',
       'comfort_convenience', 'entertainment_media', 'extras',
       'safety_security',
       ...
       'saf_side_airbag', 'saf_passenger-side_airbag',
       'saf_adaptive_headlights', 'saf_power_steering', 'saf_rear_airbag',
       'saf_immobilizer', 'saf_blind_spot_monitor',
       'saf_led_daytime_running_lights', 'saf_abs', 'saf_driver-side_airbag'],
      dtype='object', length=195)

In [12]:
df['saf_head_airbag'].value_counts()

0    29582
1     2186
Name: saf_head_airbag, dtype: int64

In [13]:
df = df.drop(['comfort_convenience','entertainment_media', 
            'extras', 'safety_security'], axis=1)

In [14]:
for f in df.columns:
    df[f] = df[f].fillna(0)

In [15]:
for f in df.columns:
    if df[f].isna().sum() > 0:
        print('nan in', f)
    print(df[f].dtype)

float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
float64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
in

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31768 entries, 0 to 15883
Columns: 191 entries, price to saf_driver-side_airbag
dtypes: float64(97), int64(94)
memory usage: 46.5 MB


In [17]:
df.to_json('data_post03.json', orient='records', lines=True)

## Summary

* I convered many of the category features into dummy features and then dropped these. 
* I also converted many nested features into dummy features and then dropped these.
* In the process increaased the number of features from 33 to 191.
* also incidentally sorted the features, fist continuous, then dummies from categorical, then dummies from nested.

<a id="3"></a>

## End of Encoding

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Next: [Data Visualizations & Handling Outliers](03_data_viz_&_handling_outliers.ipynb)