<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Data Encoding </p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION NOTEBOOK](00_introduction.ipynb)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [DATA ENCODING](#2)
* [THE END OF DATA ENCODING](#3)

<a id="1"></a>

## Importing Libraries Needed For Encoding

In [1]:
import numpy as np
import pandas as pd
import regex as re

<a id="2"></a>
## Data Encoding


In [2]:
df = pd.read_json('data_post02.json', lines=True)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15884 entries, 0 to 15883
Data columns (total 33 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   make_model           15884 non-null  object 
 1   body_type            15884 non-null  object 
 2   price                15884 non-null  int64  
 3   km                   15884 non-null  float64
 4   prev_owner           15884 non-null  int64  
 5   hp                   15884 non-null  int64  
 6   type                 15884 non-null  object 
 7   first_registration   15884 non-null  int64  
 8   body_color           15884 non-null  object 
 9   paint_type           15884 non-null  object 
 10  nr_doors             15884 non-null  object 
 11  nr_seats             15884 non-null  object 
 12  gearing_type         15884 non-null  object 
 13  displacement         15884 non-null  int64  
 14  cylinders            15884 non-null  object 
 15  weight               15884 non-null 

In [4]:
df.columns

Index(['make_model', 'body_type', 'price', 'km', 'prev_owner', 'hp', 'type',
       'first_registration', 'body_color', 'paint_type', 'nr_doors',
       'nr_seats', 'gearing_type', 'displacement', 'cylinders', 'weight',
       'drive_chain', 'fuel', 'co2_emission', 'comfort_convenience',
       'entertainment_media', 'extras', 'safety_security', 'gears',
       'country_version', 'warranty_mo', 'vat_deductible',
       'upholstery_material', 'upholstery_color', 'emission_class',
       'consumption_comb', 'consumption_city', 'consumption_country'],
      dtype='object')

In [5]:
df['age'] = df['first_registration'] - min(df['first_registration'])

In [6]:
df = df.drop('first_registration', axis=1)

### Encoding multi-class features

In [7]:
cat_cols = ['make_model', 'body_type','prev_owner','type', 'body_color',
           'paint_type', 'nr_doors', 'nr_seats', 'gearing_type', 'drive_chain', 'fuel', 
            'country_version', 'upholstery_material', 'upholstery_color', 'emission_class', 'gears', 'cylinders']

In [8]:
for f in cat_cols:
    df[f] = df[f].astype(str)

In [9]:
df2 = pd.get_dummies(df[cat_cols], prefix=cat_cols, drop_first=True)
df = df.drop(cat_cols, axis=1)
df = df.join(df2)

In [10]:
df.count()

price                          15884
km                             15884
hp                             15884
displacement                   15884
weight                         15884
                               ...  
emission_class_euro 6d-temp    15884
gears_6                        15884
gears_7                        15884
gears_8                        15884
cylinders_>=4                  15884
Length: 81, dtype: int64

In [11]:
df.columns

Index(['price', 'km', 'hp', 'displacement', 'weight', 'co2_emission',
       'comfort_convenience', 'entertainment_media', 'extras',
       'safety_security', 'warranty_mo', 'vat_deductible', 'consumption_comb',
       'consumption_city', 'consumption_country', 'age', 'make_model_audi_a3',
       'make_model_opel_astra', 'make_model_opel_corsa',
       'make_model_opel_insignia', 'make_model_renault_clio',
       'make_model_renault_espace', 'body_type_other', 'body_type_sedans',
       'body_type_station wagon', 'body_type_van', 'prev_owner_1',
       'prev_owner_2', 'prev_owner_3', 'prev_owner_4', 'type_employees_car',
       'type_new', 'type_pre_registered', 'type_used', 'body_color_black',
       'body_color_blue', 'body_color_bronze', 'body_color_brown',
       'body_color_gold', 'body_color_green', 'body_color_grey',
       'body_color_orange', 'body_color_red', 'body_color_silver',
       'body_color_violet', 'body_color_white', 'body_color_yellow',
       'paint_type_Perl effe

### Encoding nested features
* find unique features from these lists
* make each into a feature with an identifiable prefix.

In [12]:
pre = {'comfort_convenience':'com',
     'entertainment_media':'ent',
     'extras':'ext', 
     'safety_security':'saf'}
for col in ['comfort_convenience','entertainment_media', 
            'extras', 'safety_security']:
    unique = set()
    for v in df[col]:
        if (v.__class__ != str):
            continue
        for f in v.split(','):
            unique.add(f)
    for f in unique:
        name = pre[col]+'_'+re.sub('\s','_',f.lower())
        if name not in df.columns:
            df[name] = [0 if x.__class__!=str else (1 if re.search(f, x) else 0) for x in df[col]]
        else:
            print('name repeated')
        #print('\t',df[name].count())

  


In [13]:
df['saf_head_airbag'].value_counts()

0    13698
1     2186
Name: saf_head_airbag, dtype: int64

In [14]:
df = df.drop(['comfort_convenience','entertainment_media', 
            'extras', 'safety_security'], axis=1)

In [15]:
for f in df.columns:
    df[f] = df[f].fillna(0)

In [16]:
for f in df.columns:
    if df[f].isna().sum() > 0:
        print('nan in', f)
    print(df[f].dtype)

int64
float64
int64
int64
float64
int64
int64
int64
float64
float64
float64
int64
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
uint8
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64
int64


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15884 entries, 0 to 15883
Columns: 171 entries, price to saf_tire_pressure_monitoring_system
dtypes: float64(5), int64(101), uint8(65)
memory usage: 13.8 MB


In [18]:
df.to_json('data_post03.json', orient='records', lines=True)

## Summary

* I convered many of the category features into dummy features and then dropped these. 
* I also converted many nested features into dummy features and then dropped these.
* In the process increaased the number of features from 33 to 191.
* also incidentally sorted the features, fist continuous, then dummies from categorical, then dummies from nested.

<a id="3"></a>

## End of Encoding

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Next: [Modeling](05_modeling.ipynb)