# Cleaning Car Data

## Columns:

**General Columns** 
- url: url of autos 
- short_description, description: Description of autos (in English and German) written by users 

**Categorical Columns**  
- make_model, make, model: Model of autos. Ex:Audi A1 
- body_type, body: Body type of autos Example: van, sedans
- vat: VAT deductible, price negotiable 
- registration, first_registration: First registration date and year of autos. 
- prev_owner, previous_owners: Number of previous owners
- type: new or used 
- next_inspection, inspection_new: information about inspection (inspection date,..) 
- body_color, body_color_original: Color of auto Ex: Black, red
- paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- upholstery: Upholstery information (texture, color) 
- gearing_type: Type of gear Ex: automatic, manual 
- fuel: fuel type Ex: diesel, benzine 
- co2_emission, emission_class, emission_label: emission information 
- drive_chain: drive chain Ex: front,rear, 4WD 
- consumption: consumption of auto in city, country and combination (lt/100 km) 
- country_version 
- entertainment_media 
- safety_security 
- comfort_convenience 
- extras 

**Quantitative Columns**
- price: Price of cars 
- km: km of autos 
- hp: horsepower of autos (kW) 
- displacement: displacement of autos (cc) 
- warranty: warranty period (month) 
- weight: weight of auto (kg) 
- nr_of_doors: number of doors 
- nr_of_seats : number of seats 
- cylinders: number of cylinders 
- gears: number of gears

---

In [38]:
import numpy as np
import pandas as pd


car_data = pd.read_json('scout_car.json', lines=True)
car_data.columns = car_data.columns.str.lower().str.replace(' ', '_').str.replace('.', '').str.replace('_&_', '_').str.strip()

In [39]:
car_data.head(3).T

Unnamed: 0,0,1,2
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...
body_type,Sedans,Sedans,Sedans
price,15770,14500,14640
vat,VAT deductible,Price negotiable,VAT deductible
km,"56,013 km","80,000 km","83,450 km"
registration,01/2016,03/2017,02/2016
prev_owner,2 previous owners,,1 previous owner
kw,,,


In [40]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            15919 non-null  object 
 1   make_model                     15919 non-null  object 
 2   short_description              15873 non-null  object 
 3   body_type                      15859 non-null  object 
 4   price                          15919 non-null  int64  
 5   vat                            11406 non-null  object 
 6   km                             15919 non-null  object 
 7   registration                   15919 non-null  object 
 8   prev_owner                     9091 non-null   object 
 9   kw                             0 non-null      float64
 10  hp                             15919 non-null  object 
 11  type                           15917 non-null  object 
 12  previous_owners                9279 non-null  

---

## Droping columns that have missing values more than 90%.

In [41]:
def df_nans(df, limit):
    missing = df.isnull().sum()*100 / df.shape[0]
    return missing.loc[lambda x : x >= limit]

def column_nans(serial):
    # display percentage of nans in a Series
    return serial.isnull().sum()*100 / serial.shape[0]

In [42]:
df_nans(car_data, 90)

kw                               100.000000
electricity_consumption           99.139393
last_service_date                 96.444500
other_fuel_types                  94.472015
availability                      96.011056
last_timing_belt_service_date     99.899491
available_from                    98.291350
dtype: float64

In [43]:
drop_columns = df_nans(car_data, 90).index
drop_columns

Index(['kw', 'electricity_consumption', 'last_service_date',
       'other_fuel_types', 'availability', 'last_timing_belt_service_date',
       'available_from'],
      dtype='object')

In [44]:
car_data.drop(drop_columns, axis=1, inplace=True)

In [45]:
car_data.columns

Index(['url', 'make_model', 'short_description', 'body_type', 'price', 'vat',
       'km', 'registration', 'prev_owner', 'hp', 'type', 'previous_owners',
       'next_inspection', 'inspection_new', 'warranty', 'full_service',
       'non-smoking_vehicle', 'null', 'make', 'model', 'offer_number',
       'first_registration', 'body_color', 'paint_type', 'body_color_original',
       'upholstery', 'body', 'nr_of_doors', 'nr_of_seats', 'model_code',
       'gearing_type', 'displacement', 'cylinders', 'weight', 'drive_chain',
       'fuel', 'consumption', 'co2_emission', 'emission_class',
       'comfort_convenience', 'entertainment_media', 'extras',
       'safety_security', 'description', 'emission_label', 'gears',
       'country_version'],
      dtype='object')

---

## General Columns
- [x] url: url of autos
- [x] short_description, description: Description of autos (in English and German) written by users

### url

In [46]:
car_data = car_data.drop(['url'], axis=1)

dropped becuase data not needed

### short_description

In [47]:
car_data = car_data.drop(['short_description'], axis=1)

dropped because data elsewhere in dataframe

### description column

In [48]:
car_data['description']

0        [\n, Sicherheit:,  , Deaktivierung für Beifahr...
1        [\nLangstreckenfahrzeug daher die hohe Kilomet...
2        [\n, Fahrzeug-Nummer: AM-95365,  , Ehem. UPE 2...
3        [\nAudi A1: , - 1e eigenaar , - Perfecte staat...
4        [\n, Technik & Sicherheit:, Xenon plus, Klimaa...
                               ...                        
15914    [\nVettura visionabile nella sede in Via Roma ...
15915    [\nDach: Panorama-Glas-Schiebedach, Lackierung...
15916    [\n, Getriebe:,  Automatik, Technik:,  Bordcom...
15917    [\nDEK:[2691331], Renault Espace Blue dCi 200C...
15918    [\n, Sicherheit Airbags:,  , Seitenairbag,  , ...
Name: description, Length: 15919, dtype: object

In [49]:
car_data.drop('description',axis=1,inplace=True)

### This column was dropped since it includes German description of car written by users

## Categorical Columns
- [x] make_model, make, model: Model of autos. Ex:Audi A1 
- [x] body_type, body: Body type of autos Example: van, sedans
- [x] vat: VAT deductible, price negotiable 
- [x] registration, first_registration: First registration date and year of autos. 
- [x] prev_owner, previous_owners: Number of previous owners
- [x] type: new or used 
- [x] next_inspection, inspection_new: information about inspection (inspection date,..) 
- [x] body_color, body_color_original: Color of auto Ex: Black, red
- [x] paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- [x] upholstery: Upholstery information (texture, color) 
- [x] gearing_type: Type of gear Ex: automatic, manual 
- [x] fuel: fuel type Ex: diesel, benzine 
- [x] co2_emission, emission_class, emission_label: emission information 
- [x] drive_chain: drive chain Ex: front,rear, 4WD 
- [x] consumption: consumption of auto in city, country and combination (lt/100 km) 
- [x] country_version 
- [x] entertainment_media 
- [x] safety_security
- [x] comfort_convenience 
- [x] extras 

### make_model, make and model

Drop make and model because they are redundent with make_model

In [50]:
car_data = car_data.drop(['make', 'model'], axis=1)

In [51]:
car_data['make_model'].value_counts(dropna=False)

make_model
Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: count, dtype: int64

### body_type

In [52]:
car_data['body_type'].value_counts(dropna=False)

body_type
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
None               60
Off-Road           56
Coupe              25
Convertible         8
Name: count, dtype: int64

### vat

In [53]:
car_data['vat'].value_counts(dropna=False)

vat
VAT deductible      10980
None                 4513
Price negotiable      426
Name: count, dtype: int64

### registration_year (merge registration and first_registration)

In [54]:
car_data['registration'] = car_data['registration'].replace('-/-', np.nan)
car_data['registration'] = pd.DatetimeIndex(car_data['registration'])
car_data['registration_year'] = pd.DatetimeIndex(car_data['registration']).year

In [55]:
car_data['registration_year'].value_counts(dropna=False)

registration_year
2018.0    4522
2016.0    3674
2017.0    3273
2019.0    2853
NaN       1597
Name: count, dtype: int64

### prev_owner and previous owners

In [56]:
car_data['prev_owner'] = car_data['prev_owner'].str.findall('\d+').str[0].astype("float")

In [57]:
car_data['prev_owner'].value_counts(dropna=False)

prev_owner
1.0    8294
NaN    6828
2.0     778
3.0      17
4.0       2
Name: count, dtype: int64

In [58]:
car_data['previous_owners'] = car_data['previous_owners'].str.findall('\d+').str[0].astype("float")

In [59]:
car_data['previous_owners'].value_counts(dropna=False)

previous_owners
1.0    8101
NaN    6870
2.0     766
0.0     163
3.0      17
4.0       2
Name: count, dtype: int64

In [60]:
def prev_owner_combine(p1,p2):
    if p1 == p2:
        return p1
    elif np.isnan(p1) :
        if np.isnan(p2):
            return np.nan
        else:
            return p2
    elif np.isnan(p2):
        if np.isnan(p1):
            return np.nan
        else:
            return p1
    else:
        return 'conflict'

In [61]:
car_data['previous_owners'] = car_data.apply(lambda x: prev_owner_combine(x['prev_owner'],x['previous_owners']), axis=1)

In [62]:
car_data = car_data.drop(['prev_owner'], axis=1)

### drop prev_owner because redundent with previous_owners

In [63]:
car_data['previous_owners'].value_counts(dropna=False)

previous_owners
1.0    8294
NaN    6665
2.0     778
0.0     163
3.0      17
4.0       2
Name: count, dtype: int64

### type

In [64]:
car_data['type'] = car_data['type'].str[1]

In [65]:
car_data['type'].value_counts(dropna=False)

type
Used              11096
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: count, dtype: int64

### next_inspection and inspection_new

In [66]:
car_data['next_inspection'] = car_data['next_inspection'].str[0].str.strip().astype("string")
car_data['next_inspection'] = pd.DatetimeIndex(car_data['next_inspection'])
car_data['next_inspection'] = pd.DatetimeIndex(car_data['next_inspection']).year

In [67]:
car_data['next_inspection'].value_counts(dropna=False)

next_inspection
NaN       13094
2021.0     1401
2020.0      557
2022.0      483
2019.0      336
2018.0       26
2017.0        7
2023.0        5
2001.0        5
2016.0        3
2014.0        1
1921.0        1
Name: count, dtype: int64

In [68]:
car_data['inspection_new'] = car_data['inspection_new'].str[0].str.strip()

In [69]:
car_data['inspection_new'].value_counts(dropna=False)

inspection_new
NaN    11987
Yes     3570
         362
Name: count, dtype: int64

### body_color

In [70]:
car_data['body_color'] = car_data['body_color'].str[1]

In [71]:
car_data['body_color'].value_counts(dropna=False)

body_color
Black     3745
Grey      3505
White     3406
Silver    1647
Blue      1431
Red        957
NaN        597
Brown      289
Green      154
Beige      108
Yellow      51
Violet      18
Bronze       6
Orange       3
Gold         2
Name: count, dtype: int64

### upholstery

In [72]:
car_data['upholstery'] = car_data['upholstery'].str[0].str.strip()

In [73]:
car_data['upholstery'].value_counts(dropna=False)

upholstery
Cloth, Black           5821
NaN                    3720
Part leather, Black    1121
Cloth                  1005
Cloth, Grey             891
Cloth, Other            639
Full leather, Black     575
Black                   491
Grey                    273
Other, Other            182
Part leather            140
Full leather            139
Full leather, Brown     116
Part leather, Grey      116
Other, Black            110
Full leather, Other      72
Full leather, Grey       67
Part leather, Other      65
Other                    56
Part leather, Brown      50
alcantara, Black         47
Velour, Black            36
Full leather, Beige      36
Cloth, Brown             28
Velour                   16
Other, Grey              15
Cloth, Beige             13
Brown                    12
Cloth, Blue              12
Velour, Grey              8
Cloth, White              8
alcantara, Grey           6
Cloth, Red                5
Other, Yellow             4
Part leather, Red         3
Beige    

### gearing_type

In [74]:
car_data['gearing_type'] = car_data['gearing_type'].str[1]

In [75]:
car_data['gearing_type'].value_counts(dropna=False)

gearing_type
Manual            8153
Automatic         7297
Semi-automatic     469
Name: count, dtype: int64

### fuel

In [76]:
car_data['fuel'] = car_data['fuel'].str[1]

In [77]:
car_data['fuel'].value_counts(dropna=False)

fuel
Diesel (Particulate Filter)                                                                                  4315
Super 95                                                                                                     3338
Gasoline                                                                                                     3175
Diesel                                                                                                       2984
Super 95 / Regular/Benzine 91                                                                                 424
                                                                                                             ... 
Regular/Benzine 91 / Super 95 / Regular/Benzine E10 91                                                          1
Super Plus 98 / Super E10 95                                                                                    1
Regular/Benzine 91 / Super 95 / Regular/Benzine E10 91 / Super E10 95 / Super Plus 

### co2_emission

In [78]:
car_data['co2_emission'] = car_data['co2_emission'].str[0].str.strip().str.findall('\d+').str[0].astype("float")

In [79]:
car_data['co2_emission'].value_counts(dropna=False)

co2_emission
NaN      2436
120.0     740
99.0      545
97.0      537
104.0     501
         ... 
51.0        1
165.0       1
331.0       1
80.0        1
193.0       1
Name: count, Length: 120, dtype: int64

### emission_class

In [80]:
car_data['emission_class'] = car_data['emission_class'].str[0].str.replace('\n','')

In [81]:
car_data['emission_class'].value_counts(dropna=False)

emission_class
Euro 6          10139
NaN              3628
Euro 6d-TEMP     1845
Euro 6c           127
Euro 5             78
Euro 6d            62
Euro 4             40
Name: count, dtype: int64

In [82]:
car_data['emission_class'].replace(['Euro 6','Euro 6d-TEMP','Euro 6d', 'Euro 6c'], 'Euro 6', inplace = True)

### drive_chain

In [83]:
car_data['drive_chain'] = car_data['drive_chain'].str[0].str.strip()

In [84]:
car_data['drive_chain'].value_counts(dropna=False)

drive_chain
front    8886
NaN      6858
4WD       171
rear        4
Name: count, dtype: int64

### consumption l/100 km 

In [85]:
# comb
car_data['combined_emissions'] = car_data['consumption'].str[0].str[0].str.replace('l/100 km (comb)', '').str.replace('kg/100 km (comb)', '').replace('\n', 'NaN').astype("float")

In [86]:
car_data['combined_emissions'].value_counts(dropna=False).index

Index([ nan,  3.9,  4.0,  5.4,  5.1,  4.4,  3.8,  5.6,  4.7,  4.8,  5.0,  4.5,
        5.2,  4.6,  4.2,  5.3,  3.7,  4.9,  5.5,  4.1,  5.9,  3.3,  5.7,  4.3,
        3.5,  6.0,  3.6,  6.2,  5.8,  6.3,  6.1,  6.8,  6.6,  3.4,  3.0,  6.4,
        7.4,  7.1,  6.5, 10.0,  6.7,  3.2,  6.9,  8.3,  7.6,  7.0,  3.1,  7.2,
        7.8,  8.0, 51.0,  8.7,  8.6,  7.3,  7.9,  8.1, 40.0, 38.0,  0.0, 11.0,
       43.0,  7.5, 13.8, 55.0, 54.0,  1.2, 32.0, 33.0, 50.0,  1.0, 46.0,  9.1],
      dtype='float64', name='combined_emissions')

In [87]:
# city
car_data['city_emissions'] = car_data['consumption'].str[1].str[0].str.replace('l/100 km (city)', '').str.replace('kg/100 km (city)', '').replace('\n', 'NaN').astype("float")

In [88]:
car_data['city_emissions'].value_counts(dropna=False).index

Index([ nan,  5.0,  5.8,  4.5,  4.3,  4.0,  5.1,  6.0,  6.8,  4.6,  7.2,  5.7,
        7.3,  4.2,  5.9,  7.8,  6.6,  5.2,  4.1,  6.3,  5.4,  4.7,  6.7,  3.9,
        3.5,  7.6,  7.1,  7.5,  6.9,  5.5,  7.0,  6.2,  7.4,  7.7,  6.5,  8.7,
        6.1,  4.4,  8.2,  8.0,  5.3,  6.4,  5.6,  7.9,  4.8,  4.9,  3.7,  3.4,
        9.6,  9.2,  3.3,  8.5,  8.6,  8.3,  3.8, 10.2,  8.1, 11.3, 10.0,  9.9,
        9.4,  9.1,  3.0,  0.0,  8.4,  9.8,  1.0, 62.0, 11.2,  8.9, 11.0, 10.8,
       11.5,  8.8, 10.1, 45.0,  9.5, 43.0,  3.6, 16.1, 66.0, 10.4, 10.5,  9.0,
       64.0, 19.9,  9.7],
      dtype='float64', name='city_emissions')

In [89]:
# country
car_data['country_emissions'] = car_data['consumption'].str[2].str[0].str.replace('l/100 km (country)', '').str.replace('kg/100 km (country)', '').replace('\n', 'NaN').astype("float")

In [90]:
car_data['country_emissions'].value_counts(dropna=False).index

Index([ nan,  4.2,  3.7,  4.4,  4.5,  3.8,  3.9,  4.1,  4.7,  4.0,  3.5,  4.3,
        3.6,  3.1,  3.3,  4.6,  4.9,  3.4,  4.8,  5.3,  5.1,  5.7,  5.4,  3.2,
        3.0,  5.6,  5.0,  5.2,  6.3,  6.0, 10.0,  5.8,  5.5,  7.7,  6.6,  2.9,
        6.4,  2.8,  0.0,  7.3, 44.0,  6.5,  7.1,  6.7,  7.0, 35.0,  5.9,  6.9,
        7.8, 37.0, 10.3,  7.6, 42.0,  8.6,  6.1,  8.0,  2.0,  1.0],
      dtype='float64', name='country_emissions')

### country_version

In [91]:
car_data['country_version'] = car_data['country_version'].str[0].str.strip()

In [92]:
car_data['country_version'].value_counts(dropna=False)

country_version
NaN               8333
Germany           4502
Italy             1038
European Union     507
Netherlands        464
Spain              325
Belgium            314
Austria            208
Czech Republic      52
Poland              49
France              38
Denmark             33
Hungary             28
Japan                8
Slovakia             4
Croatia              4
Sweden               3
Romania              2
Bulgaria             2
Luxembourg           1
Switzerland          1
Slovenia             1
Egypt                1
Serbia               1
Name: count, dtype: int64

### entertainment_media

In [93]:
car_data['entertainment_media']

0        [Bluetooth, Hands-free equipment, On-board com...
1        [Bluetooth, Hands-free equipment, On-board com...
2                                 [MP3, On-board computer]
3        [Bluetooth, CD player, Hands-free equipment, M...
4        [Bluetooth, CD player, Hands-free equipment, M...
                               ...                        
15914    [Bluetooth, Digital radio, Hands-free equipmen...
15915    [Bluetooth, Digital radio, Hands-free equipmen...
15916    [Bluetooth, Hands-free equipment, On-board com...
15917               [Bluetooth, Digital radio, Radio, USB]
15918                                                [USB]
Name: entertainment_media, Length: 15919, dtype: object

In [94]:
car_data['entertainment_media'] = car_data['entertainment_media'].astype('str').str.replace('[','').str.replace("]",'')

#### This column was not changed as it will be transformed with getdummy function later

### safety_security

In [95]:
car_data['safety_security']

0        [ABS, Central door lock, Daytime running light...
1        [ABS, Central door lock, Central door lock wit...
2        [ABS, Central door lock, Daytime running light...
3        [ABS, Alarm system, Central door lock with rem...
4        [ABS, Central door lock, Driver-side airbag, E...
                               ...                        
15914    [ABS, Central door lock, Central door lock wit...
15915    [ABS, Adaptive Cruise Control, Blind spot moni...
15916    [ABS, Adaptive Cruise Control, Blind spot moni...
15917    [ABS, Blind spot monitor, Driver-side airbag, ...
15918    [ABS, Blind spot monitor, Daytime running ligh...
Name: safety_security, Length: 15919, dtype: object

#### This column was not changed as it will be transformed with getdummy function later

### comfort_convenience

In [96]:
car_data['comfort_convenience'] = car_data['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'')

In [97]:
car_data['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'').str.get_dummies(sep=",")

Unnamed: 0,'Air suspension','Armrest','Automatic climate control','Auxiliary heating','Cruise control','Electric Starter','Electric tailgate','Electrical side mirrors','Electrically adjustable seats','Electrically heated windshield',...,'Leather steering wheel','Light sensor','Multi-function steering wheel','Navigation system','Panorama roof','Park Distance Control','Power windows','Rain sensor','Sunroof',nan
0,0,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15914,0,0,1,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
15915,0,0,1,0,1,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
15916,0,1,1,0,1,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
15917,0,0,1,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [98]:
car_data['comfort_convenience'] = car_data['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'')

In [99]:
car_data['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'').str.get_dummies(sep=",")

Unnamed: 0,'Air suspension','Armrest','Automatic climate control','Auxiliary heating','Cruise control','Electric Starter','Electric tailgate','Electrical side mirrors','Electrically adjustable seats','Electrically heated windshield',...,'Leather steering wheel','Light sensor','Multi-function steering wheel','Navigation system','Panorama roof','Park Distance Control','Power windows','Rain sensor','Sunroof',nan
0,0,1,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15914,0,0,1,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
15915,0,0,1,0,1,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
15916,0,1,1,0,1,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
15917,0,0,1,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [101]:
car_data['comfort_convenience']

0        'Air conditioning', 'Armrest', 'Automatic clim...
1        'Air conditioning', 'Automatic climate control...
2        'Air conditioning', 'Cruise control', 'Electri...
3        'Air suspension', 'Armrest', 'Auxiliary heatin...
4        'Air conditioning', 'Armrest', 'Automatic clim...
                               ...                        
15914    'Air conditioning', 'Automatic climate control...
15915    'Air conditioning', 'Automatic climate control...
15916    'Air conditioning', 'Armrest', 'Automatic clim...
15917    'Air conditioning', 'Automatic climate control...
15918    'Air conditioning', 'Automatic climate control...
Name: comfort_convenience, Length: 15919, dtype: object

#### This column was not changed as it will be transformed with getdummy function later

### extras

In [917]:
car_data['extras'].astype('str').str.replace('[','').str.replace("]",'').str.get_dummies(sep=', ').sum()

'Alloy wheels'           11294
'Cab or rented Car'        310
'Catalytic Converter'     2258
'Handicapped enabled'       52
'Right hand drive'           3
'Roof rack'               2647
'Shift paddles'            508
'Ski bag'                  247
'Sliding door'               3
'Sport package'           1198
'Sport seats'             3098
'Sport suspension'        1619
'Touch screen'            4043
'Trailer hitch'            654
'Tuned car'                 13
'Voice Control'           4326
'Winter tyres'             246
nan                       2962
dtype: int64

In [103]:
car_data['extras']

0        [Alloy wheels, Catalytic Converter, Voice Cont...
1        [Alloy wheels, Sport seats, Sport suspension, ...
2                            [Alloy wheels, Voice Control]
3               [Alloy wheels, Sport seats, Voice Control]
4        [Alloy wheels, Sport package, Sport suspension...
                               ...                        
15914                         [Alloy wheels, Touch screen]
15915          [Alloy wheels, Touch screen, Voice Control]
15916                                       [Alloy wheels]
15917                         [Alloy wheels, Touch screen]
15918                         [Alloy wheels, Touch screen]
Name: extras, Length: 15919, dtype: object

#### This column was not changed as it will be transformed with getdummy function later

#### 

## Quantitative Columns
- [x] price: Price of cars 
- [x] km: km of autos 
- [x] hp: horsepower of autos (kW) 
- [x] displacement: displacement of autos (cc) 
- [x] warranty: warranty period (month) (drop?)
- [x] weight: weight of auto (kg) 
- [x] nr_of_doors: number of doors 
- [x] nr_of_seats : number of seats 
- [x] cylinders: number of cylinders 
- [x] gears: number of gears

### price

In [918]:
car_data['price'].astype("float")

price
14990    154
15990    151
10990    139
15900    106
17990    102
        ... 
17559      1
17560      1
17570      1
17575      1
39875      1
Name: count, Length: 2956, dtype: int64

In [None]:
car_data['price'].value_counts(dropna=False)

### km

In [919]:
car_data['km'] = car_data['km'].str.replace(',', '').str.findall('\d+').str[0].astype("float")

Index([   10.0,     nan,     1.0,     5.0,    50.0,   100.0,    15.0,  5000.0,
          20.0,  3000.0,
       ...
       57840.0, 43400.0, 31265.0, 36020.0, 53433.0, 67469.0, 43197.0, 10027.0,
       35882.0,    57.0],
      dtype='float64', name='km', length=6690)

In [104]:
car_data['km'].value_counts(dropna=False).index

Index(['10 km', '- km', '1 km', '5 km', '50 km', '100 km', '15 km', '5,000 km',
       '20 km', '3,000 km',
       ...
       '57,840 km', '43,400 km', '31,265 km', '36,020 km', '53,433 km',
       '67,469 km', '43,197 km', '10,027 km', '35,882 km', '57 km'],
      dtype='object', name='km', length=6690)

### hp

In [921]:
car_data['hp'] = car_data['hp'].str.replace(',', '').str.findall('\d+').str[0].astype("float")

hp
85.0     2542
66.0     2122
81.0     1402
100.0    1308
110.0    1112
         ... 
84.0        1
195.0       1
44.0        1
239.0       1
9.0         1
Name: count, Length: 81, dtype: int64

In [110]:
car_data['hp'].value_counts().index

Index(['85 kW', '66 kW', '81 kW', '100 kW', '110 kW', '70 kW', '125 kW',
       '51 kW', '55 kW', '118 kW', '92 kW', '121 kW', '147 kW', '77 kW',
       '56 kW', '54 kW', '103 kW', '87 kW', '165 kW', '88 kW', '60 kW',
       '162 kW', '- kW', '74 kW', '96 kW', '71 kW', '101 kW', '67 kW',
       '154 kW', '122 kW', '119 kW', '164 kW', '135 kW', '82 kW', '52 kW',
       '78 kW', '1 kW', '294 kW', '146 kW', '141 kW', '57 kW', '104 kW',
       '120 kW', '191 kW', '112 kW', '155 kW', '117 kW', '184 kW', '90 kW',
       '76 kW', '65 kW', '149 kW', '80 kW', '168 kW', '98 kW', '93 kW',
       '228 kW', '270 kW', '53 kW', '140 kW', '86 kW', '167 kW', '127 kW',
       '89 kW', '143 kW', '63 kW', '40 kW', '150 kW', '163 kW', '115 kW',
       '132 kW', '75 kW', '4 kW', '137 kW', '123 kW', '133 kW', '84 kW',
       '195 kW', '44 kW', '239 kW', '9 kW'],
      dtype='object', name='hp')

### displacement

In [922]:
car_data['displacement'] = car_data['displacement'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype("float")

In [923]:
car_data['displacement'].value_counts().index

Index([ 1598.0,   999.0,  1398.0,  1399.0,  1229.0,  1956.0,  1461.0,  1490.0,
        1422.0,  1197.0,   898.0,  1395.0,  1968.0,  1149.0,  1618.0,  1798.0,
        1498.0,  1600.0,  1248.0,  1997.0,  1364.0,  1400.0,   998.0,  1500.0,
        2000.0,  1000.0,     1.0,  1998.0,  2480.0,  1200.0,  1984.0,  1397.0,
         899.0,   160.0,   929.0,  1499.0,   997.0,  1596.0,   139.0,   900.0,
        1599.0,  1199.0,  1396.0,  1495.0,  1589.0,  1300.0,     2.0,   995.0,
        1496.0,   890.0,  1580.0,  1995.0,  1333.0,    54.0,  1533.0,  1100.0,
        1350.0, 16000.0,  1856.0,  1568.0,  1896.0,  1584.0,   996.0,  1696.0,
        1686.0, 15898.0,  1368.0,   140.0,   973.0,  1239.0,  1369.0,  1390.0,
         122.0,  1198.0,  1195.0,  2967.0,  1800.0],
      dtype='float64', name='displacement')

### warranty

In [924]:
car_data['warranty']

0                 [\n, \n, \n4 (Green)\n]
1                                     NaN
2        [\n, \n, \n99 g CO2/km (comb)\n]
3                                     NaN
4                    [\n, \n, \nEuro 6\n]
                       ...               
15914                       \n24 months\n
15915                [\n, \n, \nEuro 6\n]
15916             [\n, \n, \n4 (Green)\n]
15917                                  \n
15918                                 NaN
Name: warranty, Length: 15919, dtype: object

In [925]:
import re
def clean_warranty(a):
    if type(a) == list:
        b = re.findall(r'\d+', a[0])
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    elif type(a) ==str:
        b = re.findall(r'\d+', a)
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    else:
        return a

In [926]:
car_data['warranty'] = car_data['warranty'].apply(clean_warranty)

In [927]:
car_data['warranty'] = car_data['warranty'].astype('float')

In [114]:
car_data['warranty'].value_counts(dropna=False)

warranty
NaN                                                                                                5420
[\n, \n, \nEuro 6\n]                                                                               1868
\n12 months\n                                                                                      1177
\n                                                                                                  979
\n24 months\n                                                                                       566
                                                                                                   ... 
[\n72 months\n, \n125 g CO2/km (comb)\n]                                                              1
[\n60 months\n, \n14 g CO2/km (comb)\n]                                                               1
[\n24 months\n, \n121 g CO2/km (comb)\n]                                                              1
[\n12 months\n, \nEuro 6d\n]                           

### weight

In [928]:
car_data['weight'] = car_data['weight'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype("float")

In [929]:
car_data['weight'].value_counts().index

Index([1163.0, 1360.0, 1165.0, 1335.0, 1135.0, 1199.0, 1734.0, 1180.0, 1503.0,
       1350.0,
       ...
       1137.0, 1213.0, 1960.0, 1258.0, 1167.0, 1331.0, 1132.0, 1252.0, 1792.0,
       2037.0],
      dtype='float64', name='weight', length=434)

### nr_of_doors

In [930]:
car_data['nr_of_doors'] = car_data['nr_of_doors'].str[0].str.findall('\d+').str[0]

In [931]:
car_data['nr_of_doors'].value_counts()

nr_of_doors
5    11575
4     3079
3      832
2      219
1        1
7        1
Name: count, dtype: int64

### nr_of_seats

In [932]:
car_data['nr_of_seats'] = car_data['nr_of_seats'].str[0].str.findall('\d+').str[0]

In [933]:
car_data['nr_of_seats'].value_counts()

nr_of_seats
5    13336
4     1125
7      362
2      116
6        2
3        1
Name: count, dtype: int64

### cylinders

In [934]:
car_data['cylinders'] = car_data['cylinders'].str[0].str.findall('\d+').str[0]


In [935]:
car_data['cylinders'].value_counts(dropna=False)

cylinders
4      8105
NaN    5680
3      2104
5        22
6         3
8         2
2         2
1         1
Name: count, dtype: int64

### gears

In [936]:
car_data['gears'] = car_data['gears'].str[0].str.findall('\d+').str[0]


In [937]:
car_data['gears'].value_counts(dropna=False)

gears
6      5822
NaN    4712
5      3239
7      1908
8       224
9         6
1         2
3         2
4         2
2         1
50        1
Name: count, dtype: int64

In [938]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 45 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   make_model           15919 non-null  object        
 1   body_type            15859 non-null  object        
 2   price                15919 non-null  int64         
 3   vat                  11406 non-null  object        
 4   km                   14895 non-null  float64       
 5   registration         14322 non-null  datetime64[ns]
 6   hp                   15831 non-null  float64       
 7   type                 15917 non-null  object        
 8   previous_owners      9254 non-null   float64       
 9   next_inspection      2825 non-null   float64       
 10  inspection_new       3932 non-null   object        
 11  warranty             4853 non-null   float64       
 12  full_service         8215 non-null   object        
 13  non-smoking_vehicle  7177 non-n

Save to new csv after clean data before going to null values.
Every step use a new notebook.

In [940]:
car_data.to_csv("cleaned_car_data", index=False)