# Cleaning Car Data

## Columns:

**General Columns** 
- url: url of autos 
- short_description, description: Description of autos (in English and German) written by users 

**Categorical Columns**  
- make_model, make, model: Model of autos. Ex:Audi A1 
- body_type, body: Body type of autos Example: van, sedans
- vat: VAT deductible, price negotiable 
- registration, first_registration: First registration date and year of autos. 
- prev_owner, previous_owners: Number of previous owners
- type: new or used 
- next_inspection, inspection_new: information about inspection (inspection date,..) 
- body_color, body_color_original: Color of auto Ex: Black, red
- paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- upholstery: Upholstery information (texture, color) 
- gearing_type: Type of gear Ex: automatic, manual 
- fuel: fuel type Ex: diesel, benzine 
- co2_emission, emission_class, emission_label: emission information 
- drive_chain: drive chain Ex: front,rear, 4WD 
- consumption: consumption of auto in city, country and combination (lt/100 km) 
- country_version 
- entertainment_media 
- safety_security 
- comfort_convenience 
- extras 

**Quantitative Columns**
- price: Price of cars 
- km: km of autos 
- hp: horsepower of autos (kW) 
- displacement: displacement of autos (cc) 
- warranty: warranty period (month) 
- weight: weight of auto (kg) 
- nr_of_doors: number of doors 
- nr_of_seats : number of seats 
- cylinders: number of cylinders 
- gears: number of gears

---

In [2]:
#print(car_data.isnull().sum()*100/ car_data.shape[0])

In [3]:
import numpy as np
import pandas as pd


car_data = pd.read_json('scout_car.json', lines=True)
car_data.columns = car_data.columns.str.lower().str.replace(' ', '_').str.replace('.', '').str.replace('_&_', '_').str.strip()

In [7]:
car_data.head(3).T

Unnamed: 0,0,1,2
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...
body_type,Sedans,Sedans,Sedans
price,15770,14500,14640
vat,VAT deductible,Price negotiable,VAT deductible
km,"56,013 km","80,000 km","83,450 km"
registration,01/2016,03/2017,02/2016
prev_owner,2 previous owners,,1 previous owner
kw,,,


In [5]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            15919 non-null  object 
 1   make_model                     15919 non-null  object 
 2   short_description              15873 non-null  object 
 3   body_type                      15859 non-null  object 
 4   price                          15919 non-null  int64  
 5   vat                            11406 non-null  object 
 6   km                             15919 non-null  object 
 7   registration                   15919 non-null  object 
 8   prev_owner                     9091 non-null   object 
 9   kw                             0 non-null      float64
 10  hp                             15919 non-null  object 
 11  type                           15917 non-null  object 
 12  previous_owners                9279 non-null  

### Droping columns that have missing values more than 90%.

---

## General Columns
- [x] url: url of autos
- [x] short_description, description: Description of autos (in English and German) written by users

## Categorical Columns
- [x] make_model, make, model: Model of autos. Ex:Audi A1 
- [x] body_type, body: Body type of autos Example: van, sedans
- [x] vat: VAT deductible, price negotiable 
- [x] registration, first_registration: First registration date and year of autos. 
- [] prev_owner, previous_owners: Number of previous owners
- [x] type: new or used 
- [] next_inspection, inspection_new: information about inspection (inspection date,..) 
- [x] body_color, body_color_original: Color of auto Ex: Black, red
- [x] paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- [x] upholstery: Upholstery information (texture, color) 
- [x] gearing_type: Type of gear Ex: automatic, manual 
- [x] fuel: fuel type Ex: diesel, benzine 
- [x] co2_emission, emission_class, emission_label: emission information 
- [x] drive_chain: drive chain Ex: front,rear, 4WD 
- [x] consumption: consumption of auto in city, country and combination (lt/100 km) 
- [x] country_version 
- [x] entertainment_media 
- [x] safety_security
- [x] comfort_convenience 
- [x] extras 

In [36]:
car_data = car_data.drop(['url'], axis=1)

#### (drop) - identical for every observation

## short_description (drop)

In [37]:
car_data = car_data.drop(['short_description'], axis=1)

## data elsewhere in dataframe

### make_model, make and model

#### Drop make and model because they are redundent with make_model

In [38]:
car_data = car_data.drop(['make', 'model'], axis=1)
car_data['make_model'].value_counts(dropna=False)

make_model
Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: count, dtype: int64

### body_type

In [39]:
car_data['body_type'].value_counts(dropna=False)

body_type
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
None               60
Off-Road           56
Coupe              25
Convertible         8
Name: count, dtype: int64

### vat

In [40]:
car_data['vat'].value_counts(dropna=False)

vat
VAT deductible      10980
None                 4513
Price negotiable      426
Name: count, dtype: int64

### registration and first_registration

In [23]:
# Is this month and year?

car_data['registration'] = car_data['registration'].replace('-/-', np.nan)
car_data['registration'] = pd.DatetimeIndex(car_data['registration'])
car_data['registration_year'] = pd.DatetimeIndex(car_data['registration']).year

car_data['registration_year'].value_counts(dropna=False)

registration_year
2018.0    4522
2016.0    3674
2017.0    3273
2019.0    2853
NaN       1597
Name: count, dtype: int64

### prev_owner and previous owners

In [86]:
#car_data['prev_owner'] = car_data['prev_owner'].str.findall('\d+').str[0]
#car_data[['previous_owners']].value_counts(dropna=False)

### type

In [87]:
dcar_data['type'] = car_data['type'].str[1]
car_data['type'].value_counts(dropna=False)

type          
Used              11096
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: count, dtype: int64

### next_inspection

In [14]:
# Is this month and year?

car_data['next_inspection'] = car_data['next_inspection']
car_data['next_inspection']

0          0
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
15914    NaN
15915      0
15916    NaN
15917    NaN
15918    NaN
Name: next_inspection, Length: 15919, dtype: object

### inspection_new

In [89]:
# not sure what this is

car_data['inspection_new'].value_counts(dropna=False)

inspection_new
NaN                                                                                            11987
[\nYes\n, \nEuro 6\n]                                                                            523
\nYes\n                                                                                          362
[\nYes\n, \n102 g CO2/km (comb)\n]                                                               174
[\nYes\n, \n4 (Green)\n]                                                                         166
                                                                                               ...  
[\nYes\n, \n, 6 l/100 km (comb), \n, 8 l/100 km (city), \n, 4.9 l/100 km (country), \n]            1
[\nYes\n, \n, 6.8 l/100 km (comb), \n, 8.5 l/100 km (city), \n, 6.1 l/100 km (country), \n]        1
[\nYes\n, \n, 4.1 l/100 km (comb), \n, 7.5 l/100 km (city), \n, 5.2 l/100 km (country), \n]        1
[\nYes\n, \n, 5.2 l/100 km (comb), \n, 6.8 l/100 km (city), \n, 4.3 l/100 km

### body_color

In [90]:
car_data['body_color'] = car_data['body_color'].str[1]
car_data['body_color'].value_counts(dropna=False)

Unnamed: 0,body_color
9435,Black
11691,Grey
2283,White
14095,Red
7322,White


### upholstery

In [91]:
car_data['upholstery'] = car_data['upholstery'].str[0].str.strip()
car_data['upholstery'].value_counts(dropna=False)

Unnamed: 0,upholstery
4180,"Cloth, Black"
11236,"Full leather, Black"
7666,"Full leather, Black"
1306,"Cloth, Black"
13277,Black


### gearing_type

In [92]:
car_data['gearing_type'] = car_data['gearing_type'].str[1]
car_data['gearing_type'].value_counts(dropna=False)

Unnamed: 0,gearing_type
11866,Automatic
5472,Manual
9654,Automatic
12654,Automatic
11245,Automatic


### fuel

In [93]:
car_data['fuel'] = car_data['fuel'].str[1]
car_data['fuel'].value_counts(dropna=False)

Unnamed: 0,fuel
10079,Gasoline
4124,Diesel (Particulate Filter)
8155,Super 95
9616,Gasoline
14204,Gasoline


### co2_emission (isn't this quantitative?)

In [94]:
car_data['co2_emission'] = car_data['co2_emission'].str[0].str.strip().str.findall('\d+').str[0].astype("float")
car_data['co2_emission'].value_counts(dropna=False)

Unnamed: 0,co2_emission
2328,111.0
15415,120.0
9899,143.0
13825,95.0
10914,119.0


### emission_class

In [95]:
car_data['emission_class'] = car_data['emission_class'].str[0].str.strip()
car_data['emission_class'].value_counts(dropna=False)

emission_class
Euro 6            10139
NaN                3628
Euro 6d-TEMP       1845
Euro 6c             127
Euro 5               78
Euro 6d              62
Euro 4               40
Name: count, dtype: int64

### emission_label (drop) to many null

In [96]:
car_data['emission_label'] = car_data['emission_label'].str[0].str.strip().str.findall('\d+').str[0]
car_data['emission_label'].value_counts(dropna=False)

emission_label
NaN               11974
4                  3553
1                   381
5                     8
3                     2
2                     1
Name: count, dtype: int64

### drive_chain

In [97]:
car_data['drive_chain'] = car_data['drive_chain'].str[0].str.strip()
car_data['drive_chain'].value_counts(dropna=False)

Unnamed: 0,drive_chain
4203,
6118,
6213,front
13567,
13646,


### consumption l/100 km 
(these should be 3 differenet quantitative columns correct?)

In [98]:
# comb
car_data['consumption'].str[0].str[0].str.replace('l/100 km (comb)', '').str.replace('kg/100 km (comb)', '').replace('\n', 'NaN').astype("float")

0        3.8
1        5.6
2        3.8
3        3.8
4        4.1
        ... 
15914    5.3
15915    NaN
15916    5.3
15917    5.3
15918    6.8
Name: consumption, Length: 15919, dtype: float64

In [99]:
# city
car_data['consumption'].str[1].str[0].str.replace('l/100 km (city)', '').str.replace('kg/100 km (city)', '').replace('\n', 'NaN').astype("float")

0        4.3
1        7.1
2        4.4
3        4.3
4        4.6
        ... 
15914    6.2
15915    7.0
15916    6.2
15917    6.2
15918    8.7
Name: consumption, Length: 15919, dtype: float64

In [100]:
# country
car_data['consumption'].str[2].str[0].str.replace('l/100 km (country)', '').str.replace('kg/100 km (country)', '').replace('\n', 'NaN').astype("float")

0        3.5
1        4.7
2        3.4
3        3.5
4        3.8
        ... 
15914    4.7
15915    NaN
15916    4.7
15917    4.7
15918    5.7
Name: consumption, Length: 15919, dtype: float64

### country_version

In [101]:
car_data['country_version'] = car_data['country_version'].str[0].str.strip()
car_data['country_version'].value_counts(dropna=False)

Unnamed: 0,country_version
12038,
5413,
7852,Czech Republic
2494,
2578,Germany


### entertainment_media

In [102]:
car_data['entertainment_media'].value_counts(dropna=False)

Unnamed: 0,entertainment_media
10018,"[Bluetooth, Hands-free equipment, MP3, On-boar..."
11110,"[On-board computer, Radio, USB]"
2252,
15304,[Radio]
1120,"[Bluetooth, Hands-free equipment, MP3, Radio]"


### safety_security

In [103]:
car_data['safety_security'].value_counts(dropna=False)

Unnamed: 0,safety_security
1503,"[ABS, Central door lock, Daytime running light..."
3252,"[ABS, Central door lock, Driver-side airbag, E..."
11309,"[ABS, Adaptive headlights, Central door lock, ..."
11740,"[ABS, Adaptive headlights, Central door lock, ..."
14938,"[ABS, Adaptive headlights, Blind spot monitor,..."


### comfort_convenience

In [104]:
car_data['comfort_convenience'].value_counts(dropna=False)

Unnamed: 0,comfort_convenience
3453,"[Air conditioning, Automatic climate control, ..."
5795,"[Air conditioning, Armrest, Automatic climate ..."
13204,"[Air conditioning, Cruise control, Electrical ..."
10618,"[Air conditioning, Automatic climate control, ..."
7473,"[Air conditioning, Armrest, Automatic climate ..."


### extras

In [105]:
car_data['extras'].value_counts(dropna=False)

Unnamed: 0,extras
14003,
4666,[Roof rack]
256,"[Alloy wheels, Sport suspension]"
4690,"[Alloy wheels, Catalytic Converter, Shift padd..."
2264,


## Quantitative Columns
- [x] price: Price of cars 
- [x] km: km of autos 
- [x] hp: horsepower of autos (kW) 
- [x] displacement: displacement of autos (cc) 
- [] warranty: warranty period (month) (drop?)
- [x] weight: weight of auto (kg) 
- [x] nr_of_doors: number of doors 
- [x] nr_of_seats : number of seats 
- [x] cylinders: number of cylinders 
- [x] gears: number of gears

### price

In [106]:
car_data['price'].astype("float")
car_data['price'].value_counts(dropna=False)

Unnamed: 0,price
8472,11200
15612,28940
14532,10790
3096,14990
13214,13500


### km

In [107]:
car_data['km'] = car_data['km'].str.replace(',', '').str.findall('\d+').str[0].astype("float")
car_data['km'].value_counts(dropna=False)

km     
10.0       1045
NaN        1024
1.0         367
5.0         170
50.0        148
           ... 
22883.0       1
22880.0       1
22875.0       1
22869.0       1
28450.0       1
Name: count, Length: 6690, dtype: int64

In [108]:
#car_data[['km']].value_counts(dropna=False).index()

### hp

In [109]:
car_data['hp'] = car_data['hp'].str.replace(',', '').str.findall('\d+').str[0].astype("float")
car_data['hp'].value_counts(dropna=False)

Unnamed: 0,hp
2741,92.0
11875,125.0
3028,81.0
2900,81.0
6819,77.0


### displacement

In [112]:
car_data['displacement'] = car_data['displacement'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype("float")


AttributeError: Can only use .str accessor with string values!

In [119]:
car_data['displacement'].value_counts().index

Index([ 1598.0,   999.0,  1398.0,  1399.0,  1229.0,  1956.0,  1461.0,  1490.0,
        1422.0,  1197.0,   898.0,  1395.0,  1968.0,  1149.0,  1618.0,  1798.0,
        1498.0,  1600.0,  1248.0,  1997.0,  1364.0,  1400.0,   998.0,  1500.0,
        2000.0,  1000.0,     1.0,  1998.0,  2480.0,  1200.0,  1984.0,  1397.0,
         899.0,   160.0,   929.0,  1499.0,   997.0,  1596.0,   139.0,   900.0,
        1599.0,  1199.0,  1396.0,  1495.0,  1589.0,  1300.0,     2.0,   995.0,
        1496.0,   890.0,  1580.0,  1995.0,  1333.0,    54.0,  1533.0,  1100.0,
        1350.0, 16000.0,  1856.0,  1568.0,  1896.0,  1584.0,   996.0,  1696.0,
        1686.0, 15898.0,  1368.0,   140.0,   973.0,  1239.0,  1369.0,  1390.0,
         122.0,  1198.0,  1195.0,  2967.0,  1800.0],
      dtype='float64', name='displacement')

### warrenty - wip -  (probably not the best approach)

In [111]:
import re

for car in car_data['warranty']:
    if not isinstance(car, list) and pd.isna(car):
        print(np.nan)
    elif isinstance(car, list):
        for item in car:
            if re.search(r"\b\d+\s+months\b", item):
                print(float(re.findall('\d+', item)[0]))
    elif isinstance(car, str):
            if re.search(r"\b\d+\s+months\b", item):
                print(float(re.findall('\d+', item)[0]))
            else: 
                print(np.nan)
    else:
        print(np.nan)

nan
nan
nan
nan
12.0
nan
nan
nan
nan
12.0
nan
nan
6.0
nan
12.0
12.0
nan
nan
nan
nan
nan
nan
nan
12.0
12.0
nan
nan
nan
nan
nan
12.0
nan
nan
12.0
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
12.0
12.0
nan
nan
nan
6.0
nan
nan
nan
12.0
nan
12.0
12.0
nan
50.0
12.0
nan
nan
nan
50.0
6.0
nan
nan
nan
12.0
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
6.0
nan
nan
nan
12.0
36.0
nan
nan
12.0
nan
20.0
nan
nan
12.0
12.0
nan
nan
nan
nan
nan
36.0
nan
nan
36.0
12.0
12.0
nan
nan
12.0
nan
nan
nan
12.0
nan
12.0
nan
nan
12.0
nan
12.0
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
12.0
nan
12.0
12.0
24.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
12.0
nan
nan
3.0
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0

### weight

In [121]:
car_data['weight'] = car_data['weight'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype("float")
car_data['weight'].value_counts()

AttributeError: Can only use .str accessor with string values!

In [128]:
car_data['weight'].value_counts().index

Index([1163.0, 1360.0, 1165.0, 1335.0, 1135.0, 1199.0, 1734.0, 1180.0, 1503.0,
       1350.0,
       ...
       1137.0, 1213.0, 1960.0, 1258.0, 1167.0, 1331.0, 1132.0, 1252.0, 1792.0,
       2037.0],
      dtype='float64', name='weight', length=434)

### nr_of_doors

In [129]:
car_data['nr_of_doors'] = car_data['nr_of_doors'].str[0].str.findall('\d+').str[0]

In [132]:
car_data['nr_of_doors'].value_counts().index

Index(['5', '4', '3', '2', '1', '7'], dtype='object', name='nr_of_doors')

### nr_of_seats

In [134]:
car_data['nr_of_seats'] = car_data['nr_of_seats'].str[0].str.findall('\d+').str[0]


In [135]:
car_data['nr_of_seats'].value_counts()

nr_of_seats
5    13336
4     1125
7      362
2      116
6        2
3        1
Name: count, dtype: int64

### cylinders

In [136]:
car_data['cylinders'] = car_data['cylinders'].str[0].str.findall('\d+').str[0]


In [137]:
car_data['cylinders'].value_counts(dropna=False)

cylinders
4            8105
NaN          5680
3            2104
5              22
6               3
2               2
8               2
1               1
Name: count, dtype: int64

### gears

In [138]:
car_data['gears'] = car_data['gears'].str[0].str.findall('\d+').str[0]


In [139]:
car_data['gears'].value_counts(dropna=False)

gears
6        5822
NaN      4712
5        3239
7        1908
8         224
9           6
1           2
3           2
4           2
2           1
50          1
Name: count, dtype: int64

In [None]:
# save to new csv after clean data before going to null values - every step use a new notebook