# Car Data

## Columns:

**General Columns** 
- url: url of autos 
- short_description, description: Description of autos (in English and German) written by users 

**Categorical Columns**  
- make_model, make, model: Model of autos. Ex:Audi A1 
- body_type, body: Body type of autos Example: van, sedans
- vat: VAT deductible, price negotiable 
- registration, first_registration: First registration date and year of autos. 
- prev_owner, previous_owners: Number of previous owners
- type: new or used 
- next_inspection, inspection_new: information about inspection (inspection date,..) 
- body_color, body_color_original: Color of auto Ex: Black, red
- paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- upholstery: Upholstery information (texture, color) 
- gearing_type: Type of gear Ex: automatic, manual 
- fuel: fuel type Ex: diesel, benzine 
- co2_emission, emission_class, emission_label: emission information 
- drive_chain: drive chain Ex: front,rear, 4WD 
- consumption: consumption of auto in city, country and combination (lt/100 km) 
- country_version 
- entertainment_media 
- safety_security 
- comfort_convenience 
- extras 

**Quantitative Columns**
- price: Price of cars 
- km: km of autos 
- hp: horsepower of autos (kW) 
- displacement: displacement of autos (cc) 
- warranty: warranty period (month) 
- weight: weight of auto (kg) 
- nr_of_doors: number of doors 
- nr_of_seats : number of seats 
- cylinders: number of cylinders 
- gears: number of gears

---

In [316]:
import numpy as np
import pandas as pd


car_data = pd.read_json('scout_car.json', lines=True)
car_data.columns = car_data.columns.str.lower().str.replace(' ', '_').str.replace('.', '').str.replace('_&_', '_').str.strip()

car_data.info()
#print(car_data.isnull().sum()*100/ car_data.shape[0])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            15919 non-null  object 
 1   make_model                     15919 non-null  object 
 2   short_description              15873 non-null  object 
 3   body_type                      15859 non-null  object 
 4   price                          15919 non-null  int64  
 5   vat                            11406 non-null  object 
 6   km                             15919 non-null  object 
 7   registration                   15919 non-null  object 
 8   prev_owner                     9091 non-null   object 
 9   kw                             0 non-null      float64
 10  hp                             15919 non-null  object 
 11  type                           15917 non-null  object 
 12  previous_owners                9279 non-null  

---

## General Columns
- [x] url: url of autos 
- [x] short_description, description: Description of autos (in English and German) written by users 

### url

In [317]:
car_data[['url']].sample(5)

Unnamed: 0,url
7695,https://www.autoscout24.com//offers/opel-astra...
8917,https://www.autoscout24.com//offers/opel-corsa...
4430,https://www.autoscout24.com//offers/audi-a3-li...
1686,https://www.autoscout24.com//offers/audi-a1-sp...
2970,https://www.autoscout24.com//offers/audi-a3-2-...


## short_description

In [318]:
car_data[['short_description']].sample(5)

Unnamed: 0,short_description
11502,OPEL INSIGNIA SPORTS TOURER INNOVATION 1.6 136CV
13815,1.5 dCi 8V 75CV Start&Stop 5 porte Van 4 posti N1
7121,ST 1.4 Turbo ON KAMERA NAVI W-LAN EU6
7766,K 120 Jahre 1.4 Turbo Automatik/AGR/Navi
7813,5trg 120 Jahre 1.4 Autom./Klima/Intelli


## Categorical Columns
- [x] make_model, make, model: Model of autos. Ex:Audi A1 
- [x] body_type, body: Body type of autos Example: van, sedans
- [x] vat: VAT deductible, price negotiable 
- [] registration, first_registration: First registration date and year of autos. 
- [x] prev_owner, previous_owners: Number of previous owners
- [x] type: new or used 
- [] next_inspection, inspection_new: information about inspection (inspection date,..) 
- [x] body_color, body_color_original: Color of auto Ex: Black, red
- [x] paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- [x] upholstery: Upholstery information (texture, color) 
- [x] gearing_type: Type of gear Ex: automatic, manual 
- [x] fuel: fuel type Ex: diesel, benzine 
- [x] co2_emission, emission_class, emission_label: emission information 
- [x] drive_chain: drive chain Ex: front,rear, 4WD 
- [x] consumption: consumption of auto in city, country and combination (lt/100 km) 
- [x] country_version 
- [x] entertainment_media 
- [x] safety_security
- [x] comfort_convenience 
- [x] extras 

### make_model, make.(drop), model.(drop)

In [319]:
car_data = car_data.drop(['make', 'model'], axis=1)
car_data[['make_model']].sample(5)

Unnamed: 0,make_model
8991,Opel Corsa
9910,Opel Corsa
7509,Opel Astra
5735,Opel Astra
7887,Opel Astra


### body_type

In [320]:
car_data[['body_type']].sample(5)

Unnamed: 0,body_type
15370,Van
8218,Station wagon
9023,Sedans
13488,Sedans
10450,Compact


### vat

In [321]:
car_data[['vat']].sample(5)

Unnamed: 0,vat
2625,
2457,VAT deductible
6271,VAT deductible
8582,
5623,VAT deductible


### registration/first_registration

In [322]:
# Is this month and year?

car_data['registration'].str.replace('/', '-')                                    
#pd.to_datetime(car_data['registration'])

0        01-2016
1        03-2017
2        02-2016
3        08-2016
4        05-2016
          ...   
15914        ---
15915    01-2019
15916    03-2019
15917    06-2019
15918    01-2019
Name: registration, Length: 15919, dtype: object

### prev_owner

In [323]:
car_data['prev_owner'] = car_data['prev_owner'].str.findall('\d+').str[0]
car_data[['prev_owner']].sample(5)

Unnamed: 0,prev_owner
69,1.0
5689,
2350,1.0
10693,1.0
7239,1.0


### type

In [324]:
car_data['type'] = car_data['type'].str[1]
car_data[['type']].sample(5)

Unnamed: 0,type
156,Used
1118,Used
9094,Used
6407,Used
6667,Employee's car


### next_inspection

In [325]:
# Is this month and year?

car_data['next_inspection'] = car_data['next_inspection'].str[0].str.strip()
car_data[['next_inspection']].sample(5)

Unnamed: 0,next_inspection
8700,
671,
2590,
12672,
10547,


### inspection_new

In [326]:
# not sure what this is

car_data['inspection_new']

0                     [\nYes\n, \nEuro 6\n]
1                                       NaN
2                                       NaN
3                                       NaN
4        [\nYes\n, \n109 g CO2/km (comb)\n]
                        ...                
15914                                   NaN
15915                                   NaN
15916           [\nYes\n, \nEuro 6d-TEMP\n]
15917                                   NaN
15918    [\nYes\n, \n153 g CO2/km (comb)\n]
Name: inspection_new, Length: 15919, dtype: object

### body_color

In [327]:
car_data['body_color'] = car_data['body_color'].str[1]
car_data[['body_color']].sample(5)

Unnamed: 0,body_color
5674,Black
8563,White
3900,Grey
2907,Blue
4508,White


### upholstery

In [328]:
car_data['upholstery'] = car_data['upholstery'].str[0].str.strip()
car_data[['upholstery']].sample(5)

Unnamed: 0,upholstery
5462,
1383,
2759,Grey
10010,"Cloth, Black"
13508,


### gearing_type

In [329]:
car_data['gearing_type'] = car_data['gearing_type'].str[1]
car_data[['gearing_type']].sample(5)

Unnamed: 0,gearing_type
695,Manual
2766,Automatic
7119,Automatic
5985,Automatic
7163,Automatic


### fuel

In [330]:
car_data['fuel'] = car_data['fuel'].str[1]
car_data[['fuel']].sample(5)

Unnamed: 0,fuel
13521,Super 95
5601,Super 95
3270,Gasoline
14550,Super 95
5477,Super 95


### co2_emission (isn't this quantitative?)

In [331]:
car_data['co2_emission'] = car_data['co2_emission'].str[0].str.strip().str.findall('\d+').str[0].astype("float")
car_data[['co2_emission']].sample(5)

Unnamed: 0,co2_emission
3254,116.0
522,97.0
7732,154.0
15783,168.0
6011,104.0


### emission_class

In [332]:
car_data['emission_class'] = car_data['emission_class'].str[0].str.strip()
car_data[['emission_class']].sample(5)

Unnamed: 0,emission_class
13244,
6482,Euro 6
9870,Euro 6
9641,Euro 6d-TEMP
6521,Euro 6


### emission_label

In [333]:
car_data['emission_label'] = car_data['emission_label'].str[0].str.strip().str.findall('\d+').str[0]
car_data[['emission_label']].sample(5)

Unnamed: 0,emission_label
6592,
8107,
4523,
5723,
5486,


### drive_chain

In [334]:
car_data['drive_chain'] = car_data['drive_chain'].str[0].str.strip()
car_data[['drive_chain']].sample(5)

Unnamed: 0,drive_chain
3797,front
14507,
7133,
10023,front
9001,


### consumption l/100 km 
(these should be 3 differenet quantitative columns correct?)

In [335]:
# comb
car_data['consumption'].str[0].str[0].str.replace('l/100 km (comb)', '').str.replace('kg/100 km (comb)', '').replace('\n', 'NaN').astype("float")

0        3.8
1        5.6
2        3.8
3        3.8
4        4.1
        ... 
15914    5.3
15915    NaN
15916    5.3
15917    5.3
15918    6.8
Name: consumption, Length: 15919, dtype: float64

In [336]:
# city
car_data['consumption'].str[1].str[0].str.replace('l/100 km (city)', '').str.replace('kg/100 km (city)', '').replace('\n', 'NaN').astype("float")

0        4.3
1        7.1
2        4.4
3        4.3
4        4.6
        ... 
15914    6.2
15915    7.0
15916    6.2
15917    6.2
15918    8.7
Name: consumption, Length: 15919, dtype: float64

In [337]:
# country
car_data['consumption'].str[2].str[0].str.replace('l/100 km (country)', '').str.replace('kg/100 km (country)', '').replace('\n', 'NaN').astype("float")

0        3.5
1        4.7
2        3.4
3        3.5
4        3.8
        ... 
15914    4.7
15915    NaN
15916    4.7
15917    4.7
15918    5.7
Name: consumption, Length: 15919, dtype: float64

### country_version

In [338]:
car_data['country_version'] = car_data['country_version'].str[0].str.strip()
car_data[['country_version']].sample(5)

Unnamed: 0,country_version
5221,
2407,Spain
9791,Germany
12653,Italy
5649,European Union


### entertainment_media

In [339]:
car_data[['entertainment_media']].sample(5)

Unnamed: 0,entertainment_media
15023,"[Bluetooth, Digital radio, Hands-free equipmen..."
15220,"[Bluetooth, On-board computer, Radio]"
12540,"[Bluetooth, Digital radio, Hands-free equipmen..."
2404,"[Bluetooth, Hands-free equipment, MP3, Radio, ..."
8097,"[Bluetooth, Hands-free equipment, MP3, On-boar..."


### safety_security

In [340]:
car_data[['safety_security']].sample(5)

Unnamed: 0,safety_security
9292,"[Alarm system, Central door lock, Driver-side ..."
1478,"[ABS, Central door lock, Daytime running light..."
13541,"[ABS, Central door lock with remote control, D..."
3514,"[ABS, Central door lock, Central door lock wit..."
13044,"[ABS, Central door lock, Driver-side airbag, E..."


### comfort_convenience

In [341]:
car_data[['comfort_convenience']].sample(5)

Unnamed: 0,comfort_convenience
11478,"[Air conditioning, Armrest, Automatic climate ..."
12112,"[Air conditioning, Armrest, Automatic climate ..."
11758,"[Air conditioning, Automatic climate control, ..."
712,"[Air conditioning, Automatic climate control, ..."
15416,"[Air conditioning, Armrest, Automatic climate ..."


### extras

In [342]:
car_data[['extras']].sample(5)

Unnamed: 0,extras
10042,"[Alloy wheels, Catalytic Converter, Touch screen]"
11173,
4473,[Alloy wheels]
5869,"[Alloy wheels, Touch screen]"
12760,"[Roof rack, Trailer hitch]"


## Quantitative Columns
- [x] price: Price of cars 
- [x] km: km of autos 
- [x] hp: horsepower of autos (kW) 
- [x] displacement: displacement of autos (cc) 
- [] warranty: warranty period (month) (drop?)
- [x] weight: weight of auto (kg) 
- [x] nr_of_doors: number of doors 
- [x] nr_of_seats : number of seats 
- [x] cylinders: number of cylinders 
- [x] gears: number of gears

### price

In [343]:
car_data['price'].astype("float")
car_data[['price']].sample(5)

Unnamed: 0,price
11667,17950
1750,16950
15378,26450
13355,11500
11037,12840


### km

In [344]:
car_data['km'] = car_data['km'].str.replace(',', '').str.findall('\d+').str[0].astype("float")
car_data[['km']].sample(5)

Unnamed: 0,km
11211,102000.0
12420,1500.0
5890,80000.0
7666,15.0
14704,


### hp

In [345]:
car_data['hp'] = car_data['hp'].str.replace(',', '').str.findall('\d+').str[0].astype("float")
car_data[['hp']].sample(5)

Unnamed: 0,hp
9404,51.0
2779,81.0
13930,54.0
5301,85.0
3440,85.0


### displacement

In [346]:
car_data['displacement'] = car_data['displacement'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype("float")
car_data[['displacement']].sample(5)

Unnamed: 0,displacement
10159,1398.0
5333,1598.0
3723,1395.0
988,999.0
6856,1598.0


### warrenty - wip -  (probably not the best approach)

In [360]:
import re

for car in car_data['warranty']:
    if not isinstance(car, list) and pd.isna(car):
        print(np.nan)
    elif isinstance(car, list):
        for item in car:
            if re.search(r"\b\d+\s+months\b", item):
                print(float(re.findall('\d+', item)[0]))
    elif isinstance(car, str):
            if re.search(r"\b\d+\s+months\b", item):
                print(float(re.findall('\d+', item)[0]))
            else: 
                print(np.nan)
    else:
        print(np.nan)

nan
nan
nan
nan
12.0
nan
nan
nan
nan
12.0
nan
nan
6.0
nan
12.0
12.0
nan
nan
nan
nan
nan
nan
nan
12.0
12.0
nan
nan
nan
nan
nan
12.0
nan
nan
12.0
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
12.0
12.0
nan
nan
nan
6.0
nan
nan
nan
12.0
nan
12.0
12.0
nan
50.0
12.0
nan
nan
nan
50.0
6.0
nan
nan
nan
12.0
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
6.0
nan
nan
nan
12.0
36.0
nan
nan
12.0
nan
20.0
nan
nan
12.0
12.0
nan
nan
nan
nan
nan
36.0
nan
nan
36.0
12.0
12.0
nan
nan
12.0
nan
nan
nan
12.0
nan
12.0
nan
nan
12.0
nan
12.0
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
12.0
nan
12.0
12.0
24.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0
nan
nan
nan
nan
12.0
nan
nan
3.0
nan
nan
nan
12.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
12.0

### weight

In [32]:
car_data['weight'] = car_data['weight'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype("float")
car_data[['weight']].sample(5)

Unnamed: 0,weight
13315,1686.0
2457,1165.0
9467,1163.0
4179,
10601,1733.0


### nr_of_doors

In [33]:
car_data['nr_of_doors'] = car_data['nr_of_doors'].str[0].str.findall('\d+').str[0]
car_data[['nr_of_doors']].sample(5)

Unnamed: 0,nr_of_doors
1945,5
2729,5
2172,4
4888,5
5336,5


### nr_of_seats

In [34]:
car_data['nr_of_seats'] = car_data['nr_of_seats'].str[0].str.findall('\d+').str[0]
car_data[['nr_of_seats']].sample(5)

Unnamed: 0,nr_of_seats
309,4
9751,5
10968,5
4436,5
9629,5


### cylinders

In [35]:
car_data['cylinders'] = car_data['cylinders'].str[0].str.findall('\d+').str[0]
car_data[['cylinders']].sample(5)

Unnamed: 0,cylinders
7415,3.0
5670,
5200,4.0
7054,4.0
5803,4.0


### gears

In [36]:
car_data['gears'] = car_data['gears'].str[0].str.findall('\d+').str[0]
car_data[['gears']].sample(5)

Unnamed: 0,gears
8312,6
463,5
4436,6
13584,5
7716,6
