# Cleaning Car Data

## Columns:

**General Columns** 
- url: url of autos 
- short_description, description: Description of autos (in English and German) written by users 

**Categorical Columns**  
- make_model, make, model: Model of autos. Ex:Audi A1 
- body_type, body: Body type of autos Example: van, sedans
- vat: VAT deductible, price negotiable 
- registration, first_registration: First registration date and year of autos. 
- prev_owner, previous_owners: Number of previous owners
- type: new or used 
- next_inspection, inspection_new: information about inspection (inspection date,..) 
- body_color, body_color_original: Color of auto Ex: Black, red
- paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- upholstery: Upholstery information (texture, color) 
- gearing_type: Type of gear Ex: automatic, manual 
- fuel Ex: diesel, benzine 
- co2_emission, emission_class, emission_label: emission information 
- drive_chain: drive chain Ex: front,rear, 4WD 
- consumption: consumption of auto in city, country and combination (lt/100 km) 
- country_version 
- entertainment_media 
- safety_security 
- comfort_convenience 
- extras 

**Quantitative Columns**
- price: Price of cars 
- km: km of autos 
- hp: horsepower of autos (kW) 
- displacement: displacement of autos (cc) 
- warranty: warranty period (month) 
- weight: weight of auto (kg) 
- nr_of_doors: number of doors 
- nr_of_seats : number of seats 
- cylinders: number of cylinders 
- gears: number of gears

---

In [415]:
import numpy as np
import pandas as pd


df = pd.read_json('scout_car.json', lines=True)
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('.', '').str.replace('_&_', '_').str.strip()

In [416]:
df.head(3).T

Unnamed: 0,0,1,2
url,https://www.autoscout24.com//offers/audi-a1-sp...,https://www.autoscout24.com//offers/audi-a1-1-...,https://www.autoscout24.com//offers/audi-a1-sp...
make_model,Audi A1,Audi A1,Audi A1
short_description,Sportback 1.4 TDI S-tronic Xenon Navi Klima,1.8 TFSI sport,Sportback 1.6 TDI S tronic Einparkhilfe plus+m...
body_type,Sedans,Sedans,Sedans
price,15770,14500,14640
vat,VAT deductible,Price negotiable,VAT deductible
km,"56,013 km","80,000 km","83,450 km"
registration,01/2016,03/2017,02/2016
prev_owner,2 previous owners,,1 previous owner
kw,,,


In [417]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            15919 non-null  object 
 1   make_model                     15919 non-null  object 
 2   short_description              15873 non-null  object 
 3   body_type                      15859 non-null  object 
 4   price                          15919 non-null  int64  
 5   vat                            11406 non-null  object 
 6   km                             15919 non-null  object 
 7   registration                   15919 non-null  object 
 8   prev_owner                     9091 non-null   object 
 9   kw                             0 non-null      float64
 10  hp                             15919 non-null  object 
 11  type                           15917 non-null  object 
 12  previous_owners                9279 non-null  

---

## Droping columns that have missing values more than 90%.

In [418]:
def df_nans(df, limit):
    missing = df.isnull().sum()*100 / df.shape[0]
    return missing.loc[lambda x : x >= limit]

def column_nans(serial):
    # display percentage of nans in a Series
    return serial.isnull().sum()*100 / serial.shape[0]

In [419]:
df_nans(df, 90)

kw                               100.000000
electricity_consumption           99.139393
last_service_date                 96.444500
other_fuel_types                  94.472015
availability                      96.011056
last_timing_belt_service_date     99.899491
available_from                    98.291350
dtype: float64

In [420]:
drop_columns = df_nans(df, 90).index
drop_columns

Index(['kw', 'electricity_consumption', 'last_service_date',
       'other_fuel_types', 'availability', 'last_timing_belt_service_date',
       'available_from'],
      dtype='object')

In [421]:
df.drop(drop_columns, axis=1, inplace=True)

In [422]:
df.columns

Index(['url', 'make_model', 'short_description', 'body_type', 'price', 'vat',
       'km', 'registration', 'prev_owner', 'hp', 'type', 'previous_owners',
       'next_inspection', 'inspection_new', 'warranty', 'full_service',
       'non-smoking_vehicle', 'null', 'make', 'model', 'offer_number',
       'first_registration', 'body_color', 'paint_type', 'body_color_original',
       'upholstery', 'body', 'nr_of_doors', 'nr_of_seats', 'model_code',
       'gearing_type', 'displacement', 'cylinders', 'weight', 'drive_chain',
       'fuel', 'consumption', 'co2_emission', 'emission_class',
       'comfort_convenience', 'entertainment_media', 'extras',
       'safety_security', 'description', 'emission_label', 'gears',
       'country_version'],
      dtype='object')

---

## Helper Functions

In [423]:
def safe_strip(x):
    if isinstance(x, list) and len(x) > 1:
        return x[1].strip()
    return x  # Return the original if it's not a list or too short

df['body'] = df['body'].apply(safe_strip)

In [424]:
def combine_columns(p1,p2):
    if p1 == p2:
        return p1
    elif np.isnan(p1) :
        if np.isnan(p2):
            return np.nan
        else:
            return p2
    elif np.isnan(p2):
        if np.isnan(p1):
            return np.nan
        else:
            return p1
    else:
        return 'conflict'

#df.apply(lambda x: combine_columns(x[''],x['']), axis=1)

---

## General Columns
- [x] url: url of autos
- [x] short_description, description: Description of autos (in English and German) written by users

### url

#### dropped becuase data not needed

In [425]:
df = df.drop(['url'], axis=1)

### short_description

In [426]:
df = df.drop(['short_description'], axis=1)

#### dropped because data elsewhere in dataframe

### description column

#### Dropped because it includes German description of car written by users

In [427]:
df['description']

0        [\n, Sicherheit:,  , Deaktivierung für Beifahr...
1        [\nLangstreckenfahrzeug daher die hohe Kilomet...
2        [\n, Fahrzeug-Nummer: AM-95365,  , Ehem. UPE 2...
3        [\nAudi A1: , - 1e eigenaar , - Perfecte staat...
4        [\n, Technik & Sicherheit:, Xenon plus, Klimaa...
                               ...                        
15914    [\nVettura visionabile nella sede in Via Roma ...
15915    [\nDach: Panorama-Glas-Schiebedach, Lackierung...
15916    [\n, Getriebe:,  Automatik, Technik:,  Bordcom...
15917    [\nDEK:[2691331], Renault Espace Blue dCi 200C...
15918    [\n, Sicherheit Airbags:,  , Seitenairbag,  , ...
Name: description, Length: 15919, dtype: object

In [428]:
df.drop('description',axis=1,inplace=True)

## Categorical Columns
- [x] make_model, make, model: Model of autos. Ex:Audi A1 
- [x] body_type: Body type of autos Example: van, sedans
- [x] body: Body type of autos Example: van, sedans
- [x] vat: VAT deductible
- [x] registration: First registration date and year of autos. 
- [x] first_registration: First registration date and year of autos. 
- [x] prev_owner: Number of previous owners
- [x] previous_owners: Number of previous owners
- [x] type: new or used 
- [x] next_inspection: information about inspection (inspection date,..) 
- [x] inspection_new: information about inspection (inspection date,..) 
- [x] body_color: Color of auto Ex: Black, red
- [x] body_color_original: Color of auto Ex: Black, red
- [x] paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- [] upholstery: Upholstery information (texture, color) 
- [x] gearing_type: Type of gear Ex: automatic, manual 
- [] fuel Ex: diesel, benzine 
- [x] co2_emission
- [] emission_class
- [x] emission_label
- [x] drive_chain: drive chain Ex: front,rear, 4WD 
- [x] consumption: consumption of auto in city, country and combination (lt/100 km) 
- [x] country_version 
- [x] entertainment_media 
- [x] safety_security
- [x] comfort_convenience 
- [x] extras 

### make_model, make and model (drop make and model)

#### Drop make and model because they are redundent with make_model

In [429]:
df = df.drop(['make', 'model'], axis=1)

In [430]:
df['make_model'].value_counts(dropna=False)

make_model
Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: count, dtype: int64

### body_type and body (drop body)

In [431]:
df['body_type'].value_counts(dropna=False)

body_type
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
None               60
Off-Road           56
Coupe              25
Convertible         8
Name: count, dtype: int64

In [432]:
df['body'].value_counts(dropna=False)

body
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: count, dtype: int64

#### body_type and body are the same, can drop body

In [433]:
df.drop('body', axis=1, inplace=True)

### vat

In [434]:
df['vat'].value_counts(dropna=False)

vat
VAT deductible      10980
None                 4513
Price negotiable      426
Name: count, dtype: int64

### registration and registration_year (merge and drop registration)

In [435]:
df['registration'] = df['registration'].replace('-/-', np.nan)
df['registration'] = pd.DatetimeIndex(df['registration'])
df['registration_year'] = pd.DatetimeIndex(df['registration']).year

In [436]:
df['registration_year'].value_counts(dropna=False)

registration_year
2018.0    4522
2016.0    3674
2017.0    3273
2019.0    2853
NaN       1597
Name: count, dtype: int64

#### drop registration b/c no longer needed

In [437]:
df.drop('registration', axis=1, inplace=True)

### prev_owner and previous_owners (drop prev_owner)

In [438]:
df['prev_owner'] = df['prev_owner'].str.findall('\d+').str[0].astype("float")

In [439]:
df['prev_owner'].value_counts(dropna=False)

prev_owner
1.0    8294
NaN    6828
2.0     778
3.0      17
4.0       2
Name: count, dtype: int64

In [440]:
df['previous_owners'] = df['previous_owners'].str.findall('\d+').str[0].astype("float")

In [441]:
df['previous_owners'].value_counts(dropna=False)

previous_owners
1.0    8101
NaN    6870
2.0     766
0.0     163
3.0      17
4.0       2
Name: count, dtype: int64

In [442]:
df['previous_owners'] = df.apply(lambda x: combine_columns(x['prev_owner'],x['previous_owners']), axis=1)

In [443]:
df = df.drop(['prev_owner'], axis=1)

#### drop prev_owner because redundent with previous_owners

In [444]:
df['previous_owners'].value_counts(dropna=False)

previous_owners
1.0    8294
NaN    6665
2.0     778
0.0     163
3.0      17
4.0       2
Name: count, dtype: int64

### type

In [445]:
df['type'] = df['type'].str[1]

In [446]:
df['type'].value_counts(dropna=False)

type
Used              11096
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: count, dtype: int64

### next_inspection

In [447]:
df['next_inspection'] = df['next_inspection'].str[0].str.strip().astype("string")
df['next_inspection'] = pd.DatetimeIndex(df['next_inspection'])
df['next_inspection'] = pd.DatetimeIndex(df['next_inspection']).year

In [448]:
df['next_inspection'].value_counts(dropna=False)

next_inspection
NaN       13094
2021.0     1401
2020.0      557
2022.0      483
2019.0      336
2018.0       26
2017.0        7
2023.0        5
2001.0        5
2016.0        3
2014.0        1
1921.0        1
Name: count, dtype: int64

### inspection_new

In [449]:
df['inspection_new']

0                     [\nYes\n, \nEuro 6\n]
1                                       NaN
2                                       NaN
3                                       NaN
4        [\nYes\n, \n109 g CO2/km (comb)\n]
                        ...                
15914                                   NaN
15915                                   NaN
15916           [\nYes\n, \nEuro 6d-TEMP\n]
15917                                   NaN
15918    [\nYes\n, \n153 g CO2/km (comb)\n]
Name: inspection_new, Length: 15919, dtype: object

In [450]:
df['inspection_new'] = df['inspection_new'].str[0].str.strip()

In [451]:
df['inspection_new'].value_counts(dropna=False)

inspection_new
NaN    11987
Yes     3570
         362
Name: count, dtype: int64

### body_color

In [452]:
df['body_color'] = df['body_color'].str[1]

In [453]:
df['body_color'].value_counts(dropna=False)

body_color
Black     3745
Grey      3505
White     3406
Silver    1647
Blue      1431
Red        957
NaN        597
Brown      289
Green      154
Beige      108
Yellow      51
Violet      18
Bronze       6
Orange       3
Gold         2
Name: count, dtype: int64

### body_color_original (dropped)

In [454]:
df['body_color_original'] = df['body_color_original'].str[0].str[1:-1]

In [455]:
df['body_color_original'].value_counts(dropna=False)

body_color_original
NaN                              3759
Onyx Schwarz                      338
Bianco                            282
Mythosschwarz Metallic            238
Brillantschwarz                   216
                                 ... 
Rouge-Braun (G0Y)                   1
VARI COLRI DISPONIBILI              1
Kokosnussbraun Metallic             1
Farbe frei wählbar                  1
Perlmutt-Weiß Metallic (Weiß)       1
Name: count, Length: 1928, dtype: int64

In [456]:
df['body_color_original'].isnull().sum()

3759

In [457]:
#import statsmodels.api as sm
#from statsmodels.formula.api import ols
#model = ols('price ~ C(body_color_original)', data=df).fit()
#anova_table = sm.stats.anova_lm(model, typ=2)
#anova_table

#### This column also can be dropped

In [458]:
df.drop('body_color_original',axis=1,inplace=True)

### paint_type

In [459]:
df['paint_type'] = df['paint_type'].str[0].str.strip()

In [460]:
df['paint_type']

0        Metallic
1             NaN
2        Metallic
3        Metallic
4        Metallic
           ...   
15914    Metallic
15915    Metallic
15916         NaN
15917         NaN
15918    Metallic
Name: paint_type, Length: 15919, dtype: object

### upholstery (dropped and upholstery_color and upholstery_material columns created)

In [461]:
df['upholstery_material'] = df['upholstery'].str[0].str.replace('\n','').str.split(', ').str[0]

In [462]:
list_color = ['Black','Grey','Brown','Beige', 'Blue', 'White']
for i in list_color:
    df['upholstery_material'] = df['upholstery_material'].replace(i,np.nan)

In [463]:
df['upholstery_material'].value_counts(dropna=False)

upholstery_material
Cloth           8423
NaN             4503
Part leather    1499
Full leather    1009
Other            368
Velour            60
alcantara         57
Name: count, dtype: int64

In [464]:
df['upholstery_material'].value_counts(dropna=False)

upholstery_material
Cloth           8423
NaN             4503
Part leather    1499
Full leather    1009
Other            368
Velour            60
alcantara         57
Name: count, dtype: int64

In [465]:
df['upholstery_color'] = df['upholstery'].str[0].str.replace('\n','').str.replace(', ','')

In [466]:
list_uph_mat = ['Cloth', 'Part leather', 'Full leather', 'Other', 'Velour', 'alcantara']
for i in list_uph_mat:
    df['upholstery_color'] = df['upholstery_color'].str.replace(i,'')

In [467]:
df['upholstery_color'] = df['upholstery_color'].replace('',np.nan)

In [468]:
df['upholstery_color'].value_counts(dropna=False)

upholstery_color
Black     8201
NaN       6038
Grey      1376
Brown      207
Beige       54
Blue        16
White       13
Red          9
Yellow       4
Orange       1
Name: count, dtype: int64

#### upholstery column cleaned and by this column two columns called upholstery_color and upholstery_material were created. upholstery column can be dropped.

In [469]:
df = df.drop('upholstery', axis=1)

### gearing_type

In [470]:
df['gearing_type'] = df['gearing_type'].str[1]

In [471]:
df['gearing_type'].value_counts(dropna=False)

gearing_type
Manual            8153
Automatic         7297
Semi-automatic     469
Name: count, dtype: int64

### fuel

In [472]:
df['fuel']

0                    [\n, Diesel (Particulate Filter), \n]
1                                       [\n, Gasoline, \n]
2                    [\n, Diesel (Particulate Filter), \n]
3                    [\n, Diesel (Particulate Filter), \n]
4                    [\n, Diesel (Particulate Filter), \n]
                               ...                        
15914                [\n, Diesel (Particulate Filter), \n]
15915    [\n, Super 95 / Super Plus 98 (Particulate Fil...
15916                                     [\n, Diesel, \n]
15917                                     [\n, Diesel, \n]
15918                                   [\n, Super 95, \n]
Name: fuel, Length: 15919, dtype: object

In [393]:
df['fuel'] = df.type.str[3]

In [394]:
benzine = df.type.str[3].str.contains('Benzine', na=False, regex=True)

In [395]:
df['fuel'][benzine].value_counts()

Series([], Name: count, dtype: int64)

In [396]:
particulate = df.type.str[3].str.contains('Particulate', na=False, regex=True)

In [397]:
df['particulate']='unparticulate'

In [398]:
df.loc[particulate,'particulate']='particulate'

In [399]:
df['particulate'].value_counts()

particulate
unparticulate    15919
Name: count, dtype: int64

In [400]:
df.loc[benzine,'fuel'] = 'benzine'

In [401]:
df['fuel'][df['fuel'] == 'benzine'].value_counts()

Series([], Name: count, dtype: int64)

In [402]:
super = df.type.str[3].str.contains('Super', na=False, regex=True)

In [403]:
gasoline = df.type.str[3].str.contains('Gasoline', na=False, regex=True)

In [404]:
df.loc[super,'fuel'] = 'benzine'

In [405]:
df.loc[gasoline,'fuel'] = 'benzine'

In [406]:
df['fuel'].value_counts()

fuel
d    11096
-     1364
l     1011
o      796
Name: count, dtype: int64

In [407]:
gas = df['fuel'].isin(['LPG','Liquid petroleum gas (LPG)',\
                              'CNG','CNG (Particulate Filter)',\
                              'Biogas','Domestic gas H'])
          

In [408]:
df.loc[gas,'fuel'] = 'gas'

In [409]:
df['fuel'].value_counts()

fuel
d    11096
-     1364
l     1011
o      796
Name: count, dtype: int64

In [410]:
diesel = df['fuel'].isin(['Diesel (Particulate Filter)', 'Diesel'])

In [411]:
df.loc[diesel,'fuel'] = 'diesel'

In [412]:
others = df['fuel'].isin(['Others', 'Others (Particulate Filter)', 'Electric'])

In [413]:
df.loc[others,'fuel'] = 'others'

In [414]:
df['fuel'].value_counts(dropna=False)

fuel
d      11096
NaN     1652
-       1364
l       1011
o        796
Name: count, dtype: int64

### co2_emission

In [None]:
df['co2_emission'] = df['co2_emission'].str[0].str.strip().str.findall('\d+').str[0].astype("float")

In [None]:
df['co2_emission'].value_counts(dropna=False)

### emission_class

In [None]:
df['emission_class'] = df['emission_class'].str[0].str.replace('\n','')

In [None]:
df['emission_class'].value_counts(dropna=False)

In [None]:
df['emission_class'].replace(['Euro 6','Euro 6d-TEMP','Euro 6d', 'Euro 6c'], 'Euro 6', inplace = True)

### emission_label (dropped)

In [None]:
#df['emission_label'] = df['emission_label'].str[0].str.findall('\((.*?)\)').str[0]

In [None]:
df['emission_label'].value_counts(dropna=False)


In [None]:
df['emission_label'] = df['emission_label'].str[0].str.strip().str.replace(')', '').str.replace('(', '')

In [None]:
df['emission_label'].value_counts(dropna=False)

In [None]:
df.drop('emission_label',axis=1,inplace=True)

### drive_chain

In [None]:
df['drive_chain'] = df['drive_chain'].str[0].str.strip()

In [None]:
df['drive_chain'].value_counts(dropna=False)

### consumption l/100 km (turned into city, country and combined consumption columns

In [None]:
# comb
df['combined_consumption'] = df['consumption'].str[0].str[0].str.replace('l/100 km (comb)', '').str.replace('kg/100 km (comb)', '').replace('\n', 'NaN').astype("float")

In [None]:
df['combined_consumption'].value_counts(dropna=False)

In [None]:
# city
df['city_consumption'] = df['consumption'].str[1].str[0].str.replace('l/100 km (city)', '').str.replace('kg/100 km (city)', '').replace('\n', 'NaN').astype("float")

In [None]:
df['city_consumption'].value_counts(dropna=False)

In [None]:
df['combined_consumption'].value_counts(dropna=False)

In [None]:
# country
df['country_consumption'] = df['consumption'].str[2].str[0].str.replace('l/100 km (country)', '').str.replace('kg/100 km (country)', '').replace('\n', 'NaN').astype("float")

In [None]:
df['combined_consumption'].value_counts(dropna=False)

### country_version

In [None]:
df['country_version'] = df['country_version'].str[0].str.strip()

In [None]:
df['country_version'].value_counts(dropna=False)

#### This column can be dropped.

In [None]:
df.drop('country_version',axis=1,inplace=True)

### entertainment_media

In [None]:
df['entertainment_media']

In [None]:
df['entertainment_media'] = df['entertainment_media'].astype('str').str.replace('[','').str.replace("]",'')

#### This column was not changed as it will be transformed with getdummy function later

### safety_security

In [None]:
df['safety_security']

#### This column was not changed as it will be transformed with getdummy function later

### comfort_convenience

In [None]:
df['comfort_convenience'] = df['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'')

In [None]:
df['comfort_convenience'].astype('str').str.replace('[','').str.replace("]",'').str.get_dummies(sep=",")

In [None]:
df['comfort_convenience']

#### This column was not changed as it will be transformed with getdummy function later

### extras

In [None]:
df['extras'].astype('str').str.replace('[','').str.replace("]",'').str.get_dummies(sep=', ').sum()

In [None]:
df['extras']

#### This column was not changed as it will be transformed with getdummy function later

#### 

---

## Quantitative Columns
- [x] price: Price of cars 
- [x] km: km of autos 
- [x] hp: horsepower of autos (kW) 
- [x] displacement: displacement of autos (cc) 
- [] warranty: warranty period (month)
- [x] weight: weight of auto (kg) 
- [x] nr_of_doors: number of doors 
- [x] nr_of_seats : number of seats 
- [x] cylinders: number of cylinders 
- [x] gears: number of gears

### price

In [None]:
df['price'].astype("float")

In [None]:
df['price'].value_counts(dropna=False)

### km

In [None]:
df['km'] = df['km'].str.replace(',', '').str.findall('\d+').str[0].astype("float")

In [None]:
df['km'].value_counts(dropna=False).index

### hp

In [None]:
df['hp'] = df['hp'].str.replace(',', '').str.findall('\d+').str[0].astype("float")

In [None]:
df['hp'].value_counts().index

### displacement

In [None]:
df['displacement'] = df['displacement'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype("float")

In [None]:
df['displacement'].value_counts().index

### warranty

In [None]:
df['warranty']

In [None]:
import re
def clean_warranty(a):
    if type(a) == list:
        b = re.findall(r'\d+', a[0])
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    elif type(a) ==str:
        b = re.findall(r'\d+', a)
        if len(b)== 0:
            return np.nan
        else:
            return b[0]
    else:
        return a

In [None]:
df['warranty'] = df['warranty'].apply(clean_warranty)

In [None]:
df['warranty'] = df['warranty'].astype('float')

In [None]:
df['warranty'].value_counts(dropna=False)

### weight

In [None]:
df['weight'] = df['weight'].str[0].str.replace(',', '').str.findall('\d+').str[0].astype("float")

In [None]:
df['weight'].value_counts().index

### nr_of_doors

In [None]:
df['nr_of_doors'] = df['nr_of_doors'].str[0].str.findall('\d+').str[0]

In [None]:
df['nr_of_doors'].value_counts()

### nr_of_seats

In [None]:
df['nr_of_seats'] = df['nr_of_seats'].str[0].str.findall('\d+').str[0]

In [None]:
df['nr_of_seats'].value_counts()

### cylinders

In [None]:
df['cylinders'] = df['cylinders'].str[0].str.findall('\d+').str[0]


In [None]:
df['cylinders'].value_counts(dropna=False)

### gears

In [None]:
df['gears'] = df['gears'].str[0].str.findall('\d+').str[0]


In [None]:
df['gears'].value_counts(dropna=False)

## create csv

In [None]:
df.info()

### Save to new csv after clean data before going to filling null values.
- Every step use a new notebook.

In [None]:
df.to_csv("cleaned_car_data.csv", index=False)