# Cleaning Car Data

## Columns:

**Categorical Columns**  
- make_model, make, model: Model of autos. Ex:Audi A1 
- body_type, body: Body type of autos Example: van, sedans
- vat: VAT deductible, price negotiable 
- registration, first_registration: First registration date and year of autos. 
- prev_owner, previous_owners: Number of previous owners
- type: new or used 
- next_inspection, inspection_new: information about inspection (inspection date,..) 
- body_color, body_color_original: Color of auto Ex: Black, red
- paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- upholstery: Upholstery information (texture, color) 
- gearing_type: Type of gear Ex: automatic, manual 
- fuel: fuel type Ex: diesel, benzine 
- co2_emission, emission_class, emission_label: emission information 
- drive_chain: drive chain Ex: front,rear, 4WD 
- consumption: consumption of auto in city, country and combination (lt/100 km) 
- country_version 
- entertainment_media 
- safety_security 
- comfort_convenience 
- extras 

**Quantitative Columns**
- price: Price of cars 
- km: km of autos 
- hp: horsepower of autos (kW) 
- displacement: displacement of autos (cc) 
- warranty: warranty period (month) 
- weight: weight of auto (kg) 
- nr_of_doors: number of doors 
- nr_of_seats : number of seats 
- cylinders: number of cylinders 
- gears: number of gears

---

In [2]:
import pandas as pd
import numpy as np

car_data = pd.read_csv('cleaned_car_data.csv')

In [3]:
car_data.head(3).T

Unnamed: 0,0,1,2
make_model,Audi A1,Audi A1,Audi A1
body_type,Sedans,Sedans,Sedans
price,15770,14500,14640
vat,VAT deductible,Price negotiable,VAT deductible
km,56013.0,80000.0,83450.0
registration,2016-01-01,2017-03-01,2016-02-01
hp,66.0,141.0,85.0
type,Used,Used,Used
previous_owners,2.0,,1.0
next_inspection,2021.0,,


In [4]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 45 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   make_model           15919 non-null  object 
 1   body_type            15859 non-null  object 
 2   price                15919 non-null  int64  
 3   vat                  11406 non-null  object 
 4   km                   14895 non-null  float64
 5   registration         14322 non-null  object 
 6   hp                   15831 non-null  float64
 7   type                 15917 non-null  object 
 8   previous_owners      9254 non-null   float64
 9   next_inspection      2825 non-null   float64
 10  inspection_new       3570 non-null   object 
 11  warranty             4853 non-null   float64
 12  full_service         8215 non-null   object 
 13  non-smoking_vehicle  7177 non-null   object 
 14  null                 15919 non-null  object 
 15  offer_number         12744 non-null 

---

## Replace Missing Values Function


In [5]:
def fill(method, df, column, group_cols=None):
    """
    Fills NaN values in `df[column]` either using the overall mean, median or mode (no grouping)
    or group-specific mean, median or mode (group_col provided).
    Prints stats about how many NaNs were filled and the final distribution.
    """
    # Debug prints: which column is being filled, and grouping info.
    print('Filling column:', column)
    print('Grouping by:', group_cols)

    # 1. Count NaNs before filling
    nan_before = df[column].isnull().sum()

    # 2. Fill logic
    if method == 'mean':
            if group_cols is None:
                # Fill all NaN in 'column' with the overall mean
                df[column].fillna(df[column].mean().iloc[0], inplace=True)
            else:
                # Calculate groupwise mean for each row
                group = df.groupby(group_cols)[column].transform(lambda x: x.mean())
                # Fill missing values in df[column] with corresponding group mean
                df[column].fillna(group, inplace=True)
    
    elif method == 'median':
        if group_cols is None:
            # Fill all NaN in 'column' with the overall median
            df[column].fillna(df[column].median().iloc[0], inplace=True)
        else:
            # Calculate groupwise median for each row
            group = df.groupby(group_cols)[column].transform(lambda x: x.median())
            # Fill xmissing values in df[column] with corresponding group median
            df[column].fillna(group, inplace=True)
    
    elif method == 'mode':
        if group_cols is None:
            # Fill all NaN in 'column' with the overall mode
            df[column].fillna(df[column].mode().iloc[0], inplace=True)            
        else:
            # Calculate groupwise mode for each row
            group = df.groupby(group_cols)[column].transform(lambda x: x.mode())
            # Fill missing values in df[column] with corresponding group mode
            df[column].fillna(group, inplace=True)

    # 3. Count NaNs after filling
    nan_after = df[column].isnull().sum()
    nan_filled = nan_before - nan_after

    # 4. Print final stats
    print("Number of NaN before filling:", nan_before)
    print("Number of NaN filled:", nan_filled)
    print("Number of NaN after filling:", nan_after)
    print("------------------")
    print(df[column].value_counts(dropna=False))

---

## Provided Functions

In [6]:
def fill_most_freq(df, group_col, col_name):
    
    '''Fills the missing values with the most existing value (mode) in the relevant column according to single-stage grouping'''
    
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        mode = list(df[cond][col_name].mode())
        if mode != []:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
        else:
            df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
    print("Number of NaN : ", df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))

In [7]:
def fill_prop(df, group_col, col_name):
    
    '''Fills the missing values with "ffill and bfill method" according to single-stage grouping'''
    
    for group in list(df[group_col].unique()):
        cond = df[group_col]==group
        df.loc[cond, col_name] = df.loc[cond, col_name].fillna(method="ffill").fillna(method="bfill")
    df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    print("Number of NaN : ", df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))

In [8]:
def double_stage(df, group_col1, group_col2, col_name, method): # method can be either "mode" or "mean" or "median" or "ffill"
    
    '''Fills the missing values with "mode/mean/median/ffill/bfill method" according to double-stage grouping'''
    
    if method == "mode":
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond1 = df[group_col1]==group1
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                mode1 = list(df[cond1][col_name].mode())
                mode2 = list(df[cond2][col_name].mode())
                if mode2 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond2][col_name].mode()[0])
                elif mode1 != []:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[cond1][col_name].mode()[0])
                else:
                    df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(df[col_name].mode()[0])

    elif method == "mean":
        df[col_name].fillna(df.groupby([group_col1, group_col2])[col_name].transform("mean"), inplace = True)
        df[col_name].fillna(df.groupby(group_col1)[col_name].transform("mean"), inplace = True)
        df[col_name].fillna(df[col_name].mean(), inplace = True)
        
    elif method == "median":
        df[col_name].fillna(df.groupby([group_col1, group_col2])[col_name].transform("median"), inplace = True)
        df[col_name].fillna(df.groupby(group_col1)[col_name].transform("median"), inplace = True)
        df[col_name].fillna(df[col_name].median(), inplace = True)
        
    elif method == "ffill":           
        for group1 in list(df[group_col1].unique()):
            for group2 in list(df[group_col2].unique()):
                cond2 = (df[group_col1]==group1) & (df[group_col2]==group2)
                df.loc[cond2, col_name] = df.loc[cond2, col_name].fillna(method="ffill").fillna(method="bfill")
                
        for group1 in list(df[group_col1].unique()):
            cond1 = df[group_col1]==group1
            df.loc[cond1, col_name] = df.loc[cond1, col_name].fillna(method="ffill").fillna(method="bfill")            
           
        df[col_name] = df[col_name].fillna(method="ffill").fillna(method="bfill")
    
    print("Number of NaN : ",df[col_name].isnull().sum())
    print("------------------")
    print(df[col_name].value_counts(dropna=False))

---

## Categorical Columns
- [x] make_model Ex: Audi A1 
- [x] body_type: Example: van, sedans
- [] body: Body type of autos Example: van, sedans
- [] vat: VAT deductible, price negotiable 
- [] registration_year: First registration year of autos.
- [] previous_owners: Number of previous owners
- [x] type: new or used 
- [] next_inspection: information about inspection (inspection date,..) 
- [x] inspection_new: information about inspection (inspection date,..) 
- [x] body_color: Color of auto Ex: Black, red
- [] body_color_original: Color of auto Ex: Black, red
- [] paint_type: Paint type of auto Ex: Metallic, Uni/basic 
- [] upholstery: Upholstery information (texture, color) 
- [x] gearing_type: Type of gear Ex: automatic, manual 
- [] fuel: fuel type Ex: diesel, benzine 
- [] combined_emissions, city_emissions, country_emissions
- [] co2_emission, emission_class, emission_label: emission information 
- [] drive_chain: drive chain Ex: front,rear, 4WD 
- [] consumption: consumption of auto in city, country and combination (lt/100 km) 
- [] country_version 
- [] entertainment_media 
- [] safety_security
- [] comfort_convenience 
- [] extras

### make_model

In [9]:
car_data['make_model'].value_counts(dropna=False)

make_model
Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: count, dtype: int64

### body_type

In [10]:
car_data['body_type'].value_counts(dropna=False)

body_type
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: count, dtype: int64

In [11]:
car_data['make_model'].value_counts(dropna=False)

make_model
Audi A3           3097
Audi A1           2614
Opel Insignia     2598
Opel Astra        2526
Opel Corsa        2219
Renault Clio      1839
Renault Espace     991
Renault Duster      34
Audi A2              1
Name: count, dtype: int64

#### same make_model same body_type so we can use fill()

In [12]:
fill('mode', car_data, 'body_type', 'make_model')

Filling column: body_type
Grouping by: make_model
Number of NaN before filling: 60
Number of NaN filled: 0
Number of NaN after filling: 60
------------------
body_type
Sedans           7903
Station wagon    3553
Compact          3153
Van               783
Other             290
Transporter        88
NaN                60
Off-Road           56
Coupe              25
Convertible         8
Name: count, dtype: int64


### vat

In [13]:
car_data.vat

0          VAT deductible
1        Price negotiable
2          VAT deductible
3                     NaN
4                     NaN
               ...       
15914      VAT deductible
15915      VAT deductible
15916      VAT deductible
15917      VAT deductible
15918      VAT deductible
Name: vat, Length: 15919, dtype: object

### registration

In [14]:
# Look at distance traveld

In [15]:
car_data['registration']

0        2016-01-01
1        2017-03-01
2        2016-02-01
3        2016-08-01
4        2016-05-01
            ...    
15914           NaN
15915    2019-01-01
15916    2019-03-01
15917    2019-06-01
15918    2019-01-01
Name: registration, Length: 15919, dtype: object

### previous owners

In [16]:
car_data['previous_owners'].value_counts(dropna=False)

previous_owners
1.0    8294
NaN    6665
2.0     778
0.0     163
3.0      17
4.0       2
Name: count, dtype: int64

In [17]:
car_data['previous_owners'].fillna(0.0, inplace=True)

#### NaN means no previous owner so can fill with 0.0

### type

In [18]:
car_data['type'].value_counts(dropna=False)

type
Used              11096
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
NaN                   2
Name: count, dtype: int64

#### Only 2 NaN so replaced with mode

In [19]:
fill('mode', car_data, 'type')

Filling column: type
Grouping by: None
Number of NaN before filling: 2
Number of NaN filled: 2
Number of NaN after filling: 0
------------------
type
Used              11098
New                1650
Pre-registered     1364
Employee's car     1011
Demonstration       796
Name: count, dtype: int64


### next_inspection

In [20]:
car_data['next_inspection'].value_counts(dropna=False)

next_inspection
NaN       13094
2021.0     1401
2020.0      557
2022.0      483
2019.0      336
2018.0       26
2017.0        7
2023.0        5
2001.0        5
2016.0        3
2014.0        1
1921.0        1
Name: count, dtype: int64

### inspection_new

In [21]:
car_data['inspection_new'].fillna('No', inplace=True)

In [22]:
car_data['inspection_new'].value_counts(dropna=False)

inspection_new
No     12349
Yes     3570
Name: count, dtype: int64

### body_color

In [23]:
car_data['body_color'].value_counts(dropna=False)

body_color
Black     3745
Grey      3505
White     3406
Silver    1647
Blue      1431
Red        957
NaN        597
Brown      289
Green      154
Beige      108
Yellow      51
Violet      18
Bronze       6
Orange       3
Gold         2
Name: count, dtype: int64

In [24]:
fill('mode', car_data, 'body_color')

Filling column: body_color
Grouping by: None
Number of NaN before filling: 597
Number of NaN filled: 597
Number of NaN after filling: 0
------------------
body_color
Black     4342
Grey      3505
White     3406
Silver    1647
Blue      1431
Red        957
Brown      289
Green      154
Beige      108
Yellow      51
Violet      18
Bronze       6
Orange       3
Gold         2
Name: count, dtype: int64


### upholstery

In [25]:
car_data['upholstery'].value_counts(dropna=False)

upholstery
Cloth, Black           5821
NaN                    3720
Part leather, Black    1121
Cloth                  1005
Cloth, Grey             891
Cloth, Other            639
Full leather, Black     575
Black                   491
Grey                    273
Other, Other            182
Part leather            140
Full leather            139
Full leather, Brown     116
Part leather, Grey      116
Other, Black            110
Full leather, Other      72
Full leather, Grey       67
Part leather, Other      65
Other                    56
Part leather, Brown      50
alcantara, Black         47
Velour, Black            36
Full leather, Beige      36
Cloth, Brown             28
Velour                   16
Other, Grey              15
Cloth, Beige             13
Brown                    12
Cloth, Blue              12
Velour, Grey              8
Cloth, White              8
alcantara, Grey           6
Cloth, Red                5
Other, Yellow             4
Part leather, Red         3
Beige    

### gearing_type

In [26]:
car_data['gearing_type'].value_counts(dropna=False)

gearing_type
Manual            8153
Automatic         7297
Semi-automatic     469
Name: count, dtype: int64

### fuel

In [27]:
car_data['fuel'].value_counts(dropna=False)

fuel
Diesel (Particulate Filter)                                                                                  4315
Super 95                                                                                                     3338
Gasoline                                                                                                     3175
Diesel                                                                                                       2984
Super 95 / Regular/Benzine 91                                                                                 424
                                                                                                             ... 
Regular/Benzine 91 / Super 95 / Regular/Benzine E10 91                                                          1
Super Plus 98 / Super E10 95                                                                                    1
Regular/Benzine 91 / Super 95 / Regular/Benzine E10 91 / Super E10 95 / Super Plus 

### co2_emission

In [28]:
car_data['co2_emission'].value_counts(dropna=False)

co2_emission
NaN      2436
120.0     740
99.0      545
97.0      537
104.0     501
         ... 
51.0        1
165.0       1
331.0       1
80.0        1
193.0       1
Name: count, Length: 120, dtype: int64

In [29]:
fill("median", car_data, 'co2_emission', ['combined_emissions'])

Filling column: co2_emission
Grouping by: ['combined_emissions']
Number of NaN before filling: 2436
Number of NaN filled: 523
Number of NaN after filling: 1913
------------------
co2_emission
NaN      1913
120.0     796
97.0      551
99.0      548
104.0     514
         ... 
253.0       1
331.0       1
51.0        1
165.0       1
193.0       1
Name: count, Length: 120, dtype: int64


  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)


### emission_class

In [30]:
car_data['emission_class'].value_counts(dropna=False)

emission_class
Euro 6    12173
NaN        3628
Euro 5       78
Euro 4       40
Name: count, dtype: int64

### drive_chain

In [31]:
car_data['drive_chain'].value_counts(dropna=False)

drive_chain
front    8886
NaN      6858
4WD       171
rear        4
Name: count, dtype: int64

### consumption l/100 km 

In [32]:
car_data['combined_emissions'].value_counts(dropna=False).index

Index([ nan,  3.9,  4.0,  5.4,  5.1,  4.4,  3.8,  5.6,  4.7,  4.8,  5.0,  4.5,
        5.2,  4.6,  4.2,  5.3,  3.7,  4.9,  5.5,  4.1,  5.9,  3.3,  5.7,  4.3,
        3.5,  6.0,  3.6,  6.2,  5.8,  6.3,  6.1,  6.8,  6.6,  3.4,  3.0,  6.4,
        7.4,  7.1,  6.5, 10.0,  6.7,  3.2,  6.9,  8.3,  7.6,  7.0,  3.1,  7.2,
        7.8,  8.0, 51.0,  8.7,  8.6,  7.3,  7.9,  8.1, 40.0, 38.0,  0.0, 11.0,
       43.0,  7.5, 13.8, 55.0, 54.0,  1.2, 32.0, 33.0, 50.0,  1.0, 46.0,  9.1],
      dtype='float64', name='combined_emissions')

In [33]:
car_data['city_emissions'].value_counts(dropna=False).index

Index([ nan,  5.0,  5.8,  4.5,  4.3,  4.0,  5.1,  6.0,  6.8,  4.6,  7.2,  5.7,
        7.3,  4.2,  5.9,  7.8,  6.6,  5.2,  4.1,  6.3,  5.4,  4.7,  6.7,  3.9,
        3.5,  7.6,  7.1,  7.5,  6.9,  5.5,  7.0,  6.2,  7.4,  7.7,  6.5,  8.7,
        6.1,  4.4,  8.2,  8.0,  5.3,  6.4,  5.6,  7.9,  4.8,  4.9,  3.7,  3.4,
        9.6,  9.2,  3.3,  8.5,  8.6,  8.3,  3.8, 10.2,  8.1, 11.3, 10.0,  9.9,
        9.4,  9.1,  3.0,  0.0,  8.4,  9.8,  1.0, 62.0, 11.2,  8.9, 11.0, 10.8,
       11.5,  8.8, 10.1, 45.0,  9.5, 43.0,  3.6, 16.1, 66.0, 10.4, 10.5,  9.0,
       64.0, 19.9,  9.7],
      dtype='float64', name='city_emissions')

In [34]:
car_data['country_emissions'].value_counts(dropna=False).index

Index([ nan,  4.2,  3.7,  4.4,  4.5,  3.8,  3.9,  4.1,  4.7,  4.0,  3.5,  4.3,
        3.6,  3.1,  3.3,  4.6,  4.9,  3.4,  4.8,  5.3,  5.1,  5.7,  5.4,  3.2,
        3.0,  5.6,  5.0,  5.2,  6.3,  6.0, 10.0,  5.8,  5.5,  7.7,  6.6,  2.9,
        6.4,  2.8,  0.0,  7.3, 44.0,  6.5,  7.1,  6.7,  7.0, 35.0,  5.9,  6.9,
        7.8, 37.0, 10.3,  7.6, 42.0,  8.6,  6.1,  8.0,  2.0,  1.0],
      dtype='float64', name='country_emissions')

### country_version

In [35]:
car_data['country_version'].value_counts(dropna=False)

country_version
NaN               8333
Germany           4502
Italy             1038
European Union     507
Netherlands        464
Spain              325
Belgium            314
Austria            208
Czech Republic      52
Poland              49
France              38
Denmark             33
Hungary             28
Japan                8
Slovakia             4
Croatia              4
Sweden               3
Romania              2
Bulgaria             2
Luxembourg           1
Switzerland          1
Slovenia             1
Egypt                1
Serbia               1
Name: count, dtype: int64

### entertainment_media

In [36]:
car_data['entertainment_media']

0        'Bluetooth', 'Hands-free equipment', 'On-board...
1        'Bluetooth', 'Hands-free equipment', 'On-board...
2                               'MP3', 'On-board computer'
3        'Bluetooth', 'CD player', 'Hands-free equipmen...
4        'Bluetooth', 'CD player', 'Hands-free equipmen...
                               ...                        
15914    'Bluetooth', 'Digital radio', 'Hands-free equi...
15915    'Bluetooth', 'Digital radio', 'Hands-free equi...
15916    'Bluetooth', 'Hands-free equipment', 'On-board...
15917         'Bluetooth', 'Digital radio', 'Radio', 'USB'
15918                                                'USB'
Name: entertainment_media, Length: 15919, dtype: object

In [37]:
car_data['entertainment_media'] = car_data['entertainment_media'].astype('str').str.replace('[','').str.replace("]",'')

#### This column was not changed as it will be transformed with getdummy function later

### safety_security

In [38]:
car_data['safety_security']

0        ['ABS', 'Central door lock', 'Daytime running ...
1        ['ABS', 'Central door lock', 'Central door loc...
2        ['ABS', 'Central door lock', 'Daytime running ...
3        ['ABS', 'Alarm system', 'Central door lock wit...
4        ['ABS', 'Central door lock', 'Driver-side airb...
                               ...                        
15914    ['ABS', 'Central door lock', 'Central door loc...
15915    ['ABS', 'Adaptive Cruise Control', 'Blind spot...
15916    ['ABS', 'Adaptive Cruise Control', 'Blind spot...
15917    ['ABS', 'Blind spot monitor', 'Driver-side air...
15918    ['ABS', 'Blind spot monitor', 'Daytime running...
Name: safety_security, Length: 15919, dtype: object

#### This column was not changed as it will be transformed with getdummy function later

### comfort_convenience

In [39]:
car_data['comfort_convenience']

0        'Air conditioning', 'Armrest', 'Automatic clim...
1        'Air conditioning', 'Automatic climate control...
2        'Air conditioning', 'Cruise control', 'Electri...
3        'Air suspension', 'Armrest', 'Auxiliary heatin...
4        'Air conditioning', 'Armrest', 'Automatic clim...
                               ...                        
15914    'Air conditioning', 'Automatic climate control...
15915    'Air conditioning', 'Automatic climate control...
15916    'Air conditioning', 'Armrest', 'Automatic clim...
15917    'Air conditioning', 'Automatic climate control...
15918    'Air conditioning', 'Automatic climate control...
Name: comfort_convenience, Length: 15919, dtype: object

### extras

In [40]:
car_data['extras']

0        ['Alloy wheels', 'Catalytic Converter', 'Voice...
1        ['Alloy wheels', 'Sport seats', 'Sport suspens...
2                        ['Alloy wheels', 'Voice Control']
3         ['Alloy wheels', 'Sport seats', 'Voice Control']
4        ['Alloy wheels', 'Sport package', 'Sport suspe...
                               ...                        
15914                     ['Alloy wheels', 'Touch screen']
15915    ['Alloy wheels', 'Touch screen', 'Voice Control']
15916                                     ['Alloy wheels']
15917                     ['Alloy wheels', 'Touch screen']
15918                     ['Alloy wheels', 'Touch screen']
Name: extras, Length: 15919, dtype: object

## Quantitative Columns
- [x] price: Price of cars 
- [] km: km of autos 
- [] hp: horsepower of autos (kW) 
- [] displacement: displacement of autos (cc) 
- [] warranty: warranty period (month) (drop?)
- [] weight: weight of auto (kg) 
- [] nr_of_doors: number of doors 
- [] nr_of_seats : number of seats 
- [] cylinders: number of cylinders 
- [] gears: number of gears

### price

In [41]:
car_data['price'].value_counts(dropna=False)

price
14990    154
15990    151
10990    139
15900    106
17990    102
        ... 
17559      1
17560      1
17570      1
17575      1
39875      1
Name: count, Length: 2956, dtype: int64

In [42]:
car_data['price'].isnull().values.any()

False

### km

In [43]:
car_data['km'].value_counts(dropna=False)

km
10.0       1045
NaN        1024
1.0         367
5.0         170
50.0        148
           ... 
67469.0       1
43197.0       1
10027.0       1
35882.0       1
57.0          1
Name: count, Length: 6690, dtype: int64

In [45]:
fill('median', car_data, 'km', ['make_model', 'body_type'])

Filling column: km
Grouping by: ['make_model', 'body_type']
Number of NaN before filling: 1024
Number of NaN filled: 1022
Number of NaN after filling: 2
------------------
km
10.0       1045
1.0         367
17768.0     205
5.0         170
50.0        148
           ... 
31265.0       1
36020.0       1
53433.0       1
67469.0       1
57.0          1
Name: count, Length: 6701, dtype: int64


### hp

In [30]:
car_data['hp'].value_counts().index

Index([ 85.0,  66.0,  81.0, 100.0, 110.0,  70.0, 125.0,  51.0,  55.0, 118.0,
        92.0, 121.0, 147.0,  77.0,  56.0,  54.0, 103.0,  87.0, 165.0,  88.0,
        60.0, 162.0,  74.0,  96.0,  71.0, 101.0,  67.0, 154.0, 122.0, 119.0,
       164.0, 135.0,  52.0,  82.0,   1.0,  78.0, 146.0, 294.0, 141.0,  57.0,
       120.0, 104.0, 112.0, 191.0, 117.0, 155.0, 184.0,  65.0,  90.0,  76.0,
       168.0,  98.0, 149.0,  80.0,  93.0,  53.0,  86.0, 140.0, 150.0, 228.0,
       270.0, 143.0, 167.0,  40.0,  89.0,  63.0, 127.0, 123.0,  75.0, 115.0,
       195.0, 132.0, 163.0,  84.0,   4.0, 137.0,   9.0,  44.0, 133.0, 239.0],
      dtype='float64', name='hp')

### displacement

In [31]:
car_data['displacement'].value_counts().index

Index([ 1598.0,   999.0,  1398.0,  1399.0,  1229.0,  1956.0,  1461.0,  1490.0,
        1422.0,  1197.0,   898.0,  1395.0,  1968.0,  1149.0,  1618.0,  1798.0,
        1498.0,  1600.0,  1248.0,  1997.0,  1364.0,  1400.0,   998.0,  1500.0,
        2000.0,  1000.0,     1.0,  1998.0,  2480.0,  1200.0,  1984.0,  1397.0,
         899.0,   160.0,   929.0,  1499.0,   997.0,  1596.0,   139.0,   900.0,
        1599.0,  1199.0,  1396.0,  1495.0,  1589.0,  1300.0,     2.0,   995.0,
        1496.0,   890.0,  1580.0,  1995.0,  1333.0,    54.0,  1533.0,  1100.0,
        1350.0, 16000.0,  1856.0,  1568.0,  1896.0,  1584.0,   996.0,  1696.0,
        1686.0, 15898.0,  1368.0,   140.0,   973.0,  1239.0,  1369.0,  1390.0,
         122.0,  1198.0,  1195.0,  2967.0,  1800.0],
      dtype='float64', name='displacement')

### warranty

In [32]:
car_data['warranty'].value_counts(dropna=False)

warranty
NaN     11066
12.0     2594
24.0     1118
60.0      401
36.0      279
48.0      149
6.0       125
72.0       59
3.0        33
23.0       11
18.0       10
20.0        7
25.0        6
2.0         5
50.0        4
26.0        4
16.0        4
4.0         3
1.0         3
19.0        3
34.0        3
13.0        3
28.0        2
22.0        2
14.0        2
11.0        2
46.0        2
21.0        2
9.0         2
17.0        2
45.0        2
33.0        1
40.0        1
65.0        1
10.0        1
15.0        1
7.0         1
8.0         1
56.0        1
49.0        1
47.0        1
30.0        1
Name: count, dtype: int64

### weight

In [33]:
car_data['weight'].value_counts().index

Index([1163.0, 1360.0, 1165.0, 1335.0, 1135.0, 1199.0, 1734.0, 1180.0, 1503.0,
       1350.0,
       ...
       1137.0, 1213.0, 1960.0, 1258.0, 1167.0, 1331.0, 1132.0, 1252.0, 1792.0,
       2037.0],
      dtype='float64', name='weight', length=434)

### nr_of_doors

In [34]:
car_data['nr_of_doors'].value_counts()

nr_of_doors
5.0    11575
4.0     3079
3.0      832
2.0      219
1.0        1
7.0        1
Name: count, dtype: int64

### nr_of_seats

In [35]:
car_data['nr_of_seats'].value_counts()

nr_of_seats
5.0    13336
4.0     1125
7.0      362
2.0      116
6.0        2
3.0        1
Name: count, dtype: int64

### cylinders

In [36]:
car_data['cylinders'].value_counts(dropna=False)

cylinders
4.0    8105
NaN    5680
3.0    2104
5.0      22
6.0       3
8.0       2
2.0       2
1.0       1
Name: count, dtype: int64

### gears

In [37]:
car_data['gears'].value_counts(dropna=False)

gears
6.0     5822
NaN     4712
5.0     3239
7.0     1908
8.0      224
9.0        6
1.0        2
3.0        2
4.0        2
2.0        1
50.0       1
Name: count, dtype: int64

---

In [38]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 45 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   make_model           15919 non-null  object 
 1   body_type            15859 non-null  object 
 2   price                15919 non-null  int64  
 3   vat                  11406 non-null  object 
 4   km                   14895 non-null  float64
 5   registration         14322 non-null  object 
 6   hp                   15831 non-null  float64
 7   type                 15917 non-null  object 
 8   previous_owners      9254 non-null   float64
 9   next_inspection      2825 non-null   float64
 10  inspection_new       3570 non-null   object 
 11  warranty             4853 non-null   float64
 12  full_service         8215 non-null   object 
 13  non-smoking_vehicle  7177 non-null   object 
 14  null                 15919 non-null  object 
 15  offer_number         12744 non-null 

In [39]:
# car_data.to_csv("", index=False)