### Introduction

Welcome to Priceless Wheels! In this project, our goal is to build a model that can accurately predict the price of a used vehicle based on various factors such as make, model, year, mileage, and condition. The automobile industry is one of the largest and most competitive industries in the world, with millions of vehicles being sold each year. The price of a vehicle can have a significant impact on a consumer's purchasing decision and it is important for both buyers and sellers to have an understanding of the market value of a vehicle. By using machine learning algorithms and data analysis, we aim to provide a reliable and robust model that can assist in determining the fair market value of a vehicle. Join us on this exciting journey as we delve into the world of vehicle price prediction.

### About the data

This data is scrapped from https://www.cardekho.com/. This data is meant for research and academic purposes only and is **not meant for commercial use**. This dataset contains about 38000 (thirty eight thousand) used cars listed in CarDekho in India. Download the data from https://github.com/chats-bug/Priceless-Wheels/blob/data-cleaning/data/raw/cardekho_cars_2023_03_19_16_44_14.csv.

---

#### Importing the data and libraries

Let's start by importing the necessary libraries and the data.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import regex as re # for regex matching
import matplotlib.pyplot as plt # 
import seaborn as sns # seaborn for nice looking plots
sns.set() # setting seaborn default for plots

# removing scientific notation
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Next, we will import the data and preview the first 5 rows

- Note: The ```usedCarSkuId``` is a unique identifier for each car. We would be using this column as the index of the dataframe
- Note: Since we are using github lfs for storing the CSV files, the link mentioned in the notebook will expire in some time. Please visit this [link](https://github.com/chats-bug/Priceless-Wheels/blob/data-cleaning/data/raw/cardekho_cars_2023_03_19_16_44_14.csv) to get the data

In [2]:
# The data can be found on kaggle link: https://www.kaggle.com/datasets/sukritchatterjee/used-cars-dataset-cardekho

# file_path = "/kaggle/input/used-cars-dataset-cardekho/cars_details_merges.csv"
file_path = '../data/raw/cardekho_cars_2023_03_19_16_44_14.csv'
df = pd.read_csv(file_path, index_col="usedCarSkuId")

# sanity check
df.head()

  df = pd.read_csv(file_path, index_col="usedCarSkuId")


Unnamed: 0_level_0,position,loc,myear,bt,tt,ft,km,ip,pi,images,...,owner_type,price_segment_new,template_name_new,page_template,template_Type_new,experiment,Fuel Suppy System,Compression Ratio,Alloy Wheel Size,Ground Clearance Unladen
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7111bf25-97af-47f9-867b-40879190d800,1,Gomti Nagar,2016,Hatchback,Manual,CNG,69162,0,https://images10.gaadi.com/usedcar_image/origi...,[{'img': 'https://images10.gaadi.com/usedcar_i...,...,first,2lakh-5lakh,used cardetail v2/corporate/13,Used Car > Detail Page,used,control,,,,
c309efc1-efaf-4f82-81ad-dcb38eb36665,2,Borivali West,2015,Hatchback,Manual,CNG,45864,0,https://images10.gaadi.com/usedcar_image/origi...,[{'img': 'https://images10.gaadi.com/usedcar_i...,...,first,2lakh-5lakh,used cardetail v2/corporate/13,Used Car > Detail Page,used,control,Intelligent-Gas Port Injection,11.0:1,,
7609f710-0c97-4f00-9a47-9b9284b62d3a,3,JASOLA,2015,Sedan,Manual,CNG,81506,0,https://images10.gaadi.com/usedcar_image/origi...,[{'img': 'https://images10.gaadi.com/usedcar_i...,...,second,2lakh-5lakh,used cardetail v2/corporate/13,Used Car > Detail Page,used,control,,,,
278b76e3-5539-4a5e-ae3e-353a2e3b6d7d,4,jasola,2013,Hatchback,Manual,CNG,115893,0,,[{'img': ''}],...,second,2lakh-5lakh,used cardetail v2/corporate/13,Used Car > Detail Page,used,control,MPFI,,13.0,
b1eab99b-a606-48dd-a75b-57feb8a9ad92,5,mumbai g.p.o.,2022,MUV,Manual,CNG,18900,0,https://images10.gaadi.com/usedcar_image/origi...,[{'img': 'https://images10.gaadi.com/usedcar_i...,...,first,10+lakh,used cardetail v2,Used Car > Detail Page,used,control,,12.0+-.03,,


The data has imported with some mixed data types. We will need to convert the data types as we explore the data.

In [3]:
# checking the shape of the dataset
shape = df.shape

print(f"The shape of the dataframe is {shape[0]} rows and {shape[1]} columns")

The shape of the dataframe is 37814 rows and 139 columns


As we can see, there are 37,814 rows and 139 columns in the dataset.

Let's look at the data types of the columns.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37814 entries, 7111bf25-97af-47f9-867b-40879190d800 to a96fbcd7-c183-4829-ae97-b2581afe4bac
Columns: 139 entries, position to Ground Clearance Unladen
dtypes: bool(3), float64(10), int64(28), object(98)
memory usage: 39.6+ MB


Let's take a look at the feature dictionary provided with the dataset.

In [5]:
# feature_dictionary = pd.read_csv("/kaggle/input/used-cars-dataset-cardekho/feature_dictionary.csv")
feature_dictionary = pd.read_csv("../data/feature_dictionary_raw.csv")
feature_dictionary

Unnamed: 0,Feature,Type,Sample,Missing Values,Unique Values,Description
0,position,int64,1,0,20,Position of the car in the list
1,loc,object,Gomti Nagar,5851,511,Location of the car
2,myear,int64,2016,0,34,Manufacturing year of the car
3,bt,object,Hatchback,19,11,Body type of the car
4,tt,object,Manual,0,2,Transmission type of the car
...,...,...,...,...,...,...
135,experiment,object,control,0,1,control
136,Fuel Suppy System,object,,5502,99,"Type of fuel supply system (Carburetor, Fuel I..."
137,Compression Ratio,object,,27642,100,Compression ratio of the engine
138,Alloy Wheel Size,object,,13146,18,Size of the alloy wheels in inches


Now comes the difficult task. Go through the feature dictionary and remove the unwanted columns - features which might be repeated but not totally apparent. Note down the columns which makes sense to keep and are not repeating and delete the rest.

In [6]:
columns_to_keep = [
    "loc",
    "myear",
    "bt",
    "tt",
    "ft",
    "km",
    "ip",
    "images",
    "imgCount",
    "threesixty",
    "dvn",
    "oem",
    "model",
    "variantName",
    "city_x",
    "pu",
    "discountValue",
    "utype",
    "carType", 
    "top_features",
    "comfort_features",
    "interior_features",
    "exterior_features",
    "safety_features",
    "Color",
    "Engine Type",
    "Max Power",
    "Max Torque",
    "No of Cylinder",
    "Values per Cylinder",
    "Value Configuration",
    "BoreX Stroke",
    "Turbo Charger",
    "Super Charger",
    "Length",
    "Width",
    "Height",
    "Wheel Base",
    "Front Tread",
    "Rear Tread",
    "Kerb Weight",
    "Gross Weight",
    "Gear Box",
    "Drive Type",
    "Seating Capacity",
    "Steering Type",
    "Turning Radius",
    "Front Brake Type",
    "Rear Brake Type",
    "Top Speed",
    "Acceleration",
    "Tyre Type",
    "No Door Numbers",
    "Cargo Volumn",
    "model_type_new",
    "state",
    "owner_type",
    "exterior_color",
    "Fuel Suppy System",
    "Compression Ratio",
    "Alloy Wheel Size",
    "Ground Clearance Unladen",
]

df.drop([x for x in df.columns if x not in columns_to_keep], axis=1, inplace=True)
print(f"After dropping some unnecessary columns, the dataset now has {df.shape[1]} columns. These columns are hand picked and will be be further analyzed.")

After dropping some unnecessary columns, the dataset now has 62 columns. These columns are hand picked and will be be further analyzed.


---

### Data cleaning and wrangling

* Dropping duplicate rows
* Fixing the values and data types of the columns
* Checking for multicollinearity and determining how to handle it
* Dropping irrelevant columns for the model
* Saving the cleaned data for the next part of the project

---

### Duplicates

We know that there should be no duplicates in the data. They are checked for and removed at the time of data collection. *(Trust me, I collected it myself 😉 )*

In [7]:
# checking for duplicate rows
duplucate_rows = df.duplicated().sum()

print(f"The number of duplicate rows are {duplucate_rows}.")

The number of duplicate rows are 1.


Somehow, one duplicate row sneaked in (that's awkward 🫣)! Let's check the rows.

In [8]:
# setting the duplicated index
duplicate_index = df.duplicated(keep=False)

# calling the duplicated index in a dataframe
df.loc[duplicate_index, :].sort_index()

Unnamed: 0_level_0,loc,myear,bt,tt,ft,km,ip,images,imgCount,threesixty,...,No Door Numbers,Cargo Volumn,model_type_new,state,exterior_color,owner_type,Fuel Suppy System,Compression Ratio,Alloy Wheel Size,Ground Clearance Unladen
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
aa39e640-6183-4379-a517-9f5b2458b2a5,,2014,Hatchback,Manual,Diesel,70000,0,[{'img': ''}],0,False,...,5.0,204-liters,used,uttar pradesh,Silver,second,CRDI,17.6:1,,
ce8b30d8-c438-4a2f-bce1-5c1887a95495,,2014,Hatchback,Manual,Diesel,70000,0,[{'img': ''}],0,False,...,5.0,204-liters,used,uttar pradesh,Silver,second,CRDI,17.6:1,,


As we can see, the rows are exactly the same. This means we can safely drop the row. Since the we indexed by the usedCarSkuId, we need to drop one of them by their index.

In [9]:
# making a copy with the duplicated rows dropped
df2 = df.drop_duplicates().copy()

# checking for duplicate rows in the new dataframe
dup = df2.duplicated().sum()

print(f"The number of duplicate rows are {dup}.")

The number of duplicate rows are 0.


In [10]:
# sanity check
df2.head()

Unnamed: 0_level_0,loc,myear,bt,tt,ft,km,ip,images,imgCount,threesixty,...,No Door Numbers,Cargo Volumn,model_type_new,state,exterior_color,owner_type,Fuel Suppy System,Compression Ratio,Alloy Wheel Size,Ground Clearance Unladen
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7111bf25-97af-47f9-867b-40879190d800,Gomti Nagar,2016,Hatchback,Manual,CNG,69162,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,5.0,180-liters,used,uttar pradesh,Silver,first,,,,
c309efc1-efaf-4f82-81ad-dcb38eb36665,Borivali West,2015,Hatchback,Manual,CNG,45864,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,5.0,235-litres,used,maharashtra,Grey,first,Intelligent-Gas Port Injection,11.0:1,,
7609f710-0c97-4f00-9a47-9b9284b62d3a,JASOLA,2015,Sedan,Manual,CNG,81506,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,4.0,400-litres,used,delhi,Silver,second,,,,
278b76e3-5539-4a5e-ae3e-353a2e3b6d7d,jasola,2013,Hatchback,Manual,CNG,115893,0,[{'img': ''}],0,False,...,4.0,,used,delhi,Silver,second,MPFI,,13.0,
b1eab99b-a606-48dd-a75b-57feb8a9ad92,mumbai g.p.o.,2022,MUV,Manual,CNG,18900,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,6,False,...,5.0,,used,maharashtra,White,first,,12.0+-.03,,


Great! Now that we have removed the duplicate rows, let's look at missing values

---

### Fixing the data types

Let's look at the data types of the columns and in the process we can check if we need to clean any them to make them more useful, such as converting them to a numeric type.

*When we do feature engineering we will further modify these columns and might also drop a few of them*

We already know about the current data types of the features from the ```feature_dictionary```.

In [11]:
# Filter the dataframe to only include the columns_to_keep
new_feature_dict = feature_dictionary.copy()

# Drop the rows which Feature is not in the columns_to_keep list
row_index = [i for i, x in enumerate(new_feature_dict['Feature']) if x not in columns_to_keep]
new_feature_dict.drop(row_index, inplace=True)

new_feature_dict

Unnamed: 0,Feature,Type,Sample,Missing Values,Unique Values,Description
1,loc,object,Gomti Nagar,5851,511,Location of the car
2,myear,int64,2016,0,34,Manufacturing year of the car
3,bt,object,Hatchback,19,11,Body type of the car
4,tt,object,Manual,0,2,Transmission type of the car
5,ft,object,CNG,0,5,Fuel type of the car
...,...,...,...,...,...,...
130,owner_type,object,first,0,6,"Owner type of the car (first, second, etc.)"
136,Fuel Suppy System,object,,5502,99,"Type of fuel supply system (Carburetor, Fuel I..."
137,Compression Ratio,object,,27642,100,Compression ratio of the engine
138,Alloy Wheel Size,object,,13146,18,Size of the alloy wheels in inches


We know that if a column has the ```object``` type, it means it could have mixed types as well as strings. Let's try to convert them to a numerical type wherver we encouter something like Length or '12mm'.

In [12]:
# A helper function to print all the types in a column for a dataframe
def print_samples_from_every_type(df: pd.DataFrame, col: str):
    # Get all the types in the column.
    types = df[col].apply(type).unique()
    for t in types:
      # Print 10 samples from each type.
        print(f"Samples for type {t}: ")
        print(df[df[col].apply(type) == t].sample(10)[col])
        print()

In [13]:
# Make a copy of the original dataframe so that we can roll back to it anytime.
cars_df = df2.copy()

---

Let's list down all the columns which need the standard string treatment: 
1. Convert column to be a string type
2. Strip the whitespace 
3. Make it lowercase

*Here we are also give some of the other columns (that we wish to convert later) the standard treatment (it will help us later)*

In [14]:
columns_str_to_lower = [
    'loc',
    'bt',
    'ft',
    'tt',
    'images',
    'dvn',
    'oem',
    'model',
    'variantName',
    'city_x',
    # 'pu': 3,70,000 -> 370000
    'utype',
    'carType',
    'top_features',
    'comfort_features',
    'interior_features',
    'exterior_features',
    'safety_features',
    'Color',
    'Engine Type',
    'Max Power', # Max Power : 33.54bhp@4000 rpm -> 2 columns 33.54, 4000
    'Max Torque', # Max Torque : 40.2Nm@3500 rpm -> 2 columns 40.2, 3500
    # No of Cylinder : float -> int,
    # Values per Cylinder : float -> int, !! Fix the name -> Valve per Cylinder
    'Value Configuration', # !! Fix the name -> Valve Configuration
    'BoreX Stroke', # BoreX Stroke : 69 x 72 mm -> Bore: 69, Stroke: 72
    'Turbo Charger', # Convert to boolean
    'Super Charger', # Convert to boolean
    'Length', # Length : 3599mm -> 3599
    'Width', # Width : 1495mm -> 1495
    'Height', # Height : 1700mm -> 1700
    'Wheel Base', # Wheel Base : 2400mm -> 2400
    'Front Tread', # Front Tread : 1295mm -> 1295
    'Rear Tread', # Rear Tread : 1295mm -> 1295
    'Kerb Weight', # Kerb Weight : 960kg -> 960
    'Gross Weight', # Gross Weight : 1350kg -> 1350
    'Gear Box', # might need some additional cleaning
    'Drive Type', # might need some additional cleaning
    # Seating Capacity : float -> int
    'Steering Type', # might need some additional cleaning
    'Turning Radius', # Turning Radius : 4.6 metres -> 4.6
    'Front Brake Type', # might need some additional cleaning
    'Rear Brake Type', # might need some additional cleaning
    'Top Speed', # Top Speed : 137 kmph -> 137
    'Acceleration', # Acceleration : 13.5 seconds -> 13.5
    'Tyre Type', # might need some additional cleaning
    # No Door Numbers : float -> int !! fix name
    'Cargo Volumn', # 'Cargo Volumn' : 300 litres -> 300, !! fix name
    'model_type_new', # !! fix name
    'state',
    'exterior_color',
    'owner_type', # might need some additional cleaning
    'Fuel Suppy System', # might need some additional cleaning
    # Compression Ratio : 10.0:1 -> 10.0
    'Alloy Wheel Size', # Alloy Wheel Size : convert to float
    'Ground Clearance Unladen', # Ground Clearance Unladen : 170mm -> 170
]

In [15]:
# Apply the changes
temp_arr = []

for col in columns_str_to_lower:
    cars_df[col] = cars_df[col].astype(str).str.strip().str.lower()
    
    # Print a nice sample
    while (True):
        found = 0
        vals = [v for v in cars_df[col].sample(100)]
        for val in vals:
            if val is not None:
                found = 1
                vals = val
                break
        
        if found == 1:
            temp_arr.append({ 
                "Feature" : col, 
                "Type": f'{cars_df[col].dtype}', 
                "Unique": cars_df[col].nunique(), 
                "Sample": vals
            })
            break

pd.DataFrame(temp_arr)

Unnamed: 0,Feature,Type,Unique,Sample
0,loc,object,398,bangalore city
1,bt,object,12,sedan
2,ft,object,5,petrol
3,tt,object,2,manual
4,images,object,37135,[{'img': 'https://images10.gaadi.com/usedcar_i...
5,dvn,object,4128,honda amaze s i-vtech
6,oem,object,46,maruti
7,model,object,382,tata indica ev2
8,variantName,object,3430,sx ivt
9,city_x,object,617,rajkot


There is an additional issues here that needs to be addressed: 
Some of the columns have **more unique values** than are expected. For eg: `Turbo Charger` should either be a yes or a no or a NaN, beter are `5` unique values. The following are the columns suspected to have a higher than expected cardinality due to some quality issues: 
1. `Value Configuration`
2. `Turbo Charger`
3. `Gear Box`
4. `Drive Type`
5. `Steering Type`
6. `Front Brake Type`
7. `Rear Brake Type`
8. `Tyre Type`


In [16]:
# Make a list of such columns which seems to have a higher than expected cardinality
columns_unexpected_cardinality = [
    'Value Configuration',
    'Turbo Charger',
    'Gear Box',
    'Drive Type',
    'Steering Type',
    'Front Brake Type',
    'Rear Brake Type',
    'Tyre Type',
]

---

##### `Value Configuration`

In [17]:
cars_df[columns_unexpected_cardinality[0]].value_counts()

dohc                    23502
nan                      7831
sohc                     6200
undefined                 121
idsi                       66
dohc with vis              31
dohc with vgt              20
16 modules 48 cells        11
16-valve dohc layout        9
dohc with tis               9
vtec                        6
mpfi                        4
ohv / pushrod               3
Name: Value Configuration, dtype: int64

We have identified the following issues:
1. As we can see, the majority of it is `dohc`. There are a couple of variants of `dohc` like `dohc vis` but they are so few that it would be practical to replace all the columns containing any variant `dohc` to simply `dohc`
2. There are some `undefined` values which should be `NaN`
3. `mpfi` is a fuel injection system, so replace it with `NaN`
4. `vtec` engines can be `dohc` and `sohc` as well (`vtec` is an engine type developed by Honda in the 1960s). It would be best to replace this type by `NaN`, postpone handling it until null values handling (and there are only `6` rows, so it should be fine)

In [18]:
# Replace all the variants of `dohc` to simply `dohc`
cars_df['Value Configuration'] = cars_df['Value Configuration'].str.replace('dohc with vis', 'dohc')
cars_df['Value Configuration'] = cars_df['Value Configuration'].str.replace('dohc with vgt', 'dohc')
cars_df['Value Configuration'] = cars_df['Value Configuration'].str.replace('16-valve dohc layout', 'dohc')
cars_df['Value Configuration'] = cars_df['Value Configuration'].str.replace('dohc with tis', 'dohc')

# Replace `undefined`, `mpfi`, `vtec` with `NaN`
cars_df['Value Configuration'] = cars_df['Value Configuration'].str.replace('undefined', 'nan')
cars_df['Value Configuration'] = cars_df['Value Configuration'].str.replace('mpfi', 'nan')
cars_df['Value Configuration'] = cars_df['Value Configuration'].str.replace('vtec', 'nan')

cars_df['Value Configuration'].value_counts()

dohc                   23571
nan                     7962
sohc                    6200
idsi                      66
16 modules 48 cells       11
ohv / pushrod              3
Name: Value Configuration, dtype: int64

---

##### `Turbo Charger`

In [19]:
cars_df[columns_unexpected_cardinality[1]].value_counts()

no       20247
yes      15262
nan       2176
twin       125
turbo        3
Name: Turbo Charger, dtype: int64

Well, cleaning this column should be quite straightforward.
1. `twin` and `turbo` should be replaced by `yes`

In [20]:
# Replace `twin` and `turbo` with `yes`
cars_df['Turbo Charger'] = cars_df['Turbo Charger'].str.replace('twin', 'yes')
cars_df['Turbo Charger'] = cars_df['Turbo Charger'].str.replace('turbo', 'yes')

cars_df['Turbo Charger'].value_counts()

no     20247
yes    15390
nan     2176
Name: Turbo Charger, dtype: int64

--- 

##### `Gear Box`

In [21]:
cars_df[columns_unexpected_cardinality[2]].value_counts()

5 speed                  22840
6 speed                   5876
5-speed                   1457
7 speed                   1256
6-speed                   1199
                         ...  
six speed manual             1
7g-dct                       1
amg speedshift dct 8g        1
amg 7-speed dct              1
ecvt                         1
Name: Gear Box, Length: 106, dtype: int64

Here the cardinality is high. And as is evident, several categories can be lumped together into a single one. Let's define a mapping to clearly indeicate which data should be lumped together.

In [22]:
gear_box_mapping = {}
gear_box_mapping['1 speed'] = [
	'single speed', 
	'single speed automatic', 
	'single speed reduction gear', 
	'single-speed transmission', 
]
gear_box_mapping['4 speed'] = [
	'4 speed', 
	'4-speed', 
]
gear_box_mapping['5 speed'] = [
	'5', 
	'5 - speed', 
	'5 gears', 
	'5 manual', 
	'5 speed', 
	'5 speed at+ paddle shifters', 
	'5 speed cvt', 
	'5 speed forward, 1 reverse', 
	'5 speed manual', 
	'5 speed manual transmission', 
	'5 speed+1(r)', 
	'5 speed,5 forward, 1 reverse', 
	'5-speed', 
	'5-speed`', 
	'five speed', 
	'five speed manual', 
	'five speed manual transmission', 
	'five speed manual transmission gearbox', 
]
gear_box_mapping['6 speed'] = [
	'6', 
	'6 speed', 
	'6 speed at', 
	'6 speed automatic', 
	'6 speed geartronic', 
	'6 speed imt', 
	'6 speed ivt', 
	'6 speed mt', 
	'6 speed with sequential shift', 
	'6-speed', 
	'6-speed at', 
	'6-speed automatic', 
	'6-speed autoshift', 
	'6-speed cvt', 
	'6-speed dct', 
	'6-speed imt', 
	'6-speed ivt', 
	'6-speed`', 
	'six speed  gearbox', 
	'six speed automatic gearbox', 
	'six speed automatic transmission', 
	'six speed geartronic, six speed automati', 
	'six speed manual', 
	'six speed manual transmission', 
	'six speed manual with paddle shifter', 
]
gear_box_mapping['7 speed'] = [
	'7 speed', 
	'7 speed 7g-dct', 
	'7 speed 9g-tronic automatic', 
	'7 speed cvt', 
	'7 speed dct', 
	'7 speed dsg', 
	'7 speed dual clutch transmission', 
	'7 speed s tronic', 
	'7-speed', 
	'7-speed dct', 
	'7-speed dsg', 
	'7-speed pdk', 
	'7-speed s tronic', 
	'7-speed s-tronic', 
	'7-speed steptronic', 
	'7-speed stronic', 
	'7g dct 7-speed dual clutch transmission', 
	'7g-dct', 
	'7g-tronic automatic transmission',
	'amg 7-speed dct',	
	'mercedes benz 7 speed automatic',
]
gear_box_mapping['8 speed'] = [
	'8', 
	'8 speed', 
	'8 speed cvt', 
	'8 speed multitronic', 
	'8 speed sport', 
	'8 speed tip tronic s', 
	'8 speed tiptronic', 
	'8-speed', 
	'8-speed automatic', 
	'8-speed automatic transmission', 
	'8-speed dct', 
	'8-speed steptronic', 
	'8-speed steptronic sport automatic transmission', 
	'8-speed tiptronic', 
	'8speed',
	'amg speedshift dct 8g', 
]
gear_box_mapping['9 speed'] = [
	'9 -speed', 
	'9 speed', 
	'9 speed tronic', 
	'9-speed', 
	'9-speed automatic', 
	'9g tronic', 
	'9g-tronic', 
	'9g-tronic automatic', 
	'amg speedshift 9g tct automatic',
]
gear_box_mapping['10 speed'] = [
	'10 speed', 
]
gear_box_mapping['cvt'] = [
'cvt', 
'e-cvt', 
'ecvt', 
]
gear_box_mapping['direct drive'] = [
'direct drive', 
]
gear_box_mapping['fully automatic'] = [
	'automatic transmission', 
	'fully automatic',
]
gear_box_mapping['nan'] = [
'nan',
'ags', 
'imt', 
'ivt', 
]

mapping_dict = {v: k for k, lst in gear_box_mapping.items() for v in lst}
cars_df['Gear Box'] = cars_df['Gear Box'].replace(mapping_dict)

cars_df[columns_unexpected_cardinality[2]].value_counts()

5 speed            25203
6 speed             7459
7 speed             1842
8 speed             1150
4 speed              760
cvt                  474
nan                  471
9 speed              399
fully automatic       20
1 speed               15
direct drive          11
10 speed               9
Name: Gear Box, dtype: int64

--- 

##### `Drive Type`

In [23]:
cars_df[columns_unexpected_cardinality[3]].value_counts()

fwd                                         27456
nan                                          4496
rwd                                          2248
awd                                          1082
2wd                                           648
4wd                                           570
2 wd                                          369
4x2                                           297
4x4                                           229
front wheel drive                             176
two wheel drive                                98
all wheel drive                                32
rear wheel drive with esp                      29
two whhel drive                                26
permanent all-wheel drive quattro              21
rwd(with mtt)                                  14
rear-wheel drive with esp                       7
4 wd                                            7
all-wheel drive with electronic traction        5
four whell drive                                2


We will once again take the same approach and define a mapping since many categories are redundant here.

In [24]:
drive_type_mapping = {}
drive_type_mapping['fwd'] = ['fwd', 'front wheel drive']
drive_type_mapping['2wd'] = ['2wd', 'two wheel drive', '2 wd', 'two whhel drive']
drive_type_mapping['rwd'] = ['rwd', 'rear wheel drive with esp', 'rear-wheel drive with esp', 'rwd(with mtt)']
drive_type_mapping['awd'] = ['awd', 'all wheel drive', 'all-wheel drive with electronic traction', 'permanent all-wheel drive quattro']
drive_type_mapping['4wd'] = ['4wd', '4 wd', '4x4', 'four whell drive']
drive_type_mapping['nan'] = ['nan', '3']

mapping_dict = {v: k for k, lst in drive_type_mapping.items() for v in lst}
cars_df['Drive Type'] = cars_df['Drive Type'].replace(mapping_dict)

cars_df[columns_unexpected_cardinality[3]].value_counts()

fwd    27632
nan     4497
rwd     2298
2wd     1141
awd     1140
4wd      808
4x2      297
Name: Drive Type, dtype: int64

---

##### `Steering Type`

In [25]:
# Steering Type
cars_df[columns_unexpected_cardinality[4]].value_counts()

power         31920
electric       3895
nan             808
manual          652
electronic      344
electrical      138
epas             49
hydraulic         5
mt                1
motor             1
Name: Steering Type, dtype: int64

There are 2 kinds of steering types (broadly) - 
1. Power (electric)
2. Manual (hydraulic, etc.)

In [26]:
cars_df['Steering Type'] = cars_df['Steering Type'].str.replace('electrical', 'power')
cars_df['Steering Type'] = cars_df['Steering Type'].str.replace('electric', 'power')
cars_df['Steering Type'] = cars_df['Steering Type'].str.replace('electronic', 'power')
cars_df['Steering Type'] = cars_df['Steering Type'].str.replace('epas', 'power')
cars_df['Steering Type'] = cars_df['Steering Type'].str.replace('mt', 'power')
cars_df['Steering Type'] = cars_df['Steering Type'].str.replace('motor', 'power')
cars_df['Steering Type'] = cars_df['Steering Type'].str.replace('hydraulic', 'manual')
cars_df['Steering Type'] = cars_df['Steering Type'].str.replace('hydraulic', 'manual')

cars_df[columns_unexpected_cardinality[4]].value_counts()

power     36348
nan         808
manual      657
Name: Steering Type, dtype: int64

---

##### `Front Brake Type` and `Rear Brake Type`

In [27]:
cars_df[columns_unexpected_cardinality[5]].value_counts()

disc                                        22642
ventilated disc                             13178
ventilated discs                              400
solid disc                                    398
nan                                           327
disc & caliper type                           296
disk                                          205
ventilated disk                                78
drum                                           67
multilateral disc                              42
vantilated disc                                39
disc, 236 mm                                   22
ventlated disc                                 17
vacuum assisted hydraulic dual circuit w       12
ventillated disc                               11
disc & drum                                    10
disc,internally ventilated                      9
electric parking brake                          8
264mm ventilated discs                          8
caliper ventilated disc                         6


In [28]:
cars_df[columns_unexpected_cardinality[6]].value_counts()

drum                                        29260
disc                                         5272
ventilated disc                               757
solid disc                                    612
nan                                           326
self-adjusting drum                           322
discs                                         301
disc & caliper type                           296
ventilated discs                              178
leading-trailing drum                         134
leading & trailing drum                        75
ventilated drum                                46
drums                                          40
self adjusting drum                            33
disc & drum                                    32
self adjusting drums                           31
drums 180 mm                                   22
drum in discs                                  17
vacuum assisted hydraulic dual circuit w       12
262mm disc & drum combination                   8


In [29]:
brake_type_mapping = {}
brake_type_mapping['disc'] = [
	'disc',
	'260mm discs',
	'disc brakes',
	'disc, 236 mm',
	'discs',
	'disk',
	'multilateral disc',
	'solid disc',
	'electric parking brake',
	'abs',
]
brake_type_mapping['ventilated disc'] = [
	'264mm ventilated discs',
	'booster assisted ventilated disc',
	'caliper ventilated disc',
	'disc brakes with inner cooling',
	'disc,internally ventilated',
	'vantilated disc',
	'ventilated & grooved steel discs',
	'ventilated disc',
	'ventilated disc with twin pot caliper',
	'ventilated discs',
	'ventilated disk',
	'ventillated disc',
	'ventillated discs',
	'ventlated disc',
	'ventilated drum in discs',
	'ventialte disc',
	'ventialted disc',
]
brake_type_mapping['carbon ceramic'] = [
	'carbon ceramic brakes',
	'carbon ceramic brakes.',
]
brake_type_mapping['disc & drum'] = [
	'disc & drum',
	'228.6 mm dia, drums on rear wheels',
	'262mm disc & drum combination',
	'drum in disc',
	'drum in discs',
]
brake_type_mapping['drum'] = [
	'drum',
	'203mm drums',
	'drum`',
	'drums',
	'drums 180 mm',
	'booster assisted drum',
	'drum brakes',
	'leading & trailing drum',
	'leading-trailing drum',
	'self adjusting drum',
	'self adjusting drums',
	'self-adjusting drum',
	'single piston sliding fist',
	'ventilated drum',
	'tandem master cylinder with servo assist',

]
brake_type_mapping['caliper'] = [
	'six piston claipers',
	'twin piston sliding fist caliper',
	'vacuum assisted hydraulic dual circuit w',
	'four piston calipers',
	'disc & caliper type',
]

In [30]:
mapping_dict = {v: k for k, lst in brake_type_mapping.items() for v in lst}
cars_df['Front Brake Type'] = cars_df['Front Brake Type'].replace(mapping_dict)
cars_df['Rear Brake Type'] = cars_df['Rear Brake Type'].replace(mapping_dict)

In [31]:
cars_df[columns_unexpected_cardinality[5]].value_counts()

disc               23334
ventilated disc    13759
nan                  327
caliper              312
drum                  70
disc & drum           10
carbon ceramic         1
Name: Front Brake Type, dtype: int64

In [32]:
cars_df[columns_unexpected_cardinality[6]].value_counts()

drum               29973
disc                6195
ventilated disc      944
nan                  326
caliper              310
disc & drum           64
carbon ceramic         1
Name: Rear Brake Type, dtype: int64

---

##### `Tyre Type`

In [33]:
cars_df[columns_unexpected_cardinality[7]].value_counts()

tubeless,radial               21770
tubeless                       5723
tubeless, radial               4994
tubeless tyres                 2408
radial                          606
radial, tubeless                389
tubeless radial tyres           301
radial, tubless                 296
nan                             257
tubeless tyres, radial          224
tubeless,radials                212
radial,tubeless                 158
tubless, radial                  92
runflat                          69
tubeless tyre                    50
tubeless, runflat                48
run-flat                         47
runflat tyres                    41
radial tubeless                  33
tubeless,runflat                 31
tubeless radial                  17
runflat tyre                     16
tubeless, radials                 7
radial with tube                  4
tubeless. runflat                 3
tubeless radial tyrees            3
tubeless tyres mud terrain        3
radial tyres                

In [34]:
tyre_type_mapping = {}
tyre_type_mapping['tubeless'] = [
	'tubeless tyres',
	'tubeless',
	'tubeless tyres mud terrain',
	'tubeless tyre',
]
tyre_type_mapping['tubeless radial'] = [
	'tubeless, radial',
	'tubeless,radial',
	'tubeless tyres, radial',
	'tubeless radial tyres',
	'radial, tubeless',
	'radial',
	'tubless, radial',
	'radial tubeless',
	'tubeless radial',
	'tubeless,radials',
	'tubeless radials',
	'radial,tubeless',
	'tubeless radial tyre',
	'radial, tubless',
	'tubless radial tyrees',
	'tubeless , radial',
	'tubeless, radials',
	'radial tyres',
]
tyre_type_mapping['runflat'] = [
	'runflat tyres',
	'runflat',
	'tubeless,runflat',
	'run-flat',
	'runflat tyre',
	'tubeless, runflat',
	'tubeless. runflat',
	'tubeless.runflat',
	'tubeless radial tyrees',
]
tyre_type_mapping['tube'] = [
	'radial with tube',
]

In [35]:
mapping_dict = {v: k for k, lst in tyre_type_mapping.items() for v in lst}
cars_df['Tyre Type'] = cars_df['Tyre Type'].replace(mapping_dict)

cars_df[columns_unexpected_cardinality[7]].value_counts()

tubeless radial    29108
tubeless            8184
runflat              260
nan                  257
tube                   4
Name: Tyre Type, dtype: int64

---

##### `Fuel Injection System`

Although we expect a high cardinality from this feature, it's still worth a look if we could reduce the number of catgories by losing some of the variance.

(This step is optional and can be skipped. Also the mapping that is done is approximate and to the best of my knowledge, your mapping could differ)

In [36]:
cars_df['Fuel Suppy System'].value_counts()

mpfi                                      13443
crdi                                       9165
nan                                        5502
direct injection                           3831
pgm-fi                                      882
                                          ...  
intake port(multi-point)                      1
mfi                                           1
distributor-type diesel fuel injection        1
tfsi                                          1
pgm-fi (programmed fuel inject                1
Name: Fuel Suppy System, Length: 71, dtype: int64

In [37]:
fuel_injection_mapping = {
    "Gasoline Port Injection": [
    	"intelligent-gas port injection", 
    	"i-gpi",
    	"dohc",
    	"pfi"
	],
    "Multi-Point Fuel Injection": [
    	"mpfi", 
    	"multi-point injection", 
    	"mpfi+lpg", 
    	"mpfi+cng", 
    	"multipoint injection", 
    	"smpi", 
    	"mpi",
    	"multi point fuel injection",
    	"dpfi",
    	"mfi",
    	"multi point injection",
    	"msds",
    	"cng"
    ],
    "Electronic Fuel Injection": [
	    "efi(electronic fuel injection)", 
	    "efi", 
	    "efi (electronic fuel injection)", 
	    "efic", 
	    "electronic fuel injection", 
	    "electronically controlled injection", 
	    "electronic injection system", 
	    "sefi",
	    "egis",
	    "efi (electronic fuel injection",
	    "efi",
        "efi -electronic fuel injection",
    ],
    "Direct Injection": [
    	"direct injection", 
    	"direct injectio", 
    	"direct fuel injection",
    	"direct engine",
    ],
    "Common Rail Injection": [
    	"crdi", 
    	"common rail", 
    	"common rail injection", 
    	"common rail direct injection", 
    	"common rail direct injection (dci)", 
    	"common-rail type", 
    	"advanced common rail", 
    	"common rail system", 
    	"common rail diesel", 
    	"pgm-fi (programmed fuel injection)", 
    	"pgm-fi (programmed fuel inje", 
    	"pgm - fi", 
    	"pgm-fi", 
    	"pgm-fi (programmed fuel inject",
    	"direct injection common rail",
    	"cdi"
    ],
    "Distributor-Type Fuel Injection": [
    	"dedst", 
    	"distribution type fuel injection", 
    	"distributor-type diesel fuel injection",
    ],
    "Indirect Injection": [
    	"indirect injection",
    	"idi"
    ],
    "Gasoline Direct Injection": [
    	"gdi", 
    	"gasoline direct injection",
    	"tfsi",
    	"tsi",
    	"tgdi"
    ],
    "Turbo Intercooled Diesel": [
    	"tcdi", 
    	"turbo intercooled diesel",
    	"tdci"
    ],
    "Intake Port Injection": [
    	"intake port(multi-point)"
    ],
    "Diesel Direct Injection": [
    	"ddi", 
    	"ddis"
    ],
    "Variable Valve Timing Injection": [
    	"dual vvt-i", 
    	"vvt-ie", 
    	"ti-vct"
    ],
    "Three-Phase AC Induction Motors": [
    	"3 phase ac induction motors"
    ],
    "Electric": [
    	"electric", 
    	"isg"
    ],
}

mapping_dict = {v: k for k, lst in fuel_injection_mapping.items() for v in lst}
cars_df['Fuel Suppy System'] = cars_df['Fuel Suppy System'].replace(mapping_dict)

cars_df['Fuel Suppy System'].value_counts()

Multi-Point Fuel Injection         14055
Common Rail Injection              11822
nan                                 5502
Direct Injection                    3837
Electronic Fuel Injection           1593
Gasoline Direct Injection            653
Gasoline Port Injection              100
Diesel Direct Injection               99
Turbo Intercooled Diesel              89
Indirect Injection                    21
Variable Valve Timing Injection       15
Three-Phase AC Induction Motors       11
Distributor-Type Fuel Injection        8
Electric                               7
Intake Port Injection                  1
Name: Fuel Suppy System, dtype: int64

##### With this column, all of the features suspected of high cardinality are handled. We can move on to the other things that were pointed out earlier in the code.

---

#### Converting features to numerical (and/or boolean)

The features we wish to convert are the following:
1. `pu` to float
2. `Max Power` to 2 separate columns: `Max Power Delivered` and `Max Power At`
3. ~~`No of Cylinder` to int (from float)~~
4. ~~`Values per Cylinder` to int (from float)~~
5. `BoreX Stroke` to 2 separate columns: `Bore` and `Stroke`
6. `Turbo Charger` to boolean
7. `Super Charger` to boolean
8. `Length` to int 
9. `Width` to int
10. `Height` to int
11. `Wheel Base` to int
12. `Front Tread` to int
13. `Rear Tread` to int
14. `Kerb Weight` to int
15. `Gross Weight` to int
16. ~~`Seating Capacity` to int (from float)~~
17. `Turning Radius` to float
18. `Top Speed` to int
19. `Acceleration` to float
20. ~~`No Door Numbers` to int (from float)~~
21. `Cargo Volumn` to int
22. `Compression Ratio` to float
23. `Alloy Wheel Size` to float
24. `Ground Clearance Unladen` to int

In [38]:
# Make a copy of the data, cars_df will now serve as a checkpoint
cars_df2 = cars_df.copy()

# Sanity check
cars_df2.head()

Unnamed: 0_level_0,loc,myear,bt,tt,ft,km,ip,images,imgCount,threesixty,...,No Door Numbers,Cargo Volumn,model_type_new,state,exterior_color,owner_type,Fuel Suppy System,Compression Ratio,Alloy Wheel Size,Ground Clearance Unladen
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7111bf25-97af-47f9-867b-40879190d800,gomti nagar,2016,hatchback,manual,cng,69162,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,5.0,180-liters,used,uttar pradesh,silver,first,,,,
c309efc1-efaf-4f82-81ad-dcb38eb36665,borivali west,2015,hatchback,manual,cng,45864,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,5.0,235-litres,used,maharashtra,grey,first,Gasoline Port Injection,11.0:1,,
7609f710-0c97-4f00-9a47-9b9284b62d3a,jasola,2015,sedan,manual,cng,81506,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,4.0,400-litres,used,delhi,silver,second,,,,
278b76e3-5539-4a5e-ae3e-353a2e3b6d7d,jasola,2013,hatchback,manual,cng,115893,0,[{'img': ''}],0,False,...,4.0,,used,delhi,silver,second,Multi-Point Fuel Injection,,13.0,
b1eab99b-a606-48dd-a75b-57feb8a9ad92,mumbai g.p.o.,2022,muv,manual,cng,18900,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,6,False,...,5.0,,used,maharashtra,white,first,,12.0+-.03,,


In [39]:
# A utility function to get a number from a string
def converst_to_number(x, conv: str = 'float'):
    x = str(x)
    new_str = ''
    is_dec = True
    for a in x:
        if 48 <= ord(a) <= 57:
            new_str += a
            continue
        elif a == ',' or a == '_':
            continue
        elif a == '.' and is_dec:
            new_str += a
            is_dec = False
        else:
            break
    
    if new_str == '':
        return None
    
    if conv == 'int':
        return int(new_str)
    
    return float(new_str)

def get_begin_number(x):
    return converst_to_number(x, 'int')

def get_begin_float(x):
    return converst_to_number(x, 'float')

# def get_begin_number(x):
#     x = str(x)
#     reg_match = re.search(r'\d+', x)
#     if reg_match is None:
#         return None
#     return int(reg_match.group())

# def get_begin_float(x):
#     x = str(x)
#     reg_match = re.search(r'\d+\.?\d+', x)
#     if reg_match is None:
#         return None
#     return float(reg_match.group())

---

#### `pu`

This is our target column (the price of our used cars). This should be a float or int.

In [40]:
cars_df2['pu'].sample(5)

usedCarSkuId
b50e8500-56d4-4de7-945b-644f7ea5d074    4,50,000
2b032154-7d27-4981-a0d9-93d17cf2d351    7,44,000
75457405-852c-4bc9-a085-84d02e97f933    2,75,000
9513e70e-e45c-44f8-ae22-6389c894f104    2,19,925
a86fb138-510f-4d2c-914d-b85dd4da3892    4,77,000
Name: pu, dtype: object

In [41]:
# We simply need to replace the ',' to with '' and convert
cars_df2['pu'] = cars_df2['pu'].str.replace(',', '').astype(float)

cars_df2['pu'].sample(5)

usedCarSkuId
9a1111ab-f0e6-42e9-bee7-514eef20ebc3   794918.000
aeba3023-8877-436e-b013-74f781ee809d   310000.000
da5ff6db-c529-4ab3-9732-55bdd7e9e849   260000.000
d442028a-de1d-4d60-a7ed-6387f81332de   980000.000
3d39a8d6-bb38-41c2-9290-4bc95fb81100   414000.000
Name: pu, dtype: float64

---

#### `Max Power` and `Max Torque`

We will be separating this column into 2 columns: `Max Power Delivered` and `Max Power At` (containing the rpm at which the max power is reach)
Same with `Max Torque`

In [42]:
cars_df2[['Max Power', 'Max Torque']].sample(5)

Unnamed: 0_level_0,Max Power,Max Torque
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
e2538573-6be1-4b46-96a2-051f7a6e207d,81.86bhp@6000rpm,113.75nm@4000rpm
1777c2d3-ff3f-4519-a2a5-ef0c91282f26,88.73bhp@4000rpm,219.7nm@1500-2750rpm
36dc9f17-3eae-44a8-83f7-c627d3410c93,152.87bhp@3750rpm,360nm@1750-2800rpm
fb02f7a1-06ed-4d8d-89ed-4ef658251a90,53.3bhp@5678rpm,72nm@4386rpm
4b3c8894-6e6a-4872-8385-4d6a7c07bc72,67.04bhp@6000rpm,90nm@3500rpm


In [43]:
c = cars_df2.copy()
cars_df2['Max Power Delivered'] = cars_df2['Max Power'].str.split('@').str[0].apply(get_begin_float).astype(float)
cars_df2['Max Power At'] = cars_df2['Max Power'].str.split('@').str[1].apply(get_begin_float).astype(float)


def get_rpm_average(x):
    x = str(x)
    if '-' in x:
        p1 = get_begin_float(x.split('-')[0])
        p2 = get_begin_float(x.split('-')[1])
        if p1 is None:
            return p2
        if p2 is None:
            return p1
        
        return (p1 + p2)/2
    else:
        return get_begin_float(x)

cars_df2['Max Torque Delivered'] = cars_df2['Max Torque'].str.split('@').str[0].apply(get_begin_float).astype(float)
cars_df2['Max Torque At'] = cars_df2['Max Torque'].str.split('@').str[1].apply(get_rpm_average).astype(float)

In [44]:
cars_df2[['Max Power Delivered', 'Max Power At', 'Max Torque Delivered', 'Max Torque At']].sample(5)

Unnamed: 0_level_0,Max Power Delivered,Max Power At,Max Torque Delivered,Max Torque At
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b8a30c7c-814b-4ea3-a8cd-9aa738673fe5,68.05,5500.0,99.04,4500.0
4866b787-6699-4e7f-b53f-4f7b7e2a30c2,138.1,3750.0,350.0,2125.0
24354a62-02dd-4623-9812-1747893ece02,97.7,6000.0,134.0,4000.0
075009cb-914d-4841-9e0d-922031298a87,167.67,3750.0,350.0,2125.0
c49a90dc-7abf-46f5-aea5-edb46f587f38,126.2,4000.0,259.9,2325.0


Now we can safely drop `Max Power` from our dataframe

In [45]:
cars_df2.drop(columns=['Max Power'], inplace=True, axis=1)
cars_df2.drop(columns=['Max Torque'], inplace=True, axis=1)
cars_df2.head()

Unnamed: 0_level_0,loc,myear,bt,tt,ft,km,ip,images,imgCount,threesixty,...,exterior_color,owner_type,Fuel Suppy System,Compression Ratio,Alloy Wheel Size,Ground Clearance Unladen,Max Power Delivered,Max Power At,Max Torque Delivered,Max Torque At
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7111bf25-97af-47f9-867b-40879190d800,gomti nagar,2016,hatchback,manual,cng,69162,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,silver,first,,,,,58.16,6200.0,77.0,3500.0
c309efc1-efaf-4f82-81ad-dcb38eb36665,borivali west,2015,hatchback,manual,cng,45864,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,grey,first,Gasoline Port Injection,11.0:1,,,58.2,6000.0,78.0,3500.0
7609f710-0c97-4f00-9a47-9b9284b62d3a,jasola,2015,sedan,manual,cng,81506,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,silver,second,,,,,86.7,6000.0,109.0,4500.0
278b76e3-5539-4a5e-ae3e-353a2e3b6d7d,jasola,2013,hatchback,manual,cng,115893,0,[{'img': ''}],0,False,...,silver,second,Multi-Point Fuel Injection,,13.0,,58.2,6200.0,77.0,3500.0
b1eab99b-a606-48dd-a75b-57feb8a9ad92,mumbai g.p.o.,2022,muv,manual,cng,18900,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,6,False,...,white,first,,12.0+-.03,,,86.63,5500.0,121.5,4200.0


---

#### `BoreX Stroke`

We will make 2 new columns `Bore` and `Stroke` and remove the original column.


In [46]:
cars_df2['BoreX Stroke'].sample(10)

usedCarSkuId
21420de7-aa72-4d29-ad50-8e9c1b45f0ea    81.0 x 95.5 mm
48b8fb2d-5b12-423c-92ab-eb9e71a2ebb4               nan
0117c37c-4b96-4fa3-98d1-b46b95d5eeca               nan
ac56a9f6-6753-41b1-96f2-4a1c1c64f4f6               nan
797a131d-afb6-4b62-9f8d-00c545407b52        69 x 72 mm
c8e2a9fb-a94e-4c8f-b736-04fc6f36fb54               nan
3bcaa929-3183-4a55-9807-beb5729d5969               nan
ecf7289b-a043-4694-b0d7-d79306f4db2b               nan
7f75de4f-d09a-45ea-aecc-fe56754c2b58               nan
ced49412-dcea-492a-9529-656370db83a1               nan
Name: BoreX Stroke, dtype: object

In [47]:
cars_df2['Bore'] = cars_df2['BoreX Stroke'].str.split('x').str[0].apply(get_begin_float).astype(float)
cars_df2['Stroke'] = cars_df2['BoreX Stroke'].str.split('x').str[1].apply(get_begin_float).astype(float)

In [48]:
cars_df2[['Bore', 'Stroke']].sample(10)

Unnamed: 0_level_0,Bore,Stroke
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
f829f46f-eecc-4b24-ab5e-a3a90aec09ee,,
de2f6208-430e-4c75-92e8-7e07bf00ddf9,71.0,
0a070224-8a97-4ffa-9623-dadf8162810c,,
2f28a462-1982-4169-9b94-8e13c5d96a0b,,
ab752e9c-14e6-4274-8528-e88900138142,,
21fdcf5e-89e9-4818-8584-bd440fb8f8d9,,
46746cb1-089c-458b-b09f-ce79aa6fd0b5,69.6,
d4afa5a4-f424-4115-a120-cdf31a02f5c7,,
be54c57b-9687-4db9-87ff-652ca6d9c962,,
d51f3f5a-9e2f-4fd6-bd06-38e9437ef84f,76.0,82.5


In [49]:
cars_df2.drop(columns=['BoreX Stroke'], inplace=True, axis=1)
cars_df2.head()

Unnamed: 0_level_0,loc,myear,bt,tt,ft,km,ip,images,imgCount,threesixty,...,Fuel Suppy System,Compression Ratio,Alloy Wheel Size,Ground Clearance Unladen,Max Power Delivered,Max Power At,Max Torque Delivered,Max Torque At,Bore,Stroke
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7111bf25-97af-47f9-867b-40879190d800,gomti nagar,2016,hatchback,manual,cng,69162,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,,,,,58.16,6200.0,77.0,3500.0,69.0,
c309efc1-efaf-4f82-81ad-dcb38eb36665,borivali west,2015,hatchback,manual,cng,45864,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,Gasoline Port Injection,11.0:1,,,58.2,6000.0,78.0,3500.0,73.0,
7609f710-0c97-4f00-9a47-9b9284b62d3a,jasola,2015,sedan,manual,cng,81506,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,,,,,86.7,6000.0,109.0,4500.0,,
278b76e3-5539-4a5e-ae3e-353a2e3b6d7d,jasola,2013,hatchback,manual,cng,115893,0,[{'img': ''}],0,False,...,Multi-Point Fuel Injection,,13.0,,58.2,6200.0,77.0,3500.0,,
b1eab99b-a606-48dd-a75b-57feb8a9ad92,mumbai g.p.o.,2022,muv,manual,cng,18900,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,6,False,...,,12.0+-.03,,,86.63,5500.0,121.5,4200.0,,


---

#### `Turbo Charger` and `Super Charger`

Convert them to boolean

In [50]:
cars_df2[['Turbo Charger', 'Super Charger']].sample(5)

Unnamed: 0_level_0,Turbo Charger,Super Charger
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
caadf193-3213-4937-96f7-6f0c2c236af5,yes,no
fca1f01b-2b69-490d-8784-a169aed337eb,yes,no
04571ec1-032d-4710-ad57-9ea9688b4030,no,no
d42f2ee6-4c69-4a49-ae48-d434bdfb4b58,yes,no
d3334072-f3db-474c-8f7c-278e29a696ee,no,no


In [51]:
cars_df2['Turbo Charger'] = cars_df2['Turbo Charger'].replace('yes', True)
cars_df2['Turbo Charger'] = cars_df2['Turbo Charger'].replace('no', False).astype(bool)

cars_df2['Super Charger'] = cars_df2['Turbo Charger'].replace('yes', True)
cars_df2['Super Charger'] = cars_df2['Turbo Charger'].replace('no', False).astype(bool)

In [52]:
cars_df2[['Turbo Charger', 'Super Charger']].sample(5)

Unnamed: 0_level_0,Turbo Charger,Super Charger
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
8e7a0383-faad-44d4-a122-a6f1b5c69548,False,False
12d86f47-0316-4af2-81a0-e76f28493d01,False,False
7a20de83-bd01-4827-8103-80d4e72b13f6,False,False
516a27bd-1bc2-4d00-ace6-1a47f2288469,True,True
063ae741-f5e0-4fa5-b616-078ed86b0cbb,False,False


---

#### `Length`, `Width`, `Height`, `Wheel Base`

Convert these columns to int

In [53]:
cars_df2[['Length', 'Width', 'Height', 'Wheel Base']].sample(10)

Unnamed: 0_level_0,Length,Width,Height,Wheel Base
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e24ec803-7f13-497a-9a4b-34036113ade6,4440mm,1695mm,1485mm,2550mm
ccb40843-c348-48da-bf0e-704da58db55d,3840mm,1735mm,1530mm,2450mm
cd4ce72b-7edf-4df9-98f6-9062a16397f9,3765mm,1660mm,1520mm,2425mm
933b335d-ffa5-49d5-a6a0-d83465de8d99,4507 mm,1731 mm,1455 mm,2512 mm
ce75d9d4-8965-4cc8-b9f3-abc7a5fd6f7c,4315mm,1800mm,1645mm,2610mm
b12fd039-1ee3-41ee-a9fc-c658e1e38dfb,4440mm,1695mm,1495mm,2600mm
09074205-31d4-41e2-87f2-30e588c05b17,3585mm,1595mm,1550mm,2380mm
27636220-1d32-4794-98e6-823d02895977,4388mm,1831mm,1608mm,2603mm
a454a927-fd41-4404-afcb-a98faf66a196,3695mm,1600mm,1560mm,2425mm
b06efdd0-9968-4be2-a301-db6295345d87,4585 mm,1866 mm,1774 mm,2760 mm


In [54]:
for col in ['Length', 'Width', 'Height', 'Wheel Base']:
    cars_df2[col] = cars_df2[col].apply(get_begin_number).astype(float)

In [55]:
cars_df2[['Length', 'Width', 'Height', 'Wheel Base']].sample(5)

Unnamed: 0_level_0,Length,Width,Height,Wheel Base
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a982e5a6-548e-447f-9f87-9e4f4f8965c3,4440.0,1695.0,1485.0,2550.0
9e64f708-7845-4653-be89-e5ba5bf519bb,4591.0,1770.0,1447.0,2760.0
2c99b549-efe0-4c04-b4eb-fe917ef26928,3995.0,1706.0,1570.0,2470.0
2f9203f2-3d19-4e7c-8e68-3c3d950f61f2,4591.0,1770.0,1447.0,2760.0
eeb2f23e-851c-4e14-aee4-4ddcad9e8dae,3995.0,1695.0,1555.0,2430.0


---

#### `Front Tread` and `Rear Tread`

In [56]:
cars_df2[['Front Tread', 'Rear Tread']].sample(10)

Unnamed: 0_level_0,Front Tread,Rear Tread
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
776f4552-3684-4bbe-9c8d-f87db63655b0,,
89dc16e4-8f7e-4c54-ac86-f2b796a87d04,1520,1520
af3c8161-91c7-4eae-a450-8a7cd95bc5d4,1479mm,1493mm
8c33fbec-494a-40ec-b075-c9c95c6e26e5,1479mm,1493mm
3a64f7b2-98bf-47eb-afa7-50430269ed14,1490mm,1475mm
4e7a9b91-4532-4cde-872b-1f38e4e5b92d,1530mm,1530mm
0ca27c98-9af1-4ab5-980f-efe8b66d2592,1479mm,1493mm
1b0da9f6-2a8c-4158-9977-b73caf5af778,1435mm,1425mm
2f0f7677-94e0-40e6-a0ed-b3ae14b94626,1480mm,1465mm
ce805aa6-a5e3-43d9-abf8-16e022036bbe,,


In [57]:
for col in ['Front Tread', 'Rear Tread']:
    cars_df2[col] = cars_df2[col].apply(get_begin_float).astype(float)

In [58]:
cars_df2[['Front Tread', 'Rear Tread']].sample(5)

Unnamed: 0_level_0,Front Tread,Rear Tread
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
0ae6e684-e8f4-4d1f-a3ab-20422eca48fa,1485.0,1495.0
d7dfd8e8-911e-4dad-b25c-0f35d3aeaedd,1505.0,1515.0
8e6a5b3c-3bfe-41b0-a0aa-0da6efb7f56f,1470.0,1480.0
6dd56365-d15d-4d8b-9499-d4ec9862e541,,
54f1b3f8-b2e1-4f36-a4fb-e997ea8cb6c7,1530.0,1530.0


---

#### `Kerb Weight` and `Gross Weight` 

Convert weights to numerical features

In [59]:
cars_df2[['Kerb Weight', 'Gross Weight']].sample(10)

Unnamed: 0_level_0,Kerb Weight,Gross Weight
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
47bef6c6-89fc-49ab-b951-95887845c517,1130kg,
0a3b8743-8cdb-434c-8672-511a8a1810e4,865-890,1340kg
ccfa7216-f300-47d0-8a32-6b71e685929b,1790kg,2340kg
4943e932-29ec-426d-8de1-5a41f349ff74,,
85b4a87e-0757-4f33-be86-aabe4524f75a,860-895kg,1315kg
a85dac1c-4a98-4ffa-92fb-aa02cc9ff703,,
d72c080e-3e54-4930-a407-570db2cbf453,970kg,1415kg
027d521e-61e4-43df-8e8c-d87e229c6666,1066kg,
aaec6cff-3f54-4ded-9fe2-919a9c3f294e,,2215kg
da4d4cf1-1a9a-4ccd-b087-d196b610ff10,,


In [60]:
for col in ['Kerb Weight', 'Gross Weight']:
    cars_df2[col] = cars_df2[col].apply(get_begin_number).astype(float)

In [61]:
cars_df2[['Kerb Weight', 'Gross Weight']].sample(10)

Unnamed: 0_level_0,Kerb Weight,Gross Weight
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
aedf5af6-3a1f-424b-bd93-d62b68472f51,,
f88f68c4-18bb-43af-9ea9-8b4e45553e4c,1660.0,2185.0
552aa414-d454-4ae3-a0fa-77e23c103cf4,995.0,
43ed7974-a713-48bc-a3f6-ffabff870844,,
8be615b8-3cf3-4834-a5ec-50fd34ab4b3d,1230.0,1670.0
69c68199-5de0-4b97-824a-177ad7dfb7a4,1240.0,
753a611a-d7d4-4380-9bdb-29f9f922c9ec,1200.0,1680.0
649a3f4b-769a-4b5e-a520-a01668530658,1600.0,2160.0
9f9362dd-bffa-4744-92c0-00d2e5eb3f3b,,
4d9d2d51-c343-42f9-a8f8-b7a8fbd77178,960.0,1405.0


---

#### `Turning Radius`,

In [62]:
cars_df2['Turning Radius'].sample(10)

usedCarSkuId
32f6517d-535e-4651-b4d1-f52ff410ce5a     5.4 meters
ff6cfa75-2736-4414-9f01-26ea39a33d7b            nan
bf98806e-4756-4f72-8e76-f461d3480244    5.75 metres
321e31cb-3922-4eea-b3a0-3e2c5dc3570c     5.2 metres
1ed1d3b6-a2d8-4426-b3ed-f58f7d3b4a40    5.75 metres
4e291b79-3b37-494d-b151-c5e9df7da8e6     4.6 metres
1750c2ca-5db6-432a-b75e-16cd6b1bd399     4.6 metres
3b3f8e22-7e37-4b0e-89d1-4c0743494df0            4.9
0e8d3cdd-1357-4b6a-8cf3-821af53b20b6     4.6 metres
5f4273dd-0625-43b3-851f-6ea895cf6b7e     5.2 metres
Name: Turning Radius, dtype: object

In [63]:
cars_df2['Turning Radius'] = cars_df2['Turning Radius'].apply(get_begin_float).astype(float)

In [64]:
cars_df2['Turning Radius'].sample(5)

usedCarSkuId
6e6487c4-f778-4354-90d7-60ebd7f02ea1   5.300
18326e34-7d77-465c-95d1-dbd63ba8d4c7   5.300
47bdf5e6-08c9-4772-9bb2-4bf29902bb52   6.000
8ab578b5-2b5b-4751-a844-1e84a32b64c1   4.600
32fd645c-d49e-4c4d-ba5b-65e042555c4f   5.300
Name: Turning Radius, dtype: float64

---

#### `Top Speed` and `Acceleration`

In [65]:
cars_df2[['Top Speed', 'Acceleration']].sample(10)

Unnamed: 0_level_0,Top Speed,Acceleration
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
ab290c5b-0be3-4337-8e21-e0eac1557b45,,
1771a4a6-e29a-4a36-9ce2-07b164f3a84f,172 kmph,12.36 seconds
b985b788-ad53-4cc8-b3ee-02451c6f898e,165 kmph,13.6 seconds
3b3b9ba1-064a-4fae-8f3b-421c761dec89,156 kmph,15 seconds
283126b4-f9f8-4e0c-a509-fcbe07e79738,171.92 kmph,11.41 seconds
3c75f20c-4a01-4257-9d20-befd571ab9bd,227,8.2
9a67dde0-378d-412b-9f06-914a97e13048,190 kmph,10.62 seconds
a254360b-dade-466c-abb5-b2f9ea8f91d8,158 kmph,14.2 seconds
494d21c6-a2de-4c11-aa4a-57ca9428c4b0,,
ca99c5b1-dd0c-41d2-92fa-4168370c5140,165 kmph,14.3 seconds


In [66]:
cars_df2['Top Speed'] = cars_df2['Top Speed'].apply(get_begin_float).astype(float)
cars_df2['Acceleration'] = cars_df2['Acceleration'].apply(get_begin_float).astype(float)

In [67]:
cars_df2[['Top Speed', 'Acceleration']].sample(10)

Unnamed: 0_level_0,Top Speed,Acceleration
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
01ff7411-65cf-4ce4-b2d8-f3a38e0bcf74,,
15ac20cd-0f5e-4c4b-a028-86513746c60f,161.0,14.8
7096bbbd-e032-4cf2-acd0-1a52f14b79d8,126.0,20.0
c095cf39-336e-44e1-a895-6c6b64d5f3d3,,11.31
2cede171-018b-45ea-96a8-84938df505f3,,9.9
ba543743-7686-445c-a048-1c58ff3e1439,,
0ad967ff-59e7-4d22-9c29-cc8c0e3ec81c,165.0,13.2
9b9e574e-394c-4aea-b0ef-42f03fa58b11,,
875e3d05-1343-43e3-84d3-0d7a8a88ca06,190.0,10.0
5b87cf94-05c0-4357-94cd-4878a5a8da31,172.0,12.36


---

#### `Cargo Volumn` and `Ground Clearance Unladen`

Convert these columns to int

In [68]:
cars_df2[['Cargo Volumn', 'Ground Clearance Unladen']].sample(10)

Unnamed: 0_level_0,Cargo Volumn,Ground Clearance Unladen
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
3ef93e64-8c21-4c1e-84c4-dda19b8df1da,,
8ce1df8f-b194-482f-a7b9-bc92599c8ecd,,
a95c10f1-8eb6-4e3b-ae1f-9cf853ac0822,494,
e26f6ad0-d47a-4a16-ad00-f7bddc2962df,259 litres,
d5545e6b-5a6d-434e-a51c-3fa1ad6b168c,242-litres,
02c6b229-014f-4ecd-9bbf-e7027424271e,475-litres,
fc254729-4505-467e-aa17-ea8080fc911b,320 liters,
224ad1ed-141c-4d8f-aedd-b061b712293c,400-litres,
91fcdfb4-88d5-4893-b53f-e0ad5f4abeca,425,
879e371e-9fb8-4758-ae25-c161842eba54,177-litres,


In [69]:
for col in ['Cargo Volumn', 'Ground Clearance Unladen']:
    cars_df2[col] = cars_df2[col].apply(get_begin_number).astype(float)

In [70]:
cars_df2[['Cargo Volumn', 'Ground Clearance Unladen']].sample(10)

Unnamed: 0_level_0,Cargo Volumn,Ground Clearance Unladen
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
f34835f5-823f-44ad-b07c-2f976eea270f,494.0,
841154d9-b8d9-4469-a208-7062dfa56019,433.0,
4bf31c29-85b3-4e38-b291-460f441be50e,,
2c99b549-efe0-4c04-b4eb-fe917ef26928,390.0,
65833e9a-343b-4894-8df9-8f16b0dae811,,
b84a4099-6aec-4552-af52-9463a5c2b84d,510.0,
d6061642-01a3-4584-88c1-61216cd8d7b2,177.0,
731d6a3c-6f12-4eb9-8737-76e35d61f340,,
a8ed8410-c7c3-440a-b9ac-0236feb581c5,425.0,
6b238af8-2658-4007-b3d8-7d6692070230,,


---

#### `Compression Ratio` and `Alloy Wheel Size`

These columns should be converted to floats

In [71]:
cars_df2[['Alloy Wheel Size', 'Compression Ratio']].sample(10)

Unnamed: 0_level_0,Alloy Wheel Size,Compression Ratio
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
b27b297e-1a70-4192-b40b-5ca64109e589,15.0,
90a10dec-086d-4978-b29e-c1aa47db11e7,,
90ed6040-3d9a-48e0-b5f3-eae28e59d036,,
0308e426-d4c8-4f85-b13f-86dcc19a932b,,
ec4c508b-a5c2-4d48-8757-61100fc80ec5,16.0,
d62d4305-64e3-4756-b8b0-8f5f962a4cd6,,
526e1cbc-4234-444a-81ab-2db29244dc7c,,
9291bf7f-3d17-4709-abff-17e7a77e04fd,13.0,10:1
5cd92e26-cf83-49f9-9b6e-cd14179da92f,16.0,
ba61673d-6d30-4b8e-bd1b-ed04a5020ecb,,


In [72]:
for col in ['Alloy Wheel Size', 'Compression Ratio']:
    cars_df2[col] = cars_df2[col].apply(get_begin_float).astype(float)

In [73]:
cars_df2[['Alloy Wheel Size', 'Compression Ratio']].sample(10)

Unnamed: 0_level_0,Alloy Wheel Size,Compression Ratio
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1
5caf1e7f-13b8-41c6-a787-389d693aeed0,14.0,10.0
eb1adedf-e547-46d6-b55c-9c6c76d18675,,
f44b8229-47d9-437a-bf3b-a2408e35a89e,,
a5e16a14-c608-46d9-ab6c-ad6a07216334,,
9f068552-41fd-43dc-97c8-f34b65b3b5a4,16.0,
8ec8fd68-1484-42a0-92ed-c4d3a7a789f2,,
bcc743bf-d6bb-46b0-a2fb-a72185f9680e,16.0,17.3
f7d80434-b7be-443a-aafd-90334f944e8f,16.0,10.0
72562572-154a-425a-a36d-473fd66474c8,20.0,16.1
475d1f73-6057-4c82-843a-63876a09e86d,17.0,


---
### `km`

The columns need to be converted to int64.

In [74]:
cars_df2['km'].sample(10)

usedCarSkuId
7ae96078-f369-433b-824e-416667ce14ae      41,494
4b2f7567-d7cb-446f-b20b-4736c92758cb      24,173
31fa45fb-ba54-4dac-9b6a-8b66f16c1def      66,000
ed020e94-42d2-4ab1-a175-317e476cc424      40,000
32488a91-3ea5-4686-b4ea-54639a186829    1,10,000
6ef35063-db72-4843-b757-8329b0ba1cea      46,067
d87e86ee-f3c3-40d1-95ea-f38d4621bcfd      33,076
8b899050-f9b1-43c4-a51d-03009d4f9c5e    2,04,500
4beb3088-e6c7-4777-a275-360dcfebf249      11,834
83a06c34-cd00-4c65-96e1-066237a2d4f1    1,17,391
Name: km, dtype: object

In [75]:
cars_df2['km'] = cars_df2['km'].apply(get_begin_float).astype(float)

In [76]:
cars_df2['km'].sample(10)

usedCarSkuId
1330e5e7-8750-488c-b958-d2cd000f5698    35111.000
85857ba9-5a85-4fcf-ab56-758060b3b619    85000.000
1c773902-eb9d-4e23-a4ed-16d0bb932f20    51000.000
7a0da7fb-26ef-4620-b102-c7d28147777a    70000.000
7f82c058-367a-4579-83e0-69c3c727680e    92000.000
d5277af2-39c6-4854-9bd0-519f3844e454    67800.000
edbc4714-a470-4f80-a5f5-f39ca760f3fe   184290.000
5903f1d2-826c-4f6e-ac4f-0f4cdbeab513    80804.000
761a325c-c412-40f4-9a71-541a4cb29825    22509.000
730e67fe-688e-4b01-ad21-0ce680694726    13633.000
Name: km, dtype: float64

With this, we have converted all the columns to numerical that were noted.

In [77]:
cars_df2.dtypes.value_counts()

object     31
float64    27
int64       4
bool        3
dtype: int64

---

#### Fixing some column names

In [78]:
cars_df2 = cars_df2.rename(columns={
    'tt': 'transmission',
    'bt': 'body',
    'ft': 'fuel',
    'variantName': 'variant',
    'pu': 'listed_price',
    'Values per Cylinder': 'Valves per Cylinder',
    'Value Configuration': 'Valve Configuration',
    'No Door Numbers': 'Doors',
    'Seating Capacity': 'Seats',
    'Cargo Volumn': 'Cargo Volume',
    'city_x': 'City',
})

In [79]:
cars_df2.head()

Unnamed: 0_level_0,loc,myear,body,transmission,fuel,km,ip,images,imgCount,threesixty,...,Fuel Suppy System,Compression Ratio,Alloy Wheel Size,Ground Clearance Unladen,Max Power Delivered,Max Power At,Max Torque Delivered,Max Torque At,Bore,Stroke
usedCarSkuId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7111bf25-97af-47f9-867b-40879190d800,gomti nagar,2016,hatchback,manual,cng,69162.0,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,,,,,58.16,6200.0,77.0,3500.0,69.0,
c309efc1-efaf-4f82-81ad-dcb38eb36665,borivali west,2015,hatchback,manual,cng,45864.0,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,Gasoline Port Injection,11.0,,,58.2,6000.0,78.0,3500.0,73.0,
7609f710-0c97-4f00-9a47-9b9284b62d3a,jasola,2015,sedan,manual,cng,81506.0,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,15,False,...,,,,,86.7,6000.0,109.0,4500.0,,
278b76e3-5539-4a5e-ae3e-353a2e3b6d7d,jasola,2013,hatchback,manual,cng,115893.0,0,[{'img': ''}],0,False,...,Multi-Point Fuel Injection,,13.0,,58.2,6200.0,77.0,3500.0,,
b1eab99b-a606-48dd-a75b-57feb8a9ad92,mumbai g.p.o.,2022,muv,manual,cng,18900.0,0,[{'img': 'https://images10.gaadi.com/usedcar_i...,6,False,...,,12.0,,,86.63,5500.0,121.5,4200.0,,


In [80]:
cars_df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37813 entries, 7111bf25-97af-47f9-867b-40879190d800 to a96fbcd7-c183-4829-ae97-b2581afe4bac
Data columns (total 65 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   loc                       37813 non-null  object 
 1   myear                     37813 non-null  int64  
 2   body                      37813 non-null  object 
 3   transmission              37813 non-null  object 
 4   fuel                      37813 non-null  object 
 5   km                        37813 non-null  float64
 6   ip                        37813 non-null  int64  
 7   images                    37813 non-null  object 
 8   imgCount                  37813 non-null  int64  
 9   threesixty                37813 non-null  bool   
 10  dvn                       37813 non-null  object 
 11  oem                       37813 non-null  object 
 12  model                     37813 non-null  object 
 13  

#### With this, we are ready to move on to the next steps:
1. EDA
2. Feature Engineering and selection
3. Model Building and Optimization

In [81]:
import datetime

fileName = f"../data/clean/cleaned_data_{datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}.csv"
cars_df2.to_csv(fileName)

: 