# Preprocessing the Used Cars Dataset

In this notebook, I first tackle missing values, rows with placeholder '--' characters, and mixed-type columns containing both numeric and textual data. These steps are crucial for cleaning the data and ensuring consistency. Next, I apply one-hot encoding to transform categorical variables into a format suitable for modeling. Finally, I perform feature scaling to standardize the range of the data before conducting feature selection to identify the most informative variables for use in the predictive models.

Data Source: [US Used Cars dataset (3 million)](https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset)


In [1]:
import pandas as pd
import numpy as np
from collections import Counter
import ast
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
import matplotlib.pyplot as plt
import seaborn as sns

### Data Download

The dataset is downloaded.

In [2]:
data = pd.read_csv('used_cars_data.csv')

  data = pd.read_csv('used_cars_data.csv')


In [3]:
data.shape

(3000040, 66)

### Data Cleaning

In [3]:
data.head()

Unnamed: 0,vin,back_legroom,bed,bed_height,bed_length,body_type,cabin,city,city_fuel_economy,combine_fuel_economy,...,transmission,transmission_display,trimId,trim_name,vehicle_damage_category,wheel_system,wheel_system_display,wheelbase,width,year
0,ZACNJABB5KPJ92081,35.1 in,,,,SUV / Crossover,,Bayamon,,,...,A,9-Speed Automatic Overdrive,t83804,Latitude FWD,,FWD,Front-Wheel Drive,101.2 in,79.6 in,2019
1,SALCJ2FX1LH858117,38.1 in,,,,SUV / Crossover,,San Juan,,,...,A,9-Speed Automatic Overdrive,t86759,S AWD,,AWD,All-Wheel Drive,107.9 in,85.6 in,2020
2,JF1VA2M67G9829723,35.4 in,,,,Sedan,,Guaynabo,17.0,,...,M,6-Speed Manual,t58994,Base,,AWD,All-Wheel Drive,104.3 in,78.9 in,2016
3,SALRR2RV0L2433391,37.6 in,,,,SUV / Crossover,,San Juan,,,...,A,8-Speed Automatic Overdrive,t86074,V6 HSE AWD,,AWD,All-Wheel Drive,115 in,87.4 in,2020
4,SALCJ2FXXLH862327,38.1 in,,,,SUV / Crossover,,San Juan,,,...,A,9-Speed Automatic Overdrive,t86759,S AWD,,AWD,All-Wheel Drive,107.9 in,85.6 in,2020


Some features have the occurrences of '--' and these occurrences are replaced with NaN values to standardize missing data representations.

In [4]:
df = data.replace('--', np.nan)

The percentage of missing/null values for each feature are calculated. 

In [5]:
null_percentage = df.isnull().mean()*100
null_percentage = null_percentage.sort_values(ascending=False)
print(null_percentage)

bed_height                 100.000000
vehicle_damage_category    100.000000
combine_fuel_economy       100.000000
is_certified               100.000000
bed                         99.347742
                              ...    
franchise_dealer             0.000000
dealer_zip                   0.000000
daysonmarket                 0.000000
city                         0.000000
year                         0.000000
Length: 66, dtype: float64


The number of unique values for each feature is calculated.

In [6]:
unique_counts = {col: df[col].nunique() for col in df.columns}
unique_counts_df = pd.DataFrame(list(unique_counts.items()), columns=['Variable', 'Unique_Count'])

print(unique_counts_df.sort_values(by='Unique_Count', ascending=False))

                   Variable  Unique_Count
0                       vin       3000000
38               listing_id       3000000
12              description       2519325
40         main_picture_url       2415855
41            major_options        279972
..                      ...           ...
33                is_oemcpo             1
9      combine_fuel_economy             0
60  vehicle_damage_category             0
30             is_certified             0
3                bed_height             0

[66 rows x 2 columns]


The columns/features that have more than 50% missing values and the highest numbers of unique values are removed.

In [7]:
df = df.drop(['vehicle_damage_category', 'combine_fuel_economy', 'is_certified', 'bed', 'bed_height', 'cabin',
'bed_length', 'is_oemcpo', 'is_cpo', 'owner_count', 'description', 'vin', 'main_picture_url', 'listing_id'], axis=1)

Since having accident(s) affects the price of cars, the missing values of 'has_accidents' are removed

In [8]:
df = df.dropna(subset=['has_accidents'])

After removing the features with the highest number of unique values, I proceed to eliminate duplicates from the dataset.

In [9]:
df = df.drop_duplicates()

The percentage of missing values for each feature left are recalculated.

In [10]:
new_percentage = df.isnull().mean()*100
new_percentage = new_percentage.sort_values(ascending=False)
print(new_percentage)

franchise_make          36.167126
city_fuel_economy       15.688048
highway_fuel_economy    15.688048
interior_color          15.494226
torque                  12.812209
power                   11.080536
back_legroom             7.007289
front_legroom            4.894270
major_options            4.503127
fuel_tank_volume         4.310196
maximum_seating          4.262806
width                    4.261788
height                   4.260962
length                   4.260262
wheelbase                4.258735
horsepower               4.056898
engine_displacement      4.056898
wheel_system             3.231359
wheel_system_display     3.231359
trim_name                2.824251
trimId                   2.795244
engine_cylinders         2.541183
engine_type              2.541183
exterior_color           2.241386
fuel_type                2.069001
seller_rating            1.749612
transmission_display     1.697133
transmission             1.697133
mileage                  1.206313
body_type     

The features with more than 15% missing values are removed.

In [11]:
df = df.drop(['franchise_make', 'interior_color', 'city_fuel_economy', 'highway_fuel_economy'], axis=1)

There are some features which have unit labels ('in' for inches, 'gal' for gallons, 'seats') at the end of values. These labels are removed and the features are converted the modified strings into floats for numerical analysis.

In [12]:
df['back_legroom'] = df['back_legroom'].str.replace(r'\s*in$', '', regex=True).astype(float)
df['front_legroom'] = df['front_legroom'].str.replace(r'\s*in$', '', regex=True).astype(float)
df['width'] = df['width'].str.replace(r'\s*in$', '', regex=True).astype(float)
df['wheelbase'] = df['wheelbase'].str.replace(r'\s*in$', '', regex=True).astype(float)
df['height'] = df['height'].str.replace(r'\s*in$', '', regex=True).astype(float)
df['length'] = df['length'].str.replace(r'\s*in$', '', regex=True).astype(float)
df['fuel_tank_volume'] = df['fuel_tank_volume'].str.replace(r'\s*gal$', '', regex=True).astype(float)
df['maximum_seating'] = df['maximum_seating'].str.replace(r'\s*seats$', '', regex=True).astype(float)

The features 'transmission_display' and 'wheel_system_display' are similar with 'transmission' and 'wheel_system'. Therefore, these are removed.

In [13]:
df = df.drop(['transmission_display', 'wheel_system_display'], axis=1)

The features 'engine_cylinders' and 'engine_type' are the same and 'engine_cylinders' is removed.

In [14]:
df = df.drop('engine_cylinders', axis=1)

The feature 'major_options' contains key car features such as Bluetooth and backup cameras, making it a crucial variable. Therefore, rows with missing values in this feature are removed from the dataset.

In [15]:
df = df.dropna(subset=['major_options'])

Since the definition of 'sp_id' is not known, it is removed. Moreover, the feature 'listing_color' is the dominant color of 'exterior_color'. That's why 'exterior_color' is removed. 

In [16]:
df = df.drop(['sp_id', 'exterior_color'], axis=1)

The percentage of missing values for each feature left are recalculated.

In [17]:
new_percentage_df = df.isnull().mean()*100
print(new_percentage_df.sort_values(ascending=False))

torque                 12.122195
power                  10.365084
back_legroom            6.051339
front_legroom           4.087203
fuel_tank_volume        3.492641
maximum_seating         3.475588
height                  3.475055
width                   3.474722
length                  3.474523
wheelbase               3.473324
horsepower              3.321452
engine_displacement     3.321452
wheel_system            2.529057
engine_type             2.203666
trim_name               2.181818
trimId                  2.162168
fuel_type               1.733664
seller_rating           1.658328
transmission            1.485142
mileage                 1.099335
body_type               0.063613
model_name              0.000000
theft_title             0.000000
sp_name                 0.000000
savings_amount          0.000000
salvage                 0.000000
price                   0.000000
listing_color           0.000000
make_name               0.000000
major_options           0.000000
longitude 

Since the percentages of missing values of the features 'torque' and 'power' are more than 10%, the missing values of these features are removed.

In [18]:
df = df.dropna(subset=['torque', 'power'])

The next part is to fill missing values but it depends on the type of columns.

In [19]:
numeric_columns = df.select_dtypes(include=['number']).columns
print('Numeric columns:', numeric_columns)
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
print('Categorical columns:', categorical_columns)
bool_columns = df.select_dtypes(include=['bool']).columns
print('Boolen columns:', bool_columns)

Numeric columns: Index(['back_legroom', 'daysonmarket', 'engine_displacement', 'front_legroom',
       'fuel_tank_volume', 'height', 'horsepower', 'latitude', 'length',
       'longitude', 'maximum_seating', 'mileage', 'price', 'savings_amount',
       'seller_rating', 'wheelbase', 'width', 'year'],
      dtype='object')
Categorical columns: Index(['body_type', 'city', 'dealer_zip', 'engine_type', 'fleet',
       'frame_damaged', 'fuel_type', 'has_accidents', 'isCab', 'listed_date',
       'listing_color', 'major_options', 'make_name', 'model_name', 'power',
       'salvage', 'sp_name', 'theft_title', 'torque', 'transmission', 'trimId',
       'trim_name', 'wheel_system'],
      dtype='object')
Boolen columns: Index(['franchise_dealer', 'is_new'], dtype='object')


The missing values of numerical columns are filled by using the mean.

In [20]:
for col in ['back_legroom', 'seller_rating', 'mileage', 'fuel_tank_volume', 'front_legroom', 'maximum_seating', 'height', 'width', 'wheelbase', 'length']:
    df[col] = df[col].fillna(df[col].mean()) ## there is no seller id 

Before filling missing values of categorical columns, their unique values are calculated.

In [21]:
unique_values_cat = {col: df[col].nunique() for col in categorical_columns}

unique_values_cat = pd.DataFrame(list(unique_values_cat.items()), columns=['Variable', 'Unique_Count'])
print(unique_values_cat.sort_values(by = 'Unique_Count', ascending=False))

         Variable  Unique_Count
11  major_options        178380
20         trimId         33985
16        sp_name         25905
2      dealer_zip          9035
21      trim_name          7485
1            city          4659
18         torque          1905
14          power          1883
9     listed_date          1642
13     model_name           958
12      make_name            59
3     engine_type            37
10  listing_color            15
0       body_type             9
6       fuel_type             7
22   wheel_system             5
19   transmission             4
8           isCab             2
15        salvage             2
7   has_accidents             2
17    theft_title             2
5   frame_damaged             2
4           fleet             2


Since 'trim_name' has approximately 7500 unique values, it is removed.

In [22]:
df = df.drop('trim_name', axis=1)

The missing values of categorical columns are filled by using the mean.

In [23]:
for col in ['body_type', 'fuel_type', 'engine_type', 'wheel_system', 'transmission']:
    most_common = df[col].mode()[0]  # Mod 
    df[col] = df[col].fillna(most_common)

In [24]:
new_percentage_df = df.isnull().mean()*100
print(new_percentage_df.sort_values(ascending=False))

back_legroom           0.0
savings_amount         0.0
major_options          0.0
make_name              0.0
maximum_seating        0.0
mileage                0.0
model_name             0.0
power                  0.0
price                  0.0
salvage                0.0
seller_rating          0.0
body_type              0.0
sp_name                0.0
theft_title            0.0
torque                 0.0
transmission           0.0
trimId                 0.0
wheel_system           0.0
wheelbase              0.0
width                  0.0
longitude              0.0
listing_color          0.0
listed_date            0.0
length                 0.0
city                   0.0
daysonmarket           0.0
dealer_zip             0.0
engine_displacement    0.0
engine_type            0.0
fleet                  0.0
frame_damaged          0.0
franchise_dealer       0.0
front_legroom          0.0
fuel_tank_volume       0.0
fuel_type              0.0
has_accidents          0.0
height                 0.0
h

Now, there is no missing data for these features in the dataset.

Boolean values in specified columns are converted into integers (0 or 1)

In [25]:
for col in bool_columns:
    df[col] = df[col].astype(int)

To manage the dimensionality of the dataset when applying one-hot encoding, I first identify the most frequent categories for each categorical variable and label all remaining categories as 'Other'. This reduction of categories simplifies the dataset, ensuring that only the most impactful categories are included. 

In [26]:
categorical_columns = df.select_dtypes(include=['object', 'category']).columns

top_categories_dfs = []

for col in categorical_columns:
    top_categories = df[col].value_counts(normalize=True).nlargest(10).reset_index() 
    top_categories.columns = ['Category', 'Percentage'] 
    top_categories['Percentage'] *= 100  
    top_categories['Variable'] = col 
    top_categories_dfs.append(top_categories) 

final_df = pd.concat(top_categories_dfs, ignore_index=True)

print(final_df)

            Category  Percentage      Variable
0    SUV / Crossover   45.437346     body_type
1              Sedan   28.496646     body_type
2       Pickup Truck   12.661415     body_type
3            Minivan    3.501958     body_type
4              Coupe    3.105212     body_type
..               ...         ...           ...
152              FWD   42.948502  wheel_system
153              AWD   25.246562  wheel_system
154              4WD   19.109447  wheel_system
155              RWD    8.690921  wheel_system
156              4X2    4.004568  wheel_system

[157 rows x 3 columns]


The threshold for selecting the most frequent categories is set at 20%. Any category that makes up at least 20% of a variable will be retained, and categories below this threshold will be labeled as 'Other'.

In [27]:
## body_type
print(final_df.loc[final_df['Variable'] == 'body_type'])
top_categories = df['body_type'].value_counts().nlargest(2).index ## the number of the most frequent categories is 2
df['body_type'] = df['body_type'].apply(lambda x: x if x in top_categories else 'Other')

          Category  Percentage   Variable
0  SUV / Crossover   45.437346  body_type
1            Sedan   28.496646  body_type
2     Pickup Truck   12.661415  body_type
3          Minivan    3.501958  body_type
4            Coupe    3.105212  body_type
5        Hatchback    2.329326  body_type
6            Wagon    1.958686  body_type
7      Convertible    1.358710  body_type
8              Van    1.150700  body_type


In [28]:
## city
print(final_df.loc[final_df['Variable'] == 'city'])
df = df.drop('city', axis=1) ## since there is no frequent variable

        Category  Percentage Variable
9        Houston    1.378973     city
10   San Antonio    0.704550     city
11      Columbus    0.614850     city
12         Miami    0.613636     city
13       Phoenix    0.575995     city
14  Jacksonville    0.563929     city
15        Dallas    0.543666     city
16       Orlando    0.537899     city
17         Tampa    0.528109     city
18     Las Vegas    0.510048     city


In [29]:
## dealer_zip
print(final_df.loc[final_df['Variable'] == 'dealer_zip'])
df = df.drop('dealer_zip', axis=1) ## since there is no frequent category

   Category  Percentage    Variable
19    77477    0.413745  dealer_zip
20    33612    0.175151  dealer_zip
21    77034    0.171964  dealer_zip
22    30096    0.169308  dealer_zip
23    30060    0.164830  dealer_zip
24    77074    0.158380  dealer_zip
25    75006    0.153599  dealer_zip
26    32505    0.153067  dealer_zip
27    43228    0.149652  dealer_zip
28    77090    0.146465  dealer_zip


In [30]:
## engine_type
print(final_df.loc[final_df['Variable'] == 'engine_type'])
top_categories = df['engine_type'].value_counts().nlargest(2).index  ## the number of the most frequent categories is 2
df['engine_type'] = df['engine_type'].apply(lambda x: x if x in top_categories else 'Other')

                Category  Percentage     Variable
29                    I4   46.932201  engine_type
30                    V6   29.043575  engine_type
31                    V8   10.444480  engine_type
32  V8 Flex Fuel Vehicle    3.162280  engine_type
33  V6 Flex Fuel Vehicle    3.040251  engine_type
34                    H4    2.394970  engine_type
35                    I6    1.492806  engine_type
36  I4 Flex Fuel Vehicle    0.654767  engine_type
37             I4 Diesel    0.495932  engine_type
38                    I3    0.362065  engine_type


In [31]:
## fuel_type
print(final_df.loc[final_df['Variable'] == 'fuel_type'])
top_categories = df['fuel_type'].value_counts().nlargest(1).index  ## the number of the most frequent categories is 1
df['fuel_type'] = df['fuel_type'].apply(lambda x: x if x in top_categories else 'Other')

                  Category  Percentage   Variable
43                Gasoline   91.448259  fuel_type
44       Flex Fuel Vehicle    6.864433  fuel_type
45                  Diesel    1.221504  fuel_type
46               Biodiesel    0.366922  fuel_type
47                  Hybrid    0.091749  fuel_type
48  Compressed Natural Gas    0.006830  fuel_type
49                 Propane    0.000304  fuel_type


In [32]:
## listed_date
print(final_df.loc[final_df['Variable'] == 'listed_date'])
df = df.drop('listed_date', axis=1) ## since there is no frequent category

      Category  Percentage     Variable
54  2020-09-02    2.860168  listed_date
55  2020-09-03    2.609811  listed_date
56  2020-09-05    2.534681  listed_date
57  2020-09-04    2.324697  listed_date
58  2020-08-29    2.301551  listed_date
59  2020-08-28    2.210105  listed_date
60  2020-08-27    2.162144  listed_date
61  2020-08-26    2.092478  listed_date
62  2020-09-06    2.018790  listed_date
63  2020-08-30    2.002170  listed_date


In [33]:
## listing_color
print(final_df.loc[final_df['Variable'] == 'listing_color'])
top_categories = df['listing_color'].value_counts().nlargest(2).index  ## the number of the most frequent categories is 2
df['listing_color'] = df['listing_color'].apply(lambda x: x if x in top_categories else 'Other')

   Category  Percentage       Variable
64    WHITE   21.522858  listing_color
65    BLACK   20.556947  listing_color
66   SILVER   13.990453  listing_color
67     GRAY   12.841195  listing_color
68  UNKNOWN   10.692939  listing_color
69      RED    8.752770  listing_color
70     BLUE    8.406262  listing_color
71    BROWN    1.109947  listing_color
72    GREEN    0.856176  listing_color
73     GOLD    0.504887  listing_color


In [34]:
## make_name
print(final_df.loc[final_df['Variable'] == 'make_name'])
df = df.drop('make_name', axis=1) ## since there is no frequent category

     Category  Percentage   Variable
84       Ford   13.441930  make_name
85  Chevrolet   11.712503  make_name
86     Nissan    8.097851  make_name
87     Toyota    6.928179  make_name
88      Honda    6.597608  make_name
89       Jeep    5.898370  make_name
90    Hyundai    4.141548  make_name
91      Dodge    4.129102  make_name
92        GMC    3.849983  make_name
93        Kia    3.668078  make_name


In [35]:
## model_name
print(final_df.loc[final_df['Variable'] == 'model_name'])
df = df.drop('model_name', axis=1) ## since there is no frequent category

           Category  Percentage    Variable
94            F-150    3.103163  model_name
95   Silverado 1500    2.416750  model_name
96           Escape    1.955423  model_name
97             1500    1.903363  model_name
98            Rogue    1.901846  model_name
99   Grand Cherokee    1.628798  model_name
100          Accord    1.504948  model_name
101          Altima    1.485672  model_name
102          Malibu    1.411529  model_name
103        Explorer    1.394454  model_name


In [36]:
## sp_name
print(final_df.loc[final_df['Variable'] == 'sp_name'])
df = df.drop('sp_name', axis=1) ## since there is no frequent category

                         Category  Percentage Variable
116                       Carvana    0.464059  sp_name
117                         Vroom    0.386577  sp_name
118       Avis Car Sales - Denver    0.084236  sp_name
119           Pacific Auto Center    0.056006  sp_name
120           Honda of Fort Myers    0.053805  sp_name
121                Auto Spot LLC.    0.050314  sp_name
122                 ALM Kia South    0.049707  sp_name
123  EchoPark Automotive - Dallas    0.047355  sp_name
124                  Auto Express    0.045381  sp_name
125               Autos of Dallas    0.045230  sp_name


In [37]:
## transmission
print(final_df.loc[final_df['Variable'] == 'transmission'])
top_categories = df['transmission'].value_counts().nlargest(1).index ## the number of the most frequent categories is 1
df['transmission'] = df['transmission'].apply(lambda x: x if x in top_categories else 'Other')

        Category  Percentage      Variable
138            A   85.725192  transmission
139          CVT   11.538870  transmission
140            M    2.395274  transmission
141  Dual Clutch    0.340664  transmission


In [38]:
## wheel_system
print(final_df.loc[final_df['Variable'] == 'wheel_system'])
top_categories = df['wheel_system'].value_counts().nlargest(2).index ## the number of the most frequent categories is 1
df['wheel_system'] = df['wheel_system'].apply(lambda x: x if x in top_categories else 'Other')

    Category  Percentage      Variable
152      FWD   42.948502  wheel_system
153      AWD   25.246562  wheel_system
154      4WD   19.109447  wheel_system
155      RWD    8.690921  wheel_system
156      4X2    4.004568  wheel_system


In [39]:
## trimId
print(final_df.loc[final_df['Variable'] == 'trimId'])
df = df.drop('trimId', axis=1) ## since there is no frequent category

    Category  Percentage Variable
142   t82263    0.346811   trimId
143   t73359    0.314331   trimId
144   t82619    0.309853   trimId
145   t67808    0.294827   trimId
146   t66967    0.292551   trimId
147   t78889    0.290881   trimId
148   t83401    0.268418   trimId
149   t66217    0.260602   trimId
150   t87739    0.259084   trimId
151   t68162    0.255593   trimId


The feature 'power' is of mixed type and contains values formatted like '395 hp @ 5,600 RPM'. To simplify analysis, I separate this data into two columns: one for horsepower and another for RPM.

In [40]:
df[['hp', 'RPM']] = df['power'].str.split('@', expand=True)

df['hp'] = df['hp'].str.extract('(\d+)').astype(int)  
df['RPM'] = df['RPM'].str.extract('(\d+,\d+|\d+)').replace(',', '', regex=True)
df['RPM'] = pd.to_numeric(df['RPM'], errors='coerce').fillna(0).astype(int) 

Since the feature 'power' is seperated and there is a feature called 'horsepower', I removed the features 'power' and 'hp'.

In [41]:
df = df.drop(['power', 'hp'], axis=1)

'major_options' consists of a list of specific features, such as 'Steel Wheels', 'Bluetooth', 'Backup Camera'. Initially, the five most frequent features are identified, and new columns are created for each of these top features.

In [42]:
options_counter = Counter()

for options_string in df['major_options']:
   
    options_list = ast.literal_eval(options_string)
    
    options_counter.update(options_list)

options_count_df = pd.DataFrame(list(options_counter.items()), columns=['Option', 'Count'])

total_options_count = sum(options_counter.values())

options_count_df['Percentage'] = options_count_df['Count'].apply(lambda x: (x / total_options_count) * 100)

options_count_df.sort_values(by='Percentage', ascending=False) ## there are 144 major options

Unnamed: 0,Option,Count,Percentage
1,Bluetooth,1005469,12.534744
2,Backup Camera,988993,12.329345
0,Alloy Wheels,958358,11.947432
3,Heated Seats,591758,7.377189
6,Navigation System,526475,6.563335
...,...,...,...
140,Grand Tour Package,8,0.000100
143,Convenience Light Package,8,0.000100
138,Appearance and Protection Package,3,0.000037
141,Adaptive Ride Package,2,0.000025


In [43]:
df['Bluetooth'] = df['major_options'].apply(lambda options: 1 if 'Bluetooth' in options else 0)
df['Backup_Camera'] = df['major_options'].apply(lambda options: 1 if 'Backup Camera' in options else 0)
df['Alloy_Wheels'] = df['major_options'].apply(lambda options: 1 if 'Alloy Wheels' in options else 0)
df['Heated_Seats'] = df['major_options'].apply(lambda options: 1 if 'Heated Seats' in options else 0)
df['Navigation_System'] = df['major_options'].apply(lambda options: 1 if 'Navigation System' in options else 0)

In [44]:
df = df.drop('major_options', axis=1) ## removed

The feature 'torque' is of mixed type and contains values formatted like ''258 lb-ft @ 1,500 RPM'. To simplify analysis, I separate this data into two columns: one for lb-ft and another for torque RPM.

In [45]:
df[['torque_value', 'torque_rpm']] = df['torque'].str.split(' @ ', expand=True)

df['torque_value'] = df['torque_value'].str.extract('(\d+)')[0].fillna(0).astype(int)
df['torque_rpm'] = df['torque_rpm'].str.replace(',', '').str.extract('(\d+)')[0].fillna(0).astype(int)

In [46]:
df = df.drop('torque', axis=1) #removed

The features 'fuel_type' ve 'transmission' have 2 different values. If the value is other, set 0 otherwise 1

In [47]:
df['fuel_type'] = df['fuel_type'].map({'Other': 0, 'Gasoline': 1})
df['transmission'] = df['transmission'].map({'Other': 0, 'A': 1})

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1317720 entries, 2 to 3000039
Data columns (total 40 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   back_legroom         1317720 non-null  float64
 1   body_type            1317720 non-null  object 
 2   daysonmarket         1317720 non-null  int64  
 3   engine_displacement  1317720 non-null  float64
 4   engine_type          1317720 non-null  object 
 5   fleet                1317720 non-null  object 
 6   frame_damaged        1317720 non-null  object 
 7   franchise_dealer     1317720 non-null  int64  
 8   front_legroom        1317720 non-null  float64
 9   fuel_tank_volume     1317720 non-null  float64
 10  fuel_type            1317720 non-null  int64  
 11  has_accidents        1317720 non-null  object 
 12  height               1317720 non-null  float64
 13  horsepower           1317720 non-null  float64
 14  isCab                1317720 non-null  object 
 15  is_

The features 'isCab', 'salvage', 'theft_title', 'has_accidents', 'frame_damaged' and 'fleet' contain true/false values and these columns are converted into 1/0.

In [49]:
columns_to_convert = ['isCab', 'salvage', 'theft_title', 'has_accidents', 'frame_damaged', 'fleet']

df[columns_to_convert] = df[columns_to_convert].astype(int)

There are still some categorical columns.

In [50]:
categorical_columns_last = df.select_dtypes(include=['object', 'category']).columns
print('Categorical columns:', categorical_columns_last)

Categorical columns: Index(['body_type', 'engine_type', 'listing_color', 'wheel_system'], dtype='object')


### One-hot encoding

One-hot encoding is applied to the categorical columns in the dataset to convert categorical data into a numerical format.

In [51]:
df = pd.get_dummies(df, columns=categorical_columns_last)
df = df.astype(int)

### Feature Selection

First, the features and the target variable are defined.

In [82]:
X = df.drop('price', axis=1) # features
y = df['price'] # label or target variable


Before feature selection, the distributions of target variable and features are analyzed to check if needed standardization of the data set.

In [68]:
print(y.agg(['min', 'mean', 'max']))

min     2.000000e+02
mean    2.322665e+04
max     3.195000e+06
Name: price, dtype: float64


Based on the statistics of the target variable 'price', which shows a minimum value of 200, a mean of 23,226.65, and a maximum of 3,195,000, scaling is necessary to normalize the data distribution. This step will help reduce the influence of extreme values, ensuring that the machine learning models can operate more efficiently and effectively, without being skewed by the wide range of values in the target variable.

In [62]:
print(X.agg(['min', 'mean', 'max']))

      back_legroom  daysonmarket  engine_displacement     fleet  \
min        0.00000      0.000000          1000.000000  0.000000   
mean      37.07283     57.704326          3105.525757  0.217917   
max       59.00000   3573.000000          8400.000000  1.000000   

      frame_damaged  franchise_dealer  front_legroom  fuel_tank_volume  \
min        0.000000          0.000000       0.000000          7.000000   
mean       0.009328          0.647743      41.819724         18.637503   
max        1.000000          1.000000      67.000000         64.000000   

      fuel_type  has_accidents  ...  body_type_Sedan  engine_type_I4  \
min    0.000000       0.000000  ...         0.000000        0.000000   
mean   0.914483       0.155628  ...         0.284966        0.469322   
max    1.000000       1.000000  ...         1.000000        1.000000   

      engine_type_Other  engine_type_V6  listing_color_BLACK  \
min            0.000000        0.000000             0.000000   
mean           0.

Given the substantial range in the statistics of the features, such as 'daysonmarket' which varies from 0 to 3573, and 'engine_displacement' ranging from 1000 to 8400, scaling is essential. Scaling these features will standardize their ranges, making it easier for machine learning algorithms to process the data without bias towards larger scale features, and improving the overall predictive performance of the model.

In [83]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns) 
y_scaled = scaler.fit_transform(y.values.reshape(-1, 1)).ravel()

In [80]:
X_scaled_df.shape

(1317720, 47)

There are 47 features and I need to perform dimensionality reduction/feature selection to increase model training efficiency and potentially enhance model accuracy by eliminating irrelevant features that do not contribute significantly to the predictive power of the model. This process will also help in simplifying the model, making it faster to train and easier to interpret. For the feature selection process, the SelectKBest method is employed.

In [84]:
selector = SelectKBest(score_func=f_regression, k=10)

X_new = selector.fit_transform(X_scaled_df, y_scaled)

selected_features = X.columns[selector.get_support()]
selected_scores = selector.scores_[selector.get_support()]

# create dataframe for selected features
features_scores_df = pd.DataFrame({'Feature': selected_features, 'Score': selected_scores})

print(features_scores_df.sort_values(by="Score", ascending=False))

               Feature          Score
2           horsepower  587849.279022
8         torque_value  400011.260274
3              mileage  292430.969838
7                 year  238271.045671
9     wheel_system_FWD  179244.008135
6                width  163366.786099
0  engine_displacement  145273.841843
4       savings_amount  144520.637862
1     fuel_tank_volume  143329.937762
5            wheelbase  123929.852517


The scaled feature set and the scaled target variable are horizontally concatenated for modeling.

In [86]:
X_filtered = X_scaled_df[selected_features]
y_scaled_series = pd.Series(y_scaled, name='price')
df_filtered = pd.concat([X_filtered, y_scaled_series], axis=1)

Save the cleaned and preprocessed dataset to a CSV file.

In [88]:
df_filtered.to_csv('filtered_used_cars_data.csv', index=False)