In [1]:
import pandas as pd

pd.set_option('display.max_columns', None)

In [2]:
df = pd.read_csv('Data/vehicles.csv')
df.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,7222695916,https://prescott.craigslist.org/cto/d/prescott...,prescott,https://prescott.craigslist.org,6000,,,,,,...,,,,,,,az,,,
1,7218891961,https://fayar.craigslist.org/ctd/d/bentonville...,fayetteville,https://fayar.craigslist.org,11900,,,,,,...,,,,,,,ar,,,
2,7221797935,https://keys.craigslist.org/cto/d/summerland-k...,florida keys,https://keys.craigslist.org,21000,,,,,,...,,,,,,,fl,,,
3,7222270760,https://worcester.craigslist.org/cto/d/west-br...,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,...,,,,,,,ma,,,
4,7210384030,https://greensboro.craigslist.org/cto/d/trinit...,greensboro,https://greensboro.craigslist.org,4900,,,,,,...,,,,,,,nc,,,


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   url           426880 non-null  object 
 2   region        426880 non-null  object 
 3   region_url    426880 non-null  object 
 4   price         426880 non-null  int64  
 5   year          425675 non-null  float64
 6   manufacturer  409234 non-null  object 
 7   model         421603 non-null  object 
 8   condition     252776 non-null  object 
 9   cylinders     249202 non-null  object 
 10  fuel          423867 non-null  object 
 11  odometer      422480 non-null  float64
 12  title_status  418638 non-null  object 
 13  transmission  424324 non-null  object 
 14  VIN           265838 non-null  object 
 15  drive         296313 non-null  object 
 16  size          120519 non-null  object 
 17  type          334022 non-null  object 
 18  pain

In [5]:
# calculate missing values percentage per column
round(df.isnull().mean() * 100, 2)

id                0.00
url               0.00
region            0.00
region_url        0.00
price             0.00
year              0.28
manufacturer      4.13
model             1.24
condition        40.79
cylinders        41.62
fuel              0.71
odometer          1.03
title_status      1.93
transmission      0.60
VIN              37.73
drive            30.59
size             71.77
type             21.75
paint_color      30.50
image_url         0.02
description       0.02
county          100.00
state             0.00
lat               1.53
long              1.53
posting_date      0.02
dtype: float64

Data Cleaning Notes:
- Remove `county`, contains no actual values of use.
- Keep relevant columns and remove the additional following columns: `id`, `url`, `region_url`, `VIN`, `image_url`, `lat`, `long`.
- Rename `year` as `year_manufactured`.
* Will need to impute majority of missing values for:  
    - KNN imputer: `condition`, `cylinders`
    - Assess each unique value and see if we can determine appropriate imputed value based on other car info: `drive`, `size`, `type`, and `paint_color`

Feature Engineering Notes:
- Year of selling date can be extracted from `posting_date` 

In [25]:
# keep relevant columns
cols_to_remove = ['county', 'id', 'url', 'region_url', 'VIN', 'image_url', 'lat', 'long']

# drop cols
vehicles_df = df.drop(columns = cols_to_remove)

# rename year col
vehicles_df = vehicles_df.rename(columns = {'year':'year_manufactured',
                                            'odometer': 'miles'})
vehicles_df.tail()

Unnamed: 0,region,price,year_manufactured,manufacturer,model,condition,cylinders,fuel,miles,title_status,transmission,drive,size,type,paint_color,description,state,posting_date
426875,wyoming,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,fwd,,sedan,,Carvana is the safer way to buy a car During t...,wy,2021-04-04T03:21:31-0600
426876,wyoming,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,fwd,,sedan,red,Carvana is the safer way to buy a car During t...,wy,2021-04-04T03:21:29-0600
426877,wyoming,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,,,hatchback,white,Carvana is the safer way to buy a car During t...,wy,2021-04-04T03:21:17-0600
426878,wyoming,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,fwd,,sedan,silver,Carvana is the safer way to buy a car During t...,wy,2021-04-04T03:21:11-0600
426879,wyoming,30590,2019.0,bmw,4 series 430i gran coupe,good,,gas,22716.0,clean,other,rwd,,coupe,,Carvana is the safer way to buy a car During t...,wy,2021-04-04T03:21:07-0600
