# Basic Exploratory Data Analysis

## Description of the data

The dataset contains the following fields:

- `price`
- `model_year`
- `model`
- `condition`
- `cylinders`
- `fuel`
- `odometer`
- `transmission`
- `type`
- `paint_color`
- `is_4wd`
- `date_posted`
- `days_listed`

## Imports

In [1]:
import pandas as pd
import numpy as np

print("pandas version: " + pd.__version__)
print("numpy version: " + np.__version__)

pandas version: 2.2.2
numpy version: 2.1.0


In [2]:
# visualization module
import plotly.express as px

## Input Data

In [3]:
vehicles = pd.read_csv("../vehicles_us.csv")
vehicles.head(10)

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
5,14990,2014.0,chrysler 300,excellent,6.0,gas,57954.0,automatic,sedan,black,1.0,2018-06-20,15
6,12990,2015.0,toyota camry,excellent,4.0,gas,79212.0,automatic,sedan,white,,2018-12-27,73
7,15990,2013.0,honda pilot,excellent,6.0,gas,109473.0,automatic,SUV,black,1.0,2019-01-07,68
8,11500,2012.0,kia sorento,excellent,4.0,gas,104174.0,automatic,SUV,,1.0,2018-07-16,19
9,9200,2008.0,honda pilot,excellent,,gas,147191.0,automatic,SUV,blue,1.0,2019-02-15,17


In [4]:
vehicles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


Infering that the price column is the target variable and the other 12 columns are in this case the features. Four features are numerical, while the other nine are categorical. None of the `float64` fields actually utilize the features of the float data type, so they should be changed to `int64` or into other more appropriate data types. For example, `is_4wd` is a float but should be represented as a categorical variable. The `date_posted` column should be converted to a `date_time` format. There are missing values in the `model_year`, `cylinders`, `odometer`, `paint_color`, `is_4wd`.

## Handling missing data

Before and data type conversion can be completed. The missing values must be handled. The following operations handle the missing data by either dropping or filling them. First we drop data that can't confidently be filled.

In [5]:
# Drop missing values from model_year, and cylinders

vehicles = vehicles.dropna(subset=['model_year', 'cylinders'])

We drop the missing values in `model_year`, and `cylinders` because these are non-negotiable and there is no real way to fill the missing values with applicable data.

The rest of the features will be handled accordingly:

- `odometer` : fill missing values with the median
- `paint_color` : fill with `unknown` there are too many missing values that wont really contribute as an important feature so dropping values (+8000 rows of the entire dataset) doesn't seem like a good idea.
- `is_4wd` : fill with `0`. Many of the missing values here must represent the boolean opposite of `1.0`.


In [6]:
# Fill NA paint_color
vehicles['paint_color'] = vehicles['paint_color'].fillna('unknown')

In [7]:
# Fill NA odomerter
vehicles['odometer'] = vehicles['odometer'].fillna(vehicles['odometer'].median())

In [8]:
# Fill NA is_4wd
vehicles['is_4wd'] = vehicles['is_4wd'].fillna(0)

In [9]:
# Check for NA values
vehicles.isnull().sum()

price           0
model_year      0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64

There are no longer any null values in the `vehicles` dataframe. Now we can appropriately update the data types of the feature columns in question.

In [10]:
vehicles = vehicles.astype({'is_4wd': 'int64', 'odometer': 'int64', 'cylinders': 'int64', 'model_year': 'int64'})
vehicles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43009 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   price         43009 non-null  int64 
 1   model_year    43009 non-null  int64 
 2   model         43009 non-null  object
 3   condition     43009 non-null  object
 4   cylinders     43009 non-null  int64 
 5   fuel          43009 non-null  object
 6   odometer      43009 non-null  int64 
 7   transmission  43009 non-null  object
 8   type          43009 non-null  object
 9   paint_color   43009 non-null  object
 10  is_4wd        43009 non-null  int64 
 11  date_posted   43009 non-null  object
 12  days_listed   43009 non-null  int64 
dtypes: int64(6), object(7)
memory usage: 4.6+ MB


Were all the missing values in `is_4wd` really just the boolean opposite to complete the boolean pairing in the feature column? Probably many of them are but there is the potential for several to actually be missing `1.0` values. Let's check for those by doing a simple value count analysis between the `model` and `is_4wd`. 

In [11]:
num_models = vehicles['model'].nunique()
print("Number of unique car models in the vehicle data set:", num_models)

Number of unique car models in the vehicle data set: 100


To go through all of these values and determine that there are no `fuzzy duplicates`, we can utilize a special python library known as `thefuzz` to perform fuzzy string matching.

In [13]:
from thefuzz import fuzz

vehicle_models = vehicles['models']

for index, row in vehicle_models.iterrows():
    vehicle_models[row['model']] = vehicle_models['model'].apply(lambda x:fuzz.ratio(row['model'], x))

print(vehicle_models.to_string())

KeyError: 'models'