# Exploratory Data Analysis

This notebook will cover EDA on the dataset 'vehicles_us.csv'

## Imports

In [59]:
import pandas as pd
import plotly.express as px
import notebook_auxiliaries as aux

## Load

In [60]:
data = pd.read_csv('../data/vehicles_us.csv')
data.info()
data.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
7064,4200,2007.0,honda odyssey,good,,gas,244300.0,automatic,mini-van,black,,2018-05-08,128
16722,17900,2013.0,gmc acadia,like new,6.0,gas,,automatic,SUV,white,,2018-12-05,102
7257,4600,2007.0,honda civic,excellent,4.0,hybrid,136000.0,automatic,sedan,grey,,2019-03-06,21
45063,8950,2011.0,subaru forester,excellent,4.0,gas,,automatic,SUV,blue,1.0,2018-11-15,27
12472,29500,2005.0,chevrolet corvette,like new,8.0,gas,33000.0,manual,convertible,yellow,,2018-12-08,6


File seems to have been read successfully and column names are ok.

## Nulls & Data types

Making sure categorical variables are recognized as such by Pandas can improve the quality of visualizations.
Dealing with nulls makes for easier analysis.

In [61]:
aux.isna_report(data)

*** Missing/Null values report ***
----------------------------------

Examining column: 'price'
No missing values.

Examining column: 'model_year'
NaN sum: 3619
Percentage: 7.0238

Examining column: 'model'
No missing values.

Examining column: 'condition'
No missing values.

Examining column: 'cylinders'
NaN sum: 5260
Percentage: 10.2086

Examining column: 'fuel'
No missing values.

Examining column: 'odometer'
NaN sum: 7892
Percentage: 15.3168

Examining column: 'transmission'
No missing values.

Examining column: 'type'
No missing values.

Examining column: 'paint_color'
NaN sum: 9267
Percentage: 17.9854

Examining column: 'is_4wd'
NaN sum: 25953
Percentage: 50.3697

Examining column: 'date_posted'
No missing values.

Examining column: 'days_listed'
No missing values.

-----------------------------------
+++ Report END +++


* The column 'is_4wd' is 50% missing values, it'll be removed.
* The rows with missing values in 'model_year' amount to less than 10% of all the rows, they can be dropped.
* The rows with missing values in 'cylinders' amount to around 10% of all the rwos, they can be dropped.
* The column 'paint_color' has over 17% missing values, they will be inputed with the most common color (the mode) for cars of the same model ('model' column).
* The column 'odometer' has around 15% missing values, they'll be inputed with the mean odometer readings for cars of the same condition and same model ('condition' & 'model' columns).

In [None]:
data = data.drop('is_4wd', axis=1)
data = data.dropna(subset=['model_year', 'cylinders'])
#colors_by_type = data[['']]
# TODO input paint color with color mode by model
# TODO input odometer with mean odometer of vehicles of same condition and model
car_color_mode = data['paint_color'].mode()[0]
data['paint_color'] = data['paint_color'].fillna(car_color_mode)
aux.isna_report(data)

100
*** Missing/Null values report ***
----------------------------------

Examining column: 'price'
No missing values.

Examining column: 'model_year'
No missing values.

Examining column: 'model'
No missing values.

Examining column: 'condition'
No missing values.

Examining column: 'cylinders'
No missing values.

Examining column: 'fuel'
No missing values.

Examining column: 'odometer'
NaN sum: 6590
Percentage: 15.3224

Examining column: 'transmission'
No missing values.

Examining column: 'type'
No missing values.

Examining column: 'paint_color'
No missing values.

Examining column: 'date_posted'
No missing values.

Examining column: 'days_listed'
No missing values.

-----------------------------------
+++ Report END +++


In [63]:
if aux.check_info_in_decimals(data['model_year']) == 0:
    data['model_year'] = data['model_year'].astype('Int64')
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43009 entries, 0 to 51524
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         43009 non-null  int64  
 1   model_year    43009 non-null  Int64  
 2   model         43009 non-null  object 
 3   condition     43009 non-null  object 
 4   cylinders     43009 non-null  float64
 5   fuel          43009 non-null  object 
 6   odometer      36419 non-null  float64
 7   transmission  43009 non-null  object 
 8   type          43009 non-null  object 
 9   paint_color   43009 non-null  object 
 10  date_posted   43009 non-null  object 
 11  days_listed   43009 non-null  int64  
dtypes: Int64(1), float64(2), int64(2), object(7)
memory usage: 4.3+ MB
