# Exploratory Data Analysis

This notebook will cover EDA on the dataset 'vehicles_us.csv'

## Imports

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import notebook_utils.notebook_auxiliaries as aux

## Load

In [2]:
data = pd.read_csv('../data/vehicles_us.csv')
data.info()
data.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
32472,19995,2007.0,chevrolet tahoe,excellent,,gas,,automatic,SUV,black,,2018-05-14,40
2504,8000,2010.0,chevrolet suburban,excellent,8.0,gas,,automatic,SUV,black,1.0,2018-10-13,16
1690,30948,2018.0,chevrolet colorado,excellent,,gas,,automatic,truck,,1.0,2018-09-09,21
18849,19895,2015.0,jeep wrangler unlimited,good,6.0,gas,105057.0,automatic,SUV,green,1.0,2019-02-20,46
19316,7500,2017.0,nissan sentra,excellent,4.0,gas,58.0,automatic,sedan,,,2019-03-26,37


File seems to have been read successfully and column names are ok.

## Data Cleaning

FixingIn this section I'll fix missing malues, deal with duplicates, and fit adequate data types.

### Missing values

In [3]:
aux.isna_report(data)

*** Missing/Null values report ***
----------------------------------
Single column analysis

3619 (7.0%) missing values in column: 'model_year'
5260 (10.2%) missing values in column: 'cylinders'
7892 (15.3%) missing values in column: 'odometer'
9267 (18.0%) missing values in column: 'paint_color'
25953 (50.4%) missing values in column: 'is_4wd'

Multiple column analysis (2)

363 (0.7%) concurrent missing values in: ('model_year', 'cylinders')
549 (1.1%) concurrent missing values in: ('model_year', 'odometer')
652 (1.3%) concurrent missing values in: ('model_year', 'paint_color')
1811 (3.5%) concurrent missing values in: ('model_year', 'is_4wd')
812 (1.6%) concurrent missing values in: ('cylinders', 'odometer')
950 (1.8%) concurrent missing values in: ('cylinders', 'paint_color')
2681 (5.2%) concurrent missing values in: ('cylinders', 'is_4wd')
1455 (2.8%) concurrent missing values in: ('odometer', 'paint_color')
4016 (7.8%) concurrent missing values in: ('odometer', 'is_4wd')
4637 (9.

* The column 'is_4wd' is 50% missing values, it'll be removed.
* The rows with missing values in 'model_year' amount to less than 10% of all the rows, they can be dropped.
* The rows with missing values in 'cylinders' amount to around 10% of all the rows, they can be dropped.
* The column 'paint_color' has over 17% missing values, they will be inputed with the most common color (the mode) for cars of the same model ('model' column).
* The column 'odometer' has around 15% missing values, they'll be inputed with the mean odometer readings for cars of the same condition and same model ('condition' & 'model' columns).
* Rows with multiple missing values are rare in the data set.

In [4]:
data = aux.fix_nas(data)

*** Fixing missing values ***
-----------------------------
Dropping column: 'is_4wd'...
Dropping rows with missing values from 'model_year' and 'cylinders'...
Inputting values for 'paint_color'...
Inputting values for 'odometer'...
Starting rows: 51525 Rows after: 43009 (8516 rows lost, 16.5%)
Starting columns: 13 Columns after: 12 (1 columns lost, 7.7%)
-----------------------------
+++ Missing values FIXED +++


Report on missing values, again.

In [5]:
aux.isna_report(data)

*** Missing/Null values report ***
----------------------------------
No missing values found.
-----------------------------------
+++ Report END +++


### Duplicates

In [6]:
data.duplicated().sum()

np.int64(0)

* There are no explicit, full-row duplicates.
* No variable or combination of variables seems to work as a unique identifier.
* There is no need to check for duplicates in individual columns or combinations of columns.

### Data types

In [7]:
data = aux.fix_data_types(data)

Fitting data type for column: 'model'
	cardinality: 100
	due to context as category, casting to 'category'
Fitting data type for column: 'condition'
	cardinality: 6
	due to context as category, casting to 'category'
Fitting data type for column: 'fuel'
	cardinality: 5
	due to context as category, casting to 'category'
Fitting data type for column: 'transmission'
	cardinality: 3
	due to context as category, casting to 'category'
Fitting data type for column: 'type'
	cardinality: 13
	due to context as category, casting to 'category'
Fitting data type for column: 'paint_color'
	cardinality: 12
	due to context as category, casting to 'category'
Fitting data type for column: 'price'
	min: 1	max: 375000
	requires negative values?: False
	requires decimal part?: False
	due to monetary context, casting to 'float32'
Fitting data type for column: 'model_year'
	min: 1908.0	max: 2019.0
	requires negative values?: False
	requires decimal part?: False
	due to context as year, casting to 'uint16'
Fit

Results of data type correction:

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43009 entries, 0 to 51524
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   price         43009 non-null  float32       
 1   model_year    43009 non-null  uint16        
 2   model         43009 non-null  category      
 3   condition     43009 non-null  category      
 4   cylinders     43009 non-null  uint8         
 5   fuel          43009 non-null  category      
 6   odometer      43009 non-null  float32       
 7   transmission  43009 non-null  category      
 8   type          43009 non-null  category      
 9   paint_color   43009 non-null  category      
 10  date_posted   43009 non-null  datetime64[ns]
 11  days_listed   43009 non-null  uint16        
dtypes: category(6), datetime64[ns](1), float32(2), uint16(2), uint8(1)
memory usage: 1.4 MB


### Data cleaning: Results

* I've removed an unrecoverable column.
* I've inputted sensible data to cover missing values.
* No duplicates were detected.
* Data types have been fitted that better represent the data. 

## Exploration