# **Sample EDA**

In [31]:
import pandas as pd
from pandasql import sqldf
import plotly.express as px

### **Step 1: Preview Dataset**

In [18]:
training_data = pd.read_parquet('clean_car_listings.parquet')
training_data

Unnamed: 0,price,model_year,make,model,trim,mileage,exterior_color,interior_color,num_accidents,num_owners,usage_type,city,state
0,13895,2006,BMW,Z4,Roadster 3.0si,114889,White,Unknown,0,5,Personal,Tempe,AZ
1,19888,2008,BMW,M5,Sedan,129195,Blue,Black,0,3,Personal,Tempe,AZ
2,19999,2008,BMW,M6,Coupe,93700,Gray,Black,0,2,Fleet,West Park,FL
3,18995,2009,BMW,Z4,Roadster sDrive30i,95185,Gray,Black,1,5,Fleet,Englewood,CO
4,6500,2010,BMW,X3,xDrive30i AWD,126832,Red,Beige,0,3,Personal,Bountiful,UT
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74590,12687,2014,Jeep,Patriot,Latitude 4WD,101479,White,Gray,0,5,Personal,Littleton,CO
74591,23409,2016,Mercedes-Benz,GLA,GLA 250 4MATIC,35602,White,Brown,0,5,Personal,Orlando,FL
74592,8191,2012,Chevrolet,Sonic,LT 2LT Sedan AT,130163,Red,Black,0,5,Personal,Ft Collins,CO
74593,16998,2014,Mazda,Mazda5,Sport Automatic,58600,Silver,Black,0,5,Personal,Oxnard,CA


### **Step 2: Get Schema Information**

In [19]:
with open("clean_car_listings_schema.txt", "r") as file:
    for line in file:
        print(line)

TRAINING DATA SCHEMA



price - price of the vehicle listed.

model_year - model year of the vehicle listed.

make - make of the vehicle listed.

trim - trim level of the vehicle specified.

mileage - number of miles on the vehicle's odometer.

exterior_color - exterior color of the vehicle listed.

interior_color - interior color of the vehicle listed.

num_accidents - number of accidents the vehicle listed has been involved in.

num_owners - number of owners associated with the vehicle listed.

usage_type - identifies whether the vehicle listed was used as part of a fleet or for personal use.

city - city where the vehicle is listed.

state - state where the vehicle is listed.


### **Step 3: View Descriptive Statistics**

In [22]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)
training_data.describe()

Unnamed: 0,price,model_year,mileage,num_accidents,num_owners
count,74595.0,74595.0,74595.0,74595.0,74595.0
mean,28190.25,2018.45,52617.49,0.32,1.6
std,11695.49,3.03,33815.88,0.61,0.87
min,2499.0,2005.0,5.0,0.0,0.0
25%,19998.0,2017.0,26807.5,0.0,1.0
50%,25998.0,2019.0,46246.0,0.0,1.0
75%,33998.0,2021.0,73691.5,1.0,2.0
max,99998.0,2024.0,170000.0,7.0,9.0


##### **Key Insights:**
###### - There doesn't seem to be any null values present in any of the variables
###### - There is a lot of variablity shown in the price, mileage, num_accidents, and num_owners column (see difference betwen 75% and max).
###### - The mean for the "model_year" column is high, which could indicate bias towards newer vehicles.

### **Step 4: Univariate EDA**

##### **Model Year Bar Chart**

In [34]:
listings_by_year = sqldf("SELECT model_year, COUNT(*) AS num_listings FROM training_data GROUP BY model_year")
fig = px.bar(listings_by_year, x='model_year', y='num_listings', title='# of Listings by Model Year')
fig.update_yaxes(showgrid=False)

##### **Key Insights:**
###### - Values are skewed to the left.
###### - There is very little data to support predictions between 2005 and 2011

#### **Price Histogram**

In [39]:
fig = px.histogram(training_data, x='price', nbins=100,title='# of Listings by Price')
fig.update_yaxes(showgrid=False)

##### **Key Insights:**
###### - Values are skewed to the right.
###### - Little evidence to support values above $50,000 or below $8,000

#### **Make Bar Chart**

In [41]:
listings_by_make = sqldf("SELECT make, COUNT(*) AS num_listings FROM training_data GROUP BY make ORDER BY num_listings DESC")
fig = px.bar(listings_by_make, x='make', y='num_listings', title='# of Listings by Make')
fig.update_yaxes(showgrid=False)

##### **Key Insights**:
###### - This data seems to represent the real world: Toyota, Honda, Ford, Chevrolet are all top sellers, whereas Luxury Brands such as Cadillac, Volvo, and INFINITI sell less models.
###### - There needs to be a clear cut off point for evidence. For instance, MINI only sells 6 distinct models, whereas Cadillac sells over 30.