<h1>Title: Sprint 4 Project - Exploratory Data Analysis</h1>

This project aims to analyze a dataset of car listings in the US. We'll explore various aspects of the data, such as price distribution, model year distribution, odometer readings, and more. Visualizations will be created using Plotly and integrated into a Streamlit app.

Dataset Description

The dataset contains various columns, including:
- `price`: Price of the car in USD
- `model_year`: Year the car model was made
- `odometer`: Odometer reading of the car
- `days_listed`: Number of days the car has been listed for sale
- `cylinders`: Number of cylinders in the car's engine
- `condition`, `fuel`, `transmission`: Various categorical attributes of the car

Let's get started by loading and preprocessing the data.

In [2]:
import streamlit as st
import pandas as pd
import plotly.express as px

In [3]:
# EDA.ipynb

# import pandas as pd
# import plotly.express as px

# Load your dataset
df= pd.read_csv("C:/Users/CRcha/OneDrive/Desktop/coding/sprint4project/sprint4_project/vehicles_us.csv")

# Basic information about the dataset
df.info()
df.describe(include='object')



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


Unnamed: 0,model,condition,fuel,transmission,type,paint_color,date_posted
count,51525,51525,51525,51525,51525,42258,51525
unique,100,6,5,3,13,12,354
top,ford f-150,excellent,gas,automatic,SUV,white,2019-03-17
freq,2796,24773,47288,46902,12405,10029,186


In [4]:
# Check for duplicate rows
duplicate_rows = df[df.duplicated()]
print("\nNumber of duplicate rows: ", duplicate_rows.shape[0])

# Display duplicate rows
print("Duplicate rows:\n", duplicate_rows)



Number of duplicate rows:  0
Duplicate rows:
 Empty DataFrame
Columns: [price, model_year, model, condition, cylinders, fuel, odometer, transmission, type, paint_color, is_4wd, date_posted, days_listed]
Index: []


In [5]:
display(df.head(10))
display(df.sample(10))

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
5,14990,2014.0,chrysler 300,excellent,6.0,gas,57954.0,automatic,sedan,black,1.0,2018-06-20,15
6,12990,2015.0,toyota camry,excellent,4.0,gas,79212.0,automatic,sedan,white,,2018-12-27,73
7,15990,2013.0,honda pilot,excellent,6.0,gas,109473.0,automatic,SUV,black,1.0,2019-01-07,68
8,11500,2012.0,kia sorento,excellent,4.0,gas,104174.0,automatic,SUV,,1.0,2018-07-16,19
9,9200,2008.0,honda pilot,excellent,,gas,147191.0,automatic,SUV,blue,1.0,2019-02-15,17


Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
45169,9000,2010.0,subaru forester,excellent,4.0,gas,,automatic,SUV,,,2019-04-04,28
41919,34000,2018.0,honda odyssey,like new,6.0,gas,17635.0,automatic,mini-van,black,,2019-02-10,76
26175,2495,2000.0,ford explorer,excellent,6.0,gas,,automatic,SUV,blue,1.0,2018-12-26,59
27283,5995,2008.0,volkswagen passat,excellent,4.0,gas,112118.0,automatic,hatchback,blue,,2019-02-11,100
43039,4200,,ford fusion,excellent,,gas,,automatic,sedan,,,2019-01-10,11
48467,14900,2014.0,ford f-150,good,6.0,gas,161257.0,automatic,truck,grey,1.0,2018-05-19,17
32034,30000,2016.0,ford f-150,excellent,6.0,gas,43446.0,automatic,pickup,black,1.0,2018-11-02,59
1939,19200,2010.0,toyota 4runner,good,,gas,111857.0,automatic,SUV,white,1.0,2019-03-16,26
18689,36995,2013.0,ram 3500,excellent,8.0,gas,35922.0,automatic,truck,black,1.0,2018-12-07,38
27635,8500,2007.0,chevrolet silverado 2500hd,good,8.0,gas,270000.0,automatic,truck,white,1.0,2019-03-26,65


In [6]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Percentage of missing values
missing_percentage = (missing_values / len(df)) * 100
print("\nPercentage of missing values in each column:\n", missing_percentage)


Missing values in each column:
 price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

Percentage of missing values in each column:
 price            0.000000
model_year       7.023775
model            0.000000
condition        0.000000
cylinders       10.208637
fuel             0.000000
odometer        15.316836
transmission     0.000000
type             0.000000
paint_color     17.985444
is_4wd          50.369723
date_posted      0.000000
days_listed      0.000000
dtype: float64


In [7]:
# Data preprocessing
# Dropping rows with missing values for simplicity
df.dropna(inplace=True)

# Convert data types if necessary
df['model_year'] = df['model_year'].astype(int)
df['cylinders'] = df['cylinders'].astype(int)
df['odometer'] = df['odometer'].astype(int)

# Display the first few rows of the dataframe
st.write("## Data Overview")
st.write(df.head())

2024-07-08 19:17:07.385 
  command:

    streamlit run c:\Users\CRcha\anaconda3\envs\sprint4_env\lib\site-packages\ipykernel_launcher.py [ARGUMENTS]


In [8]:
# Check for duplicate rows
duplicate_rows = df[df.duplicated()]
print("\nNumber of duplicate rows: ", duplicate_rows.shape[0])

# Display duplicate rows
print("Duplicate rows:\n", duplicate_rows)



Number of duplicate rows:  0
Duplicate rows:
 Empty DataFrame
Columns: [price, model_year, model, condition, cylinders, fuel, odometer, transmission, type, paint_color, is_4wd, date_posted, days_listed]
Index: []


In [9]:
fig = px.histogram(
df,
x='price',
title='Distribution of Car Prices',
labels={'price': 'Car Price (USD)'},
color_discrete_sequence=['#636EFA'],
template='presentation',
)

# Update the layout for a more professional appearance
fig.update_layout(
title=dict(text='Distribution of Car Prices', x=0.5),
xaxis_title='Price (USD)',
yaxis_title='Number of Listings',
bargap=0.2,
plot_bgcolor='rgba(0, 0, 0, 0)',
paper_bgcolor='rgba(0, 0, 0, 0)',
)

# Display the plot
fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [None]:
if st.checkbox('Show Price Distribution Histogram'):
    st.subheader('Distribution of Car Prices')
    
    # Create the histogram with detailed and professional styling
    fig = px.histogram(
        df,
        x='price',
        title='Distribution of Car Prices',
        labels={'price': 'Car Price (USD)'},
        color_discrete_sequence=['#636EFA'],
        template='presentation',
    )
    

2024-07-07 15:50:58.347 
  command:

    streamlit run c:\Users\CRcha\anaconda3\envs\sprint4_env\lib\site-packages\ipykernel_launcher.py [ARGUMENTS]


In [10]:
# Title of the Streamlit app
st.title("Car Listings Analysis")

# Streamlit header
st.header("Explore the Dataset")

# Checkbox to show/hide Price Distribution Histogram
if st.checkbox('Show Price Distribution Histogram'):
    st.subheader('Distribution of Car Prices')
    
    # Create the histogram with detailed and professional styling
    fig = px.histogram(
        df,
        x='price',
        title='Distribution of Car Prices',
        labels={'price': 'Car Price (USD)'},
        color_discrete_sequence=['#636EFA'],
        template='presentation',
    )
    
    # Update the layout for a more professional appearance
    fig.update_layout(
        title=dict(text='Distribution of Car Prices', x=0.5),
        xaxis_title='Price (USD)',
        yaxis_title='Number of Listings',
        bargap=0.2,
        plot_bgcolor='rgba(0, 0, 0, 0)',
        paper_bgcolor='rgba(0, 0, 0, 0)',
    )
    
    # Show the plot
    st.plotly_chart(fig)

# Checkbox to show/hide Model Year Distribution Histogram
if st.checkbox('Show Model Year Distribution Histogram'):
    st.subheader('Model Year Distribution')
    fig = px.histogram(df, x='model_year', title='Model Year Distribution')
    st.plotly_chart(fig)

# Checkbox to show/hide Odometer Readings Distribution Histogram
if st.checkbox('Show Odometer Readings Distribution Histogram'):
    st.subheader('Odometer Readings Distribution')
    fig = px.histogram(df, x='odometer', title='Odometer Readings Distribution')
    st.plotly_chart(fig)

# Checkbox to show/hide Days Listed Distribution Histogram
if st.checkbox('Show Days Listed Distribution Histogram'):
    st.subheader('Days Listed Distribution')
    fig = px.histogram(df, x='days_listed', title='Days Listed Distribution')
    st.plotly_chart(fig)

# Checkbox to show/hide Price vs. Odometer Scatter Plot
if st.checkbox('Show Price vs. Odometer Scatter Plot'):
    st.subheader('Price vs. Odometer (Mileage)')
    fig = px.scatter(df, x='odometer', y='price', title='Price vs. Odometer (Mileage)')
    st.plotly_chart(fig)

# Checkbox to show/hide Price vs. Model Year Scatter Plot
if st.checkbox('Show Price vs. Model Year Scatter Plot'):
    st.subheader('Price vs. Model Year')
    fig = px.scatter(df, x='model_year', y='price', title='Price vs. Model Year')
    st.plotly_chart(fig)

# Checkbox to show/hide Price vs. Cylinders Scatter Plot
if st.checkbox('Show Price vs. Cylinders Scatter Plot'):
    st.subheader('Price vs. Cylinders')
    fig = px.scatter(df, x='cylinders', y='price', title='Price vs. Cylinders')
    st.plotly_chart(fig)

# Checkbox to show/hide Number of Listings by Condition Bar Plot
if st.checkbox('Show Number of Listings by Condition'):
    st.subheader('Number of Listings by Condition')
    fig = px.bar(df, x='condition', title='Number of Listings by Condition')
    st.plotly_chart(fig)

# Checkbox to show/hide Number of Listings by Fuel Type Bar Plot
if st.checkbox('Show Number of Listings by Fuel Type'):
    st.subheader('Number of Listings by Fuel Type')
    fig = px.bar(df, x='fuel', title='Number of Listings by Fuel Type')
    st.plotly_chart(fig)

# Checkbox to show/hide Number of Listings by Transmission Type Bar Plot
if st.checkbox('Show Number of Listings by Transmission Type'):
    st.subheader('Number of Listings by Transmission Type')
    fig = px.bar(df, x='transmission', title='Number of Listings by Transmission Type')
    st.plotly_chart(fig)

# Checkbox to show/hide Price by Condition Box Plot
if st.checkbox('Show Price by Condition Box Plot'):
    st.subheader('Price by Condition')
    fig = px.box(df, x='condition', y='price', title='Price by Condition')
    st.plotly_chart(fig)

# Checkbox to show/hide Price by Fuel Type Box Plot
if st.checkbox('Show Price by Fuel Type Box Plot'):
    st.subheader('Price by Fuel Type')
    fig = px.box(df, x='fuel', y='price', title='Price by Fuel Type')
    st.plotly_chart(fig)

# Checkbox to show/hide Price by Transmission Type Box Plot
if st.checkbox('Show Price by Transmission Type Box Plot'):
    st.subheader('Price by Transmission Type')
    fig = px.box(df, x='transmission', y='price', title='Price by Transmission Type')
    st.plotly_chart(fig)

# Checkbox to show/hide Correlation Matrix Heatmap
if st.checkbox('Show Correlation Matrix Heatmap'):
    st.subheader('Correlation Matrix Heatmap')
    correlation_matrix = df[['price', 'model_year', 'odometer', 'days_listed', 'cylinders']].corr()
    fig = px.imshow(correlation_matrix, title='Correlation Matrix Heatmap')
    st.plotly_chart(fig)

# Checkbox to show/hide Scatter Matrix
if st.checkbox('Show Scatter Matrix'):
    st.subheader('Scatter Matrix')
    fig = px.scatter_matrix(df, dimensions=['price', 'model_year', 'odometer', 'days_listed', 'cylinders'], title='Scatter Matrix')
    st.plotly_chart(fig)