# Programming and Database project

## Cars for sale in the US: an analysis





**Student**: Gianello Alessio <br>
**ID**: VR502062

## Import and Clean of the dataset

This dataset provides comprehensive information about used cars available for sale in the United States. It includes detailed data on various aspects of each vehicle, making it a valuable resource for car buyers, sellers, and data enthusiasts. 
The dataset contains the following key attributes:
- Model: The specific model of the car.
- Mileage: The number of miles the car has been driven.
- Year: The manufacturing year of the car.
- Status: Indicates whether the car is new, used, or certified pre-owned.
- Dealer: Information about the dealer or seller offering the car.
- Price: The listed price of the car in USD.

dataset source: https://www.kaggle.com/datasets/juanmerinobermejo/us-sales-cars-dataset



Useful imports:


In [48]:
import pandas as pd

In [49]:
# Reading data
cars_list_df = pd.read_csv('cars.csv', encoding='utf-16')

In [50]:
# Taking a look at what we have at hand
cars_list_df.head(10)

Unnamed: 0,Brand,Model,Year,Status,Mileage,Dealer,Price
0,Mazda,CX-5,2023,New,,,36703.0
1,Kia,Sportage,2023,New,,Classic Kia,28990.0
2,Chevrolet,Camaro,2024,New,,Classic Chevrolet Beaumont,41425.0
3,Ford,Bronco,2023,Used,1551.0,Mike Smith Chrysler Dodge Jeep RAM,58900.0
4,Acura,TLX,2021,Used,30384.0,Mike Smith Nissan,34499.0
5,Volkswagen,Golf,2022,Certified,13895.0,Volkswagen of Beaumont,34000.0
6,GMC,Yukon,2021,Used,68506.0,BMW of Beaumont,56954.0
7,BMW,M340,2023,New,,BMW of Beaumont,61715.0
8,Hyundai,Sonata,2023,New,,Hyundai of Silsbee,37945.0
9,Hyundai,Sonata,2023,New,,Hyundai of Silsbee,33495.0


In [51]:
# Taking a closer look
cars_list_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51793 entries, 0 to 51792
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Brand    51793 non-null  object 
 1   Model    51793 non-null  object 
 2   Year     51793 non-null  int64  
 3   Status   51793 non-null  object 
 4   Mileage  22981 non-null  float64
 5   Dealer   51689 non-null  object 
 6   Price    50644 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 2.8+ MB


It's evident that there are a lot of null values, mainly in column 4, more precisely:

In [52]:
#percentage of not-null values in each columns, i define a function beacuse I'll surely need to use it later

def get_null_percentage(df): 
 df_length =len(df)
 return (df.notnull().sum()/df_length)*100 

get_null_percentage(cars_list_df)


Brand      100.000000
Model      100.000000
Year       100.000000
Status     100.000000
Mileage     44.370861
Dealer      99.799201
Price       97.781553
dtype: float64

## Managing null values

As we can see, mileage has a low percentage of not null value, being it so important for our analysis, we have to get rid of rows with null mileage (column number 4), same for the price (number 6)

In [53]:
cars_list_df = cars_list_df.dropna(subset=['Mileage','Price'])


Now our dataset has been significantly resized, but we can surely expect a better quality analysis.

We still need to handle the null values in the 'Dealer' field. Since this is a descriptive (and not quantitative) field, we will replace the null values with 'Unknown dealer'


In [54]:
cars_list_df['Dealer'].fillna('Unkown dealer', inplace=True) # Inplace = True edits the original df

We know shouldn't have colums with null values, let's check it:

In [55]:
#percentage of not-null values in each columns

get_null_percentage(cars_list_df)

Brand      100.0
Model      100.0
Year       100.0
Status     100.0
Mileage    100.0
Dealer     100.0
Price      100.0
dtype: float64

## Analysis of attributes and links between them