# Programming and Database project

## Cars for sale in the US: an analysis





**Student**: Gianello Alessio <br>
**ID**: VR502062

## Import and Clean of the dataset

This dataset provides comprehensive information about used cars available for sale in the United States. It includes detailed data on various aspects of each vehicle, making it a valuable resource for car buyers, sellers, and data enthusiasts. 
The dataset contains the following key attributes:
- Model: The specific model of the car.
- Mileage: The number of miles the car has been driven.
- Year: The manufacturing year of the car.
- Status: Indicates whether the car is new, used, or certified pre-owned.
- Dealer: Information about the dealer or seller offering the car.
- Price: The listed price of the car in USD.

dataset source: https://www.kaggle.com/datasets/juanmerinobermejo/us-sales-cars-dataset



Useful imports:


In [158]:
import pandas as pd

In [159]:
# Reading data
cars_list_df = pd.read_csv('cars.csv', encoding='utf-16')

In [160]:
# Taking a look at what we have at hand
cars_list_df.head(10)

Unnamed: 0,Brand,Model,Year,Status,Mileage,Dealer,Price
0,Mazda,CX-5,2023,New,,,36703.0
1,Kia,Sportage,2023,New,,Classic Kia,28990.0
2,Chevrolet,Camaro,2024,New,,Classic Chevrolet Beaumont,41425.0
3,Ford,Bronco,2023,Used,1551.0,Mike Smith Chrysler Dodge Jeep RAM,58900.0
4,Acura,TLX,2021,Used,30384.0,Mike Smith Nissan,34499.0
5,Volkswagen,Golf,2022,Certified,13895.0,Volkswagen of Beaumont,34000.0
6,GMC,Yukon,2021,Used,68506.0,BMW of Beaumont,56954.0
7,BMW,M340,2023,New,,BMW of Beaumont,61715.0
8,Hyundai,Sonata,2023,New,,Hyundai of Silsbee,37945.0
9,Hyundai,Sonata,2023,New,,Hyundai of Silsbee,33495.0


In [161]:
# Taking a closer look
cars_list_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51793 entries, 0 to 51792
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Brand    51793 non-null  object 
 1   Model    51793 non-null  object 
 2   Year     51793 non-null  int64  
 3   Status   51793 non-null  object 
 4   Mileage  22981 non-null  float64
 5   Dealer   51689 non-null  object 
 6   Price    50644 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 2.8+ MB


It's evident that there are a lot of null values, mainly in column 4, more precisely:

In [162]:
#percentage of not-null values in each columns, i define a function beacuse I'll surely need to use it later

def get_null_percentage(df): 
 df_length =len(df)
 return (df.notnull().sum()/df_length)*100 

get_null_percentage(cars_list_df)


Brand      100.000000
Model      100.000000
Year       100.000000
Status     100.000000
Mileage     44.370861
Dealer      99.799201
Price       97.781553
dtype: float64

Set columns name to lower case:


In [163]:
cars_list_df.columns = cars_list_df.columns.map(lambda x: x.lower())

## Managing null values

As we can see, mileage has a low percentage of not null value, being it so important for our analysis, we have to get rid of rows with null mileage (column number 4), same for the price (number 6)

In [164]:
cars_list_df = cars_list_df.dropna(subset=['mileage','price'])


Now our dataset has been significantly resized, but we can surely expect a better quality analysis.

We still need to handle the null values in the 'Dealer' field. Since this is a descriptive (and not quantitative) field, we will replace the null values with 'Unknown dealer'


In [165]:
cars_list_df['dealer'].fillna('unkown_dealer', inplace=True) # Inplace = True edits the original df

We now shouldn't have colums with null values, let's check it:

In [166]:
#percentage of not-null values in each columns

get_null_percentage(cars_list_df)

brand      100.0
model      100.0
year       100.0
status     100.0
mileage    100.0
dealer     100.0
price      100.0
dtype: float64

## Analysis of attributes and links between them

In the following sections I'll perform a .value_counts() for every descriptive column of the dataframe, to check for anomalies.

In [167]:
cars_list_df.brand.value_counts().sort_index() #Sorting index for a faster check for troubles

Acura                   501
Alfa Romeo               67
Aston Martin             17
Audi                    932
BMW                    1393
Bentley                  44
Buick                   138
Cadillac                541
Chevrolet              1951
Chrysler                119
Dodge                   466
FIAT                     26
Ferrari                  47
Ford                   2614
GMC                     806
Genesis                 198
Geo                       1
Honda                  1097
Hummer                   19
Hyundai                 411
Infiniti                363
International Scout       1
Isuzu                     1
Jaguar                  134
Jeep                    876
Karma                     4
Kia                     502
Lamborghini              37
Land Rover              410
Lexus                  1218
Lincoln                 342
Lotus                     3
Lucid                     4
MINI                     79
Maserati                 70
Maybach             

In [168]:
brand =cars_list_df.brand.value_counts().index
len(brand)

59

In [169]:
formatted_brand = brand.map(lambda x: x.lower().replace(' ','_').replace('-',''))
len(formatted_brand)

59

In [170]:
cars_list_df.brand.value_counts() #Sorting index for a faster check for troubles

Acura                   501
Alfa Romeo               67
Aston Martin             17
Audi                    932
BMW                    1393
Bentley                  44
Buick                   138
Cadillac                541
Chevrolet              1951
Chrysler                119
Dodge                   466
FIAT                     26
Ferrari                  47
Ford                   2614
GMC                     806
Genesis                 198
Geo                       1
Honda                  1097
Hummer                   19
Hyundai                 411
Infiniti                363
International Scout       1
Isuzu                     1
Jaguar                  134
Jeep                    876
Karma                     4
Kia                     502
Lamborghini              37
Land Rover              410
Lexus                  1218
Lincoln                 342
Lotus                     3
Lucid                     4
MINI                     79
Maserati                 70
Maybach             

In [171]:
models =cars_list_df.model.value_counts().sort_index().index
len(models)

560

In [172]:
formatted_models = models.map(lambda x: x.lower().replace(' ','_').replace('-',''))
len(formatted_models)

560

So there aren't doubles in car models or brands caused by format problems, I can get back to my analysis