# Capstone project for Victor Herrera

![Course Hero](images/hero.png)

## Introduction

This project is the capstone project for the 'Plating with Jupyter notebooks and Machine learning' bootcamp. The purpose of this project is to improve my data cleaning skills and create a model that is capable of estimating a used car value according to several factors.

## Data Set Selection

The dataset was found on Kaggle. 

It resulted interesting for me because I like to check every now and then the prices of different car models just to have an idea of the current state of the car market. I know that this prices may not be tha same for my country but it's a good place to start.

## Data Examination

Let's start with the imports for the notebook.

Note: Remember to add in the `requirements.txt` file all the modules you use.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Get the selected data set into a pandas Dataframe.

Note: You need to add the right method to load the data.

In [None]:
cars_raw_df = pd.read_csv('data/cars_raw.csv')

Find relevant information about the selected dataset.

- How many rows and columns does it has?
- Which characteristics does each column has?
    - Data type
    - Minimum and maximum values
    - Values distribution
    - Missing data
- Which columns are related or are dependent on each other? 
    - Which ones can be derived?
    - Which are good candidates for an hypothesis?

Note: Use pandas methods as shape, head, sample, groupby, describe and any other you can think of!

In [None]:
# How many rows and columns does it has?

print("Data frame has:")
print(" -- " + str(cars_raw_df.shape[1]) + " rows")
print(" -- " + str(cars_raw_df.shape[0]) +" cols ")

print("Columns in the data frame are:")
for col in cars_raw_df.columns:
    print(" -- " + col)

In [None]:
# Display some data to get an idea of what we are working with

print("5 entries sample:")
cars_raw_df.sample(n=5)

In [None]:
# Colums data type and missing data

print("Columns composition is:")
cars_raw_df.info(verbose=True)

In [None]:
# Of which we only have these unique values

cars_raw_df.nunique()

In [None]:
# Columns data min and max values as well as dristribution
cars_raw_df.describe(include="all")

## Define the Hypothesis to test

The price of an used car should be able to be estimated based on several attributes of the state of the car. Older cars should be cheaper than newer cars, but also a car from a premium brand should have a higher value than a regular brand car from the same year.

The hypothesis to test is that depending on the attributes of the car we should be able to estimate a "fair price" for selling.

### Drawing some charts



## Clean the data

Create a new Data Frame just with the data you are going to use

### Removing columns that i'm not interested in:

- **ConsumerRating** as I'm not sure what the user is rating
- **ConsumerReviews** more reviews shouldn't imply a higher price
- **SellerName** because `cars_raw_df["SellerName"].nunique()` returns 3971, so it would be harder to transform those unique values to a number representation
- **SellerRating** because I'm analizing the cars, not the seller
- **StreetName** because I'm not interested in the particular seller address
- **Zipcode** because I will use  `State` as my reference for location
- **DealType** because I don't know what is the criteria for a 'Great' deal
- **ValueForMoneyRating** beacause this may be the most subjective value of them all, how would a potential seller rate this value?
- **MinMPG** Related to engine type
- **MaxMPG** Related to engine type
- **VIN** Is unique to each vehicle
- **Stock#** is unique to each vehicle


In [None]:
# SellerName overview
print("Registered unique seller names:")
cars_raw_df["SellerName"].nunique()

In [None]:
# DealType overview
print("Types of DealType:")
cars_raw_df["DealType"].unique()

In [None]:
# Column removal

cars_df = cars_raw_df.drop(columns=[
    "ConsumerRating",
    "ConsumerReviews",
    "SellerName",
    "SellerRating",
    "StreetName",
    "Zipcode",
    "DealType",
    "ValueForMoneyRating",
    "MinMPG",
    "MaxMPG",
    "VIN",
    "Stock#",
])

cars_df.describe(include="all")

In [None]:
print("Remaining columns in the data frame are:")
for col in cars_df.columns:
    print(" -- " + col)

### Narrow the data frame

For this excercise we will focus on a certain car model.

First I wan't to know what car manufacturer has the most data.

At this moment my guss would be Toyota because fun fact: the Toyota Corolla is the most sold car ever in history

In [None]:
print("Top 10 Makers count:")
(cars_df.groupby("Make")["Make"]
    .count()
    .reset_index(name='count')
    .sort_values(['count'], ascending=False)
    .head(10))



I guess I was wrong, and the used car market is flodded with BMW's then. I would like to have a look at the top 5 makers

In [None]:
top_3_makers = ["BMW", "Mercedes-Benz", "Toyota"]
top_5_makers = ["BMW", "Mercedes-Benz", "Toyota", "Honda", "Ford"]
top_makers = cars_df[cars_df["Make"].isin(top_5_makers)]
top_makers.describe(include="all")

In my first attempt I tried to use the top 3 makers and for those (`BMW`, `Mercedes-Benz`, `Toyota`) the most pupular car model was the **BMW X5 xDrive40i**, but I had my doubts. Maybe rich people change care more often, so they resell their car when buying a new one...

![People's car](images/bmw_x5.jpg)

Including the top 5 makers resulted in the most popular car model being the **Honda CR-V EX-L** wich make a little bit more sense to me.

![Honda CR-V](images/honda-cr-v.png)

But the issue persisted, the most common car model for the whole data frame is the **Jeep Grand Cherokee Limited**, so not a Honda nor a BMW

### Model cleanup

Often times a cars model name is the sum of the model itself combined with some base extra commodities that the manufacturer adds on top that base model.

In [None]:
#Define a function for listing top n for certain column
def list_top_n(df, col, n):
    return (df.groupby(col)[col]
        .count()
        .reset_index(name='count')
        .sort_values(['count'], ascending=False)
        .head(n))

In [None]:
bmw_df = cars_df[cars_df["Make"] == "BMW"]

list_top_n(bmw_df, col="Model", n=10)

Returning to BMW, we can see that out of the top 10 most popular BMW models 2 are different configurations of the X5 ("X5 xDrive40i", "X5 xDrive35i") and 3 are X3 ("X3 xDrive30i", "X3 sDrive30i", "X3 xDrive28i").

I would like to do the same but for top 30 car models from the whole data frame:

In [None]:
list_top_n(cars_df, col="Model", n=30)

Reviewing the top 30 models I would make the **bold** assumption that more often than not, the last word is the one that has the most extras, so maybe deleting that last word may produce different results for the same analysis.

In [None]:
cars_df[["ModelBase", "ModelExtras"]] = cars_df["Model"].str.rsplit(' ', n=1, expand=True)
cars_df.head()

Now that we have our models separated by base and extras let's see what are the top models

In [None]:
list_top_n(cars_df, col="ModelBase", n=30)

In [None]:
## TODO: Select a car model and justify it

corollas = cars_df[cars_df["ModelBase"] == "Corolla"]
list_top_n(corollas, col="ModelExtras", n=20)

## Run your experiment(s)

Describe what your experiment is done, and execute it.

Note: Be generous with your plots!

## Reach a conclusion

What was the result of your experiment?

How can it be improved?

Elaborate in one thing you learn during the capstone project.

## Congratulations

You have finished the bootcamp!

![Congratulations](images/congratulations.jpg)