In [1]:
# standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# scikit-learn imports
from sklearn.feature_selection import chi2

# tensorflow imports
import tensorflow as tf
from tensorflow.keras.models import load_model, save_model, Sequential
from tensorflow.keras.layers import Dense, Dropout

# setting style
plt.style.use("seaborn-colorblind")

ImportError: Could not find the DLL(s) 'msvcp140_1.dll'. TensorFlow requires that these DLLs be installed in a directory that is named in your %PATH% environment variable. You may install these DLLs by downloading "Microsoft C++ Redistributable for Visual Studio 2015, 2017 and 2019" for your platform from this URL: https://support.microsoft.com/help/2977003/the-latest-supported-visual-c-downloads

# INTRODUCTION

This project was a born from a curiosity question of mine as I prepared for my MBA in the U.S. As many know, cars are the most common method of transportation in the U.S as trains, buses, and other kinds of public transportation vary greatly by state and city.

Considering that, I asked myself: "How much would it cost to buy a car in the U.S?" I immediately discarded new vehicles because they're substantially more expensive, depreciate rapidly and simply didn't adjust to my budget. The only option was **used vehicles**. 

The question then changed to: "How much would it cost to buy a used car in the U.S?" It's sometimes difficult to find information that one considers both reliable and scalable and that has relevance over time, specially when P2P (Peer-to-Peer) markets are so heterogenous. 

In light of this, and building on my curiosity about ML / DL, I decided to build a model to **estimate the value / price of a used vehicle in the U.S**.

**SCOPE**: 
* For time reasons I will **exclude** any kind of **image analysis**.
* In-depth NLP tasks are not in scope of this analysis, but some tasks may be used for additional information.
* Data points such as VIN, which are used in history analysis, will also be excluded.

In general, this exercise attempts to create a prediction model with minimal inputs and without complex encodings & transformations

# DATA SELECTION, CLEANING & PREPARATION

## DATA CLEANING

The first thing I'll do after loading the data will be to analyze and discover columns that only have 1 value OR where each value is unique. Excluding geographical coordinates, features such as "ID" or otherwise unique identifiers do not assist for analysis.

In [None]:
# reading file and info
df = pd.read_csv("data/vehicles.csv")
df.nunique()

We can observe that **ID** and **URL** are unique for each vehicle posting, so we can go ahead and discard them since they do not provide any useful insights.

Additionally, image analysis is not in the scope of this review, so I will also ignore **image_url**.

Finally, as mentioned in scope, I will delete VIN, since for the scope for of this project I will not be querying the car's history.

In [None]:
# deleting useless columns.
df = df.drop(columns = ["id", "url", "image_url", "VIN"])

Next, I'll analyze and discover missing data points and find out what to do with these columns!

In [None]:
df.info()

Df.info() provides a good insight into the missing information, but let's try a different way to see how much missing data do we really have

In [None]:
# calculating all missing data points and normalizing to df size
df.isna().sum() * 100 / df.shape[0]

As we can see, columns such as:
* County -> 100% missing data
* size -> 71.76% missing data

have large amounts of missing data that we can outright consider impossible to fix, so we will just delete them.

In [None]:
# I'll delete the columns with too much missing data to be recoverable
df = df.drop(columns = ["county", "size"])

In [None]:
# let's start exploring some of the remaining information to understand what kind of information do we have at hand
## In particular, let's start exploring REGION & REGION_URL
df.info()

In [None]:
# Top 10 records in region
df["region"].unique()[:10]

In [None]:
# top 10 records in region URL
df["region_url"].unique()[:10]

In general, region seems to be related to the seller's location. 