# Predicting Car Prices 

In [1]:
import numpy as np
import pandas as pd

from PythonScripts import wrangle , explore, scale, evaluate

import warnings
warnings.filterwarnings('ignore')

Executive 

## Wrangle

In [2]:
#This database has 3 million observations
#we'll use 200,000 due to ram limitations
cars_df = wrangle.get_car_data(200000)
cars_df.shape

(200000, 66)

In [3]:
cars_df = wrangle.clean_car_data(cars_df)
cars_df.shape

<class 'pandas.core.series.Series'>


(87517, 41)

### Encode / Split / Scale
Here I will encode all categorical objects into numbers for the regression model

In [None]:
encoded_cars = wrangle.encode_cars(cars_df)

In [None]:
train, validate, test = wrangle.split_for_model(encoded_cars)

In [None]:
train_scaled, validate_scaled, test_scaled = scale.scale_data(train, validate, test, scale_type='Robust', to_scale=['back_legroom','city_fuel_economy', 'daysonmarket', 'engine_displacement', 'front_legroom','fuel_tank_volume', 'height', 'highway_fuel_economy', 'horsepower', 'length', 'mileage','wheelbase', 'width'])

### Exploration

In [None]:
explore.get_distribution(train_scaled.drop(columns=['vin','city','dealer_zip']))

### Target = Price

In [None]:
explore.graph_to_target(train_scaled.sample(2500).drop(columns=['vin','city','dealer_zip']),'price')

In [None]:
explore.get_heatmap(train_scaled, 'price')

### Takeaways
- mostly right tailed, as theres more cars that are least expensive the luxury cars
- Year is left tailed, as this data is scraped from car gurus , majority of the cars listed upon scraping are new listed vehicles for sale
- Based on research, we want to use year, make and model of the car,trim level and milage on the user end in order to get the other factors that are needed for helping predict the car price. The minimum the user should know is the year make and model if the user knows the trim level then that gives us an idea on other useful information that can be used in our model and mileage (such as engine type, or horsepower)

### Statistical testing
Lets make sure Year, Make, and Model hold correlation to price, then we'll look at what rfe to see what else plays a role in predicting car prices. many of our variables are not normally distributed, so we'll use pearsons r correlation 

In [None]:
cont_var = ['year', 'mileage']
cat_var = ['make_name_num', 'model_name_num']
evaluate.get_t_test(cat_var, train_scaled, 'price', 0.05)
evaluate.get_pearsons(cont_var,'price',0.05,train_scaled)

### Feature Elimination
- What feautures does k-best and rfe select as the best drivers

In [None]:
X_train = train_scaled.drop(columns=['vin','price']).select_dtypes(exclude='object')
y_train = train.price

X_validate = validate_scaled.drop(columns=['vin','price']).select_dtypes(exclude='object')
y_validate = validate_scaled.price

X_test = test_scaled.drop(columns=['vin','price']).select_dtypes(exclude='object')
y_test = test_scaled.price

In [None]:
kbest = evaluate.select_kbest(X_train, y_train, 10)

In [None]:
kbest

In [None]:
rfe = evaluate.select_rfe(X_train, y_train, 10)

In [None]:
rfe

##### We will use the Year, Make , Model, Mileage, Trimid as input from the user, then after having this information , grabbing the avg horsepower and avg city fuel economy , and most occuring wheel system for said car based off information in the dataset.


In [None]:
my_list = ['year', 'make_name_num', 'model_name_num', 'mileage', 'trimId', 'horsepower','engine_displacement']

# Modeling

### Baseline

In [None]:
target = 'price'

In [None]:
baseline = evaluate.baseline_errors(y_train)[2]

In [None]:
baseline

### OLS 

In [None]:
ols_train = evaluate.get_model_results(X_train[my_list], y_train, X_train[my_list], y_train, target, normalize=True)  

In [None]:
ols_validate = evaluate.get_model_results(X_train[my_list], y_train, X_validate[my_list], y_validate, target, normalize=True)   

### Lasso Lars 

In [None]:
lasso_train = evaluate.get_model_results(X_train[my_list], y_train, X_train[my_list], y_train, target,model='lasso', alpha= .01 )  

In [None]:
lasso_validate = evaluate.get_model_results(X_train[my_list], y_train, X_validate[my_list], y_validate, target,model='lasso', alpha= .01 )  

### Tweedie Regressor (GLM)

In [None]:
glm_train  = evaluate.get_model_results(X_train[my_list], y_train, X_train[my_list], y_train, target,model='glm', power = 1)  

In [None]:
glm_validate = evaluate.get_model_results(X_train[my_list], y_train, X_validate[my_list], y_validate, target,model='glm', power = 1)  

### Polynomial Regression

In [None]:
poly_train = evaluate.get_model_results(X_train[my_list], y_train, X_train[my_list], y_train, target,model='poly', degree = 3)

In [None]:
poly_validate = evaluate.get_model_results(X_train[my_list], y_train, X_validate[my_list], y_validate, target,model='poly', degree = 3)

#### Test on 3rd degree polynomial


In [None]:
poly_test = evaluate.get_model_results(X_train, y_train, X_test, y_test, target,model='poly', degree = 3)

# Exploration of error
- further exploration may suggest that the data should be split into more than 1 model, as the current model is struggling to predict Super Luxury cars with accuracy, being weighted down by the vast amount of data from the regular car market

In [None]:
test_scaled['predictions'] = np.round(poly_test[0],2)

In [None]:
test_scaled['Diff'] = abs(test_scaled.price - test_scaled.predictions)

In [None]:
extreme_error_cases = test_scaled[(test_scaled['Diff'] >= 5000)]

In [None]:
extreme_error_cases= extreme_error_cases[['vin', 'horsepower', 'mileage','price', 'predictions', 'Diff']]

In [None]:
extreme_error_cases.sort_values(by="Diff")

# Conclusion
- As you can see above the super cars are being valued either extremely low or extremely high suggesting that these cars should be seperated into their own market. 
- It also has trouble valuing classic (collectible) cars. cars that are older, but still hold value. because cars dont always depreciate over time.
- model might rely too much on horsepower as an idicator of price.
- We will revisit and split the data into premium and regular markets in order to improve accuracy. 
- we will also revisit classic  car handling.