
---
# Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

---

In [15]:
# !pip install pandas
# !pip install seaborn
# !pip install plotly
# !pip install matplotlib
# !pip install scipy
# !pip install scikit-learn
# !pip install numpy

***ALL IMPORTS FOR FUTURE PHASES***

In [16]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from scipy import stats
import numpy as np

***LOAD DATASET***

In [17]:
# Load the dataset
df = pd.read_csv('data/vehicles.csv')

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

***DROP UNNECESSARY COLUMNS (id, VIN, size)***
- 'id' not relevant to price
- 'VIN' not relevant to price
- 'size' is missing 3/4 of data in database so will result in inacuracy
- possibly need to remove the model column due to parsing issue with regression models

In [19]:
# DROP UNNECESSARY COLUMNS (id, VIN, size)
# if ('id' in df.columns) | ('VIN' in df.columns) | ('size' in df.columns):
if ('id' in df.columns) | ('VIN' in df.columns) | ('size' in df.columns):
    df = df.drop(columns=['id', 'VIN','size'])

***DROP ROWS WITH MISSING 'price'***

In [20]:
df = df.dropna(subset=['price']) 

***CONVERT COLUMN 'YEAR' INTO INTEGER***

In [21]:
# Convert 'year' to integer, handling errors by coercing
df['year'] = pd.to_numeric(df['year'], errors='coerce').astype('Int32')
df.sample()

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,type,paint_color,state
97265,gainesville,11799,2010,ford,f150 xl super cab,,8 cylinders,gas,112986.0,clean,automatic,,,,fl


***ADDING 2 NEW COLUMNS***
- adding the 'age' of the car to show how old it is in comparison to the current year
- adding the OPTION for conversion from 'miles' to 'kilometers' to allow universal interpretation for client.

In [22]:
# Create a new feature 'age' based on the 'year' of the car
df['age'] = 2023 - df['year']

# Convert odometer from miles to kilometers
# df['odometer_km'] = df['odometer'] * 1.60934

df.sample()

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,type,paint_color,state,age
334689,philadelphia,13500,2011,chevrolet,camaro,excellent,6 cylinders,gas,81000.0,clean,automatic,rwd,coupe,grey,pa,12


#### ***HANDLE MISSING VALUES***

2 MOTHODS:
1. remove every row with missing data then run models
2. fill missing data by using mean median mode to fill

**METHOD 1:**

In [23]:
# removing all rows with missing values
df_nan_removed = df.dropna()

# Verify that there are no more missing values
# print(df_nan_removed.isnull().any())
print(df_nan_removed.info())
df_nan_removed

<class 'pandas.core.frame.DataFrame'>
Index: 115988 entries, 31 to 426878
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   region        115988 non-null  object 
 1   price         115988 non-null  int64  
 2   year          115988 non-null  Int32  
 3   manufacturer  115988 non-null  object 
 4   model         115988 non-null  object 
 5   condition     115988 non-null  object 
 6   cylinders     115988 non-null  object 
 7   fuel          115988 non-null  object 
 8   odometer      115988 non-null  float64
 9   title_status  115988 non-null  object 
 10  transmission  115988 non-null  object 
 11  drive         115988 non-null  object 
 12  type          115988 non-null  object 
 13  paint_color   115988 non-null  object 
 14  state         115988 non-null  object 
 15  age           115988 non-null  Int32  
dtypes: Int32(2), float64(1), int64(1), object(12)
memory usage: 14.4+ MB
None


Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,type,paint_color,state,age
31,auburn,15000,2013,ford,f-150 xlt,excellent,6 cylinders,gas,128000.0,clean,automatic,rwd,truck,black,al,10
32,auburn,27990,2012,gmc,sierra 2500 hd extended cab,good,8 cylinders,gas,68696.0,clean,other,4wd,pickup,black,al,11
33,auburn,34590,2016,chevrolet,silverado 1500 double,good,6 cylinders,gas,29499.0,clean,other,4wd,pickup,silver,al,7
34,auburn,35000,2019,toyota,tacoma,excellent,6 cylinders,gas,43000.0,clean,automatic,4wd,truck,grey,al,4
35,auburn,29990,2016,chevrolet,colorado extended cab,good,6 cylinders,gas,17302.0,clean,other,4wd,pickup,red,al,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426859,wyoming,48590,2020,cadillac,xt6 premium luxury,good,6 cylinders,gas,7701.0,clean,other,fwd,other,black,wy,3
426860,wyoming,39990,2017,infiniti,qx80 sport utility 4d,good,8 cylinders,gas,41664.0,clean,automatic,4wd,other,black,wy,6
426866,wyoming,32990,2016,infiniti,qx80 sport utility 4d,good,8 cylinders,gas,55612.0,clean,automatic,rwd,other,black,wy,7
426874,wyoming,33590,2018,lexus,gs 350 sedan 4d,good,6 cylinders,gas,30814.0,clean,automatic,rwd,sedan,white,wy,5


**METHOD 2:**

In [24]:
# Handling missing values
# Check the data type of the column
# Replace NaN with 'UNKNOWN' for object columns
# Replace NaN with 0 for numerical columns
# for column in df.columns:
#     if df.dtypes[column] == object:
#         df[column].fillna(f'UNKNOWN_{column}', inplace=True) 
#     else: 
#         df[column].fillna(0, inplace=True)  


# Verify that there are no more missing values
# print(df.isnull().any())
# print(df.info())

***HANDLE OUTLIERS USING METHODS:***
- Z-Score
- Interquartile Range (IQR)

***Z-Score***

In [25]:
# Calculate Z-scores for 'price' and 'odometer'
z_scores_price = stats.zscore(df['price'])
z_scores_odometer = stats.zscore(df['odometer'])

# Define a threshold for outliers (e.g., 3 or -3)
threshold = 3

# Identify outliers
outliers_price = np.where(np.abs(z_scores_price) > threshold)[0]
outliers_odometer = np.where(np.abs(z_scores_odometer) > threshold)[0]

# Remove outliers from the original DataFrame 'df'
df_no_outliers = df.drop(outliers_price)
df = df_no_outliers.drop(outliers_odometer)


***Interquartile Range (IQR)***

In [26]:
# Calculate IQR for 'price' and 'odometer'
Q1_price = df['price'].quantile(0.25)
Q3_price = df['price'].quantile(0.75)
IQR_price = Q3_price - Q1_price

Q1_odometer = df['odometer'].quantile(0.25)
Q3_odometer = df['odometer'].quantile(0.75)
IQR_odometer = Q3_odometer - Q1_odometer

# Define the IQR thresholds for outliers
lower_bound_price = Q1_price - 1.5 * IQR_price
upper_bound_price = Q3_price + 1.5 * IQR_price

lower_bound_odometer = Q1_odometer - 1.5 * IQR_odometer
upper_bound_odometer = Q3_odometer + 1.5 * IQR_odometer

# Identify outliers for 'odometer'
outliers_odometer = df[(df['odometer'] < lower_bound_odometer) | (df['odometer'] > upper_bound_odometer)].index

# Remove outliers for 'odometer'
df_no_outliers = df.drop(outliers_odometer)

# Identify outliers for 'price' in the updated DataFrame
outliers_price = df_no_outliers[(df_no_outliers['price'] < lower_bound_price) | (df_no_outliers['price'] > upper_bound_price)].index

# Remove outliers for 'price'
df = df_no_outliers.drop(outliers_price)
df

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,type,paint_color,state,age
0,prescott,6000,,,,,,,,,,,,,az,
1,fayetteville,11900,,,,,,,,,,,,,ar,
2,florida keys,21000,,,,,,,,,,,,,fl,
3,worcester / central MA,1500,,,,,,,,,,,,,ma,
4,greensboro,4900,,,,,,,,,,,,,nc,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426875,wyoming,23590,2019,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,fwd,sedan,,wy,4
426876,wyoming,30590,2020,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,fwd,sedan,red,wy,3
426877,wyoming,34990,2020,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,,hatchback,white,wy,3
426878,wyoming,28990,2018,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,fwd,sedan,silver,wy,5


***SEPARATE THE NUMERIC AND CATEGORICAL FEATURES***

- Since both dataframes are have the same columns I only need to loop through 1 of them
- In this case I will just use the regular ''df''

In [27]:
categorical_features = df_nan_removed.select_dtypes(include=['object']).columns.tolist()
numerical_features = df_nan_removed.columns.difference(categorical_features + ['price'])

print("Numerical Features:", numerical_features)
print("Categorical Features:", categorical_features)


Numerical Features: Index(['age', 'odometer', 'year'], dtype='object')
Categorical Features: ['region', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'type', 'paint_color', 'state']


---

#### ***PREPARING FOR THE MODELING PHASE***
- Encode categorical features
- Separate the target feature 'price'
- Split the data into training and testing sets

---

***APPLY PRICIPAL COMPONENT ANALYSIS (PCA)***

***APPLY STANDARD SCALAR TO L1 & L2***
- this prevents bias models for L1 and L2 regression models
- improves performance speeds

In [28]:

# Identify high cardinality categorical features (for example, features with more than 10 unique values)
high_cardinality_features = [col for col in categorical_features if df_nan_removed[col].nunique() > 10]

# Identify low cardinality categorical features
low_cardinality_features = list(set(categorical_features) - set(high_cardinality_features))

# Define the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('low_cardinality', OneHotEncoder(handle_unknown='ignore', sparse_output=False), low_cardinality_features),
        ('high_cardinality', OneHotEncoder(handle_unknown='ignore', sparse_output=False), high_cardinality_features),
        ('scaler', StandardScaler(), numerical_features)
    ],
    remainder='passthrough'
)

In [29]:
# Define the preprocessing pipeline
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_features),
#         ('scaler', StandardScaler(), numerical_features)
#     ],
#     remainder='passthrough'
# )

# Create the final pipeline with PCA and the linear regression model
# n_components in PCA() as needed
# linear_pipeline = Pipeline([
#     ('preprocessor', preprocessor),
#     ('pca', PCA()),
#     ('model', LinearRegression())
# ])
# ^^^^FIXED ABOVE^^^^
# REPLACED PCA with TruncatedSVD
# Use TruncatedSVD for sparse input. THis is due to the features 'model' and 'region'
linear_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('pca', TruncatedSVD()),  
    ('model', LinearRegression())
])


# Create a new pipeline for Lasso regression
lasso_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler(with_mean=False)),  # StandardScaler for L1 regularization
    ('model', Lasso())
])

# Create a new pipeline for Ridge regression
ridge_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler(with_mean=False)),  # StandardScaler for L2 regularization
    ('model', Ridge())
])

In [30]:
# Split the data into training and testing sets
X = df_nan_removed.drop('price', axis=1)
y = df_nan_removed['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


---
# Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

---

---

#### RUNNING 5 MODELS:

- ***MODEL 1: LINEAR REGRESSION***
- ***MODEL 2: LASSO REGRESSION***
- ***MODEL 3: RIDGE REGRESSION***
- ***MODEL 4: RANDOM FOREST REGRESSOR***
- ***MODEL 5: GRADIENT BOOSTING REGRESSOR***


In [None]:
# Model 1
# Fit and evaluate the linear regression model with PCA
linear_scores_pca = cross_val_score(linear_pipeline, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
linear_rmse_pca = np.sqrt(-linear_scores_pca.mean())
print("Linear Regression with PCA RMSE:", linear_rmse_pca)

---

In [None]:
# Model 2
# Fit and evaluate the Lasso regression model
lasso_scores = cross_val_score(lasso_pipeline, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
lasso_rmse = np.sqrt(-lasso_scores.mean())
print("Lasso Regression RMSE:", lasso_rmse)

In [None]:
# Model 3
# Fit and evaluate the Ridge regression model
ridge_scores = cross_val_score(ridge_pipeline, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
ridge_rmse = np.sqrt(-ridge_scores.mean())
print("Ridge Regression RMSE:", ridge_rmse)


---

In [None]:
# Model 4
# rf_reg = RandomForestRegressor()
# rf_scores = cross_val_score(rf_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
# rf_rmse = np.sqrt(-rf_scores.mean()
# print("Random Forest Regression RMSE:", rf_rmse)


In [None]:
# Model 5
# gb_reg = GradientBoostingRegressor()
# gb_scores = cross_val_score(gb_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
# gb_rmse = np.sqrt(-gb_scores.mean())
# print("Gradient Boosting Regression RMSE:", gb_rmse)



---
# Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

---

***IMPORTS***

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score

***EVALUATING MODEL 1: LINEAR REGRESSION***

In [None]:
# Fit the linear regression model with PCA
linear_pipeline.fit(X_train, y_train)

# Predict using cross-validation
y_pred = cross_val_predict(linear_pipeline, X_train, y_train, cv=5)

# Calculate residuals
residuals = y_train - y_pred

# plotting residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, c='blue', label='Predicted Values')
plt.scatter(y_pred, residuals, c='red', label='Residuals')
plt.title('Residuals Plot for Linear Regression with PCA')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.legend()
plt.show()

In [None]:
# Make predictions on the test set
# Calculate RMSE on the test set
# Calculate R-squared on the test set
y_test_pred = linear_pipeline.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)

# Use cross_val_score for cross-validation
cross_val_scores = cross_val_score(linear_pipeline, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
cross_val_rmse = np.sqrt(-cross_val_scores.mean())

# Assuming X_train is your input data
# Get the coefficients from the linear regression model
feature_names = X_train.columns
coefficients = linear_pipeline.named_steps['model'].coef_

print("Test Set RMSE:", test_rmse)
print("Test Set R-squared:", test_r2)
print("Cross-Validation RMSE:", cross_val_rmse)
print("Coefficients:", coefficients)

# Map coefficients to features
# Print the mapping
coefficients_mapping = dict(zip(feature_names, coefficients))
print("Coefficients Mapping:")
for feature, coefficient in coefficients_mapping.items():
    print(f"{feature}: {coefficient}")


***EVALUATING MODEL 2: LASSO REGRESSION***

***EVALUATING MODEL 3: RIDGE REGRESSION***

***EVALUATING MODEL 4: RANDOM FOREST REGRESSOR***

***EVALUATING MODEL 5: GRADIENT BOOSTING REGRESSOR***


---
# Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

---