# **Supervised Model Example**
#### **Description**: This dataset is a compiled set of Used Car Listings extracted from www.truecar.com. We will be creating a Decision Tree to predict the price of a given listing.
##### **NOTE**: We are using a Decision Tree since the Categorical Features (e.g., make, model, trim) will greatly influence the price.

### **Step 1. Import Required Libraries**

In [None]:
# data wrangling libraries
import pandas as pd

# data pre-processing libraries
from sklearn.preprocessing import LabelEncoder

# train and fit model
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

### **Step 2. Preview Training Data**

In [None]:
training_data = pd.read_parquet('clean_car_listings.parquet')
training_data

### **Step 3: Perform Exploratory Data Analysis**
#### For the purposes of this exercise, we will skip this step to keep thing simple. 

### **Step 4: Pre-Process Training Data**
##### Computers can't read English, or any text for that matter. Thus, we need to convert all Categorical variables into numerical equivalents. We will use a technique known as Label Encoding to do this.

In [None]:
encoder = LabelEncoder()
training_data['make'] = encoder.fit_transform(training_data['make'])
training_data['model'] = encoder.fit_transform(training_data['model'])
training_data['trim'] = encoder.fit_transform(training_data['trim'])
training_data['mileage'] = encoder.fit_transform(training_data['mileage'])
training_data['exterior_color'] = encoder.fit_transform(training_data['exterior_color'])
training_data['interior_color'] = encoder.fit_transform(training_data['interior_color'])
training_data['usage_type'] = encoder.fit_transform(training_data['usage_type'])
training_data['city'] = encoder.fit_transform(training_data['city'])
training_data['state'] = encoder.fit_transform(training_data['state'])
training_data

### **Step 5. Split Training Data into Training and Testing Subsets**
##### This step is used to ensure that the model has an adequate amount of training data to learn the general trends, yet enough testing data to validate its accuracy.

In [None]:
X = training_data.drop(columns=['price'])
y = training_data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### **Step 6. Fit Training Data to Model**

In [None]:
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

### **Step 7. View Feature Importance Chart**
##### This step isn't necessary, though it provides insight for future reference.

In [None]:
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')

##### From this visualization, we can see that "model_year", "model", and "trim" are the 3 most important features used to predict a listing's price, whereas "num_accidents", "usage_type", and "num_owners" are the 3 least important features to consider.

### **Step 8. Evaluate Model Performance**
##### To evaluate how well our model performed. We will analyze the following metrics:
###### - **R-Sqaured**: This is a measure between 0 and 1 of how well the regression line fits the data.
###### - **Max Depth**: This is a measure of how many levels our Decision Tree went.

In [None]:
print(f'R-squared: {round(model.score(X, y), 4)}')
print(f'Max Depth: {model.get_depth()}')
print(f'Prediciton Output: ${model.predict([[2016, 23, 124, 171, 96800, 1, 1, 1, 2, 2, 365, 37]])[0]}')