# Hands On - Predicting The Quality Of Wine - Regression RECAP SESSION (COMPLETED)

# Import & Prepare Data

In [None]:
import pandas as pd
wine = pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/wine_regression.csv")



*   Investigate the structure of the data.
*   Delete missing or unnecessary values if required
*   Separate features from label (price) in X and y


In [None]:
wine.info()

In [None]:
wine.head()

Vintage and age carry the same information. Age is only a mathematical transformation of the vintage year.

Taking 2021 as "reference year": 2021 - 1952 = 69

Thus, we delete the variable "vintage":

In [None]:
wine = wine.drop("vintage", axis = 1)
wine.head()

Separate Features and Target

In [None]:
X = wine.drop("price", axis = 1)
y = wine["price"]

# Train and a Decision Tree


## 1) Import Model Function

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error, mean_absolute_percentage_error

## 2) Instantiate Model

In [None]:
tree = DecisionTreeRegressor(random_state=1)

## 3) Create Test & Training Data


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.12, shuffle=False, random_state=1)

## 4) Fit Model to Training Data


In [None]:
tree.fit(X_train, y_train)

## 5) Make Predictions on Testing Data


In [None]:
y_pred = tree.predict(X_test)

## 6) Evaluate Performance



*   Print RMSE and MAE as error measures

---





In [None]:
root_mean_squared_error(y_test, y_pred)

In [None]:
mean_absolute_percentage_error(y_test, y_pred)

Is it good model? Why? Why Not?

**Answer:**____

## 7) Plot Tree

In [None]:
def plot_tree_regression(treemodel, X_train):
    from sklearn import tree
    import matplotlib.pyplot as plt
    fig = plt.figure(figsize=(60,20))
    _ = tree.plot_tree(treemodel, feature_names=X_train.columns, filled=True, precision=2)

In [None]:
plot_tree_regression(tree, X_train)

# Build a better model - Apply a Random Forest

## 1) Import Model Function

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2) Instantiate Model

Build a random forest with 500 Trees

In [None]:
forest = RandomForestRegressor(n_estimators=500, random_state=1)

## 3) Create Test & Training Data

To compare the performance between models, make sure to use the identical split (same parameters like `test_size` and `random_state`). Instead of creating a new split, you could alternatively just use the train/test data from above.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.12, shuffle=False, random_state=1)

## 3) Fit Model to Training Data (Using the Same Split as Above)

In [None]:
forest.fit(X_train, y_train)

## 5) Make Predictions on Testing Data

In [None]:
y_pred = forest.predict(X_test)

## 6) Evaluate Performance

*   Print RMSE and MAE as error measures


In [None]:
root_mean_squared_error(y_test, y_pred)

In [None]:
mean_absolute_percentage_error(y_test, y_pred)

Compare the RandomForest to the DecisionTree? Which model is better?

**Answer:** ____