<h1 style="color:red;"><strong>1. Look at the big Picture</h1>

## **1.1 Frame the Problems**

* **What's the business objective**
* **How does the company expect to use and benefit from the model?**
* Frame problems --> chose algorithm --> performance measure --> ways/time to tune the model

## **1.2 Chose performance measure (Loss)**


### **1.2.1 Regression**
* RMSE
* MAE
* MSE, 
* RMSLE
* MAPE
* SMAPE

### **1.2.2 Classification**
* Binary Cross Entropy 
* Sparse Categorical Cross Entropy
* Categorial Cross Entropy

## **1.3 Chose performance measure (Metrics)**

### **1.3.1 Regression**
* Adjusted r^2

### **1.3.2 Classification**
* AUC
* Accuracy
* False Positive
* False Negative
* F1 Score
* Confusion matrix

## **1.4 Check the Assumptions** 

<h1 style="color:red;"><strong>2. Get the Data</h1>

In [1]:
# import libraries

import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
#-----------------------------
import matplotlib.pyplot as plt
import seaborn as sns

  from pandas.core import (


In [None]:
df = pd.read_csv(r"file_name.csv")

In [None]:
# Take a look at the top 5 rows
df.head()

In [None]:
df.info()

In [None]:
df.describe()

<h1 style="color:red;"><strong>3. Data Cleaning (General)</h1>

## **3.1 Formating**

In [None]:
df.head(1)

In [None]:
df.columns

### **3.1.1 Format column names**

In [2]:
def rename_columns(df):
   # convert form camel case to snake case
   df.columns = (df.columns
                  .str.replace('(?<=[a-z])(?=[A-Z])', '_', regex=True)
                  .str.lower()
               )
   # convert " " to _
   df.columns = df.columns.str.replace(' ', '_')
   return df

In [None]:
df = rename_columns(df)

In [None]:
df.columns

In [None]:
# rename function:
# df.rename(columns={'unnamed:_0': 'id'}, inplace=True)

In [None]:
df.columns

### **3.1.2 Format data body**

Split the columns model to make, version, engine, body type, trim, and doors

In [None]:
def extract_version(df, column_name):
    # Split the 'model' column by spaces
    split_model = df[column_name].str.split(' ', expand=True)
    
    # Extract the part between the first and second space as the 'version'
    version = split_model[1].where(split_model[1].str.match(r'^\d+(\.\d+)?$'))
    
    # Assign the 'version' to the DataFrame
    df['version'] = version
    
    return df

In [None]:
df = extract_version(df, 'model')

In [None]:
df.version.unique()

In [None]:
def extract_doors(df, column_name):
    # Extract the 'doors' component using regular expression
    df['doors'] = df[column_name].str.extract(r'(\d/\d-Doors)')

    return df

In [None]:
df = extract_doors(df, 'model')

In [None]:
df.doors.unique()

In [None]:
df.columns

Many more way to clean this df and extract this model (['make', 'version', 'engine', 'body_type', 'trim', 'doors']), but for the sake of time let move to other formating

### **3.1.3 Format Datatypes**
After we format all the data body and columns, let's format the data types

In [None]:
df.dtypes

Datatypes are formatted correctly, so no need to further format datatypes, but in other scenario, it might be best format datatypes when there's date/datetime columns

## **3.2 Drop duplicates rows**

In [None]:
df.drop_duplicates(inplace=True)

<h1 style="color:red;"><strong>4. Attribute Combination</h1>

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df['km_per_year'] = df['km']/df['age_08_04']

<h1 style="color:red;"><strong> 5. EDA</h1>

In [None]:
df.apply(lambda col: col.unique())

In [None]:
cat_attribs = ['model', 'fuel_type', 'color', 'doors', 'version']
bin_attribs = [col for col in df.columns if len(df[col].unique()) == 2 or len(df[col].unique()) == 1]
num_attribs = [col for col in df.columns if col not in cat_attribs and col not in bin_attribs and col != 'id']

## **5.1 Univariate**

In [None]:
# Define the number of rows and columns for subplots
num_rows = 5
num_cols = (len(num_attribs) + num_rows - 1) // num_rows

# Create subplots
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 10))

# Flatten the axes array to iterate over numerical attributes
axes = axes.flatten()

# Iterate over numerical attributes and plot histograms
for i, column in enumerate(num_attribs):
    sns.histplot(df[column], ax=axes[i], kde=True)
    axes[i].set_title(column)

# Hide empty subplots
for j in range(i+1, num_rows*num_cols):
    axes[j].axis('off')

# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
# Define the number of rows and columns for subplots
num_rows = 5
num_cols = (len(bin_attribs) + num_rows - 1) // num_rows

# Create subplots
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 10))

# Flatten the axes array to iterate over binary attributes
axes = axes.flatten()

# Iterate over binary attributes and plot bar graphs
for i, column in enumerate(bin_attribs):
    sns.countplot(x=df[column], ax=axes[i])
    axes[i].set_title(column)

# Hide empty subplots
for j in range(i+1, num_rows*num_cols):
    axes[j].axis('off')

# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
# Count the occurrences of each model
model_counts = df['model'].value_counts().head(5)

# Plotting
plt.figure(figsize=(10, 6))
model_counts.plot(kind='bar', color='skyblue')
plt.title('Top 5 Models')
plt.xlabel('Model')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:

# Count the occurrences of each fuel type
fuel_counts = df['fuel_type'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
fuel_counts.plot(kind='bar', color='lightgreen')
plt.title('Fuel Type Distribution')
plt.xlabel('Fuel Type')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

In [None]:
# Count the occurrences of each color
color_counts = df['color'].value_counts()

# Plotting
plt.figure(figsize=(10, 6))
color_counts.plot(kind='bar', color='salmon')
plt.title('Color Distribution')
plt.xlabel('Color')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
num_rows = 3
num_cols = len(num_attribs) // num_rows + (len(num_attribs) % num_rows > 0)

fig, axes = plt.subplots(num_rows, num_cols, figsize=(8 * num_cols, 6 * num_rows))
axes = axes.ravel()  # Flatten the axes array

for i, col in enumerate(num_attribs):
    ax = axes[i]
    sns.boxplot(x=df[col], ax=ax, color='skyblue')  # Use seaborn's boxplot function
    ax.set_title(f'Box plot for {col}')
    ax.set_xlabel('Values')
    ax.set_ylabel('')

plt.tight_layout()
plt.show()

## **5.2 Bivariate**

In [None]:
sns.pairplot(df[num_attribs])
plt.title('Pairplot of Numerical Variables')
plt.show()

In [None]:
def plot_distribution_curves(df, columns, hue_column):
    """
    Plot distribution curves for each column in the DataFrame
    with respect to the specified hue column using different colors.

    Parameters:
        df (pandas.DataFrame): DataFrame containing the data.
        columns (list): List of columns to plot.
        hue_column (str): Column to use for coloring the distribution curves.

    Returns:
        None
    """
    colors = sns.color_palette("husl", len(df[hue_column].unique()))
    
    num_plots = len(columns)
    fig, axes = plt.subplots(nrows=2, ncols=num_plots // 2, figsize=(15, 10))
    axes = axes.flatten()  # Flatten the 2D array of axes to 1D for easy iteration

    for i, col in enumerate(columns):
        for j, value in enumerate(df[hue_column].unique()):
            sns.kdeplot(data=df[df[hue_column] == value], x=col, color=colors[j], ax=axes[i])
        axes[i].set_title(f"Distribution of {col} by {hue_column}")
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Density')
        axes[i].legend(df[hue_column].unique(), title=hue_column)

    plt.tight_layout()
    plt.show()

In [None]:
plot_distribution_curves(df, num_attribs, 'fuel_type')

## **5.3 Multivariate**

In [None]:
# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df[num_attribs].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

<h1 style="color:red;"><strong> 6. Labels Cleaning (Optionals)</h1>

* Drop null labels
* Drop outlier labels

<h1 style="color:red;"><strong> 7. Split Train Test For Modelling </h1>

In [None]:
target = "price"

In [None]:
num_attribs.remove(target)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_set, test_set = train_test_split(df, test_size=0.3, random_state=42)

train = train_set.drop(target, axis=1)
train_labels = train_set[[target]]

train_num = train[num_attribs+bin_attribs]
train_cat = train[cat_attribs]


test = test_set.drop(target, axis=1)
test_labels = test_set[[target]]

<h1 style="color:red;"><strong> 8. Data Cleaning for Modelling </h1>

In [None]:
train_set.isnull().sum()

In [None]:
# Count null values in each column
null_counts = train_set.isnull().sum()

# Filter columns with more than zero null values
null_counts_gt_zero = null_counts[null_counts > 0]

# Plotting the null counts
plt.figure(figsize=(10, 6))
null_counts_gt_zero.plot(kind='bar', color='skyblue')
plt.title('Count of Null Values in DataFrame Columns (Columns with >0 Nulls)')
plt.xlabel('Columns')
plt.ylabel('Count of Null Values')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()

## **8.1 Handle Nulls**

### 8.1.1 Types of NULLS

**Missing Completely at Random (MCAR)**
- Missing data is random
- Data was lost in ETL, someone was interrupted when completing a survey
- Remove or impute

**Missing at Random (MAR)**
- Missing data suggests something about something else observed
- Older (which we have in data) may have higher privacy concerns and not report income
- Remove or impute

**Missing Not at Random (MNAR)**
- Missing based on something not observed
- Self-selection bias: Depressed do not complete mental health surveys
- Advanced econometrics

**By Design:**
- Remove credit card data if birthdate is below a certain value

### 8.1.2 Solutions to handle Nulls 
- Get rid of the corresponding rows.
- Get rid of the whole attribute.
- Set the values to some value (zero, the mean, the median, etc.).
- Create a new column that have True for the non-null and False for null (or the opposite)

In [None]:
# df.dropna(subset=["minimum_payments"]) # option 1
# df.drop("minimum_payments", axis=1) # option 2
# median = df["minimum_payments"].median() # option 3
# df["minimum_payments"].fillna(median, inplace=True) # option 4
# 

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
train_imputer = SimpleImputer(strategy="median")
train_labels_imputer = SimpleImputer(strategy="median")

Fit train (non_labels)

In [None]:
train_imputer.fit(train_num)

In [None]:
train_num.isnull().sum().sum()

In [None]:
train_imputer.statistics_


In [None]:
train_num.median().values

In [None]:
X = train_imputer.transform(train_num) # return Numpy value

In [None]:
train_num =  pd.DataFrame(X, columns=train_num.columns)


In [None]:
train_num.isnull().sum().sum()

Fit Labels

In [None]:
train_labels_imputer.fit(train_labels)

In [None]:
train_labels_imputer.statistics_


In [None]:
train_labels.isnull().sum()

In [None]:
Y = train_labels_imputer.transform(train_labels) # return Numpy value

In [None]:
train_labels =  pd.DataFrame(Y, columns=train_labels.columns)


## **8.2 Handle Outliers**

In [None]:
num_rows = 3
num_cols = len(num_attribs) // num_rows + (len(num_attribs) % num_rows > 0)

fig, axes = plt.subplots(num_rows, num_cols, figsize=(8 * num_cols, 6 * num_rows))
axes = axes.ravel()  # Flatten the axes array

for i, col in enumerate(num_attribs):
    ax = axes[i]
    sns.boxplot(x=df[col], ax=ax, color='skyblue')  # Use seaborn's boxplot function
    ax.set_title(f'Box plot for {col}')
    ax.set_xlabel('Values')
    ax.set_ylabel('')

plt.tight_layout()
plt.show()

In [None]:
# Remove Outlier
def remove_outliers(df, column_names=None):
    """
    Remove outliers from specific columns in the DataFrame based on the interquartile range (IQR) method,
    or remove outliers from all numerical columns if column_names is None.

    Parameters:
    - df: DataFrame
        The DataFrame containing the data.
    - column_names: list or None, default None
        The list of column names for which outliers are to be removed,
        or None to remove outliers from all numerical columns.

    Returns:
    - df_filtered: DataFrame
        The DataFrame with outliers removed.
    """
    if column_names is None:
        num_attribs = df.select_dtypes(include='number').columns
    else:
        num_attribs = column_names

    total_removed = 0
    total_rows = len(df)

    for col in num_attribs:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1

        # Define the lower and upper bounds for outliers
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Remove outliers from the specified column
        removed_rows = len(df) - len(df[(df[col] >= lower_bound) & (df[col] <= upper_bound)])
        total_removed += removed_rows

        # Update DataFrame
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

        # Print the number and percentage of removed values if any rows have been removed
        percentage_removed = (removed_rows / total_rows) * 100
        print(f"Removed {removed_rows} rows ({percentage_removed:.2f}%) due to outliers in column '{col}'.")

    return df

In [None]:
# We are not handling any outlier in this case to see how the model perform

<h1 style="color:red;"><strong> 9.Preprocess Data for Modelling </h1>

## **9.1 Handle Categorical Data**

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

### 9.1.1 Ordinal Encoding

In [None]:
train_cat

In [None]:
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(train_cat[cat_attribs])
housing_cat_encoded[:10]


In [None]:
ordinal_encoder.categories_


### 9.1.2 One Hot Encoding

In [None]:
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(train_cat[cat_attribs])
housing_cat_1hot

In [None]:
housing_cat_1hot.toarray()

In [None]:
cat_encoder.categories_


## **9.2 Feature Scaling**
* As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset (including the test set). Only then can you use them to transform the training set and the test set (and new data).

### 9.2.1 Min-Max Scailing (Normalization Scailing)
* Values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min
* Normalization are more affected by outliers

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [None]:
def minmax_scale_dataframe(df, columns_to_scale=None):
    """
    Scale the specified columns in the DataFrame using Min-Max scaling.
    
    Parameters:
        df (pandas.DataFrame): DataFrame to be scaled.
        columns_to_scale (list): List of columns to be scaled. If None, scale all numerical columns.
        
    Returns:
        pandas.DataFrame: Scaled DataFrame.
    """
    scaler = MinMaxScaler()
    if columns_to_scale is None:
        columns_to_scale = df.select_dtypes(include=['number']).columns.tolist()
    df_scaled = df.copy()
    df_scaled[columns_to_scale] = scaler.fit_transform(df_scaled[columns_to_scale])
    return df_scaled

### 9.2.2 Standardization Scailing 
* First it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the standard deviation so that the resulting distribution has unit variance
*  Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1)
* standardization is much less affected by outliers.

In [None]:
def standard_scale_dataframe(df, columns_to_scale=None):
    """
    Scale the specified columns in the DataFrame using standard scaling.
    
    Parameters:
        df (pandas.DataFrame): DataFrame to be scaled.
        columns_to_scale (list): List of columns to be scaled. If None, scale all numerical columns.
        
    Returns:
        pandas.DataFrame: Scaled DataFrame.
    """
    scaler = StandardScaler()
    if columns_to_scale is None:
        columns_to_scale = df.select_dtypes(include=['number']).columns.tolist()
    df_scaled = df.copy()
    df_scaled[columns_to_scale] = scaler.fit_transform(df_scaled[columns_to_scale])
    return df_scaled

## **9.3 Transformation Pipelines**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
train_set.head()

In [None]:
num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('std_scaler', StandardScaler()),
 ])
# housing_num_tr = num_pipeline.fit_transform(train_num)

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs+bin_attribs),
 ("cat", OneHotEncoder(handle_unknown='ignore'), cat_attribs),
 ])
train_prepared = full_pipeline.fit_transform(train)


In [None]:
train_prepared

In [None]:
train_labels_imputer.fit(train_labels.values.reshape(-1, 1))

In [None]:
Y = train_labels_imputer.transform(train_labels.values.reshape(-1, 1))

In [None]:
train_labels =  pd.DataFrame(Y, columns=train_labels.columns)

<h1 style="color:red;"><strong> 10. Select and Train a Model </h1>

## **10.1 Training and Evaluating on the Training Set**

### 10.1.1 Model 1: Linear Regression Model

In [3]:
from sklearn.linear_model import LinearRegression

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(train_prepared, train_labels)

In [None]:
train_labels.iloc[:5]

In [None]:
some_data = train.iloc[:5]
some_labels = train_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", np.array(some_labels))

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(train_prepared)
lin_mse = mean_squared_error(np.array(train_labels), housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
# Compute residuals
residuals = np.array(train_labels) - housing_predictions
import statsmodels.api as sm

# Plot QQ plot
sm.qqplot(residuals, line = '45')
plt.title('QQ Plot of Residuals')
plt.show()

### 10.1.2 Model 2: Decision Tree Regressor Model

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree_reg = DecisionTreeRegressor()
tree_reg.fit(train_prepared, train_labels)

In [None]:
housing_predictions = tree_reg.predict(train_prepared)
tree_mse = mean_squared_error(train_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse


In [None]:
# Compute residuals
residuals = np.array(train_labels) - housing_predictions
import statsmodels.api as sm

# Plot QQ plot
sm.qqplot(residuals, line='45')
plt.title('QQ Plot of Residuals')
plt.show()

In [None]:
from sklearn.ensemble import RandomForestRegressor

### 10.1.3 Model 3: Forest Regressor Model

In [None]:
forest_reg = RandomForestRegressor()
forest_reg.fit(train_prepared, train_labels)


In [None]:
housing_predictions = forest_reg.predict(train_prepared)
forest_mse = mean_squared_error(train_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print(forest_rmse)

In [None]:
# Compute residuals
residuals = np.array(train_labels) - housing_predictions
import statsmodels.api as sm

# Plot QQ plot
sm.qqplot(residuals, line='45')
plt.title('QQ Plot of Residuals')
plt.show()

Potentially Saving the model for future usage

In [None]:
import joblib

In [None]:
# # Save the trained model to a file
# joblib.dump(forest_reg, "forest_reg.pkl")

# # Later, when you want to use the model again
# # Load the model from the file
# forest_reg = joblib.load("forest_reg.pkl")

Could this model really be absolutely perfect? Of course,
it is much more likely that the model has badly overfit the data. How can you be sure?
As we saw earlier, you don’t want to touch the test set until you are ready to launch a
model you are confident about, so you need to use part of the training set for train‐
ing, and part for model validation

## **10.2 Better Evaluation Using Cross-Validation**

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
scores = cross_val_score(tree_reg, train_prepared, train_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
display_scores(tree_rmse_scores)


In [None]:
lin_scores = cross_val_score(lin_reg, train_prepared, train_labels,scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)


In [None]:
forest_scores = cross_val_score(forest_reg, train_prepared, train_labels,scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)


The score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set. The solution include:
* Simplify the model
* Constrain it (i.e., regularize it)
* Get a lot more training data

Try others ML models, the goal is to shortlist 2-5 promising models

<h1 style="color:red;"><strong> 11. Fine Tune Your Model </h1>

## **11.1 Grid Search**
One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations.


Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = [
 {'n_estimators': [5, 25, 50], 'max_features': [5, 10, 15, 25]},
 {'bootstrap': [False], 'n_estimators': [5, 10], 'max_features': [5, 10, 15]},
 ]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
 scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(train_prepared, train_labels)


In [None]:
GridSearchCV(forest_reg, param_grid, cv=5,
 scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(train_prepared, train_labels)


This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of n_estimators and max_features hyperparameter values specified in the first dict. It will then try all 2 × 3 = 6 combinations of hyperparameter values in the second dict, but this time with the bootstrap hyperparameter set to False instead of True (which is the default value for this hyperparameter)

In [None]:
grid_search.best_params_


In [None]:
grid_search.best_estimator_

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)


Non of the fine tune model are as good as the orininal one

## **11.2 Randomized Search**
Better when the hyperparameter space is large. This approach is the same way as the GridSearchCV class, but instead of trying out all possible combinations, it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. This approach has two main benefits:
* If you let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach).
* You have more control over the computing budget you want to allocate to hyperparameter search, simply by setting the number of iterations


## **11.3 Ensemble Methods**
Another way to fine-tune your system is to try to combine the models that perform best. The group (or “ensemble”) will often perform better than the best individual model (just like Random Forests perform better than the individual Decision Trees they rely on), especially if the individual models make very different types of errors.

<h1 style="color:red;"><strong>   12. Analyze Model and Explain Features important </h1>

## **12.1 Grid Search Feature Importance**

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

## **12.2 Lime**

## **12.3 Shapley**

<h1 style="color:red;"><strong>   13 Evaluate Your System on the Test Set </h1>

Now is the time to evaluate the final model on the test set. There is nothing special about this process; just get the predictors and the labels from your test set, run your full_pipeline to transform the data (call `transform`, not `fit_transform()`, you do not want to fit the test set!), and evaluate the final model on the test set:


In [None]:
final_model = forest_reg

In [None]:
test_set

In [None]:
test = test_set.drop("price", axis=1)
test_labels = test_set["price"].copy()

test_prepared = full_pipeline.transform(test)
test_labels = train_labels_imputer.transform(test_labels.values.reshape(-1, 1))

final_model.fit(test_prepared, test_labels)
final_predictions = final_model .predict(test_prepared)
final_mse = mean_squared_error(test_labels, final_predictions)
final_rmse = np.sqrt(final_mse)

In [None]:
print(final_rmse)