# <h3 align="center">__Module 9 Activity__</h3>
# <h3 align="center">__Assigned at the start of Module 9__</h3>
# <h3 align="center">__Due at the end of Module 9__</h3><br>

# Weekly Discussion Forum Participation

Each week, you are required to participate in the module’s discussion forum. The discussion forum consists of the week's Module Activity, which is released at the beginning of the module. You must complete/attempt the activity before you can post about the activity and anything that relates to the topic. 

## Grading of the Discussion

### 1. Initial Post:
Create your thread by **Day 5 (Saturday night at midnight, PST).**

### 2. Responses:
Respond to at least two other posts by **Day 7 (Monday night at midnight, PST).**

---

## Grading Criteria:

Your participation will be graded as follows:

### Full Credit (100 points):
- Submit your initial post by **Day 5.**
- Respond to at least two other posts by **Day 7.**

### Half Credit (50 points):
- If your initial post is late but you respond to two other posts.
- If your initial post is on time but you fail to respond to at least two other posts.

### No Credit (0 points):
- If both your initial post and responses are late.
- If you fail to submit an initial post and do not respond to any others.

---

## Additional Notes:

- **Late Initial Posts:** Late posts will automatically receive half credit if two responses are completed on time.
- **Substance Matters:** Responses must be thoughtful and constructive. Comments like “Great post!” or “I agree!” without further explanation will not earn credit.
- **Balance Participation:** Aim to engage with threads that have fewer or no responses to ensure a balanced discussion.

---

## Avoid:
- A number of posts within a very short time-frame, especially immediately prior to the posting deadline.
- Posts that complement another post, and then consist of a summary of that.


In [28]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)



# Problem 1: Investigating FEature Importance in Decision Tree & Random Forest Regression

You are working with a housing dataset where your goal is to predict house prices based on different features like square footage, number of bedrooms, and location. You will train a Decision Tree Regressor and a Random Forest Regressor and compare how each model determines feature importance.

## Code

```python
# Generate a synthetic dataset with more features for a Car Price Prediction Model
np.random.seed(2)
n_samples = 200

engine_size = np.random.rand(n_samples) * 5  # Engine size in liters
horsepower = np.random.rand(n_samples) * 125  # Horsepower
weight = np.random.rand(n_samples) * 4000  # Car weight in pounds
age = np.random.randint(1, 20, n_samples)  # Age of car in years
mileage = np.random.rand(n_samples) * 200000  # Miles driven
luxury_rating = np.random.randint(1, 10, n_samples)  # Subjective luxury rating

# Target variable: Price of the car in USD
y = (
    2500 * engine_size +
    250 * horsepower -
    3 * weight -
    800 * age -
    0.05 * mileage +
    1000 * luxury_rating +
    np.random.normal(0, 5000, n_samples)  # Adding noise
)

# Create a dataframe
df_cars = pd.DataFrame({
    "Engine_Size": engine_size,
    "Horsepower": horsepower,
    "Weight": weight,
    "Age": age,
    "Mileage": mileage,
    "Luxury_Rating": luxury_rating,
    "Price": y
})

# Split dataset into features and target variable
X = df_cars.drop(columns=["Price"])
y = df_cars["Price"]

# Standardize features
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Train Decision Tree and Random Forest models
dt_model = DecisionTreeRegressor(max_depth=5, random_state=42)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

dt_model.fit(X_scaled, y)
rf_model.fit(X_scaled, y)

# Extract feature importance
dt_importance = dt_model.feature_importances_
rf_importance = rf_model.feature_importances_

# Convert to DataFrame for visualization
importance_df = pd.DataFrame({
    "Feature": X.columns,
    "Decision Tree Importance": dt_importance,
    "Random Forest Importance": rf_importance
})

# Sort by importance values (Random Forest)
importance_df = importance_df.sort_values(by="Random Forest Importance", ascending=False)

# Plot feature importances
plt.figure(figsize=(8, 5))
sns.barplot(x="Random Forest Importance", y="Feature", data=importance_df, palette="mako", hue="Feature", legend=False)
plt.xlabel("Feature Importance Score")
plt.ylabel("Feature")
plt.title("Feature Importance in Random Forest - Car Price Prediction")
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.show()

# Plot feature importances
plt.figure(figsize=(8, 5))
sns.barplot(x="Decision Tree Importance", y="Feature", data=importance_df, palette="rocket", hue="Feature", legend=False)
plt.xlabel("Feature Importance Score")
plt.ylabel("Feature")
plt.title("Feature Importance in Decision Tree - Car Price Prediction")
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.show()

```

## Questions
1. Compare the feature importance scores from the Decision Tree and Random Forest models. Are they similar or different? Why might this be the case?
2. Which feature has the highest importance in predicting house prices? Does this make sense logically? You have the "pricing function" try messing with the values. 
3. Try removing the least important feature from the dataset and retrain the model. Does the model’s performance change significantly?

# Problem 2: Investigating Multicollinearity and Regularization in Regression Models

Multicollinearity can make linear regression models unstable. In this activity, you'll calculate Variance Inflation Factor (VIF) to identify multicollinearity and apply Ridge & Lasso Regression to handle it.

## Code

```python

# Create multicollinear features
X6 = engine_size + np.random.normal(0, 0.1, n_samples)  # Correlated with engine_size
X7 = horsepower + np.random.normal(0, 5, n_samples)  # Correlated with horsepower

# Update dataframe with new correlated features
df_cars["Engine_Size_Correlated"] = X6
df_cars["Horsepower_Correlated"] = X7

# Update feature set
X_multi = df_cars.drop(columns=["Price"])
y_multi = df_cars["Price"]

# Standardize the new dataset
scaler = StandardScaler()
X_multi_scaled = pd.DataFrame(scaler.fit_transform(X_multi), columns=X_multi.columns)

# Compute VIF to check for multicollinearity
vif_data = pd.DataFrame()
vif_data["Feature"] = X_multi_scaled.columns
vif_data["VIF"] = [variance_inflation_factor(X_multi_scaled.values, i) for i in range(X_multi_scaled.shape[1])]

# Train models
X_train, X_test, y_train, y_test = train_test_split(df_cars, y, test_size=0.2, random_state=42)

lr = LinearRegression().fit(X_train, y_train)
ridge = Ridge(alpha=10).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)

# Print coefficients
coef_df = pd.DataFrame({"Feature": df_cars.columns, "LinearRegression": lr.coef_, 
                         "Ridge": ridge.coef_, "Lasso": lasso.coef_})

# Display tables
print("VIF Scores:")
display(vif_data)
print("\nModel Coefficients:")
display(coef_df)

```

## Questions
1. Look at the VIF scores for the features. Which feature has the highest multicollinearity?
2. Compare the coefficients of Linear Regression vs. Ridge vs. Lasso. How do Ridge and Lasso modify the coefficients to handle multicollinearity?
3. Try increasing the alpha value for Ridge and Lasso. How does this change the coefficients? What happens when you set alpha too high?

# Problem 3: Comparing Decision Tree Regression and Support Vector Regression for Nonlinear data

Support vector regression can model data that is lienar and non-linear by using different kernel functions. The linear kernel works best when there is a linear relationship between features and the target variable. Radial Basis Function kernels can understand non-linear patterns similar to decision trees. You can also smooth the step functions of a decision tree by using random forest regression. In this activity, you'll compare their performance on a nonlinear dataset.

## Code

```python
# Generate Nonlinear Data
np.random.seed(42)
X_nl = np.sort(5 * np.random.rand(n_samples, 1), axis=0)
y_nl = np.sin(X_nl).ravel() + np.random.normal(0, 0.1, X_nl.shape[0])

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_nl, y_nl, test_size=0.2, random_state=42)

# Train Support Vector Regression models
svr_linear = SVR(kernel='linear', C=1.0)
svr_linear.fit(X_train, y_train)


# Train Decision Tree Regression model and Random Forest
tree_reg = DecisionTreeRegressor(max_depth=5)
forest_reg = RandomForestRegressor(n_estimators=100)

tree_reg.fit(X_train, y_train)
forest_reg.fit(X_train, y_train)

# Predict on test set
y_pred_svr_linear = svr_linear.predict(X_test)

y_pred_tree = tree_reg.predict(X_test)
y_pred_forest = forest_reg.predict(X_test)


```

## Tasks
1. Compute the mean squared error of the models.
2. Plot the decision tree vs the random forest model
3. Plot the linear kernel SVR against the decision tree or random forest.
4. Add the radial basis function kernel SVR to the code. 
5. Compute the mean squared error of all the models again. 
6. Plot the SVR with rbf kernel against all other models. 
7. Answer questions below.


## Questions
1. Compare the MSE (Mean Squared Error) of all models. Which model performs better?
2. Look at the plot of the fitted curves. Compare and contrast all the models smoothness curves.
3. Try increasing the depth of the Decision Tree (max_depth=10) or the number of iterations of the random forest. What happens to the predictions? Do you observe overfitting?