# Assignment 2: Regression, Multi-class, and Hyper-parameter Tuning

### Task 1: Regression Metrics (30 points total)

The code below executes the following steps:
* Load the California Housing dataset from sklearn.
* Split the dataset into training and testing sets.
* Train a linear regression model on the training data.

It is your task to:
* Evaluate the model's performance using Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared metrics.
* Print the evaluation results.
* Interpret the results and discuss how each metric reflects the performance of a regression model.

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the Boston Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set

# Evaluate model performance


**Question: Interpret the results. How might we interpret the model performance and communicate it to stakeholders? (20 points)**



*Your Answer:*

### Task 2: Multiclass Classification Metrics (30 points total)

The code below executes the following steps:
* Load the Iris dataset from scikit-learn.
* Split the dataset into training and testing sets.
* Train a multiclass classification model, logistic regression, on the training data.

It is your task to:
* Evaluate the model's performance using precision, recall, F1 score
* Visualize a confusion matrix.
* Print the evaluation results.
* Interpret the results and discuss how each metric reflects the performance of a regression model.

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set

# Evaluate model performance


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Question: Interpret the results. How might we interpret the model performance and communicate it to stakeholders? (20 points)**


*Your Answer:*

### Task 3: Model Selection, Hyperparameter Tuning, and Cross-Validation (40 points total)
The code below executes the following steps:
* Load in the Iris dataset.
* Split into training and testing

It is your task to:
* Implement a grid search with cross-validation to tune hyperparameters for a classification model (e.g. random forest).
* Explore different hyperparameters (e.g. number of estimators for random forest).
* Evaluate the model's performance using accuracy, precision, recall, and F1 score on the testing set.
* Print the **best hyperparameters** and evaluation results.

In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid

# Perform grid search with cross-validation

# Get the best hyperparameters

# Evaluate model performance


### OPTIONAL Task 4: Custom Scoring Metric (20 bonus points)

In sklearn, you are not limited to using their scoring functions. You can create your own!

You can create a custom scoring metric in scikit-learn by defining a scoring function and then using the `make_scorer` function to wrap it as a scorer. 

**For bonus points:**

* Define a custom scoring function custom_scoring that calculates the weighted sum of precision and recall for a binary classification problem. 
* Then wrap this function using make_scorer to create a custom scorer custom_scorer. 
* Use this custom scorer in cross-validation to evaluate the performance of a logistic regression model.

In [5]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Define your custom scoring function
def custom_scoring(y_true, y_pred, precision_weight = 0.6, recall_weight = 0.4):
    # YOUR CODE HERE
    
    return score

# Wrap the custom scoring function as a scorer
# YOUR CODE HERE



In [None]:
# THIS CODE TESTS YOUR FUNCTION
# Generate sample data
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# Create and train a model using cross-validation
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer)

# Print the custom scores obtained from cross-validation
print("Custom Scores:", scores)
print("Mean Custom Score:", scores.mean())