## Course Information
INFO 521: Introduction to Machine Learning\
Instructor: Xuan Lu, College of Information Science

## Instructions
#### Objectives
This worksheet will assess your knowledge of basic commands in Python. Please review the lectures, suggested readings, and additional resources before starting the homework, as this document closely follows the provided materials.

#### Grading
Please note that grades are **NOT exclusively based on your final answers**. We will be grading the overall structure and logic of your code. Feel free to use as many lines as you need to answer each of the questions. I also highly recommend and strongly encourage adding comments (`#`) to your code. Comments will certainly improve the reproducibility and readability of your submission. Commenting your code is also good coding practice. **Specifically for the course, you’ll get better feedback if the TA is able to understand your code in detail.**

__Total score__: 120 points, with 20 points for Questions 1-3 and 100 points for the brief independent project.

#### Submission
This homework is due by the end of October 16th (**Wednesday, 11:59 pm AZ time**). Please contact the instructor if you are (i) having issues opening the assignment, (ii) not understanding the questions, or (iii) having issues submitting your assignment. Note that late submissions are subject to a penalty (see late work policies in the syllabus).
- Please submit a single Jupyter Notebook file (this file). Answers to each question should be included in the relevant block of code (see below). Rename your file to "**lastname_Hw7.ipynb**" before submitting. <font color='red'>A broken file won’t be graded, so please ensure that your file is accessible.</font> If a given block of code is causing issues and you didn't manage to fix it, please add comments.

#### Time commitment
Please reach out if you’re taking more than ~18h to complete (1) this homework, (2) reading the book chapters, and (3) going over the lectures. I will be happy to provide accommodations if necessary. **Do not wait until the last minute to start working on this homework**. In most cases, working under pressure will certainly increase the time needed to answer each of these questions and the instructor and the TA might not be 100% available on Sundays to troubleshoot with you.

#### Looking for help?
First, please go over the relevant readings for this week. Second, if you’re still struggling with any of the questions, do some independent research (e.g. stackoverflow is a wonderful resource). Don’t forget that your classmates will also be working on the same questions - reach out for help (check under the Discussion forum for folks looking to interact with other students in this class or start your own thread). Finally, the TA is available to answer any questions during office hours and via email.

## Questions
#### Author:
Name: [Your name]\
Affiliation: [Your affiliation]

### Conceptual

#### Question 1

Draw an example (of your own invention) of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions $R_1$, $R_2$, ... , the cutpoints $t_1$ , $t_2$, ..., and so forth. Please insert your sketch below and make sure the sketch file is attached to your submission.

  > **_Answer:_**  [BEGIN SOLUTION].

#### Question 2 (421/521)

Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of $X$, produce $10$ estimates of $P(Class is Red|X)$:

$0.1$, $0.15$, $0.2$, $0.2$, $0.55$, $0.6$, $0.6$, $0.65$, $0.7$, and $0.75$.

There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this week's lecture and readings. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

  > **_Answer:_**  [BEGIN SOLUTION].

## Applied

#### Question 3

Apply boosting and random forest to a data set of your choice. Feel free to use any of the datasets in the `ISLP` python package (e.g. `College`) to examine any of the questions that were discussed in any of the previous homeworks. See https://intro-stat-learning.github.io/ISLP/data.html for `ISLP` datasets. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods (e.g. linear or logistic regression models)? Which of these approaches yields the best performance?

In [None]:
# BEGIN SOLUTION

# END SOLUTION

## Brief independent project

In this part of the assignment, you will be provided with two options of basic ML code used to examine a given question. __Assume that this code was sent to you by a collaborator who is asking for your help in improving their ML workflow.__ 

1. Critically examine the steps outlined by your collaborator (e.g. data pre-processing, model development) with the goal of improving the performance of the model given the specific research question. 
2. Reconsider and re-evaluate their data pre-processing steps.
3. Generate an alternative model. Your alternative model can be of the same class as the baseline model (but not necessarily).
4. Finally, you will compare your new model to the one suggested by your collaborator. If your model does not outperform your collaborator's model, briefly examine why this was the case.

__Please justify your decisions.__ For instance, explain why you decided to plot Y vs X and not Y vs Z; why 20% of observations in the F feature were inputted using mean values; why model N was selected and not M; or even why model performance is solely examined based on accuracy. 

__I’m looking for systematic and well-justified decisions.__ I am interested in reading thoughtful discussions on the potential practical consequences of your decisions. Hint: Assume that you’re explaining this to your collaborator. __Please be brief and focus on your major decisions__. Your grade will be based on your ability to explain your choices and not the length of your submission. Please do not forget to annotate your code.

Your submission (also part of this file; see below) should include only the following sections (also below under __Report__):

##### Synopsis
A brief (<300 words) description of the dataset, goals of the analysis, methods used, main results, and conclusions.

##### Data pre-processing
Steps related to missing data, reduction of dimensionality, training and test data splits, center, scale, creating dummy variables, etc.

##### Data exploration
Correlation analyses, scatterplots, class imbalance, etc.

##### Model development
This component should include descriptions on why a given model was used, how each of the examined models was designed (e.g. are you using all the features in the dataset?), tuning (if needed), and model evaluation. 

##### Conclusions
Frame your conclusions in two different ways (<300 words). First, conclude on the question that is being addressed in your selected project. Second, compare the performance of your approach in relation to your collaborator's model. 


Choose one of the two options below. 

__Option 1__. Your collaborator is interested in implementing a new model to classify patients into sick or not sick categories. _They are particularly concerned with the number of times their model fails to correctly classify truly sick patients._ Feel free to use any of the variables in the dataset, reconsider the data pre-processing approach, explore other models, data partitions, etc.

In [None]:
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix

Data1Raw = pd.read_csv('sick.data', sep = '\t')
Data1 = Data1Raw.drop(columns=['TBG', 'TBG_measured', 'hypopituitary', 'TSH', 'T3', 'TT4', 'T4U', 'FTI']).dropna()
Data1['Class'] = Data1['Class'].map({'sick': 1, 'negative': 0})

binary_cols = Data1.columns[Data1.isin(['f', 't']).all()]
Data1[binary_cols] = Data1[binary_cols].map(lambda x: 1 if x == 't' else 0)
Data1['sex'] = Data1['sex'].map({'F': 0, 'M': 1})
Data1 = pd.get_dummies(Data1, columns=['referral_source'], dtype='int')

X = Data1.drop(columns=['Class'])
y = Data1['Class']
X = sm.add_constant(X)
glm_model = sm.GLM(y, X, family=sm.families.Binomial())
glm_results = glm_model.fit()

y_pred = glm_results.predict(X)
y_pred_class = [1 if p > 0.5 else 0 for p in y_pred]
conf_matrix = confusion_matrix(y, y_pred_class)
print(conf_matrix)

__Option 2__. Your collaborator is generating a model to predict the duration of flight delays (`dep_delay`+`arr_delay` in the `flights` dataset). _They will be only convinced that your model is better if you are able to reduce uncertainty around predictions for a novel observation (`newObs`)._ Feel free to use any of the variables in the dataset, reconsider the data pre-processing approach, explore other models, data partitions, etc.
Hint: Consider the trade-offs associated with the overall model error and prediction error for a particular observation. Consider explaining this to your collaborator.

In [None]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

flights = pd.read_csv("flights.data", sep = '\t')
flights['delay'] = flights['dep_delay'] + flights['arr_delay']
flights = flights.loc[flights.delay.notna()]

new_obs = flights.sample(n=1,random_state=5)

flights_train, flights_test = train_test_split(flights, test_size=0.25, random_state=5)
X_train = sm.add_constant(flights_train['distance'])
y_train = flights_train['delay']
de_model = sm.OLS(y_train, X_train).fit()

X_new_obs = sm.add_constant(new_obs[['distance']], has_constant='add')
pred = de_model.get_prediction(X_new_obs)
pred_summary = pred.summary_frame(alpha=0.05)
print(pred_summary)

### Report

##### Synopsis

  > **_Answer:_**  [BEGIN SOLUTION].

##### Data pre-processing

In [None]:
# BEGIN SOLUTION

# END SOLUTION

##### Data exploration

In [None]:
# BEGIN SOLUTION

# END SOLUTION

##### Model development

In [None]:
# BEGIN SOLUTION

# END SOLUTION

##### Conclusions

  > **_Answer:_**  [BEGIN SOLUTION].