# Assignment 1

All required code is a single line. The length of your response for questions that require identification and/or interpretation will not be considered in evaluation. For example, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. 

We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

### Question 1: Simple Linear Regression 

Let's set up our workspace and use the `Boston` dataset in the `ISLP` library. Print `Boston` to learn more about the dataset.

In [None]:
# Import standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.pyplot import subplots
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Import specific objects
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)

In [None]:
# Load the "Boston" dataset using the "load_data" function from the ISLP package
Boston = load_data('Boston')

Before we fit and review model outputs, we should visualize our data. Review the code and plot, shown below. Answer the following questions:

_(i)_ What are the `medv` and `dis` variables being plotted? (Hint: review this [link](https://islp.readthedocs.io/en/latest/datasets/Boston.html))

_(ii)_ What concept ‘defines’ the plotted line?

In [None]:
# (i) medv is the median value of owner-occupied homes in $1000s and dis is the weighted mean of distances to five Boston employment centres
# (ii) The concept that defines the plotted line is the single linear regression model, which estimates the relationship between the independent variable ('dis') and the dependent variable ('medv') and try
# to find the best-fitting straight line that minimizes the sum of squared differences between the observed data points and the predicted values generated by the model.

In [None]:
# Extract the variables
medv = Boston['medv'].values.reshape(-1, 1)
dis = Boston['dis'].values.reshape(-1, 1)

# Plot data
plt.scatter(dis, medv, label='Data')
plt.xlabel('dis')
plt.ylabel('medv')

# Fit a linear regression model
lm = LinearRegression()
lm.fit(dis, medv)

# Plot the regression line
plt.plot(dis, lm.predict(dis), color='red', label='Regression Line')

# Add labels and legend
plt.xlabel('dis')
plt.ylabel('medv')
plt.legend()

# Show the plot
plt.show()

Consider the variables plotted above. In the context of the `Boston` dataset:

_(iii)_ What is the (implied) null hypothesis? What is the (implied) alternative hypothesis?

_(iv)_ Now, let’s fit a simple regression model, using the general syntax `sm.OLS()` and `model.fit()`. As above, use `medv` as the response variable Y, and `dis` as the predictor variable X. (Hint: use the syntax `sm.add_constant` with the appropriate argument).

In [None]:
# (iii) The implied null hypothesis is that there is no relationship between dis and medv. The implied alternative hypothesis is that there is relationship between dis and medv.

# (iv)
# Add a constant term to the predictor variable
dis_with_const = sm.add_constant(dis)

# Fit the regression model
model = sm.OLS(medv, dis_with_const)
results = model.fit()

# Print the summary
print(results.summary())

# Print confidence intervals for the coefficients
conf_intervals = results.conf_int()
print(conf_intervals)

Review your model output to answer the following questions (Hint: use the `summary` and `conf_int` functions):    
_(v)_ What are the _coefficient estimates_ for $B_0$ (intercept) and $B_1$ (slope)?  
_(vi)_ What are the _standard errors_ for $B_0$ and $B_1$?  
_(vii)_ What are the _confidence intervals_ for $B_0$ and $B_1$?  


In [None]:
#  (v) Intercept is 18.3901 and slope is 1.0916
#  (vi) Intercept is  0.871 and slope is 0.188
#  (vii) Intercept is  [16.78417945 19.99599722] and slope is [0.72150933 1.4617167 ]



Now, let's interpret the model output.  
_(viii)_ Is the model a good fit? (Hint: review $R^2$)  
_(ix)_ Do we reject the (implied) null hypothesis? Why or why not? (Hint: review model $F$ statistic, $p$ value).  

In [None]:
# (viii) R^2 is 0.062 which is very closed to 0 and suggests a relatively weak relationship between dis and the medv in this model, and indicates a poor fit of the model.
# (ix) F - statstic is 33.58 and p-value is 1.21e-08 therefore we reject the null hypothesis, indicating that the regression model is statistically significant and there is evidence of a significant linear relationship between dist and medv.

### Question 2: Multiple Linear Regression 

We'll continue to use the `Boston` dataset for this question.

_(i)_ Fit a multiple linear regression, with two predictor variables: $X_1$ is `dis`, and $X_2$ is `rm`. As before, keep `medv` as the response variable Y. (Hint: use the syntax `sm.add_constant` with the appropriate arguments).

In [None]:
# Define the predictor variables (X1: dis, X2: rm) and the response variable (Y: medv)
X = Boston[['dis', 'rm']]
Y = Boston['medv']

# Add a constant term to the predictor variables
X_with_const = sm.add_constant(X)

# Fit the multiple linear regression model
model = sm.OLS(Y, X_with_const)
results = model.fit()

# Print the summary of the regression model
print(results.summary())

# Print confidence intervals for the coefficients
conf_intervals = results.conf_int()
print(conf_intervals)


_(ii)_ In the context of the `Boston` dataset, state the null and alternative hypotheses.

_(iii)_ Review the model output, using `summary()`. Does it appear that both `dis` and `rm` are predictive of `medv`? How did you determine this?

_(iv)_ We can use the inbuilt `sm.graphics.plot_regress_exog` function to generate helpful diagnostic plots (Hint: provide `plot_regress_exog` with the multiple regression model). Review the first generated plot, 'Residuals vs. Fitted'. Which observations are outliers? What impact might outliers have on our model?

In [None]:
# (ii) The null hypothese is that there is no significant linear relationship between the predictor variables (dis and rm) and the response variable (medv).
# And the alternative hypothese is that there is significant linear relationship between the predictor variables (dis and rm) and the response variable (medv).
# (iii) Both dis and rm have non-zero coefficient estimates (0.4888 and 8.8014) which indicates for each increase in dim and rm, meds is predicted to have certain increase. Both dis and rm have statistically significant coefficients, as indicated by their small p-values (1.84e-75).
# The model overall is highly statistically significant, as evidenced by the small p-value of the F-statistic (247.0).
# R-squared value (0.496) suggests that nearly half of the variance in medv is explained by the combination of dis and rm, indicating that the model provides a reasonable fit to the data.
#  (iv) 
# Generate diagnostic plots
fig, ax = plt.subplots(figsize=(8, 6))
sm.graphics.plot_regress_exog(results, 'dis', fig=fig)
plt.show()

# Yes, there is an outlier that has a high residual (y = 40) and dis = 12.5 which might indicate a substantial difference between their observed values and the predicted values by the model.
# The impact on outliers includes distorting the estimated parameters of the regression model, leading to biased coefficient estimates and inflating the residual errors, reducing the model's predictive accuracy.

_(v)_ Fit a second model, this time including an interaction between the two predictor variables. Is there an interaction? (Hint: add a variable `x1 * x2` where `x1` and `x2` are the predictor variables). State an interpretation of the interaction, in the context of the `Boston` dataset, in one or two sentences.

In [None]:
# Create the interaction variable (dis * rm)
Boston['dis_rm_interaction'] = Boston['dis'] * Boston['rm']

# Define the predictor variables (including the interaction) and the response variable
X_interaction = Boston[['dis', 'rm', 'dis_rm_interaction']]
Y = Boston['medv']

# Add a constant term to the predictor variables
X_interaction_with_const = sm.add_constant(X_interaction)

# Fit the multiple linear regression model with the interaction
model_interaction = sm.OLS(Y, X_interaction_with_const)
results_interaction = model_interaction.fit()

# Print the summary of the regression model
print(results_interaction.summary())

# (v) 
# In the context of the Boston dataset, the interaction between the weighted distances to employment centers (dis) and the average number of rooms per dwelling (rm) suggests that the relationship between housing prices (medv) 
# and distance to employment centers varies depending on the size of dwellings, indicating that proximity to employment centers may have a different impact on housing prices in areas with different dwelling sizes.


# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Simple Linear Regression - Standard errors for $B_0$ and $B_1$|The standard errors are correct.|The standard errors are not correct.|
|Simple Linear Regression - Confidence intervals for $B_0$ and $B_1$|The confidence intervals are correct.|The confidence intervals are not correct.|
|Multiple Linear Regression - Null and alternative hypotheses|The relationship for both hypotheses has been correctly identified.|The relationship for both hypotheses has been incorrectly identified.|
|Multiple Linear Regression - Interpretation of the interaction|The interaction has been correctly identified.|The interaction has been incorrectly identified.|

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Note:

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-1`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applied_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
