## EE 461P: Data Science Principles  
### Assignment 1  
### Total points: 70
### Due: Tuesday, February 8, 2022, submitted via Canvas by 11:59 pm  

Your homework should be written in a **Jupyter notebook**. You may work in groups of two if you wish. Only one student per team needs to submit the assignment on Canvas.  But be sure to include name and UT eID for both students.  Homework groups will be created and managed through Canvas, so please do not arbitrarily change your homework group. If you do change, let the TAs know.

Also, please make sure your code runs and the graphics (and anything else) are displayed in your notebook before submitting. (%matplotlib inline)

### Name(s) and EID(s):
1. Harshika Jha
2. Jared McArthur


# Question 1  Machine Learning Environments (5 pts)

First, read the Vertex AI description at https://cloud.google.com/vertex-ai
and watch the associated video https://www.youtube.com/watch?v=gT4qqHMiEpA&t=216s.

Now state what you feel are the top three capabilities that Vertex AI provides to facilitate the design and deployment of ML solutions (the answer is subjective of course; the intent is make you think about the various steps involved in the design pipeline).

# Answer 1

# Question 2 (20 pts)

A biased coin, with an unknown but fixed probability $p$ of obtaining "tails" on any toss (independent of the outcomes of other tosses), is tossed  8 times, and the following  sequence of outcomes is observed:

1 1 1 0 1 0 0 1 

where 0 denotes heads and 1 denotes tails. Derive the  maximum likelihood estimate of $p$ given the observation.


# Answer 2

# Question 3 (15 pts)

Let variables $x\in R^{\ p}$ and $y \in R$ be related by the following equation

$$
\begin{align*}
y=\phi(w,x) +\epsilon
\end{align*}
$$

where $\phi(w,x): R^{\ p}\rightarrow R$ , with learnable parameter $w \in R^{\ p}$, denotes a deterministic function of the predictor variable $x$, and the zero-mean scalar random variable $\epsilon$ has the distribution

$$
\begin{align*}
f_{Ε}(\epsilon)= \frac{1}{2b} \exp(-\frac{|\epsilon|}{b})
\end{align*}
$$

In other words, the distribution of the target variable $y$ conditioned on the model prediction $\phi(w,x)$ and noise distribution parameter $b$ is   

$$f_{Y|\phi(w,x),b}(y|\phi(w,x),b)=\frac{1}{2b} \exp(-\frac{|y-\phi(w,x)|}{b})$$


Now suppose we have data set of size $N$ where the observations $y_1,y_2,...,y_N$ correspond to inputs $x_1,x_2,...,x_N$. Show that the maximum likelihood estimates of $w$ and $b$ are 

$$w_{ML} =  \arg \min_{w}  [\ \sum_{i=1}^N|y-\phi(w,x_i)|\ ]$$

and 
$$b_{ML}=\frac{1}{N} \sum_{i=1}^N|y-\phi(w_{ML},x_i)|$$

# Answer 3


# Question 4 : Multiple Linear Regression (MLR) in Python (25 points) 

In this problem, you will be working on a dataset to predict housing prices in Ames, Iowa. The initial few cells will download and set up the data.

The dataset consists of 1460 datapoints. You are required to predict the `SalePrice` using the following 8 features for each datapoint - 

1. `OverallQual` - Rates the overall material and finish of the house

2. `GrLivArea` - Above grade (ground) living area square feet

3. `GarageCars` - Size of garage in car capacity

4. `GarageArea` - Size of garage in square feet

5. `TotalBsmtSF` - Total square feet of basement area

6. `1stFlrSF` - First Floor square feet

7. `FullBath` - Full bathrooms above grade

8. `TotRmsAbvGrd` - Total rooms above grade (does not include bathrooms)

You will start writing code from the `Performing Regression` section. 

In [None]:
#@title Common Imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

## Load Data

In [None]:
#@title Download data
!wget https://raw.githubusercontent.com/Data-Science-FMI/ml-from-scratch-2019/master/data/house_prices_train.csv

--2022-01-20 04:17:21--  https://raw.githubusercontent.com/Data-Science-FMI/ml-from-scratch-2019/master/data/house_prices_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 460676 (450K) [text/plain]
Saving to: ‘house_prices_train.csv’


2022-01-20 04:17:21 (19.4 MB/s) - ‘house_prices_train.csv’ saved [460676/460676]



In [None]:
#@title Read data into a pandas dataframe
df = pd.read_csv('house_prices_train.csv')

## Preprocess Data

In [None]:
df = df.replace([np.inf, -np.inf], np.nan)
df = df.fillna(0)

X = df[["OverallQual","GrLivArea","GarageCars","GarageArea","TotalBsmtSF","1stFlrSF","FullBath","TotRmsAbvGrd"]]
y = df["SalePrice"]

pd.set_option('display.max_columns', 8)

## Performing Regression

Before we get into the regression model, it would be good to gain an intuition of the data that we have. We present two plots here - A distribution of the variable to be predicted and a heatmap depicting the correlation between the different dependent variables. Feel free to explore more and perform additional analysis!

a. (5 points) Plot a histogram of the `SalePrice` to get an idea of distribution of the variable to be predicted. Mention any interesting observations about the graph, along with any information about outliers. (Hint : Take a look at [`seaborn.displot`](https://seaborn.pydata.org/generated/seaborn.displot.html))

In [None]:
# Answer a

b. (5 points) Plot a heatmap to show the correlation matrix of the dependent variables (features) that will be used to make predictions. This will help you understand how correlated/uncorrelated different features are and how important they might be for prediction. (Hint : Take a look at [`pandas.Datafram.corr`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) and [`seaborn.heatmap`](https://seaborn.pydata.org/generated/seaborn.heatmap.html))

In [None]:
# Answer b

c. (2 pts) Print the shape (number of rows and columns) of the feature matrix X, and print the first 5 rows.

In [None]:
# Answer c

d. (5 pts) Using ordinary least squares, fit a multiple linear regression (MLR) on all the feature variables using the entire dataset. Report the regression coefficient of each input feature and evaluate the model using mean squared error (MSE). Example of ordinary least squares in Python is shown in Section 1.1.1 of http://scikit-learn.org/stable/modules/linear_model.html.

In [None]:
# Answer d

e. (5 pts) Split the data into a training set and a test set, using the train_test_split with test_size = 0.25 and random_state = 50. Fit an MLR using the training set. Evaluate the trained model using the training set and the test set, respectively. Compare the two MSE values thus obtained. Report the  𝑅2  value(coefficient of determination)(https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html and https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) .

In [None]:
# Answer e

f. (3 pts) Recall the assumptions behind the MLR model - 

1. There is a linear relationship between the dependent and the predictor variables

2. The residuals (difference between predicted and original values) are normally distributed with a constant variance and independent of each other

The correlation matrix plotted earlier can be used to loosely argue independence. For this problem, you are required to build a scatter plot of predicted vs real values of test data so that both axes are equal in length and plot y=x line in the middle. Also, Plot the distribution (histogram) of residuals to justify if you think MLR model is reasonable for this problem or not?

In [None]:
# Answer f

# Question 5 : Swapping dependent and independent variables (5 points)

Consider a dataset consisting of two features and N (>1) data points -

1.   The number of ice-cream cones sold per day (Var 1)
2.   The average temperature of a day (Var 2)

We perform two ordinary least squares regression experiments  -

Experiment 1 : We treat Var 1 as the independent variable and Var 2 as the dependent variable.

Experiment 2 : We treat Var 2 as the indpendent variable and Var 1 as the dependent variable.

Let the slope of the regression line obtained in Experiment 1 be S1 and the slope of regression line obtained in Experiment 2 be S2.

Select the most appropriate statement about S1 * S2 (the product of S1 and S2), which is true without making any assumptions about the data-points, from amongst the choices below along with an explanation of why you think that option is true - 

a) S1 * S2 = -1 \\
b) S1 * S2 = 1 \\
c) The value of S1 * S2 cannot be ascertained \\

# Answer 5