# CSCI 3022: Intro to Data Science - Fall 2021 Final Coding Exam
***

This practicum is due on Canvas by Thursday, December 9, at **6:00 PM**. No late work is accepted. Your solutions to theoretical questions should be done in Markdown directly below the associated question.  Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  

**Here are the rules:** 

1. All work, code and analysis, must be your own. 
1. You may use your course notes, posted lecture slides, textbooks, in-class notebooks, and homework solutions as resources.  You may also search online for answers to general knowledge questions like the form of a probability distribution function or how to perform a particular operation in Python/Pandas. 
1. This is meant to be like a coding portion of your midterm exam. So, the instructional team will be much less helpful than we typically are with homework. For example, we will not check answers, help debug your code, and so on.
1. If something is left open-ended, it is because we want to see how you approach the kinds of problems you will encounter in the wild, where it will not always be clear what sort of tests/methods should be applied. Feel free to ask clarifying questions though.
2. You may **NOT** post to message boards or other online resources asking for help.
3. You may **NOT** copy-paste solutions *from anywhere*.
4. You may **NOT** collaborate with classmates or anyone else.
5. In short, **your work must be your own**. It really is that simple.

Violation of the above rules will result in an immediate academic sanction (*at the very least*, you will receive a 0 on this practicum or an F in the course, depending on severity), and a trip to the Honor Code Council.

**By submitting this assignment, you agree to abide by the rules given above.**

***

**Name**: Felipe Lima

***


**NOTES**: 

- Late work is not accepted. Therefore, do not try to turn in work at the last minute as slow WiFi or computer crashing are not valid reasons for accepting late work. 
- If you have a question for us, post it as a **PRIVATE** message on Piazza.  If we decide that the question is appropriate for the entire class, then we will add it to a Practicum clarifications thread. 
- Do **NOT** load or use any Python packages that are not available in Anaconda 3.6. 
- Some problems with code may be autograded.  If we provide a function API **do not** change it.  If we do not provide a function API then you're free to structure your code however you like. 
- Submit only this Jupyter notebook to Canvas.  Do not compress it using tar, rar, zip, etc. If your work cannot be downloaded, opened, or interpreted by the grader then it will not be accepted at a later time.
- This should go without saying, but... For any question that asks you to calculate something, you **must show all work to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit.

---



In [1]:
from scipy import stats
from math import isnan
import numpy as np 
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


---

### Problem 1 - Multiple Linear Regression

We will further examine the `boxing$.csv` dataset in this problem.
Ten professional boxers recieved a payday (in thousands) for their latest fight. How does the amount of money spent on trainers/coaches, the amount of money spent on promotion, and the amount of money recieved from sponsors relate to the amount of money made for their payday fight?

---

**Response**:
- `Y` - amount of money made in boxing match (payday)

**Features**:
- `X1` - amount of money spent on trainers and coaches
- `X2` - amount of money spent on promotion
- `X3` - amount of money recieved from sponsors
 


**Part A**: Read the data from the csv into a Pandas DataFrame.  Print the full dataframe to the screen.

In [2]:
# Code Here
file_path = 'boxing$.csv'
dfBo = pd.read_csv(file_path)

dfBo

Unnamed: 0,Y,X1,X2,X3
0,85.099998,8.5,5.1,4.7
1,106.300003,12.9,5.8,8.8
2,50.200001,5.2,2.1,15.1
3,130.600006,10.7,8.399999,12.2
4,54.799999,3.1,2.9,10.6
5,30.299999,3.5,1.2,3.5
6,79.400002,9.2,3.7,9.7
7,91.0,9.0,7.6,5.9
8,135.399994,15.1,7.7,20.799999
9,89.300003,10.2,4.5,7.9


**Part B:** Now, let's fit a multiple linear regression model! We will use the statsmodels package for this task. Execute the following cell to import the required package. Then answer to following questions/perform the requested tasks. 

- Use sm.OLS.fit to accomplish this. 

- Then use model.params to print the regression coeficients to the screen.

- Use a Markdown cell to specify the MLR model in the form: $ \hat{y} = \beta_0+\beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p $

In [3]:
import statsmodels.api as sm

In [4]:
# Code Here
X = dfBo[["X1", "X2", "X3"]]
X = sm.add_constant(X)
y = dfBo["Y"]

model = sm.OLS(y, X).fit()

print(model.params)

const    7.676029
X1       3.661604
X2       7.621051
X3       0.828468
dtype: float64


**Solution:**

$$
\texttt{Y} = 7.676029 + 3.661604 \times \texttt{X1} + 7.621051 \times \texttt{X2} + 0.828468 \times \texttt{X3}
$$

**Part C**: Inspect the output from your MLR model from **Part B**. Perform the appropriate statistical hypothesis test at the $\alpha = 0.01$ significance level to determine if _at least one_ of the features is related to the the response $y$. 

In [5]:
# Code Here
model.summary()



0,1,2,3
Dep. Variable:,Y,R-squared:,0.967
Model:,OLS,Adj. R-squared:,0.95
Method:,Least Squares,F-statistic:,58.22
Date:,"Thu, 09 Dec 2021",Prob (F-statistic):,7.91e-05
Time:,21:47:41,Log-Likelihood:,-31.839
No. Observations:,10,AIC:,71.68
Df Residuals:,6,BIC:,72.89
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,7.6760,6.760,1.135,0.299,-8.866,24.218
X1,3.6616,1.118,3.276,0.017,0.927,6.397
X2,7.6211,1.657,4.598,0.004,3.566,11.676
X3,0.8285,0.539,1.536,0.175,-0.491,2.148

0,1,2,3
Omnibus:,1.444,Durbin-Watson:,1.971
Prob(Omnibus):,0.486,Jarque-Bera (JB):,0.427
Skew:,-0.505,Prob(JB):,0.808
Kurtosis:,2.937,Cond. No.,43.0


**Solution:**

One of the features is related to y and that is X2: the amount of money spent on promotion.

**Part D:** Using the statsmodels output from **Part C**, indicate in the table below, which features seem to have a significant relationship with the response.

| Feature Name | Significant Relationshiip w/ Response (Yes or No) | Reason (one or two word explanation) |
| --- | --- | --- |
| X1 | No | (The computed p-values associated with $X_1$ is above the $\alpha = 0.01$ significance level.) |
| X2 | Yes | (The computed p-values associated with $X_2$ is below the $\alpha = 0.01$ significance level. |
| X3 | No | (The computed p-values associated with $X_3$ is above the $\alpha = 0.01$ significance level.) |

**Part E**: Forward selection is the process of fitting an MLR model by adding one feature to the model at a time. There are multiple ways to decide which features are added. We will use the Sum of Squared Errors (SSE) to select our features. The full model in this problem involves 3 features. We seek to build a "reduced model" with the two "best" features.

How is a forward selection carried out? Read on for a detailed description of the process.

- Write a function `forward_select(df, resp_str, maxk)` that takes in the DataFrame, the name of the column corresponding to the response, and the maximum number of desired features, and returns a list of feature names corresponding to the `maxk` most important features via forward selection.   

- At each stage in forward selection you should add the feature whose inclusion in the model would result in the lowest sum of squared errors $(SSE)$. Use the following method of computing SSE: `np.sum((y-model.predict(X))**2)`

- Use your function to determine the best $k=2$ features to include in the model. Clearly indicate which feature was added in each stage via a print statement. 

**Note**: The point of this exercise is to see if you can implement **foward_select** yourself.  You may of course use canned routines like statmodels OLS, but you may not call any Python method that explicitly performs forward selection.

In [6]:
def forward_select(df, resp_str="Y", maxk=2):
    # Code Here
    initial_features = df.columns[1:4].tolist()
    desired_features = []
    min = float('inf')
    min2 = float('inf')
    for i in initial_features:
        print("Adding " , i , " feature.")
        X = df[[i,i,i]]
        X = sm.add_constant(X)
        SSE = np.sum((y-model.predict(X))**2)
        print (SSE)
        if (SSE < min2):
            desired_features.append(i)
        min2 = min
        min = SSE
    return desired_features
    
forward_select(dfBo)


Adding  X1  feature.
11948.951823657711
Adding  X2  feature.
4725.290770569185
Adding  X3  feature.
45646.598160522975


['X1', 'X2']

**Part F**: Write down the reduced multiple linear regression model, including estimated parameters, obtained by your forward selection process. 

In [7]:
# Code Here
X2 = dfBo[["X1", "X2"]]
X2 = sm.add_constant(X2)
y2 = dfBo["Y"]

model2 = sm.OLS(y2, X2).fit()

print(model2.params)
model2.summary()

const    11.848212
X1        4.228244
X2        7.436110
dtype: float64




0,1,2,3
Dep. Variable:,Y,R-squared:,0.954
Model:,OLS,Adj. R-squared:,0.941
Method:,Least Squares,F-statistic:,72.14
Date:,"Thu, 09 Dec 2021",Prob (F-statistic):,2.13e-05
Time:,21:47:41,Log-Likelihood:,-33.497
No. Observations:,10,AIC:,72.99
Df Residuals:,7,BIC:,73.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,11.8482,6.765,1.751,0.123,-4.148,27.845
X1,4.2282,1.153,3.667,0.008,1.502,6.955
X2,7.4361,1.806,4.117,0.004,3.165,11.707

0,1,2,3
Omnibus:,2.675,Durbin-Watson:,2.012
Prob(Omnibus):,0.263,Jarque-Bera (JB):,0.636
Skew:,-0.583,Prob(JB):,0.728
Kurtosis:,3.407,Cond. No.,28.4


**Solution:**

$$
\texttt{Y} = 11.848212 + 4.228244 \times \texttt{X1} + 7.436110 \times \texttt{X2}
$$

**Part G:** 

- Compare or contrast the findings of your Forward Select algorithm from **Part E** with the p-values for each feature that were given in the full model summary. 

- Are the seemingly two best features the same?

- How does the $R^2$ value of the full model compare to the $R^2$ value of the reduced model?

- How does the $F$-statistic value of the full model compare to the $F$-statistic value of the reduced model?

**Solution:**

X1 and X2 had the smallest p-values and were the two best features according to the forward selection.

$R^2$ of the reduced model is smaller than $R^2$ value of the full model.

---

### Problem 2 - ANOVA

This data set includes ratings from 8 different individuals for 10 different types of jelly beans. The ratings are values that range from 1 to 10 with a 10 indicating the best positive rating and a 1 indicating the worst negative rating.

As you examine the data, a booming voice from the sky instructs you to analyze the data to answer the following key question:
$\color{blue}{\text{"Is there a statistically significant difference between the mean ratings given to the ten different flavors of jelly beans?"}}$

**Part A:** Convert the `JellyBeanRatings` data set to a Pandas DataFrame. 

In [8]:
JellyBeanRatings = {"Buttered Popcorn": [5, 7, 8, 6, 6, 7, 9, 7], "Caramel Corn": [5, 4, 5, 3, 5, 4, 4, 3], "Cappucino": [4, 2, 4, 2, 4, 6, 7, 4], "Pomegranate": [3, 6, 6, 5, 4, 6, 5, 4],
                    "Lemon": [1, 7, 3, 4, 2, 6, 4, 4], "Pink Grapefruit": [3, 1, 6, 3, 2, 4, 5, 2], "Green Apple": [4, 4, 1, 3, 4, 2, 3, 2], "Orange Sherbet": [9, 6, 3, 3, 4, 3, 4, 3], "A&W Root Beer": [ 7,  8,  7,  7,  7, 10, 10,  9],
                    "Sizzling Cinnamon": [1, 6, 1, 4, 3, 4, 4, 3]}

In [9]:
# Code Here
dfJb = pd.DataFrame(JellyBeanRatings)

dfJb

Unnamed: 0,Buttered Popcorn,Caramel Corn,Cappucino,Pomegranate,Lemon,Pink Grapefruit,Green Apple,Orange Sherbet,A&W Root Beer,Sizzling Cinnamon
0,5,5,4,3,1,3,4,9,7,1
1,7,4,2,6,7,1,4,6,8,6
2,8,5,4,6,3,6,1,3,7,1
3,6,3,2,5,4,3,3,3,7,4
4,6,5,4,4,2,2,4,4,7,3
5,7,4,6,6,6,4,2,3,10,4
6,9,4,7,5,4,5,3,4,10,4
7,7,3,4,4,4,2,2,3,9,3


**Part B:** In the remainder of this problem, you will perform some hypothesis tests to examine whether these data suggest there are significant differences in the mean ratings of the different jelly bean flavors. Pick a level of significance for these experiments, and explain what that significance level means. "Because we used it a lot in class" is ***not*** a good reason. Instead think of type 1 and type 2 errors.

**Solution:**


I decided to pick a $\alpha = 0.05$ significance level because it is the risk I am willing to take to commit a type 1 error; rejecting $H_0$, in favor of $H_1$ when I shouldn’t have. The significance level means I am willing to take a 5% risk of saying the mean ratings for the types of jelly bean are not equal when they really are.

**Part C:** Perform an **Analysis of Variance hypothesis test** in order to determine if there is evidence that there is _some_ difference among the mean ratings given to these 10 jelly bean flavors. Clearly state your null and alternative hypothesis, and use the significance level identified in **Part B**. You must show **all** calculations **by hand** (and may of course use Python as a calculator, and to compute values from a distribution using the appropriate percent-point-function (ppf) or cumulative distribution function (cdf)).

In addition to showing the code for your calculations, make comments **in Markdown** explaining what you are doing.

**Solution:**

$H_0$: There is no difference in mean ratings between the mean ratings given to the ten different flavors of jelly beans

$H_1$: There is a statistically significant difference between the mean ratings given to the ten different flavors of jelly beans

In [10]:
# Code Here

**Part D:** Perform a **t hypothesis test** to determine if there is evidence supporting the claim that A&W Root Beer Jelly Bellies have a higher mean rating than Green Apple Jelly Bellies. Use the significance level you identified in **Part B**. Clearly state your null and alternative hypotheses, your conclusions, and show all work. Again, you may not use any canned t-test function.

**Solution:**

$H_0$: There is no difference in mean ratings between the two jelly bean types.

$H_1$: A&W Root Beer Jelly Bellies have a higher mean rating than Green Apple Jelly Bellies

In [11]:
# Code Here
from scipy.stats import ttest_ind

jb1 = dfJb[["A&W Root Beer"]]
jb2 = dfJb[["Green Apple"]]

ttest_ind(jb1,jb2)



Ttest_indResult(statistic=array([8.42410335]), pvalue=array([7.46799747e-07]))

**Part E:** Do your results from Parts C and D agree with one another? If they agree, explain how they are in agreement in _words_. If they do not agree, explain why you think they do not agree.



**Solution:**