# **Mini Review Exercise - Homeruns**
## *DATA 3300*
## Name:

In this mini-exercise you've chosen to work on a sports dataset predicting the likelihood of a homerun. This dataset has been cleaned and contains data on 865 plays that were either homeruns or not. Using supervised data mining, answer the posed questions and construct a model to predict positive homerun status given play characteristics (independent variables) including:

* **Play_ID** = primary key
* **batter_team** = team of batter - BOS, NYC, OAK
* **bearing** = center, left, right
* **pitch_name** = type of pitch - 4-Seam Fastball, Changeup, Cutter, Curveball
* **inning** = game inning, ranging 1-13
* **balls** = ranges between 0-3
* **pitch_mph** = speed of pitch in mph
* **launch_speed** = speed of launch in mph
* **is_home_run** = whether or not hit was a homerun

In [None]:
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

In [None]:
df = pd.read_csv('/content/home_runs.csv')                      # reads in dataset
df.head()                                                       # previews the dataset

**Is the relationship between the IVs and the DV linear? Plot one numerical (quantitative) IV against the DV**

In [None]:
plt.scatter(df['Numerical IV'], df['is_home_run'])              # produces a scatterplot of a numerical IV against the DV
plt.xlabel("Numerical IV Label")                                # x-axis label
plt.ylabel("Homerun")                                           # y-axis label
plt.show()

**Next, let's set up our x and y objects. The x object should contain all IVs -- let's assume none of them of collinear. The y object should contain the DV**

In [None]:
x = df.drop(['DV', 'primarykey'], axis = 1)                                         # assigns IVs to x object by dropping out non-IVs
x = pd.get_dummies(data = x, drop_first= True)                                      # creates dummy variables, dropping out the first as a referent group
x.head()

In [None]:
y = df['DV']                                                                        # assigns DV variable to y
y = pd.get_dummies(data = y, drop_first = True)                                     # splits DV into dummy variables, drops out one group

**First let's fit a linear regression and see what happens...**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=101)  # creates a 80-20 training and test split

In [None]:
model = LinearRegression()                                                                    # brings in the LinearRegression model

OLS = model.fit(x_train, y_train)                                                             # fits a linear regression to the training data

y_pred = model.predict(x_test)                                                                # makes predictions onto the test set
y_pred[:20]                                                                                   # displays values of first 20 predictions in test set

**What values would you want the model to predict for a binary classification task, what range of values does it appear to be predicting?**

In [None]:
sns.regplot(x = y_pred, y = y_test)                                                           # produces a regression plot of the actual values of y against the predicted values
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

**What is the problem with these predictions made using a Linear Regression?**

**What model should we use instead if we want to perform classification of Heart Disease or not Heart Disease?**

**Now let's fit a logistic regression...**

In [None]:
x_train_Sm = sm.add_constant(x_train)                                                                     # adds an intercept to x_train
log_reg = sm.Logit(y_train, x_train_Sm).fit()                                                             # fits a logistic regression to training data
print(log_reg.summary())                                                                                  # produces a summary statistics table

**The units of the regression coefficients above are in terms of?**

**Using your summary table find:**

1. **One quantitative (numerical) variable that is statistically significant ($α < 0.1$), but interpret its regression coefficient.**

2. **One qualitative (categorical) dummy variable that is statistically signficant and interpret its regression coefficient.**



1.   Quantitative: 

2.   Qualitative: 


**If you were to begin removing non significant variables, how would you handle Batter Team? Would you leave in both dummy variables, drop both, or leave one or the other in?**

**Why?**

**What two calculations are necessary to convert the log-odds to probability? Provide the formulas below:**

* $odds = $
* $probability = $

**Why convert from log-odds to probability? What basic information can we get from log-odds?**

**Compare and Contrast Linear and Logistic Regression, what do they have in common (e.g., common assumptions), how are they different?**

* **Similarities:** 

* **Differences:** 