# Simple Regression Models with Target Variable of Strikeouts
Model 1 - Single Feature Regression Model  
Model 2 - Multi-Feature Regression Model

#### For these models we will use the features that were selected in the Exploratory Data Analysis because of their levels of correlation with the target variable strikeouts.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

### Import Data

In [3]:
# Read in the CSV
df = pd.read_csv("Clean_2015_Pitching_Data.csv", sep = ",")
# Only keeping features selected in EDA
strikeout = df[['new_strikeout', 'strikeout', 'k_percent', 'p_swinging_strike', 'xba', 'z_swing_miss_percent', 'iz_contact_percent', 'in_zone_swing_miss', 'whiff_percent']]
strikeout.head()



Unnamed: 0,new_strikeout,strikeout,k_percent,p_swinging_strike,xba,z_swing_miss_percent,iz_contact_percent,in_zone_swing_miss,whiff_percent
0,151,207,23.4,388,0.215,18.3,80.8,218,25.4
1,119,138,19.5,288,0.225,17.5,82.1,161,24.2
2,186,167,22.4,283,0.227,19.6,80.1,195,23.6
3,184,149,19.8,231,0.275,12.8,86.8,114,21.2
4,107,198,25.0,351,0.245,15.8,83.9,154,29.7


In [4]:
# Split the data into train and test sets (80/20)
train_set, test_set = train_test_split(strikeout,
test_size=0.2, random_state=123)
print(len(train_set), len(test_set))

332 83


### Model 1 - Single Feature Regression Model

In [6]:
reg = LinearRegression()

X = train_set[['strikeout']]
y = train_set['new_strikeout']
reg.fit(X, y)

print("The bias is " , reg.intercept_)
print("The feature coefficients are ", reg.coef_)
print("The score for the training set is", reg.score(X,y))

# Check the performance on the test set
X_test = test_set[['strikeout']]
y_test = test_set['new_strikeout']
print("The score for the test set is", reg.score(X_test,y_test))

The bias is  54.09655152723151
The feature coefficients are  [0.66413516]
The score for the training set is 0.3833406201066476
The score for the test set is 0.25396812783562706


#### Single Feature Regression Model Performance
| Feature   | Training   | Test    |
| -----     | -----      | -----   |
| k_percent | 0.42 | 0.36 |
| strikeout | 0.38 | 0.25 |
| whiff_percent | 0.30 | 0.32 |
| p_swinging_strike | 0.31 | 0.25 |
| xba | 0.30 | 0.27 |
| z_swing_miss_percent | 0.27 | 0.29 |
| iz_contact_percent | 0.27 | 0.29 |
| in_zone_swing_miss | 0.30 | 0.23 |




### Model 2 - Multi-Feature Regression Model

In [5]:
X = train_set[['k_percent','strikeout', 'whiff_percent', 'xba', 'p_swinging_strike']]
y = train_set['new_strikeout']
reg.fit(X, y)

print("The bias is " , reg.intercept_)
print("The feature coefficients are ", reg.coef_)
print("The score for the training set is", reg.score(X,y))

# Check the performance on the test set
X_test = test_set[['k_percent','strikeout', 'whiff_percent', 'xba', 'p_swinging_strike']]
y_test = test_set['new_strikeout']
print("The score for the test set is", reg.score(X_test,y_test))

The bias is  27.351810928793185
The feature coefficients are  [ 5.50801027  0.21505068 -1.71761361 10.48325417  0.04754484]
The score for the training set is 0.4496081950641221
The score for the test set is 0.34155856243446836


#### Multi-Feature Model Performance
| # Features | Features | Training   | Test    |
| -----     | -----      | -----   | ----- |
| 2 | strikeout, k_percent | 0.45 | 0.35 |
| 2 | k_percent, whiff_percent | 0.42 | 0.35 |
| 2 | k_percent, p_swinging_strike | 0.43 | 0.36 |
| 2 | k_percent, xba | 0.42 | 0.36 |
| 2 | strikeout, whiff_percent | 0.41 | 0.33 |
| 2 | whiff_percent, xba | 0.35 | 0.35 |
| 3 | k_percent, strikeout, xba | 0.45 | 0.35 |
| 4 | k_percent, strikeout, xba, whiff_percent | 0.45 | 0.34 |
| 5 | k_percent, strikeout, xba, whiff_percent, p_swinging_strike | 0.45 | 0.34 |