# Regression

In [139]:
# Read data from file
import pandas as pd
df=pd.read_csv('baseball.csv').dropna(subset=['OpponentOnBasePercentage', 'OpponentSluggingPercentage'])
df.tail()

Unnamed: 0,Team,League,Year,RunsScored,RunsAllowed,Wins,OnBasePercentage,SluggingPercentage,BattingAverage,Playoffs,RankSeason,RankPlayoffs,GamesPlayed,OpponentOnBasePercentage,OpponentSluggingPercentage
415,SFG,NL,1999,872,831,86,0.356,0.434,0.271,0,,,162,0.345,0.423
416,STL,NL,1999,809,838,75,0.338,0.426,0.262,0,,,161,0.355,0.427
417,TBD,AL,1999,772,913,69,0.343,0.411,0.274,0,,,162,0.371,0.448
418,TEX,AL,1999,945,859,95,0.361,0.479,0.293,1,5.0,4.0,162,0.346,0.459
419,TOR,AL,1999,883,862,84,0.352,0.457,0.28,0,,,162,0.353,0.456


## Predict win percentage of a team based on its statistics and its opponents statistics

In [140]:
from sklearn.model_selection import train_test_split
X=df[['OnBasePercentage','SluggingPercentage','BattingAverage','OpponentOnBasePercentage', 'OpponentSluggingPercentage']]
y=df['Wins']/df['GamesPlayed']
X_train, X_test, y_train, y_test = train_test_split(X, y)

###### Using a simple Linear Regression

In [141]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics.regression import r2_score
import numpy as np

li = LinearRegression().fit(X_train, y_train)
print('Score on Train set:',li.score(X_train, y_train),'\nScore on Test set:',li.score(X_test,y_test))

Score on Train set: 0.8258602314346748 
Score on Test set: 0.8562981147113335


###### Using Linear Regression with Polynomial Features
Write a function that fits a polynomial LinearRegression model on the training data `X_train` for degrees 0 through 9. For each model compute the $R^2$ (coefficient of determination) regression score on the training data as well as the the test data

In [161]:
%precision 2
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics.regression import r2_score
import numpy as np

r2_table=pd.DataFrame(index=(i for i in range(10)), columns=['R2 on Train Set','R2 on Test Set'])
r2_table.index.name='Degree of Polynomial Features'

for degree in range(10):
    poly = PolynomialFeatures(degree=degree)
    X_po_train = poly.fit_transform(X_train)
    li = LinearRegression().fit(X_po_train, y_train)
    r2_table['R2 on Train Set'][degree]=li.score(X_po_train, y_train)
    r2_table['R2 on Test Set'][degree]=li.score(poly.fit_transform(X_test),y_test)
r2_table

# Find underfitting, overfitting,best model
r2_table['Result']=''
for i in range (10):
    under,over,best=r2_table['R2 on Train Set'].min(), r2_table['R2 on Train Set'].max(), r2_table['R2 on Test Set'].max()
    if r2_table['R2 on Train Set'][i]==under: r2_table['Result'][i]='Underfitting'
    if r2_table['R2 on Train Set'][i]==over: r2_table['Result'][i]='Overfitting'
    if r2_table['R2 on Test Set'][i]==best: r2_table['Result'][i]='BEST MODEL'
r2_table

Unnamed: 0_level_0,R2 on Train Set,R2 on Test Set,Result
Degree of Polynomial Features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0,-0.00403929,Underfitting
1,0.836949,0.830721,BEST MODEL
2,0.85066,0.820649,
3,0.861398,0.779821,
4,0.840274,0.384918,
5,0.952101,-22.1989,
6,1.0,-568.885,
7,1.0,-619.712,
8,1.0,-654.496,Overfitting
9,1.0,-675.883,


###### Using Lasso Regression with regularized parameters

Training models on high degree polynomial features can result in overly complex models that overfit, so we often use regularized versions of the model to constrain model complexity, as we saw with Ridge and Lasso linear regression.

For this part, train two models: a non-regularized LinearRegression model (default parameters) and a regularized Lasso Regression model (with parameters `alpha=`, `max_iter=`) both on polynomial features of degree 12. Return the $R^2$ score for both the LinearRegression and Lasso model's test sets.

In [1581]:
X=df['OnBasePercentage']/df['OpponentOnBasePercentage']
y=df['Wins']/df['GamesPlayed']
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [1586]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics.regression import r2_score

poly = PolynomialFeatures(degree=12)
X_po_train = poly.fit_transform(np.array(X_train).reshape(-1,1))
X_po_test = poly.fit_transform(np.array(X_test).reshape(-1,1))
li = LinearRegression().fit(X_po_train, y_train)
lasso = Lasso(alpha=0.00001,max_iter=100).fit(X_po_train, y_train)

li.score(X_po_test,y_test), lasso.score(X_po_test,y_test)



(0.5304406012804604, 0.8108453250235986)