# 1. Tennis Surface Check

Use a linear regression and statsmodels to find which surface type predicts the most points for Federer in the `tennis.csv` dataset.

1. Give a one-paragraph interpretation of the coefficients, and the meaning of the p-value. 

2. Answer the following: should your regression include a constant term? Why or why not? How would it change the interpretation of your coefficient and p-value?

3. Do a t-test to find that the largest coefficient is statistically significantly different from the second largest (hint: you can run a t-test only with mean values and standard deviations)

In [130]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

te = pd.read_csv('data/tennis.csv')
te = te.fillna(0.)

surfaces = list(set(list(te.surface)))
for surface in surfaces:
    te[surface] = te['surface'].map(lambda row: 1 if row == surface else 0)
    
te[surfaces + ['player1 total points won']]
# te.columns
y = te['player1 total points won']

x = sm.add_constant(te[surfaces])
model = sm.OLS(y,x).fit(cov_type='HC2')
model.summary()



0,1,2,3
Dep. Variable:,player1 total points won,R-squared:,0.062
Model:,OLS,Adj. R-squared:,0.058
Method:,Least Squares,F-statistic:,966.2
Date:,"Fri, 15 Jan 2021",Prob (F-statistic):,0.0
Time:,02:18:59,Log-Likelihood:,-5955.6
No. Observations:,1179,AIC:,11920.0
Df Residuals:,1173,BIC:,11950.0
Df Model:,5,,
Covariance Type:,HC2,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,53.6325,1.003,53.477,0.000,51.667,55.598
Indoor: Carpet,8.7886,4.246,2.070,0.038,0.467,17.111
Outdoor: Hard,24.2348,1.767,13.718,0.000,20.772,27.697
Outdoor: Clay,22.7812,2.425,9.396,0.000,18.029,27.533
Indoor: Hard,12.5357,2.027,6.184,0.000,8.563,16.509
Indoor: Clay,-53.6325,1.003,-53.477,0.000,-55.598,-51.667
Outdoor: Grass,38.9247,3.016,12.906,0.000,33.014,44.836

0,1,2,3
Omnibus:,21.727,Durbin-Watson:,1.571
Prob(Omnibus):,0.0,Jarque-Bera (JB):,25.333
Skew:,-0.261,Prob(JB):,3.15e-06
Kurtosis:,3.493,Cond. No.,1710000000000000.0


#### 1. Coeffs and p-values

It is worth noting our R squares i very low, variations are high around the predictors. However, this should not pose a problem since tennis games are trully never "predictable", per-se.

##### Coeffs
Two main things stand out: Indoor: Clay and Indoor: Carpet. Indoor: Clay court is the only negative coefficient when predicting points, whereas Indoor: Carpet is the smallest coefficient. These coefficients indicate, in the positive case, that the more Federer plays on a positive coefficient surface (such as Grass), the more he is likely to score points or as many points as before on average . Contrarily, should he keep playing on negative coefficient surfaces, he will likely score less and less points  on average. This means Federer in consistently underperforming on Indoor Carpet courts, wheras he is performing consistently on all other types of courts. 

##### P-value
On the other hand, the Carpet court's p-value is the only non-zero p-value and s above our $\alpha = 0.01$, meaning we should not reject the null jupothesis for Carpet courts, and should disregard carpet courts as significant predictors.

In [137]:
te.surface.value_counts()

Outdoor: Hard     482
Outdoor: Clay     249
Indoor: Hard      226
Outdoor: Grass    140
Indoor: Carpet     76
Indoor: Clay        6
Name: surface, dtype: int64

In fact, we can assume this since the number of Indoor: Clay and Indoor: Carpet games are not numerous to be significant. We'll exclude those, let's isolate the most significant surfaces

In [138]:
def filter_down_surfaces(te, remove=[], add_constant=True):
    tec = te.copy()
    for rem in remove:        
        tec = tec[tec.surface != rem]
    y = tec['player1 total points won']
    surfaces = list(set(list(tec.surface)))
    print(tec.surface.value_counts())
    for surface in surfaces:
        tec[surface] = tec['surface'].map(lambda row: 1 if row == surface else 0)
    x = tec[surfaces]
    if add_constant:
        x = sm.add_constant(tec[surfaces])
    model = sm.OLS(y,x).fit(cov_type='HC2')
    summary = model.summary2()
    table = summary.tables[1]
    table = table[table['P>|z|'] > 0.01]
    if table.count()['P>|z|'] > 0:
        return filter_down_surfaces(te, list(table.index),add_constant=add_constant)
    return model

model2 = filter_down_surfaces(te, ['Indoor: Carpet'])
model2.summary()

Outdoor: Hard     482
Outdoor: Clay     249
Indoor: Hard      226
Outdoor: Grass    140
Indoor: Clay        6
Name: surface, dtype: int64




0,1,2,3
Dep. Variable:,player1 total points won,R-squared:,0.058
Model:,OLS,Adj. R-squared:,0.055
Method:,Least Squares,F-statistic:,1167.0
Date:,"Fri, 15 Jan 2021",Prob (F-statistic):,0.0
Time:,02:22:47,Log-Likelihood:,-5562.1
No. Observations:,1103,AIC:,11130.0
Df Residuals:,1098,BIC:,11160.0
Df Model:,4,,
Covariance Type:,HC2,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,52.1677,0.841,62.040,0.000,50.520,53.816
Outdoor: Hard,25.6995,1.637,15.695,0.000,22.490,28.909
Outdoor: Clay,24.2460,2.292,10.577,0.000,19.753,28.739
Indoor: Hard,14.0004,1.898,7.376,0.000,10.280,17.721
Indoor: Clay,-52.1677,0.841,-62.040,0.000,-53.816,-50.520
Outdoor: Grass,40.3894,2.874,14.055,0.000,34.757,46.022

0,1,2,3
Omnibus:,24.257,Durbin-Watson:,1.586
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29.667
Skew:,-0.277,Prob(JB):,3.61e-07
Kurtosis:,3.582,Cond. No.,1070000000000000.0


We now only have zero-value p-values for surfaces which we should consider. However, we can see now that by having removed some values for surfaces, our $R^2$ has gone down by 0.004, indicating our fit might not be as accurate (although ever so slightly).

Despite this, our p-values being non-zero and bellow our alpha of 0.01, we can reject the null hypothessis for Outdoor Clay, Grass, Hard and Indoor Hard and Clay surfaces.

We now see see that the highest coefficient, is for Outdoor Grass at 40.3894. This suggests, should Federer play on Outdoor Grass, it is very likely he will score more 2.8 times more points than on Indoor hard surfaces, 

#### 2. Constant Term
Let's proceed by removing Constant Terms.

In [142]:
model3 = filter_down_surfaces(te, [], add_constant=False)
model3.summary()

Outdoor: Hard     482
Outdoor: Clay     249
Indoor: Hard      226
Outdoor: Grass    140
Indoor: Carpet     76
Indoor: Clay        6
Name: surface, dtype: int64
Outdoor: Grass    92.557143
Outdoor: Hard     77.867220
dtype: float64


0,1,2,3
Dep. Variable:,player1 total points won,R-squared:,0.062
Model:,OLS,Adj. R-squared:,0.058
Method:,Least Squares,F-statistic:,
Date:,"Fri, 15 Jan 2021",Prob (F-statistic):,
Time:,02:36:50,Log-Likelihood:,-5955.6
No. Observations:,1179,AIC:,11920.0
Df Residuals:,1173,BIC:,11950.0
Df Model:,5,,
Covariance Type:,HC2,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Indoor: Carpet,62.4211,4.882,12.787,0.000,52.853,71.989
Outdoor: Hard,77.8672,1.721,45.249,0.000,74.494,81.240
Outdoor: Clay,76.4137,2.612,29.256,0.000,71.294,81.533
Indoor: Hard,66.1681,2.084,31.746,0.000,62.083,70.253
Indoor: Clay,0,0,,,0,0
Outdoor: Grass,92.5571,3.365,27.502,0.000,85.961,99.153

0,1,2,3
Omnibus:,21.727,Durbin-Watson:,1.571
Prob(Omnibus):,0.0,Jarque-Bera (JB):,25.333
Skew:,-0.261,Prob(JB):,3.15e-06
Kurtosis:,3.493,Cond. No.,8.96


We can tell right away that by ommitting the Constant Term, we've increased our $R^2$ by 0.004, which is not significant. This is expected since our predictor is the type of court surface, which is not a continuous variable, but a qualitative one.

Additionally, we can also tell that not having the constant terms does not change the difference in coefficients in any significant way. The mean of points per surface type is shown rather than the difference between the points mean per surface courts.

Therefore, adding or ommiting the Constant Term has no impact on prediction or fit.

#### 3. T-Test

In [179]:
from scipy.stats import ttest_ind_from_stats
top2 = model3.params.sort_values(ascending=False).nlargest(2)
top2
df = pd.DataFrame({
    'Court': list(top2.index)
})
te[te.surface == 'Outdoor: Grass']
df['NSize'] = df.Court.apply(lambda x: te[te.surface == x].surface.count())
df['CMean'] = df.Court.apply(lambda x: te[te.surface == x]['player1 total points won'].mean())
df['CVar'] = df.Court.apply(lambda x: te[te.surface == x]['player1 total points won'].var())

first = df.iloc[0]
second = one = df.iloc[1]
_, pvalue = ttest_ind_from_stats(mean1=first.CMean, std1=np.sqrt(first.CVar), nobs1=first.NSize,
                     mean2=second.CMean, std2=np.sqrt(second.CVar), nobs2=second.NSize)
if pvalue <= 0.01:
    print("We reject the null hypothesis, the top two coeffs are statistically significantly different")
else:
    print("We accept the null hypothesis, the top two coeffs are not statistically significantly different")
pvalue

We reject the null hypothesis, the top two coeffs are statistically significantly different


7.085417056631752e-05

# 2. Titanic prediction contest

Use whatever tricks you can to best model whether a passenger would survive the titanic disaster (using linear regression).

1. Use non-regularized regression to build the best model you can. Show 2 alternate model speficications and explain why you chose the one you did

2. Interpret the coefficients in your model. Which attributes best relate to survival probability? How does this relate to socio-economic characteristics and "real-world" interpretation?

3. Use regularized regression to build a purely predictive model. Can you improve your accuracy? Plot the regularized model against the interpretable model predictions in a regression plot to make your case.

# House Price prediction

Using the techniques you learned, use everything you can to build the best **interpretable** (eg. non-regularized) regression model on the `house_price.csv` dataset. You also have `house_price_data_description.txt` to help -- full description of each column.

Here's a brief version of what you'll find in the data description file.

**SalePrice** - the property's sale price in dollars. **This is the target variable that you're trying to predict.**

Here are the features you can use (or engineer into new features!) for your `X` matrix:

    MSSubClass: The building class
    MSZoning: The general zoning classification
    LotFrontage: Linear feet of street connected to property
    LotArea: Lot size in square feet
    Street: Type of road access
    Alley: Type of alley access
    LotShape: General shape of property
    LandContour: Flatness of the property
    Utilities: Type of utilities available
    LotConfig: Lot configuration
    LandSlope: Slope of property
    Neighborhood: Physical locations within Ames city limits
    Condition1: Proximity to main road or railroad
    Condition2: Proximity to main road or railroad (if a second is present)
    BldgType: Type of dwelling
    HouseStyle: Style of dwelling
    OverallQual: Overall material and finish quality
    OverallCond: Overall condition rating
    YearBuilt: Original construction date
    YearRemodAdd: Remodel date
    RoofStyle: Type of roof
    RoofMatl: Roof material
    Exterior1st: Exterior covering on house
    Exterior2nd: Exterior covering on house (if more than one material)
    MasVnrType: Masonry veneer type
    MasVnrArea: Masonry veneer area in square feet
    ExterQual: Exterior material quality
    ExterCond: Present condition of the material on the exterior
    Foundation: Type of foundation
    BsmtQual: Height of the basement
    BsmtCond: General condition of the basement
    BsmtExposure: Walkout or garden level basement walls
    BsmtFinType1: Quality of basement finished area
    BsmtFinSF1: Type 1 finished square feet
    BsmtFinType2: Quality of second finished area (if present)
    BsmtFinSF2: Type 2 finished square feet
    BsmtUnfSF: Unfinished square feet of basement area
    TotalBsmtSF: Total square feet of basement area
    Heating: Type of heating
    HeatingQC: Heating quality and condition
    CentralAir: Central air conditioning
    Electrical: Electrical system
    1stFlrSF: First Floor square feet
    2ndFlrSF: Second floor square feet
    LowQualFinSF: Low quality finished square feet (all floors)
    GrLivArea: Above grade (ground) living area square feet
    BsmtFullBath: Basement full bathrooms
    BsmtHalfBath: Basement half bathrooms
    FullBath: Full bathrooms above grade
    HalfBath: Half baths above grade
    Bedroom: Number of bedrooms above basement level
    Kitchen: Number of kitchens
    KitchenQual: Kitchen quality
    TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
    Functional: Home functionality rating
    Fireplaces: Number of fireplaces
    FireplaceQu: Fireplace quality
    GarageType: Garage location
    GarageYrBlt: Year garage was built
    GarageFinish: Interior finish of the garage
    GarageCars: Size of garage in car capacity
    GarageArea: Size of garage in square feet
    GarageQual: Garage quality
    GarageCond: Garage condition
    PavedDrive: Paved driveway
    WoodDeckSF: Wood deck area in square feet
    OpenPorchSF: Open porch area in square feet
    EnclosedPorch: Enclosed porch area in square feet
    3SsnPorch: Three season porch area in square feet
    ScreenPorch: Screen porch area in square feet
    PoolArea: Pool area in square feet
    PoolQC: Pool quality
    Fence: Fence quality
    MiscFeature: Miscellaneous feature not covered in other categories
    MiscVal: $Value of miscellaneous feature
    MoSold: Month Sold
    YrSold: Year Sold
    SaleType: Type of sale
    SaleCondition: Condition of sale
