Topics covered:
- LinearRegression
- PolynomialFeatures
- Ridge
- MinMaxScaler
- Lasso
- SVR
- GridSearchCV
- Regularization
- DecisionTreeRegressor
- Tips for when having overfitting and underfitting:

GridSearchCV:


When having overfitting (high variance), the model is not a good predictor:
- get more training examples
- reduce the number of features (but don't use PCA!)
- adding features is likely going to make things worse
- Increase regularization parameter (in e.g. Lasso and Ridge is called alpha parameter)
- If using SVMs (SVR or SVC) decrease C (because C = 1/(regularization parameter))

When having underfitting (high bias), the model is too simple:
- get more features
- add polynomial terms
- getting more training examples is not going to help
- Decrease regularization parameter (in e.g. Lasso and Ridge is called alpha parameter)
- If using SVMs (SVR or SVC) increase C (because C = 1/(regularization parameter))

What is regularization used for?
- In lay terms, we use regularization when we have many features, but we want to reduce the magnitude of their influence in the final model.
- The other option is just dropping those features altogether.


From the crime dataset found here
https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized


below are the columns from the dataset:

predictive_columns = ['population' 'householdsize' 'racepctblack' 'racePctWhite' 'racePctAsian'
 'racePctHisp' 'agePct12t21' 'agePct12t29' 'agePct16t24' 'agePct65up'
 'numbUrban' 'pctUrban' 'medIncome' 'pctWWage' 'pctWFarmSelf' 'pctWInvInc'
 'pctWSocSec' 'pctWPubAsst' 'pctWRetire' 'medFamInc' 'perCapInc'
 'whitePerCap' 'blackPerCap' 'indianPerCap' 'AsianPerCap' 'OtherPerCap'
 'HispPerCap' 'NumUnderPov' 'PctPopUnderPov' 'PctLess9thGrade'
 'PctNotHSGrad' 'PctBSorMore' 'PctUnemployed' 'PctEmploy' 'PctEmplManu'
 'PctEmplProfServ' 'PctOccupManu' 'PctOccupMgmtProf' 'MalePctDivorce'
 'MalePctNevMarr' 'FemalePctDiv' 'TotalPctDiv' 'PersPerFam' 'PctFam2Par'
 'PctKids2Par' 'PctYoungKids2Par' 'PctTeen2Par' 'PctWorkMomYoungKids'
 'PctWorkMom' 'NumKidsBornNeverMar' 'PctKidsBornNeverMar' 'NumImmig'
 'PctImmigRecent' 'PctImmigRec5' 'PctImmigRec8' 'PctImmigRec10'
 'PctRecentImmig' 'PctRecImmig5' 'PctRecImmig8' 'PctRecImmig10'
 'PctSpeakEnglOnly' 'PctNotSpeakEnglWell' 'PctLargHouseFam'
 'PctLargHouseOccup' 'PersPerOccupHous' 'PersPerOwnOccHous'
 'PersPerRentOccHous' 'PctPersOwnOccup' 'PctPersDenseHous' 'PctHousLess3BR'
 'MedNumBR' 'HousVacant' 'PctHousOccup' 'PctHousOwnOcc' 'PctVacantBoarded'
 'PctVacMore6Mos' 'MedYrHousBuilt' 'PctHousNoPhone' 'PctWOFullPlumb'
 'OwnOccLowQuart' 'OwnOccMedVal' 'OwnOccHiQuart' 'OwnOccQrange' 'RentLowQ'
 'RentMedian' 'RentHighQ' 'RentQrange' 'MedRent' 'MedRentPctHousInc'
 'MedOwnCostPctInc' 'MedOwnCostPctIncNoMtg' 'NumInShelters' 'NumStreet'
 'PctForeignBorn' 'PctBornSameState' 'PctSameHouse85' 'PctSameCity85'
 'PctSameState85' 'LemasSwornFT' 'LemasSwFTPerPop' 'LemasSwFTFieldOps'
 'LemasSwFTFieldPerPop' 'LemasTotalReq' 'LemasTotReqPerPop'
 'PolicReqPerOffic' 'PolicPerPop' 'RacialMatchCommPol' 'PctPolicWhite'
 'PctPolicBlack' 'PctPolicHisp' 'PctPolicAsian' 'PctPolicMinor'
 'OfficAssgnDrugUnits' 'NumKindsDrugsSeiz' 'PolicAveOTWorked' 'LandArea'
 'PopDens' 'PctUsePubTrans' 'PolicCars' 'PolicOperBudg'
 'LemasPctPolicOnPatr' 'LemasGangUnitDeploy' 'LemasPctOfficDrugUn'
 'PolicBudgPerPop']
 
 
 target_columns:['murders' 'murdPerPop' 'rapes' 'rapesPerPop' 'robberies'
 'robbbPerPop' 'assaults' 'assaultPerPop' 'burglaries' 'burglPerPop'
 'larcenies' 'larcPerPop' 'autoTheft' 'autoTheftPerPop' 'arsons'
 'arsonsPerPop' 'ViolentCrimesPerPop' 'nonViolPerPop']

In [1]:
def get_crime_dataset():
    import pandas as pd
    # Communities and Crime dataset for regression
    # source:
    # https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized

    df = pd.read_table('CommViolPredUnnormalizedData.txt', sep=',', na_values='?')
    
    # drop columns for city, state, etc. plus rows with values = na
    crime = df.drop(df.columns[[0,1,2,3,4]],axis=1).dropna()
    
    # n columns based on the index
#     X_crime = crime.iloc[:,range(0,10)]
    
    # all predictive columns
#     X_crime = crime.iloc[:,range(0,124)]

    # select columns from a list
    X_crime = crime[['PctPopUnderPov','racepctblack','racePctWhite','racePctAsian',
 'racePctHisp','population','medIncome','PctKidsBornNeverMar']]
    
    # select just one column, will need to do a reshape
#     X_crime = crime['perCapInc'].values.reshape(-1,1)
    
    #your exercise here
#     X_crime = crime[[]]

    # select any one column from the target columns
    y_crime = crime['burglPerPop']

    return (X_crime,y_crime)

In [2]:
def printDataSet():
    (X_crime,y_crime) = get_crime_dataset()
    print(X_crime.head())
    print(y_crime.head())
    
printDataSet()

    PctPopUnderPov  racepctblack  racePctWhite  racePctAsian  racePctHisp  \
9            28.68         23.14         67.60          0.92        16.35   
13           27.71         53.52         45.65          0.49         0.43   
17           14.37          1.30         74.02         14.14        20.96   
19            8.21          8.41         82.64          3.92         8.91   
21           19.29         28.71         52.26          7.00        24.36   

    population  medIncome  PctKidsBornNeverMar  
9       103590      17852                 4.71  
13       57140      19143                 8.51  
17      180038      34372                 2.62  
19      261721      35048                 1.89  
21     7322564      29823                10.50  
9     2221.81
13    2987.92
17     889.74
19    1360.48
21    1355.37
Name: burglPerPop, dtype: float64


In [3]:
def returnColumnNames():
    (X_crime,y_crime) = get_crime_dataset()
    print('X: ', X_crime.columns.values, '\ny: ',y_crime.name)
    
returnColumnNames()
    

X:  ['PctPopUnderPov' 'racepctblack' 'racePctWhite' 'racePctAsian'
 'racePctHisp' 'population' 'medIncome' 'PctKidsBornNeverMar'] 
y:  burglPerPop


In [4]:
#let's get a baseset with a dummy regressor

def ex0():
    import warnings
    warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
    import numpy as np
    from sklearn.dummy import DummyRegressor
    from sklearn.model_selection import train_test_split
    
    (X_crime,y_crime) = get_crime_dataset()
    X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)
    dummy = DummyRegressor().fit(X_train, y_train)

    #linreg.score gives the R2 score
    return (dummy.score(X_train, y_train),dummy.score(X_test, y_test))

ex0()

(0.0, -0.00012124930948465007)

In [5]:
#perform linear regression

def ex1():
    import warnings
    warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
    import numpy as np
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    
    (X_crime,y_crime) = get_crime_dataset()
    X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)
    linreg = LinearRegression().fit(X_train, y_train)

    #linreg.score gives the R2 score
    return (linreg.score(X_train, y_train),linreg.score(X_test, y_test))

ex1()

(0.52839388693877787, 0.46117508270956642)

In [19]:
# perform polynomial regression of degree 2, which in theory should give us better score
# reality was different, we are now overfitting
def ex1a():
    import numpy as np
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.model_selection import train_test_split
    
    (X_crime,y_crime) = get_crime_dataset()
    
    X_poly = PolynomialFeatures(degree=2).fit_transform(X_crime)
    
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y_crime,
                                                   random_state = 0)
    linreg = LinearRegression().fit(X_train, y_train)

    return (linreg.score(X_train, y_train),linreg.score(X_test, y_test))

ex1a()

(0.64402021775553608, 0.35095535624578356)

In [20]:
# to avoid the problems of overfitting polynomial regression of higher degrees, 
# we put a penalty on the coefficients that are large.
# We use ridge regression which is a type of regularized linear regression 
# that uses an alpha parameter (sometimes called lambda parameter) 
# to penalize for large coefs theta (to avoid overfitting): 
def ex2():
    
    import numpy as np
    from sklearn.linear_model import Ridge
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.model_selection import train_test_split
    
    (X_crime,y_crime) = get_crime_dataset()
    
    X_poly = PolynomialFeatures(degree=2).fit_transform(X_crime)
    
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y_crime,
                                                   random_state = 0)
    linridge = Ridge(alpha=20.0).fit(X_train, y_train)

    return (linridge.score(X_train, y_train),linridge.score(X_test, y_test))

ex2()

(0.62850359425733404, 0.45921137323878652)

In [22]:
# when features vary wildly, e.g when calculating the price of the house: 
# the square footage is in the thousands
# and number of bedrooms is in the single digits, it's best to normalize the data 
# to values between 0 and 1 or -1 and 1, you use MinMaxScaler on most 
# occasions - make sure there are no outliers
def ex2a():
    
    import numpy as np
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.linear_model import Ridge
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.model_selection import train_test_split
    
    scaler = MinMaxScaler()
    (X_crime,y_crime) = get_crime_dataset()
    X_poly = PolynomialFeatures(degree=3).fit_transform(X_crime)
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y_crime,
                                                   random_state = 0)
    # both training set and testing set need to be scaled
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)

    return (linridge.score(X_train_scaled, y_train),linridge.score(X_test_scaled, y_test))

ex2a()

(0.53526522806995325, 0.48953144501529067)

In [23]:
# use min max scaler and ridge regressor with alpha values in 
# [0.1, 1, 10, 20, 50, 100, 1000] with a polynomial of degree 3
# notice that increasing alpha reduces overfitting
def ex3():

    import warnings
    warnings.filterwarnings(action="ignore", module="scipy")
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import Ridge
    import numpy as np
    
    scaler = MinMaxScaler()
    (X_crime,y_crime) = get_crime_dataset()
    X_poly = PolynomialFeatures(degree=3).fit_transform(X_crime)
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y_crime,
                                                       random_state = 0)

    # both training set and testing set need to be scaled
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    for this_alpha in [0.1, 1, 10, 20, 50, 100, 1000]:
        linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
        r2_train = linridge.score(X_train_scaled, y_train)
        r2_test = linridge.score(X_test_scaled, y_test)
        print('Alpha = {:.2f}\nr-squared training: {:.2f}, r-squared test: {:.2f}'
             .format(this_alpha, r2_train, r2_test))

ex3()

Alpha = 0.10
r-squared training: 0.64, r-squared test: 0.45
Alpha = 1.00
r-squared training: 0.60, r-squared test: 0.53
Alpha = 10.00
r-squared training: 0.55, r-squared test: 0.50
Alpha = 20.00
r-squared training: 0.54, r-squared test: 0.49
Alpha = 50.00
r-squared training: 0.51, r-squared test: 0.48
Alpha = 100.00
r-squared training: 0.47, r-squared test: 0.45
Alpha = 1000.00
r-squared training: 0.18, r-squared test: 0.17


In [14]:
# Lasso Regression
# another way of doing regularization is using the Lasso Regression, which also penalizes 
# the coeficients when doing the regression
# when to use Lasso vs Ridge?
# find which features have the most effect (use for example DecisionTreeRegressor explained below)
# When you have many small/medium sized effects: use Ridge
# When you have a few variables with medium/large effects: use Lasso
#
# do a Lasso regression for alpha in [0.1, 0.5, 1, 2, 3, 5, 10, 20, 50,100] 
# and max_iter = 10000 and polynomial of degree 4
# notice that increasing alpha decreases overfitting
def ex4():

    from sklearn.preprocessing import MinMaxScaler
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import Lasso
    import numpy as np
    
    scaler = MinMaxScaler()
    (X_crime,y_crime) = get_crime_dataset()
    X_poly = PolynomialFeatures(degree=4).fit_transform(X_crime)
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y_crime,
                                                       random_state = 0)

    # both training set and testing set need to be scaled
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    for alpha in [0.5, 1, 2, 3, 5,10,20,50,100]:
        linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
        r2_train = linlasso.score(X_train_scaled, y_train)
        r2_test = linlasso.score(X_test_scaled, y_test)
        print('Alpha = {:.2f}\nr-squared training: {:.2f}, r-squared test: {:.2f}'
             .format(alpha, r2_train, r2_test))

ex4()

Alpha = 0.50
r-squared training: 0.65, r-squared test: 0.42
Alpha = 1.00
r-squared training: 0.62, r-squared test: 0.49
Alpha = 2.00
r-squared training: 0.60, r-squared test: 0.50
Alpha = 3.00
r-squared training: 0.58, r-squared test: 0.50
Alpha = 5.00
r-squared training: 0.54, r-squared test: 0.46
Alpha = 10.00
r-squared training: 0.52, r-squared test: 0.46
Alpha = 20.00
r-squared training: 0.50, r-squared test: 0.45
Alpha = 50.00
r-squared training: 0.40, r-squared test: 0.36
Alpha = 100.00
r-squared training: 0.11, r-squared test: 0.09


In [15]:
# Support Vector Machines: these are algorithms that transform the data before finding a match
# C is the penalty parameter
# decreasing C increases regularization
# let's just get 
def ex5():

    from sklearn.preprocessing import MinMaxScaler
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVR
    import numpy as np
    
    scaler = MinMaxScaler()
    (X_crime,y_crime) = get_crime_dataset()
    X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                       random_state = 0)

    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    clf = SVR(C=10, kernel='linear').fit(X_train_scaled, y_train)
    r2_train = clf.score(X_train_scaled, y_train)
    r2_test = clf.score(X_test_scaled, y_test)
    print(r2_train,r2_test)

ex5()

0.228880840814 0.223396935623


In [26]:
# The prior example was underfitting, 
# in order to improve our solution when underfitting in SVRs we increase C (decrease regularization)
# We are going to use GridSearchCV to run SVR with different parameters
# notice that the best result is going to have the highest parameter C
def ex6():

    from sklearn.preprocessing import MinMaxScaler
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.svm import SVR
    import numpy as np
    
    scaler = MinMaxScaler()
    (X_crime,y_crime) = get_crime_dataset()
    X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                       random_state = 0)

    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # we use GridSearchCV on the train data to find the best parameters
    svr = GridSearchCV(SVR( gamma=0.1), cv=5,
                   param_grid={"C": [1e0, 1e1, 1e2, 1e3, 1e4], 'kernel':['rbf','linear','poly']})
    
    clf = svr.fit(X_train_scaled, y_train)
    print(clf.best_params_)
    r2_train = clf.score(X_train_scaled, y_train)
    r2_test = clf.score(X_test_scaled, y_test)
    print(r2_train,r2_test)

ex6()

{'C': 10000.0, 'kernel': 'rbf'}
0.53540402058 0.500005846702


In [36]:
# we use TreeRegressor to identify the features that are most important
# change the max_depth to 1 through 4 and see how the important features change
def ex7():
    
    import numpy as np
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.model_selection import train_test_split
    
    scaler = MinMaxScaler()
    depth = 2
    (X_crime,y_crime) = get_crime_dataset()
    X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                       random_state = 0)
    # both training set and testing set need to be scaled
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    tree = DecisionTreeRegressor(max_depth=depth).fit(X_train_scaled, y_train)
    print(tree.feature_importances_)
    returnColumnNames()
    return (tree.score(X_train_scaled, y_train),tree.score(X_test_scaled, y_test))

ex7()

[ 0.23672563  0.          0.          0.          0.          0.
  0.05450671  0.70876767]
X:  ['PctPopUnderPov' 'racepctblack' 'racePctWhite' 'racePctAsian'
 'racePctHisp' 'population' 'medIncome' 'PctKidsBornNeverMar'] 
y:  burglPerPop


(0.51961245661659561, 0.38031607613654439)

In [None]:
# The features used for this exercise for predicting burglPerPop were selected to show some of the
# algorithms you can use to solve Regression problems.
# They are not necessarily the most appropriate to do an accurate prediction for this specific problem
# now it's your turn:
# try to predict 'ViolentCrimesPerPop' or 'nonViolPerPop'
# with a r2 score of over 0.7 without overfitting
# what are the most important features? (use TreeRegressor)

def classExercise():
    
    return "solution"

classExercise()