Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

Vanilla logistic regression
Ridge logistic regression
Lasso logistic regression
If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

Record your work and reflections in a notebook to discuss with your mentor.

In [3]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

#### Description of ‘A-level-geography’

The data contain the result of examination on A-level geometry for 33,276 students from over 2,000 institutions in England in 1997. There are 15 fields in the data set of ASCII format, and each field is separated by a blank space. The detailed description of the fields is as followings


| Variable  | Coding | Description |
| --------- | ------ | ----------- |
| SCORE   |   0, 2, 4, 6, 8, 10   | 0=fail, 2=grade E, 4=grade D, 6=grade C, 8=grade B, 10=grade A | 
| BOARD  | 1 – 7  | 1=Associate and WJB, 2=Cambridge, 3=London, 5=Oxforld, 6=Joint Matriculation, 7=Oxford-Cambridge |
| GCSE-G-SCORE | 0,2,3,4,5,6,7,8 | 0=fail, 2=grade F,  3=grade E, 4=grade D, 5=grade C, 6=grade B, 7=grade A, 8=grade A* | 
| GENDER | 0 or 1 | 0=Male, 1=Female | 			
| GTOT | 19 ~ 95 continuous | 	Total point score of all GCSE subjects
| GNUM | 4 ~13 continuous | 	Total number of GCSE taken 
| GCSE-MA-MAX | 0 – 8  | Maximum point score for GCSE math: 0=fail, 2=grade F,  3=grade E, 4=grade D, 5=grade C, 6=grade B, 7=grade A, 8=grade A* | 
| GCSE-math-n | 	1,2,3,4 | 	Total number of GCSE math subjects taken | 
| AGE | 	continuous | 	Age of student in month, centred at 222 months ( 18.5 years) | 
| INST-GA-MN | 	continuous | 	Institution average of GCSE score, centred at its mean | 
| INST-GA-SD | 	continuous | 	Institution standard deviation of GCSE score | 
| INSTTYPE	Category | 1 ~ 11 |1 = LEA Maintained Comprehensive, 2 = Maintained Selective, 3 = Maintained Modern, 4 = Grammar Comprehensive, 5 = Grammar Selective, 6 = Grammar Modern, 7 = Independent selective, 8 = Independent non-selective, 9 = Sixth Form College, 10 = Further Education College, 11 = Others | 
| LEA | 	1 ~ 131 | 	Local Education Authority identification | 
| INSTITUTE | 	1 ~ 98 | 	Institution identification within LEA | 
| STUDENT | 	25 ~ 196053 | 	Student identification | 

In [15]:
df = pd.read_csv('data/geography.txt', sep=' ', header=None) 
df.columns = ['a_scre','boards', 'g_ge_s', 'gender', 'g_tl_s', 'g_tl_n','g_m_mx', 'g_m_tl','age_mh',
              'i_g_mn', 'i_g_sd','i_type', 'lea_id', 'ise_id', 'studnt']
df.tail()

Unnamed: 0,a_scre,boards,g_ge_s,gender,g_tl_s,g_tl_n,g_m_mx,g_m_tl,age_mh,i_g_mn,i_g_sd,i_type,lea_id,ise_id,studnt
33271,8.0,3,7,1,71,11.0,7,1,-3.0,-0.06,0.65,9.0,131.0,33.0,196035.0
33272,6.0,3,6,0,52,9.0,6,1,1.0,-0.06,0.65,9.0,131.0,33.0,196037.0
33273,4.0,3,5,1,46,9.0,5,1,-3.0,-0.06,0.65,9.0,131.0,33.0,196039.0
33274,8.0,3,5,1,52,9.0,5,1,5.0,-0.06,0.65,9.0,131.0,33.0,196047.0
33275,6.0,3,6,1,48,9.0,5,1,-3.0,-0.06,0.65,9.0,131.0,33.0,196053.0


In [16]:
df['passed'] = np.where(df.a_scre > 0, 1, 0)
df['g_avg_'] = np.round(df.g_tl_s.div(df.g_tl_n), decimals = 4)
df.ise_id = df.ise_id.add(df.lea_id * 100)
df.ise_id = df.ise_id.apply(lambda x : int(x))
df.age_mh = df.age_mh.apply(lambda x : int(x + 222))
df.a_scre = df.a_scre.apply(lambda x : int(x))
df.i_type = df.i_type.apply(lambda x : int(x))
drops = ['studnt', 'lea_id', 'g_tl_s', 'g_tl_n', 'i_g_mn','i_g_sd' ]
df = df.drop(drops, axis=1)
df.tail()

Unnamed: 0,a_scre,boards,g_ge_s,gender,g_m_mx,g_m_tl,age_mh,i_type,ise_id,passed,g_avg_
33271,8,3,7,1,7,1,219,9,13133,1,6.4545
33272,6,3,6,0,6,1,223,9,13133,1,5.7778
33273,4,3,5,1,5,1,219,9,13133,1,5.1111
33274,8,3,5,1,5,1,227,9,13133,1,5.7778
33275,6,3,6,1,5,1,219,9,13133,1,5.3333


#### Numeric Features:
 - age_mh : int students age in months
 - g_ge_s : int   students score on lower level geometry test
 - g_m_mx : int   students best on any lower level math exam
 - g_m_tl : int   number of lower level math exam student has taken
 - g_avg_ : float students average score on all lower level exams

#### Categorical Features:
 - i_type : ordinal integers {1:11}  type of institution attended
 - gender : ordinal integers {0,1}   students, male = 1  
 - boards : ordinal integers {1, 2, 3, 5, 6, 7, 8}

#### Targets
- passed : ordinal integers {0, 1}  student passed 1 or failed 0
- a_scre : continous int    {0:10}    score awarded for A level geometry exam ==> 0 , 2, 4, 6, 8, or 10 

In [None]:

# Numeric Features:
# - age_mh: float students age in months
# - g_ge_s: int   students score on lower level geometry test
# - g_m_mx: int   students best on any lower level math exam
# - g_m_tl: int   number of lower level math exam student has taken
# - g_avg_: float students average score on all lower level exams

# Categorical Features:
# - i_type: ordinal integers {1:11}  type of institution attended
# - gender: ordinal integers {0,1}   students, male = 1  
# - boards: ordinal integers {1, 2, 3, 5, 6, 7, 8}

# Targets
# passed ordinal integers {0, 1}  student passed 1 or failed 0
# passed continous int  {0:10}    score awarded for A level geometry exam ==> 0 , 2, 4, 6, 8, or 10 


numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [None]:
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

In [7]:
# new columns 
df1['passed'] = np.where(df1.a_scre > 0, 1, 0)
df1['g_avg_'] = np.round(df1.g_tl_s.div(df1.g_tl_n), decimals = 4)
# transform columns
df1.ise_id =  df1.ise_id.add(df1.lea_id * 100)
df1.age_mh =  df1.age_mh.add(222)
# dicts to put features in rank order by target ascending
obts = {x:ordinal_by_target(df1, 'a_scre', x ) for x in a}

In [8]:
# create dict of all cleaning / tranforming parameters
# {'col1':{'lbl':'col1', 'fnn':'ABC', 'kwg': {'fte':'col1', tgt:'a_scre', 'arg':[sr1, sr2]}, 'col2':{}, }
cleans =      [{'fte': val,'fnn':str.upper(sub),'kws':{'arg':[obts[val]], 'srs':getattr(df1, val)}} if val in obts
          else {'fte': val,'fnn':str.upper(sub),'kws':{'arg':[],          'srs':getattr(df1, val)}}
          for sub in groups for val in eval(sub)]

In [9]:
pairs = [(dct['fte'],switch_clean(dct['fnn'], **dct['kws'])) for dct in cleans]
f,s = zip(*pairs)
F = list(f); S = list(s)

In [10]:
data = pd.DataFrame(S, index=F).T 
data['gender'] = df1.gender

In [11]:
data.describe()

Unnamed: 0,i_type,g_m_tl,boards,ise_id,g_m_mx,age_mh,g_ge_s,g_avg_,gender
count,33276.0,33276.0,33276.0,33276.0,33276.0,33276.0,33276.0,33276.0,33276.0
mean,0.162844,0.513603,0.259526,0.20166,0.785446,0.015266,1.184127,3.765035e-14,0.451136
std,1.144134,0.619077,1.088771,1.003181,0.770443,1.245786,0.491898,1.0,0.497614
min,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,-3.449082,0.0
25%,-0.8,0.666667,-0.666667,-0.554404,0.5,-0.909091,1.0,-0.6284594,0.0
50%,0.0,0.666667,0.666667,0.207254,1.0,-0.181818,1.0,0.02637762,0.0
75%,1.2,0.666667,0.666667,1.043178,1.0,1.272727,1.5,0.7146999,1.0
max,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.595195,1.0


In [12]:
Y_scr = df1.a_scre
Y_psd = df1.passed
X = data

In [13]:
a = ['boards', 'g_m_tl', 'ise_id', 'i_type', 'g_ge_s', 'g_m_mx', 'age_mh','g_avg_', 'gender']
b = [ 'g_m_tl', 'g_ge_s', 'g_m_mx','g_avg_']

In [14]:
X = data[b ]

In [None]:
sklearn.metrics.mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)[

Predict scr Accuracy: 0.43 (+/- 0.0001)

Predict psd Accuracy: 0.09 (+/- 0.0001)

In [15]:
#p1 = 'distance'
p1 = 'uniform'
p0 = 10

In [16]:
knn_scr = neighbors.KNeighborsRegressor(n_neighbors=p0, weights=p1)
knn_psd = neighbors.KNeighborsRegressor(n_neighbors=p0, weights=p1)
knn_scr.fit(X, Y_scr)
knn_psd.fit(X, Y_psd)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                    weights='uniform')

In [17]:
knn_scr.predict(X)

array([8.4, 6.2, 8.4, ..., 1.6, 5. , 4.8])

In [18]:
knn_psd.predict(X)

array([1. , 0.9, 1. , ..., 0.5, 1. , 1. ])

In [19]:
acc_scr = cross_val_score(knn_scr, X, Y_scr, cv=5)
acc_psd = cross_val_score(knn_psd, X, Y_psd, cv=5)
print(acc_scr)
print(acc_psd)
print("R Squared for knn_scr: %0.2f (+/- %0.4f)" % (acc_scr.mean(), acc_scr.std() ** 2))
print("R Squared for knn_psd: %0.2f (+/- %0.4f)" % (acc_psd.mean(), acc_psd.std() ** 2))


[0.42218633 0.39269003 0.41444808 0.41395987 0.41378414]
[0.10129468 0.08004322 0.11282106 0.05742085 0.0816556 ]


In [None]:
results = []
for a,b in param:  
    knn1 = neighbors.KNeighborsRegressor(n_neighbors=a, weights=b)
    X = music[['loudness', 'duration']]
    Y = music.bpm
    knn.fit(X, Y)
    score = cross_val_score(knn1, X, Y, cv=3)
    if b == 'distance': weighted = True
    else:               weighted = False
    results.append({'n':a,'weighted':weighted,'accuracy':np.round(score.mean(), decimals=2),
                    'std':np.round(score.std(), decimals=2)})
 

In [None]:
dfr = pd.concat([pd.Series(dct) for dct in results], axis=1, sort=False).T    

In [None]:
table = dfr.pivot(index='n', columns='weighted', values=['accuracy', 'std'])
table

In [None]:
regr = linear_model.LinearRegression()
regr.fit(X, y)

# Save predicted values.
Y_pred = regr.predict(X)
print('R-squared regression:', regr.score(X, y))

# Fit a linear model using Partial Least Squares Regression.
# Reduce feature space to 3 dimensions.
pls1 = PLSRegression(n_components=3)

# Reduce X to R(X) and regress on y.
pls1.fit(X, y)

# Save predicted values.
Y_PLS_pred = pls1.predict(X)