Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

*Vanilla logistic regression*

*Ridge logistic regression*

*Lasso logistic regression*

If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

Record your work and reflections in a notebook to discuss with your mentor.

In [1]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, Normalizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import accuracy_score, mean_absolute_error
from sklearn.feature_selection import SelectKBest, f_regression

#### Description of ‘A-level-geography’

The data contain the result of examination on A-level geometry for 33,276 students from over 2,000 institutions in England in 1997. There are 15 fields in the data set of ASCII format, and each field is separated by a blank space. The detailed description of the fields is as followings


| Variable  | Coding | Description |
| --------- | ------ | ----------- |
| SCORE   |   0, 2, 4, 6, 8, 10   | 0=fail, 2=grade E, 4=grade D, 6=grade C, 8=grade B, 10=grade A | 
| BOARD  | 1 – 7  | 1=Associate and WJB, 2=Cambridge, 3=London, 5=Oxforld, 6=Joint Matriculation, 7=Oxford-Cambridge |
| GCSE-G-SCORE | 0,2,3,4,5,6,7,8 | 0=fail, 2=grade F,  3=grade E, 4=grade D, 5=grade C, 6=grade B, 7=grade A, 8=grade A* | 
| GENDER | 0 or 1 | 0=Male, 1=Female | 			
| GTOT | 19 ~ 95 continuous | 	Total point score of all GCSE subjects
| GNUM | 4 ~13 continuous | 	Total number of GCSE taken 
| GCSE-MA-MAX | 0 – 8  | Maximum point score for GCSE math: 0=fail, 2=grade F,  3=grade E, 4=grade D, 5=grade C, 6=grade B, 7=grade A, 8=grade A* | 
| GCSE-math-n | 	1,2,3,4 | 	Total number of GCSE math subjects taken | 
| AGE | 	continuous | 	Age of student in month, centred at 222 months ( 18.5 years) | 
| INST-GA-MN | 	continuous | 	Institution average of GCSE score, centred at its mean | 
| INST-GA-SD | 	continuous | 	Institution standard deviation of GCSE score | 
| INSTTYPE	Category | 1 ~ 11 |1 = LEA Maintained Comprehensive, 2 = Maintained Selective, 3 = Maintained Modern, 4 = Grammar Comprehensive, 5 = Grammar Selective, 6 = Grammar Modern, 7 = Independent selective, 8 = Independent non-selective, 9 = Sixth Form College, 10 = Further Education College, 11 = Others | 
| LEA | 	1 ~ 131 | 	Local Education Authority identification | 
| INSTITUTE | 	1 ~ 98 | 	Institution identification within LEA | 
| STUDENT | 	25 ~ 196053 | 	Student identification | 

In [2]:
# the data
df = pd.read_csv('data/geography.txt', sep=' ', header=None) 
df.columns = ['a_scre','boards', 'g_ge_s', 'gender', 'g_tl_s', 'g_tl_n','g_m_mx', 'g_m_tl','age_mh',
              'i_g_mn', 'i_g_sd','i_type', 'lea_id', 'ise_id', 'studnt']


In [3]:
# new columns and some tranformations
df['passed'] = np.where(df.a_scre > 0, 1, 0)
df['g_avg_'] = np.round(df.g_tl_s.div(df.g_tl_n), decimals = 4)
df.ise_id = df.ise_id.add(df.lea_id * 100)
df.ise_id = df.ise_id.apply(lambda x : int(x))
df.age_mh = df.age_mh.apply(lambda x : int(x + 222))
df.a_scre = df.a_scre.apply(lambda x : int(x))
df.i_type = df.i_type.apply(lambda x : int(x))
drops = ['studnt', 'lea_id', 'g_tl_s', 'g_tl_n', 'i_g_mn','i_g_sd' ]
df = df.drop(drops, axis=1)
df.tail()

Unnamed: 0,a_scre,boards,g_ge_s,gender,g_m_mx,g_m_tl,age_mh,i_type,ise_id,passed,g_avg_
33271,8,3,7,1,7,1,219,9,13133,1,6.4545
33272,6,3,6,0,6,1,223,9,13133,1,5.7778
33273,4,3,5,1,5,1,219,9,13133,1,5.1111
33274,8,3,5,1,5,1,227,9,13133,1,5.7778
33275,6,3,6,1,5,1,219,9,13133,1,5.3333


In [4]:
df.g_avg_.describe()

count    33276.000000
mean         5.853180
std          0.827229
min          3.000000
25%          5.333300
50%          5.875000
75%          6.444400
max          8.000000
Name: g_avg_, dtype: float64

#### Numeric Features:
 - age_mh : int students age in months
 - g_ge_s : int   students score on lower level geometry test
 - g_m_mx : int   students best on any lower level math exam
 - g_m_tl : int   number of lower level math exam student has taken
 - g_avg_ : float students average score on all lower level exams

#### Categorical Features:
 - i_type : ordinal integers {1:11}  type of institution attended
 - ise_id : ordinal integers {1:11}  specific institution attended
 - gender : ordinal integers {0,1}   students, male = 1  
 - boards : ordinal integers {1, 2, 3, 5, 6, 7, 8}

#### Targets
- passed : ordinal integers {0, 1}  student passed 1 or failed 0
- a_scre : continous int    {0:10}    score awarded for A level geometry exam ==> 0 , 2, 4, 6, 8, or 10 

In [5]:
# create pipeline column transformer ==> feature Selector ==> classifier
#column tranformation
num_fte = [ 'g_ge_s', 'g_m_mx', 'g_m_tl', 'age_mh', 'g_avg_']
std_tfr = Pipeline(steps=[
    ('std', StandardScaler())])
cgl_fte = ['gender', 'i_type', 'ise_id', 'boards']
hot_tfr = Pipeline(steps=[
    ('hot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
    transformers=[
        ('num_std', std_tfr, num_fte),
        ('cgl_hot', hot_tfr, cgl_fte)])
# feature selector
select = SelectKBest(f_regression, k=5)
# pipeline appending classifier
mdl = Pipeline(steps=[('ppr', preprocessor),
                      ('slt', select),
                      ('cfr', LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200))])

In [6]:
X =df.drop(['passed', 'a_scre'], axis=1)
y = df.passed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [7]:
dcts = [{'cfr__penalty':'none', 'cfr__solver':'lbfgs'},
        {'cfr__penalty':'l2',   'cfr__solver':'lbfgs'},
        {'cfr__penalty':'l1',   'cfr__solver':'liblinear'}]

In [8]:
results = []
for dct in dcts:
    mdl.set_params(**dct)
    mdl.fit(X_train, y_train)
    y_pred = mdl.predict(X_test)
    dct = {'score':mdl.score(X_test, y_test), 'mean_abs_error': mean_absolute_error(y_test, y_pred)}
    results.append(dct)

In [9]:
dfr = pd.DataFrame([pd.Series(result) for result in results], index=['Vanilla', 'Ridge', 'Lasso'])

In [10]:
dfr

Unnamed: 0,score,mean_abs_error
Vanilla,0.904748,0.095252
Ridge,0.904748,0.095252
Lasso,0.904597,0.095403


In [None]:
X_test.columns

In [None]:
y_pred = clf.predict(X_test)

In [None]:
clf.predict_log_proba(X_test).shape

In [None]:
mean_absolute_error(y_test, y_pred)

In [None]:
clf.classes_ 

In [None]:
clf.named_steps

In [None]:
results = []
for a,b in param:  
    knn1 = neighbors.KNeighborsRegressor(n_neighbors=a, weights=b)
    X = music[['loudness', 'duration']]
    Y = music.bpm
    knn.fit(X, Y)
    score = cross_val_score(knn1, X, Y, cv=3)
    if b == 'distance': weighted = True
    else:               weighted = False
    results.append({'n':a,'weighted':weighted,'accuracy':np.round(score.mean(), decimals=2),
                    'std':np.round(score.std(), decimals=2)})
 

In [None]:
dfr = pd.concat([pd.Series(dct) for dct in results], axis=1, sort=False).T    

In [None]:
table = dfr.pivot(index='n', columns='weighted', values=['accuracy', 'std'])
table

In [None]:
regr = linear_model.LinearRegression()
regr.fit(X, y)

# Save predicted values.
Y_pred = regr.predict(X)
print('R-squared regression:', regr.score(X, y))

# Fit a linear model using Partial Least Squares Regression.
# Reduce feature space to 3 dimensions.
pls1 = PLSRegression(n_components=3)

# Reduce X to R(X) and regress on y.
pls1.fit(X, y)

# Save predicted values.
Y_PLS_pred = pls1.predict(X)