# Logistic Regression with Python and Sequential Feature Selection for Regression


## Import Libraries
Let's import some libraries to get started!

In [0]:
#!pip install mlxtend

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

## Import Data and Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [0]:
df = pd.read_csv('data.csv')

In [226]:
df.head()

Unnamed: 0,SerialNumber,Leave,ActionYear,WorkDurationYear,CountLoan,Avg_MonthPerLoan,HireType,HireSourceGroup,WorkDurationYear.1,Avg_TotalAbsensePerYear,Avg_NumDaysPerAbsense,TotalEduAllowance,NumYear_SinceLastEduAllowance,TotalEduAttend,EduBranch_CHEM,EduBranch_Finance,EduBranch_Languages,Max_EduInstituteGroup,NumYear_SinceLastEdu
0,4,1.0,2000,39.0,0.0,0.0,Unknown,Unknown,39.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Unknown,41.0
1,5,1.0,2000,39.0,0.0,0.0,Unknown,Unknown,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UNIV,40.0
2,6,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Unknown,47.0
3,7,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,SCHL,39.0
4,10,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Unknown,38.0


In [0]:
dfEduInstituteGroup = pd.get_dummies(df['Max_EduInstituteGroup'], prefix='Max_EduInstituteGroup')
dfHireTypeGroup = pd.get_dummies(df['HireType'], prefix='HireType')
dfHireSourceGroup = pd.get_dummies(df['HireSourceGroup'], prefix='HireSourceGroup')

#df = df.drop(['EduInstituteGroup','HireTypeGroup','HireSourceGroup'], axis=1)

df = pd.concat([df, dfEduInstituteGroup,dfHireTypeGroup,dfHireSourceGroup], axis=1)

In [228]:
print(df.shape)
df.head()

(4591, 40)


Unnamed: 0,SerialNumber,Leave,ActionYear,WorkDurationYear,CountLoan,Avg_MonthPerLoan,HireType,HireSourceGroup,WorkDurationYear.1,Avg_TotalAbsensePerYear,...,HireType_Experienced Hire,HireType_Inexperienced Hire,HireType_Unknown,HireSourceGroup_Agency,HireSourceGroup_Campus/Fair,HireSourceGroup_Contractor Conversion,HireSourceGroup_Other,HireSourceGroup_Referral,HireSourceGroup_Unknown,HireSourceGroup_Website/Ads
0,4,1.0,2000,39.0,0.0,0.0,Unknown,Unknown,39.0,0.0,...,0,0,1,0,0,0,0,0,1,0
1,5,1.0,2000,39.0,0.0,0.0,Unknown,Unknown,39.0,0.0,...,0,0,1,0,0,0,0,0,1,0
2,6,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,...,0,0,1,0,0,0,0,0,1,0
3,7,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,...,0,0,1,0,0,0,0,0,1,0
4,10,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,...,0,0,1,0,0,0,0,0,1,0


## Sequential Feature Selection for Regression

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as acc
from mlxtend.feature_selection import SequentialFeatureSelector as sfs


In [0]:
df_train = df[ df['ActionYear'] < 2017]
df_train.shape

df_test = df[ df['ActionYear'] >= 2017]
df_test.shape


df_train_variable = df_train.drop(['SerialNumber','ActionYear','Leave','Max_EduInstituteGroup','HireType','HireSourceGroup'],axis=1)
df_train_label = df_train['Leave']

df_test_variable = df_test.drop(['SerialNumber','ActionYear','Leave','Max_EduInstituteGroup','HireType','HireSourceGroup'],axis=1)
df_test_label = df_test['Leave']



#X_train, X_test, y_train, y_test = train_test_split(df_variable, df_label, test_size=0.30, random_state=101)
X_train, X_test, y_train, y_test = df_train_variable, df_test_variable, df_train_label, df_test_label


In [231]:
X_train.shape

(3469, 34)

In [233]:
# Build RF classifier to use in feature selection
clf = LogisticRegression()

# Build step forward feature selection
sfs1 = sfs(clf,
           k_features=34,
           forward=True,
           floating=False,
           verbose=2,
           scoring='accuracy',
           cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  34 out of  34 | elapsed:    0.6s finished

[2018-07-28 12:53:38] Features: 1/34 -- score: 0.7269837111335671[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  33 out of  33 | elapsed:    0.8s finished

[2018-07-28 12:53:39] Features: 2/34 -- score: 0.7347734238224153[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    1.0s finished

[2018-07-28 12:53:40] Features: 3/34 -- score: 0.7362139301620569[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed:    1.0s finished

[2018-07-28 12:53:41] Features: 4/34 -- score: 0.7373670837647783[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.6s finished

[2018-07-28 12:54:09] Features: 30/34 -- score: 0.7191923350424791[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.5s finished

[2018-07-28 12:54:09] Features: 31/34 -- score: 0.7189045664549987[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s finished

[2018-07-28 12:54:10] Features: 32/34 -- score: 0.6906612439753651[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s finished

[2018-07-28 12:54:10] Features: 33/34 -- score: 0.6727929771157437[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished

[2018-07-28 12:54:10] Features: 34/34 -- score: 0.67279297711

In [234]:
# Which features?
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]


In [0]:
df_before_selected_train = df.drop(['SerialNumber','ActionYear','Leave','Max_EduInstituteGroup','HireType','HireSourceGroup'],axis=1)

In [0]:
df_after_selected = df_before_selected.iloc[:, feat_cols]

In [237]:
df_after_selected.columns

Index(['WorkDurationYear', 'CountLoan', 'Avg_MonthPerLoan',
       'WorkDurationYear.1', 'Avg_TotalAbsensePerYear',
       'Avg_NumDaysPerAbsense', 'TotalEduAllowance',
       'NumYear_SinceLastEduAllowance', 'TotalEduAttend', 'EduBranch_CHEM',
       'EduBranch_Finance', 'EduBranch_Languages', 'NumYear_SinceLastEdu',
       'Max_EduInstituteGroup_ACAD', 'Max_EduInstituteGroup_CAMP',
       'Max_EduInstituteGroup_COLL', 'Max_EduInstituteGroup_INST',
       'Max_EduInstituteGroup_NIDA', 'Max_EduInstituteGroup_OTHR',
       'Max_EduInstituteGroup_SASI', 'Max_EduInstituteGroup_SCHL',
       'Max_EduInstituteGroup_UNIV', 'Max_EduInstituteGroup_Unknown',
       'HireType_Contractor Conversion', 'HireType_Experienced Hire',
       'HireType_Inexperienced Hire', 'HireType_Unknown',
       'HireSourceGroup_Agency', 'HireSourceGroup_Campus/Fair',
       'HireSourceGroup_Contractor Conversion', 'HireSourceGroup_Other',
       'HireSourceGroup_Referral', 'HireSourceGroup_Unknown',
       'HireSou

In [0]:
df_after_selected = pd.concat([df['SerialNumber'], df['Leave'], df['ActionYear'], df_after_selected], axis=1)

In [239]:
print(df_after_selected.shape)
df_after_selected.head()

(4591, 37)


Unnamed: 0,SerialNumber,Leave,ActionYear,WorkDurationYear,CountLoan,Avg_MonthPerLoan,WorkDurationYear.1,Avg_TotalAbsensePerYear,Avg_NumDaysPerAbsense,TotalEduAllowance,...,HireType_Experienced Hire,HireType_Inexperienced Hire,HireType_Unknown,HireSourceGroup_Agency,HireSourceGroup_Campus/Fair,HireSourceGroup_Contractor Conversion,HireSourceGroup_Other,HireSourceGroup_Referral,HireSourceGroup_Unknown,HireSourceGroup_Website/Ads
0,4,1.0,2000,39.0,0.0,0.0,39.0,0.0,0.0,0.0,...,0,0,1,0,0,0,0,0,1,0
1,5,1.0,2000,39.0,0.0,0.0,39.0,0.0,0.0,0.0,...,0,0,1,0,0,0,0,0,1,0
2,6,1.0,2000,38.0,0.0,0.0,38.0,0.0,0.0,0.0,...,0,0,1,0,0,0,0,0,1,0
3,7,1.0,2000,38.0,0.0,0.0,38.0,0.0,0.0,0.0,...,0,0,1,0,0,0,0,0,1,0
4,10,1.0,2000,38.0,0.0,0.0,38.0,0.0,0.0,0.0,...,0,0,1,0,0,0,0,0,1,0


Great! Our data is ready for our model!

# Building a Logistic Regression model

Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

## Train Test Split

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
columns = df_after_selected.drop(['SerialNumber', 'Leave', 'ActionYear'], axis=1).columns

X_train, X_test, y_train, y_test = df_train_variable[columns], df_test_variable[columns], df_train_label, df_test_label



## Training and Predicting

In [0]:
from sklearn.linear_model import LogisticRegression

In [243]:
logmodel = LogisticRegression(C=1.0)
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [0]:
predictions = logmodel.predict(X_test)

Let's move on to evaluate our model!

## Evaluation

We can check precision,recall,f1-score using classification report!

In [0]:
from sklearn.metrics import classification_report

In [246]:
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

        0.0       0.90      0.47      0.61       959
        1.0       0.18      0.69      0.28       163

avg / total       0.79      0.50      0.57      1122



Not so bad! You might want to explore other feature engineering and the other titanic_text.csv file, some suggestions for feature engineering:

* Try grabbing the Title (Dr.,Mr.,Mrs,etc..) from the name as a feature
* Maybe the Cabin letter could be a feature
* Is there any info you can get from the ticket?