# Logistic Regression with Python

For this lecture we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous data set and very often is a student's first step in machine learning! 

We'll be trying to predict a classification- survival or deceased.
Let's begin our understanding of implementing Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

## Import Libraries
Let's import some libraries to get started!

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Import Data and Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [0]:
df = pd.read_csv('data.csv')

In [57]:
df.head()

Unnamed: 0,SerialNumber,Leave,ActionYear,WorkDurationYear,CountLoan,Avg_MonthPerLoan,HireType,HireSourceGroup,WorkDurationYear.1,Avg_TotalAbsensePerYear,Avg_NumDaysPerAbsense,TotalEduAllowance,NumYear_SinceLastEduAllowance,TotalEduAttend,EduBranch_CHEM,EduBranch_Finance,EduBranch_Languages,Max_EduInstituteGroup,NumYear_SinceLastEdu
0,4,1.0,2000,39.0,0.0,0.0,Unknown,Unknown,39.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Unknown,41.0
1,5,1.0,2000,39.0,0.0,0.0,Unknown,Unknown,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UNIV,40.0
2,6,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Unknown,47.0
3,7,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,SCHL,39.0
4,10,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,Unknown,38.0


In [0]:
dfEduInstituteGroup = pd.get_dummies(df['Max_EduInstituteGroup'], prefix='Max_EduInstituteGroup')
dfHireTypeGroup = pd.get_dummies(df['HireType'], prefix='HireType')
dfHireSourceGroup = pd.get_dummies(df['HireSourceGroup'], prefix='HireSourceGroup')

#df = df.drop(['EduInstituteGroup','HireTypeGroup','HireSourceGroup'], axis=1)

df = pd.concat([df, dfEduInstituteGroup,dfHireTypeGroup,dfHireSourceGroup], axis=1)


In [59]:
print(df.shape)
df.head()

(4591, 40)


Unnamed: 0,SerialNumber,Leave,ActionYear,WorkDurationYear,CountLoan,Avg_MonthPerLoan,HireType,HireSourceGroup,WorkDurationYear.1,Avg_TotalAbsensePerYear,...,HireType_Experienced Hire,HireType_Inexperienced Hire,HireType_Unknown,HireSourceGroup_Agency,HireSourceGroup_Campus/Fair,HireSourceGroup_Contractor Conversion,HireSourceGroup_Other,HireSourceGroup_Referral,HireSourceGroup_Unknown,HireSourceGroup_Website/Ads
0,4,1.0,2000,39.0,0.0,0.0,Unknown,Unknown,39.0,0.0,...,0,0,1,0,0,0,0,0,1,0
1,5,1.0,2000,39.0,0.0,0.0,Unknown,Unknown,39.0,0.0,...,0,0,1,0,0,0,0,0,1,0
2,6,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,...,0,0,1,0,0,0,0,0,1,0
3,7,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,...,0,0,1,0,0,0,0,0,1,0
4,10,1.0,2000,38.0,0.0,0.0,Unknown,Unknown,38.0,0.0,...,0,0,1,0,0,0,0,0,1,0


Great! Our data is ready for our model!

# Building a Logistic Regression model

Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

## Train Test Split

In [60]:
from sklearn.model_selection import train_test_split

df_train = df[ df['ActionYear']!= 2017]
df_train.shape

df_test = df[ df['ActionYear'] == 2017]
df_test.shape

(1122, 40)

In [0]:
df_train_variable = df_train.drop(['SerialNumber','ActionYear','Leave','Max_EduInstituteGroup','HireType','HireSourceGroup'],axis=1)
df_train_label = df_train['Leave']

df_test_variable = df_test.drop(['SerialNumber','ActionYear','Leave','Max_EduInstituteGroup','HireType','HireSourceGroup'],axis=1)
df_test_label = df_test['Leave']


In [0]:
#X_train, X_test, y_train, y_test = train_test_split(df_variable, df_label, test_size=0.30, random_state=101)
X_train, X_test, y_train, y_test = df_train_variable, df_test_variable, df_train_label, df_test_label

## Training and Predicting

In [0]:
from sklearn.linear_model import LogisticRegression

In [64]:
logmodel = LogisticRegression(C=1.0)
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [0]:
predictions = logmodel.predict(X_test)

Let's move on to evaluate our model!

## Evaluation

We can check precision,recall,f1-score using classification report!

In [0]:
from sklearn.metrics import classification_report

In [67]:
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

        0.0       0.90      0.47      0.61       959
        1.0       0.18      0.69      0.28       163

avg / total       0.79      0.50      0.57      1122

