# Logistic Regression with Python

For this lecture we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous data set and very often is a student's first step in machine learning! 

We'll be trying to predict a classification- survival or deceased.
Let's begin our understanding of implementing Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

## Import Libraries
Let's import some libraries to get started!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## The Data

Reading in the titanic_test.csv file into a pandas dataframe.

In [2]:
train = pd.read_csv('train_clean')

In [3]:
train.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,male,Q,S,2,3
0,0,22.0,1,0,7.25,1,0,1,0,1
1,1,38.0,1,0,71.2833,0,0,0,0,0
2,1,26.0,0,0,7.925,0,0,1,0,1
3,1,35.0,1,0,53.1,0,0,1,0,0
4,0,35.0,0,0,8.05,1,0,1,0,1


In [4]:
# Will be used later after evaluating the model accuracy with the training data
# test = pd.read_csv('test_clean') 
# test.head()

Great! Our data is ready for our model!

# Building a Logistic Regression model

## Train Test Split

In [14]:
# X_train = train.drop('Survived',axis=1)
# y_train = train['Survived']

# X_test = test
# y_test = pd.DataFrame()

# X_train.shape
# y_train.shape
# X_test.shape
# y_test.shape

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), 
                                                    train['Survived'], test_size=0.30, 
                                                    random_state=101)

## Training and Predicting

In [6]:
from sklearn.linear_model import LogisticRegression

In [7]:
logmodel = LogisticRegression()

logmodel.fit(X_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [8]:
predictions = logmodel.predict(X_test)

Let's move on to evaluate our model!

## Evaluation

We can check precision,recall,f1-score using classification report!

In [9]:
from sklearn.metrics import classification_report

In [10]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.81      0.91      0.86       163
           1       0.83      0.65      0.73       104

    accuracy                           0.81       267
   macro avg       0.82      0.78      0.79       267
weighted avg       0.81      0.81      0.81       267



In [11]:
from sklearn.metrics import confusion_matrix

In [12]:
confusion_matrix(y_test,predictions)

array([[149,  14],
       [ 36,  68]])

Not so bad! You might want to explore other feature engineering and the other titanic_text.csv file, some suggestions for feature engineering:

* Try grabbing the Title (Dr.,Mr.,Mrs,etc..) from the name as a feature
* Maybe the Cabin letter could be a feature
* Is there any info you can get from the ticket?