# Build a Logistic Regression model

Load all the packages you are going to use.

In [1]:
#Calculations and visualizations

In [2]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#Modelling

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

**Step 1**

Read the file train.csv into Python and print a few rows.

In [5]:
full_data = pd.read_csv("train.csv", index_col = 0)

In [6]:
full_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
full_data.shape

(891, 11)

Find X and y.

In [8]:
X = full_data[["Pclass"]]

In [9]:
X

Unnamed: 0_level_0,Pclass
PassengerId,Unnamed: 1_level_1
1,3
2,1
3,3
4,1
5,3
...,...
887,2
888,1
889,3
890,1


In [10]:
type(X)

pandas.core.frame.DataFrame

In [11]:
y = full_data["Survived"]

In [12]:
y

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    0
890    1
891    0
Name: Survived, Length: 891, dtype: int64

In [13]:
type(y)

pandas.core.series.Series

Split.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [15]:
y_train, y_test

(PassengerId
 332    0
 734    0
 383    0
 705    0
 814    0
       ..
 107    1
 271    0
 861    0
 436    1
 103    0
 Name: Survived, Length: 712, dtype: int64,
 PassengerId
 710    1
 440    0
 841    0
 721    1
 40     1
       ..
 434    0
 774    0
 26     1
 85     1
 11     1
 Name: Survived, Length: 179, dtype: int64)

In [16]:
X_train.shape, X_test.shape

((712, 1), (179, 1))

Build a LogReg model.

In [17]:
m_logreg = LogisticRegression(class_weight = "balanced")

In [18]:
m_logreg.fit(X_train, y_train)

LogisticRegression(class_weight='balanced')

Print the coefficients calculated by the model.

In [19]:
w = m_logreg.coef_

In [20]:
w

array([[-0.81546945]])

In [21]:
b = m_logreg.intercept_

In [22]:
b

array([1.85856922])

In [23]:
f' w: {w[0][0]} and b: {b[0]}'

' w: -0.815469446584666 and b: 1.8585692161600254'

Calculate the probabilities for your data points belonging to the positive class.

In [24]:
m_logreg.predict_proba(X)

array([[0.64286919, 0.35713081],
       [0.26055233, 0.73944767],
       [0.64286919, 0.35713081],
       ...,
       [0.64286919, 0.35713081],
       [0.26055233, 0.73944767],
       [0.64286919, 0.35713081]])

In [25]:
ypred_train = m_logreg.predict(X_train)

In [26]:
X_test_prob = m_logreg.predict_proba(X_test)[:,1]

In [27]:
X_test_prob

array([0.35713081, 0.55666312, 0.35713081, 0.55666312, 0.35713081,
       0.73944767, 0.35713081, 0.35713081, 0.35713081, 0.73944767,
       0.73944767, 0.35713081, 0.35713081, 0.35713081, 0.55666312,
       0.73944767, 0.73944767, 0.35713081, 0.55666312, 0.73944767,
       0.35713081, 0.73944767, 0.35713081, 0.35713081, 0.35713081,
       0.35713081, 0.73944767, 0.55666312, 0.35713081, 0.35713081,
       0.35713081, 0.35713081, 0.73944767, 0.35713081, 0.35713081,
       0.35713081, 0.73944767, 0.35713081, 0.73944767, 0.35713081,
       0.55666312, 0.35713081, 0.35713081, 0.35713081, 0.35713081,
       0.35713081, 0.35713081, 0.35713081, 0.35713081, 0.73944767,
       0.35713081, 0.73944767, 0.35713081, 0.73944767, 0.35713081,
       0.73944767, 0.55666312, 0.73944767, 0.55666312, 0.35713081,
       0.35713081, 0.55666312, 0.55666312, 0.73944767, 0.35713081,
       0.55666312, 0.55666312, 0.35713081, 0.35713081, 0.73944767,
       0.55666312, 0.73944767, 0.73944767, 0.73944767, 0.35713

Suppose you classify all points with a probability > 0.9 as positive.

        How does the result of your prediction change?

        How does it change if you change the threshold to > 0.1?


In [28]:
ypred_train.mean()

0.4410112359550562

In [29]:
(X_test_prob > 0.9).mean()

0.0

In [30]:
(X_test_prob > 0.1).mean()

1.0

In [32]:
X_test_prob.mean()

0.5071160611654015