## 1. Introduction to Logistic Regression
Logistic regression is a techinque used for solving the __classification problem__.<br/> And Classification is nothing but a problem of __identifing__ to which of a set of __categories__ a new observation belongs, on the basis of _training dataset_ containing observations (or instances) whose categorical membership is known. <br/>For example to predict:<br/> __Whether an email is spam (1) or not (0)__ or,<br/> __Whether the tumor is malignant (1) or not (0)<br/>__
Below is the pictorial representation of a basic logistic regression model to classify set of images into _happy or sad._
![image.png](https://miro.medium.com/max/800/1*UgYbimgPXf6XXxMy2yqRLw.png)



Both Linear regression and Logistic regression are __supervised learning techinques__. But for the _Regression_ problem the output is __continuous__ unlike the _classification_ problem where the output is __discrete__. <br/>
- Logistic Regression is used when the __dependent variable(target) is categorical__.<br/>
- __Sigmoid function__ or logistic function is used as _hypothesis function_ for logistic regression. Below is a figure showing the difference between linear regression and logistic regression, Also notice that logistic regression produces a logistic curve, which is limited to values between 0 and 1. <br/> 


Note: Throey is taken from course I took at www.insaid.co

## 2. Data Loading and Description

#### Importing Packages

In [22]:
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#### Importing the Dataset¶

In [36]:
data = pd.read_csv('../../data/titanic_train.csv')
data.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


#### Find Missing values

In [29]:
def handle_missing_values(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percentage = round(total / data.shape[0] * 100)
    return pd.concat([total, percentage], axis = 1, keys = ['total', 'percentage'])
handle_missing_values(data)

Unnamed: 0,total,percentage
Cabin,687,77.0
Age,177,20.0
Embarked,2,0.0
Fare,0,0.0
Ticket,0,0.0
Parch,0,0.0
SibSp,0,0.0
Sex,0,0.0
Name,0,0.0
Pclass,0,0.0


## 3. Preprocessing the data

### 3.1 **Dealing with missing values**<br/>

* Dropping/Replacing missing entries of __Embarked.__
* Replacing missing values of __Age__ with median values.
* Dropping the column __'Cabin'__ as it has too many _null_ values.

**Discussion**
* Generally, It is always better to keep data than to discard it. Sometimes you can drop variables if the data is missing for more than 60% observations but only if that variable is insignificant.
* If anything missing values less than 2% it is good drop the row values. this percentage may change with respective the dataset.
* calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others is easy and fast works well with small numerical datasets but when it comes to poor results on encoded categorical features (do NOT use it on categorical features). 

#### Splitting X and y into training and test datasets.

In [37]:
data.Embarked = data.Embarked.fillna(titanic_data['Embarked'].mode()[0])
#Replace all the missing valeus in Age with Median
median_age = data.Age.median()
data.Age.fillna(median_age, inplace = True)
#Drop the column Cabin. There are so manny values are missing
data.drop('Cabin',axis=1,inplace = True)
titanic = data.drop(['Name','Ticket','Sex','SibSp','Parch','Embarked'], axis = 1)
titanic.head(2)
X = titanic.loc[:,titanic.columns != 'Survived']
y = titanic.Survived
X.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

In [38]:
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [32]:
#Encode the Categorical Values
dummy_data = pd.get_dummies(data, columns=['Embarked'],drop_first=True)
X = dummy_data.loc[:,dummy_data.columns != 'Survived']
y = dummy_data.Survived
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
print('Train cases as below')
print('X_train shape: ',X_train.shape)
print('y_train shape: ',y_train.shape)
print('\nTest cases as below')
print('X_test shape: ',X_test.shape)
print('y_test shape: ',y_test.shape)

Train cases as below
X_train shape:  (712, 7)
y_train shape:  (712,)

Test cases as below
X_test shape:  (179, 7)
y_test shape:  (179,)


## 3. Logistic regression in scikit-learn

To apply any machine learning algorithm on your dataset, basically there are 4 steps:
1. Load the algorithm
2. Instantiate and Fit the model to the training dataset
3. Prediction on the test set 
The code block given below shows how these steps are carried out:<br/>

``` from sklearn.linear_model import LogisticRegression
    logreg = LogisticRegression()
    logreg.fit(X_train, y_train) 
    ```

In [41]:
model = LogisticRegression()
model.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [43]:
y_pred_train = logreg.predict(X_train) 
y_pred_test = logreg.predict(X_test)
from sklearn.metrics import accuracy_score
print('Accuracy score for test data is:', accuracy_score(y_test,y_pred_test))

Accuracy score for test data is: 0.6703910614525139
