# <font color=blue>Assignments for "Logistic Regression"</font>

- In this assignment, you are going to use your model to predict survival of the Titanic disaster. For this assignment, send a link to a Jupyter notebook containing solutions to the following tasks.
    - Download [Titanic](https://www.kaggle.com/c/titanic/data) data from Kaggle. The data in the train.csv file meets your need.
    - Split your data into training and test sets.
    - Predict the survival based on the test data you split by creating your model.
    - Is your model's performance satisfactory? Explain.
    - Try to improve your model's performance by adding or subtracting some variables. <br>

- Explore the advantages and disadvantages of Logistic Regression and discuss with your mentor.

In [58]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [112]:
titanic_df = pd.read_csv("titanic.csv")
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [113]:
#Filling age column by using median.
titanic_df.Age = titanic_df.Age.fillna(titanic_df.Age.median())
#Dropping column with many empty values
titanic_df = titanic_df.drop("Cabin",axis=1)

In [114]:
titanic_df = pd.concat([titanic_df, pd.get_dummies(titanic_df.Sex, drop_first=True)], axis=1)

In [115]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,male
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,1


In [116]:
titanic_df["male"]

0      1
1      0
2      0
3      0
4      1
      ..
886    1
887    0
888    0
889    1
890    1
Name: male, Length: 891, dtype: uint8

In [123]:
#Creating X and y variable to use in our model.
y = titanic_df.Survived
X = titanic_df[['Pclass', 'male', 'Age', 'SibSp','Parch', 'Fare']]

#Splitting data to train and tes sets
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.20, random_state=42)

### One vs Rest Method

In [124]:
#Creating a logistic regression object
logreg = LogisticRegression(solver='lbfgs', multi_class="ovr")

logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(multi_class='ovr')

In [125]:
train_accuracy = logreg.score(X_train, y_train)
test_accuracy = logreg.score(X_test, y_test)

print('One-vs.-Rest', '-'*30, 
      'Accuracy on Train Data : {:.2f}'.format(train_accuracy), 
      'Accuracy on Test Data  : {:.2f}'.format(test_accuracy), sep='\n')

One-vs.-Rest
------------------------------
Accuracy on Train Data : 0.80
Accuracy on Test Data  : 0.81


 ### Multinomial

In [126]:
#Creating a logistic regression object
logreg_mnm = LogisticRegression(solver='lbfgs', multi_class="multinomial")

logreg_mnm.fit(X_train, y_train)

LogisticRegression(multi_class='multinomial')

In [127]:
train_accuracy = logreg_mnm.score(X_train, y_train)
test_accuracy = logreg_mnm.score(X_test, y_test)

print('One-vs.-Rest', '-'*30, 
      'Accuracy on Train Data : {:.2f}'.format(train_accuracy), 
      'Accuracy on Test Data  : {:.2f}'.format(test_accuracy), sep='\n')

One-vs.-Rest
------------------------------
Accuracy on Train Data : 0.80
Accuracy on Test Data  : 0.81


**Comment:** Both of the methods give same test performance. Model is not bad, improvements are possible.

## Adding new feature

In [130]:
#Creating a new column by multiplyin age and sex. To search if old and male passengers have less chance to survive.
titanic_df["AgexSex"] = titanic_df.Age * titanic_df.male

### One vs Rest

In [131]:
#Creating X and y variable to use in our model.
y = titanic_df.Survived
X = titanic_df[['Pclass', 'male', 'Age', 'SibSp','Parch', 'Fare', 'AgexSex']]

#Splitting data to train and tes sets
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.20, random_state=42)

#Creating a logistic regression object
logreg = LogisticRegression(solver='lbfgs', multi_class="ovr")

logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(multi_class='ovr')

In [132]:
train_accuracy = logreg.score(X_train, y_train)
test_accuracy = logreg.score(X_test, y_test)

print('One-vs.-Rest', '-'*30, 
      'Accuracy on Train Data : {:.2f}'.format(train_accuracy), 
      'Accuracy on Test Data  : {:.2f}'.format(test_accuracy), sep='\n')

One-vs.-Rest
------------------------------
Accuracy on Train Data : 0.80
Accuracy on Test Data  : 0.79


### Multinomial

In [133]:
#Creating a logistic regression object
logreg_mnm = LogisticRegression(solver='lbfgs', multi_class="multinomial")

logreg_mnm.fit(X_train, y_train)

LogisticRegression(multi_class='multinomial')

In [134]:
train_accuracy = logreg_mnm.score(X_train, y_train)
test_accuracy = logreg_mnm.score(X_test, y_test)

print('One-vs.-Rest', '-'*30, 
      'Accuracy on Train Data : {:.2f}'.format(train_accuracy), 
      'Accuracy on Test Data  : {:.2f}'.format(test_accuracy), sep='\n')

One-vs.-Rest
------------------------------
Accuracy on Train Data : 0.80
Accuracy on Test Data  : 0.79


**Comment:** Adding new feature did not improved success level of model