### Logistic regression for data with multiple features
Logistic regression with sklearn on the Titanic dataset 

***
#### Environment
`conda activate sklearn-env`

***
#### Goals

- Train logistic regression with sklearn
- Predict values from test dataset and compare with test labels

***
#### References
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#### Basic python imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random

# Make numpy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

#### Dataset load using sklearn API from https://www.openml.org site

https://www.openml.org/d/40945

If the URL does not work the dataset can be loaded from the data folder `./data/titanic/`. 
- `train.csv`
- `test.csv`
- `gender_submission.csv`

In [None]:
from sklearn.datasets import fetch_openml

# Load data from https://www.openml.org/d/40945
raw_dataset = fetch_openml("titanic", version=1, as_frame=True).frame
dataset = raw_dataset.copy()
dataset.head(10)

#### Data preparation

Split data in `training` and `test` datasets

In [None]:
dataset = dataset[['age','sibsp','parch', 'fare', 'survived' ]].dropna().copy()

train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('survived')
test_labels = test_features.pop('survived')

train_dataset.head()

#### Traing sklean linear regression algorithm (based on training datasets)

In [None]:
from sklearn.linear_model import LogisticRegression

logistic_regressor = LogisticRegression().fit(train_features, train_labels)

#### Predict values from test dataset and compare with test labels

In [None]:
scored_test = logistic_regressor.predict(test_features)
scored_test_proba = logistic_regressor.predict_proba(test_features)
test_dataset['Predicted'] = scored_test
test_dataset[['False Proba', 'True Proba']] = scored_test_proba


In [None]:
test_dataset.sample(10)

In [None]:
test_digit = random.randint(1, len(scored_test))
test_dataset.iloc[[test_digit]]