### Logistic Regression

Logistic Regression bounds linear regression between 0 and 1, fitting sigmoid S-shaped curved for binary classification problems.

When to use logistic regression?
- If there is a binary target variable
- Can gauge importance of individual feature importance
- Very fast to train and relatively flexible

When not to use logistic regressions?
- Does not do well with messy data (complex relationships, many outliers, missing values)
- Should not use for continuous target variables
- Should not use when having massive amount of data (lots of rows and/or columns)

In [6]:
from sklearn.linear_model import LogisticRegression

# Explore hyperparameters by printing out the object
LogisticRegression().__dict__

{'penalty': 'l2',
 'dual': False,
 'tol': 0.0001,
 'C': 1.0,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'class_weight': None,
 'random_state': None,
 'solver': 'lbfgs',
 'max_iter': 100,
 'multi_class': 'auto',
 'verbose': 0,
 'warm_start': False,
 'n_jobs': None,
 'l1_ratio': None}

In [9]:
# Explore attributes
dir(LogisticRegression)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_feature_names',
 '_check_n_features',
 '_estimator_type',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_predict_proba_lr',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_validate_data',
 'decision_function',
 'densify',
 'fit',
 'get_params',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'score',
 'set_params',
 'sparsify']

### Hyperparameters

C hyperparameter is a regularization parameter that controls how closely model fits to the training data. 

Regularization is a technique used to reduce overfitting by discouraging overly complex models in some way.

C = 1 / lambda, lambda > 0.

High values of lambda - high regularization, low complexity, more likely to underfit.

Low values of lambda - low regularization, high complexity, more likely to overfit.

In [28]:
# For pickling the model
import joblib
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import warnings

warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

tr_features = pd.read_csv('train_features.csv')

# Header = None due to having 1 column
tr_labels = pd.read_csv('train_labels.csv')

In [23]:
print(tr_features)

     Pclass  Sex        Age      Fare  Family_cnt  Cabin_ind
0         2    0  62.000000   10.5000           0          0
1         3    0   8.000000   29.1250           5          0
2         3    0  32.000000   56.4958           0          0
3         3    1  20.000000    9.8250           1          0
4         2    1  28.000000   13.0000           0          0
..      ...  ...        ...       ...         ...        ...
529       3    1  21.000000    7.6500           0          0
530       1    0  29.699118   31.0000           0          0
531       3    0  41.000000   14.1083           2          0
532       1    1  14.000000  120.0000           3          1
533       1    0  21.000000   77.2875           1          1

[534 rows x 6 columns]


In [24]:
print(tr_labels)

     Survived
0           1
1           0
2           1
3           0
4           1
..        ...
529         1
530         0
531         0
532         1
533         0

[534 rows x 1 columns]


In [31]:
tr_labels.values.ravel()

array([1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0,

### Hyperparameter Tuning

In [12]:
def print_results(results):
    print('Best parameters: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [33]:
lr = LogisticRegression(max_iter=1000)

parameters = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

cv = GridSearchCV(lr, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

Best parameters: {'C': 1}

0.67 (+/-0.077) for {'C': 0.001}
0.708 (+/-0.098) for {'C': 0.01}
0.777 (+/-0.134) for {'C': 0.1}
0.8 (+/-0.118) for {'C': 1}
0.794 (+/-0.116) for {'C': 10}
0.794 (+/-0.116) for {'C': 100}
0.794 (+/-0.116) for {'C': 1000}


In [34]:
# Save the best model
joblib.dump(cv.best_estimator_, 'LR_model.pkl')

['LR_model.pkl']