# Logistic Regression

The logistic regression model is a simple linear model yet very useful to predict and analyse behavior between data where the endogenous variable (i.e., the dependent or explained variable, or simply the variable we want to predict, in this case whether a email is spam or not) is dichotomous, assuming only 0 and 1 values. 

Some advantages and disadvantages of using logistic regression are:

Pros:
- Simple and straightforward.
- Easy to interpret the effects of multiple independent variables on the dependent variable.
- As it provides probability estimates to each coeficient, it is very useful in decision-making.

Cons:
- Logistic regression assumes a linear relationship between the independent variables and the log-odds of the dependent variable, which may not hold true in all situations.
- It is sensitive to outliers and influential data points, which can affect the model's accuracy.
- It can suffer from overfitting if there are too many independent variables relative to the number of observations in the dataset.
---

In [2]:
# import libraries and read dataset
import pandas as pd
import numpy as np
np.random.seed(1)

df = pd.read_csv('output/spam_email.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Basic Train-Test Split 

Split the data into a training sample with which the model will learn, and a validation (testing) sample to test its accuracy. By default, the train_test_split function sets the training sample size as 25% of the total number of observations. 

In [3]:
# spliting test and train samples
from sklearn.model_selection import train_test_split

df_predictors = df.drop('spam', axis = 1)
df_predicted = df['spam']

X_train, X_test, y_train, y_test = train_test_split(df_predictors, df_predicted)

In [4]:
# training the model 
from sklearn.linear_model import LogisticRegression

LRmodel = LogisticRegression(solver='lbfgs', max_iter=2000)
LRmodel.fit(X_train, y_train)

In [10]:
# prediction score
score = LRmodel.score(X_test, y_test)

# error 
from sklearn.metrics import mean_squared_error
error = mean_squared_error(y_test, LRmodel.predict(X_test))

print('model accuracy:', score)
print('mean squared error:', error)

model accuracy: 0.9348392701998263
mean squared error: 0.06516072980017376


It means the model is capable to detect in average 93.49% of spam emails;

# Resampling methods

Sometimes the original train-test split may yealds results that are not representative of the real accuracy of the model. To ensure that these results are not merely a coincidence dependent on the way the samples were splitted, we apply resampling methods. 
Resampling consists in spliting the data randomly into multiple training and testing samples, often shuffling the data, to have a more accurate validation. 



## K-Fold Cross Validation

Cross validation consists in spliting many different training and testing samples and verifying the accuracy of the model. In practice, it means fitting the model multiple times with each different test-train split. One cross validation method commonly used is the K-Fold one for splitting a dataset into k equally sized parts.  


We can easily retrieve the average score of a cross validation with the while cross_val_score, that automates the process of using K-Fold to perform cross-validation. 

In [11]:
from sklearn.model_selection import cross_val_score

# we specify 20 folds, i.e, 20 train-test splits and fitted models. 
scores = cross_val_score(LRmodel, X_train, y_train, 
                         cv = 20)

In [12]:
def display_scores(scores):
    print("Scores:", scores)
    print("\nMean:", scores.mean())
    print("\nStd deviation:", scores.std())

display_scores(scores)

Scores: [0.9017341  0.94797688 0.89017341 0.90751445 0.93063584 0.90751445
 0.94219653 0.93063584 0.89017341 0.9017341  0.90697674 0.94186047
 0.93023256 0.93023256 0.93604651 0.93604651 0.94186047 0.95348837
 0.93604651 0.94767442]

Mean: 0.9255377066810058

Std deviation: 0.019546376020078658


---
Alternatively, we can do it manually with the KFold function

In [8]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=20, shuffle = True, random_state = 1)

r_train = []
r_test = []
accuracy = []
for train, test in kf.split(df):
    # print('train: ', train.shape[0])
    # print('test: ', test.shape[0])
    r_train.append(train.shape[0])
    r_test.append(test.shape[0])

    X_traink, X_testk = df_predictors.iloc[train], df_predictors.iloc[test]
    y_traink, y_testk = df_predicted.iloc[train], df_predicted.iloc[test]

    LRmodel = LogisticRegression(solver='lbfgs', max_iter=3000)
    LRmodel.fit(X_traink, y_traink)

    # print('accuracy', LRmodel.score(X_testk, y_testk))
    # print()

    acc = LRmodel.score(X_testk, y_testk)
    accuracy.append(acc)

results = pd.DataFrame()
results['train'] = r_train
results['test'] = r_test
results['accuracy'] = accuracy

print(results)
print('average accuracy:', results['accuracy'].mean())
print()

    train  test  accuracy
0    4370   231  0.922078
1    4371   230  0.934783
2    4371   230  0.917391
3    4371   230  0.956522
4    4371   230  0.939130
5    4371   230  0.926087
6    4371   230  0.913043
7    4371   230  0.926087
8    4371   230  0.921739
9    4371   230  0.921739
10   4371   230  0.943478
11   4371   230  0.873913
12   4371   230  0.913043
13   4371   230  0.926087
14   4371   230  0.930435
15   4371   230  0.934783
16   4371   230  0.939130
17   4371   230  0.939130
18   4371   230  0.934783
19   4371   230  0.956522
average accuracy: 0.9284952004517221



## Repeated K-Fold
RepeatedKFold is a variant of KFold that repeats the process n times, where n is specified by the user. This can help to provide a more robust estimate of model performance by averaging across multiple iterations of the KFold process. Essentially, RepeatedKFold is useful when you want to evaluate the performance of a model across multiple randomized splits of the data.

In [9]:
from sklearn.model_selection import KFold

acc_mean = []
for rep in range(1, 5): 
    kf = KFold(n_splits=20, shuffle = True, random_state = rep)

    r_train = []
    r_test = []
    accuracy = []
    for train, test in kf.split(df):
        # print('train: ', train.shape[0])
        # print('test: ', test.shape[0])
        r_train.append(train.shape[0])
        r_test.append(test.shape[0])

        X_traink, X_testk = df_predictors.iloc[train], df_predictors.iloc[test]
        y_traink, y_testk = df_predicted.iloc[train], df_predicted.iloc[test]

        LRmodel = LogisticRegression(solver='lbfgs', max_iter=3000)
        LRmodel.fit(X_traink, y_traink)

        # print('accuracy', LRmodel.score(X_testk, y_testk))
        # print()

        acc = LRmodel.score(X_testk, y_testk)
        accuracy.append(acc)

    results = pd.DataFrame()
    results['train'] = r_train
    results['test'] = r_test
    results['accuracy'] = accuracy
    
    print('random_state', rep)
    print(results)
    print('average accuracy:', results['accuracy'].mean())
    print()
    ac = results['accuracy'].mean()
    acc_mean.append(ac)

print('final average accuracy among folds:', np.mean(acc_mean))

random_state 1
    train  test  accuracy
0    4370   231  0.922078
1    4371   230  0.934783
2    4371   230  0.917391
3    4371   230  0.956522
4    4371   230  0.939130
5    4371   230  0.926087
6    4371   230  0.913043
7    4371   230  0.926087
8    4371   230  0.921739
9    4371   230  0.921739
10   4371   230  0.943478
11   4371   230  0.873913
12   4371   230  0.913043
13   4371   230  0.926087
14   4371   230  0.930435
15   4371   230  0.934783
16   4371   230  0.939130
17   4371   230  0.939130
18   4371   230  0.934783
19   4371   230  0.956522
average accuracy: 0.9284952004517221

random_state 2
    train  test  accuracy
0    4370   231  0.948052
1    4371   230  0.960870
2    4371   230  0.886957
3    4371   230  0.917391
4    4371   230  0.913043
5    4371   230  0.913043
6    4371   230  0.956522
7    4371   230  0.930435
8    4371   230  0.960870
9    4371   230  0.926087
10   4371   230  0.956522
11   4371   230  0.891304
12   4371   230  0.917391
13   4371   230  0.917