You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

Here is the list of steps to be followed (building a simple model without balancing the data):

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [3]:
churnData = pd.read_csv('Customer-Churn.txt')
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


# Data cleaning

## Data types

- Check the datatypes of all the columns in the data. You would see that the column `TotalCharges` is object type. Convert this column into numeric type using `pd.to_numeric` function.

In [4]:
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   object 
 15  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory

In [5]:
churnData.sort_values('TotalCharges').head(20)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
936,Female,0,Yes,Yes,0,Yes,Yes,Yes,Yes,No,Yes,Yes,Two year,80.85,,No
3826,Male,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,25.35,,No
4380,Female,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,20.0,,No
753,Male,0,No,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,20.25,,No
5218,Male,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,19.7,,No
3331,Male,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,19.85,,No
6754,Male,0,No,Yes,0,Yes,Yes,Yes,No,Yes,No,No,Two year,61.9,,No
6670,Female,0,Yes,Yes,0,Yes,No,Yes,Yes,Yes,Yes,No,Two year,73.35,,No
1340,Female,0,Yes,Yes,0,No,Yes,Yes,Yes,Yes,Yes,No,Two year,56.05,,No
488,Female,0,Yes,Yes,0,No,Yes,No,Yes,Yes,Yes,No,Two year,52.55,,No


In [6]:
churnData['TotalCharges'].min()

' '

In [7]:
# convert this empty value
churnData['TotalCharges'] = np.where(churnData['TotalCharges']==' ', np.nan, churnData['TotalCharges'])
churnData.sort_values('TotalCharges').head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
105,Male,0,No,No,5,No,No,No,No,No,No,No,Month-to-month,24.3,100.2,No
4459,Female,0,No,No,1,Yes,No,Yes,No,No,Yes,Yes,Month-to-month,100.25,100.25,Yes
1723,Female,0,Yes,Yes,6,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,20.1,100.35,No
2124,Male,0,Yes,No,5,No,No,No,No,No,No,No,Month-to-month,24.95,100.4,Yes
2208,Female,1,Yes,No,1,Yes,No,No,Yes,No,Yes,Yes,Month-to-month,100.8,100.8,Yes


In [8]:
# now yes, we can convert it to numeric, we don't have any empty values
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'])
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7032 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

## Null values

- Check for null values in the dataframe. Replace the null values.

In [9]:
churnData.isna().sum()

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [10]:
# delete the null values I just created in order to change the type of 'TotalCharge' 
churnData['TotalCharges'] = churnData['TotalCharges'].fillna(churnData['TotalCharges'].mean())
churnData.isna().sum().sum()

0

# Data imbalance

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.

- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.

- Each time fit the model and see how the accuracy of the model is.

In [11]:
churnData['Churn'].value_counts()

Churn
No     5174
Yes    1869
Name: count, dtype: int64

In [12]:
sum(churnData['Churn']=='Yes') / churnData['Churn'].value_counts().sum()

0.2653698707936959

Our target is underrepresented (26% yes). Let's first run a model before balancing so we can compare later the results

## Data Cleaning

In [13]:
churnData.info() # check the types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   float64
 15  Churn             7043 non-null   object 
dtypes: float64(2), int64(2), object(12)
memory

In [14]:
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [15]:
churnData['Churn'] = np.where(churnData['Churn']=='No', 0, 1)
churnData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  TotalCharges      7043 non-null   float64
 15  Churn             7043 non-null   int32  
dtypes: float64(2), int32(1), int64(2), object(

In [16]:
# split the features into categorical and numerical
numerical = churnData.select_dtypes(np.number)
numerical = numerical.drop(columns='Churn')
categorical = churnData.select_dtypes('object')
categorical

Unnamed: 0,gender,Partner,Dependents,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract
0,Female,Yes,No,No,No,Yes,No,No,No,No,Month-to-month
1,Male,No,No,Yes,Yes,No,Yes,No,No,No,One year
2,Male,No,No,Yes,Yes,Yes,No,No,No,No,Month-to-month
3,Male,No,No,No,Yes,No,Yes,Yes,No,No,One year
4,Female,No,No,Yes,No,No,No,No,No,No,Month-to-month
...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,One year
7039,Female,Yes,Yes,Yes,No,Yes,Yes,No,Yes,Yes,One year
7040,Female,Yes,Yes,No,Yes,No,No,No,No,No,Month-to-month
7041,Male,Yes,No,Yes,No,No,No,No,No,No,Month-to-month


In [17]:
categorical = pd.get_dummies(categorical, drop_first=True)
categorical = categorical.astype(int) # we want the features to be integers and not booleans
categorical

Unnamed: 0,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,OnlineBackup_Yes,DeviceProtection_No internet service,DeviceProtection_Yes,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year
0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0
2,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0
4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,1,1,1,1,0,1,0,0,0,1,0,1,0,1,0,1,1,0
7039,0,1,1,1,0,0,0,1,0,1,0,0,0,1,0,1,1,0
7040,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
7041,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Model without balancing the data

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report

Encapsulate everything in a **function**

In [19]:
# split the features into categorical and numerical
numerical = churnData.select_dtypes(np.number)
numerical = numerical.drop(columns='Churn')
categorical = churnData.select_dtypes('object')

# get dummies
categorical = pd.get_dummies(categorical, drop_first=True)
categorical = categorical.astype(int) # we want the features to be integers and not booleans

# concat all the features
X = pd.concat([categorical, numerical], axis=1)
y = churnData['Churn']

In [20]:
# split the features to train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [21]:
X_train.shape

(5634, 22)

In [22]:
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f'Accuracy: {acc}')

Accuracy: 0.8048261178140526


In [23]:
y_predicted = model.predict(X_test)
confusion_matrix(y_test, y_predicted)

array([[929, 107],
       [168, 205]], dtype=int64)

In [24]:
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.85      0.90      0.87      1036
           1       0.66      0.55      0.60       373

    accuracy                           0.80      1409
   macro avg       0.75      0.72      0.73      1409
weighted avg       0.80      0.80      0.80      1409



## Undersampling

In [25]:
from sklearn.utils import shuffle

In [26]:
# split the features to train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y, random_state=42)

# Perform undersampling ONLY on the training data
category_0 = X_train[y_train == 0]
category_1 = X_train[y_train == 1]
category_0 = category_0.sample(len(category_1), random_state=42)

y_category_0 = y_train[category_0.index]
X_train = pd.concat([category_0, category_1], axis=0)
y_train = pd.concat([y_category_0, y_train[category_1.index]])

# Shuffle training data
X_train, y_train = shuffle(X_train, y_train, random_state=42)

In [28]:
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f'Accuracy: {acc}')

Accuracy: 0.7444996451383961


Accuracy has decreased, let's see the rest of the metrics

In [29]:
y_predicted = model.predict(X_test)
confusion_matrix(y_test, y_predicted)

array([[751, 284],
       [ 76, 298]], dtype=int64)

In [30]:
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.91      0.73      0.81      1035
           1       0.51      0.80      0.62       374

    accuracy                           0.74      1409
   macro avg       0.71      0.76      0.72      1409
weighted avg       0.80      0.74      0.76      1409



## Oversampling

### Simple method

In [31]:
# split the features to train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y, random_state=42)

# Perform oversampling ONLY on the training data
category_0 = X_train[y_train == 0]
category_1 = X_train[y_train == 1]
category_1 = category_1.sample(len(category_0), random_state=42, replace=True)

y_category_0 = y_train[category_0.index]
X_train = pd.concat([category_0, category_1], axis=0)
y_train = pd.concat([y_category_0, y_train[category_1.index]])

# Shuffle training data
X_train, y_train = shuffle(X_train, y_train, random_state=42)

In [32]:
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f'Accuracy: {acc}')

Accuracy: 0.7402413058907026


Accuracy has decreased, let's see the rest of the metrics

In [33]:
y_predicted = model.predict(X_test)
confusion_matrix(y_test, y_predicted)

array([[745, 290],
       [ 76, 298]], dtype=int64)

In [34]:
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.91      0.72      0.80      1035
           1       0.51      0.80      0.62       374

    accuracy                           0.74      1409
   macro avg       0.71      0.76      0.71      1409
weighted avg       0.80      0.74      0.75      1409



### SMOTE

In [35]:
from imblearn.over_sampling import SMOTE

In [36]:
# split the features to train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y, random_state=42)

In [37]:
smote = SMOTE(random_state=42)
X_sm, y_sm = smote.fit_resample(X_train, y_train)

In [38]:
model = LogisticRegression(random_state=42)
model.fit(X_sm, y_sm)
acc = model.score(X_test, y_test)
print(f'Accuracy: {acc}')

Accuracy: 0.759403832505323


Accuracy is a bit better, let's see the rest of the metrics

In [39]:
y_predicted = model.predict(X_test)
confusion_matrix(y_test, y_predicted)

array([[819, 216],
       [123, 251]], dtype=int64)

In [40]:
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.87      0.79      0.83      1035
           1       0.54      0.67      0.60       374

    accuracy                           0.76      1409
   macro avg       0.70      0.73      0.71      1409
weighted avg       0.78      0.76      0.77      1409



### TomekLinks

In [41]:
from imblearn.under_sampling import TomekLinks

In [42]:
# split the features to train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y, random_state=42)

In [43]:
tl = TomekLinks()
X_tl, y_tl = tl.fit_resample(X_train, y_train)

In [44]:
y_tl.value_counts()

Churn
0    3677
1    1495
Name: count, dtype: int64

TomekLinks does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.

In [45]:
model = LogisticRegression(random_state=42)
model.fit(X_tl, y_tl)
acc = model.score(X_test, y_test)
print(f'Accuracy: {acc}')

Accuracy: 0.7913413768630234


Accuracy is a bit better, let's see the rest of the metrics

In [46]:
y_predicted = model.predict(X_test)
confusion_matrix(y_test, y_predicted)

array([[874, 161],
       [133, 241]], dtype=int64)

In [47]:
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.87      0.84      0.86      1035
           1       0.60      0.64      0.62       374

    accuracy                           0.79      1409
   macro avg       0.73      0.74      0.74      1409
weighted avg       0.80      0.79      0.79      1409



So far **TomekLinks** has given me the **best results**

# Scaling

In [48]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

- Use the following features: `tenure`, `SeniorCitizen`, `MonthlyCharges` and `TotalCharges`:

  - Scale the features either by using normalizer or a standard scaler.

  - Split the data into a training set and a test set.

  - Fit a logistic regression model on the training data.

  - Check the accuracy on the test data.

In [49]:
churnData.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [50]:
features = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']

In [51]:
X_features = X[features]
X_features

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,TotalCharges
0,1,0,29.85,29.85
1,34,0,56.95,1889.50
2,2,0,53.85,108.15
3,45,0,42.30,1840.75
4,2,0,70.70,151.65
...,...,...,...,...
7038,24,0,84.80,1990.50
7039,72,0,103.20,7362.90
7040,11,0,29.60,346.45
7041,4,1,74.40,306.60


In [55]:
# split the features to train the model
X_train, X_test, y_train, y_test = train_test_split(X_features, y, train_size=0.8, stratify=y, random_state=42)

## Normalizer

In [56]:
normalizer = Normalizer()
normalized_X_train = normalizer.fit_transform(X_train)
normalized_X_test = normalizer.transform(X_test)

In [57]:
model = LogisticRegression(random_state=42)
model.fit(normalized_X_train, y_train)
acc = model.score(normalized_X_test, y_test)
print(f'Accuracy: {acc}')

Accuracy: 0.7686302342086586


In [58]:
y_predicted = model.predict(normalized_X_test)
confusion_matrix(y_test, y_predicted)

array([[977,  58],
       [268, 106]], dtype=int64)

In [59]:
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.78      0.94      0.86      1035
           1       0.65      0.28      0.39       374

    accuracy                           0.77      1409
   macro avg       0.72      0.61      0.63      1409
weighted avg       0.75      0.77      0.73      1409



## Standard Scaler

In [60]:
standardizer = StandardScaler()
standardized_X_train = standardizer.fit_transform(X_train)
standardized_X_test = standardizer.transform(X_test)

In [61]:
model = LogisticRegression(random_state=42)
model.fit(standardized_X_train, y_train)
acc = model.score(standardized_X_test, y_test)
print(f'Accuracy: {acc}')

Accuracy: 0.7842441447835344


In [62]:
y_predicted = model.predict(standardized_X_test)
confusion_matrix(y_test, y_predicted)

array([[935, 100],
       [204, 170]], dtype=int64)

In [63]:
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86      1035
           1       0.63      0.45      0.53       374

    accuracy                           0.78      1409
   macro avg       0.73      0.68      0.69      1409
weighted avg       0.77      0.78      0.77      1409



Normalizing is worse than standardizing, so let's try standardizing with the data balanced with TomekLinks, the method with which we obtained the best result

## TomekLinks + Standard Scaler

In [64]:
tl = TomekLinks()
X_tl, y_tl = tl.fit_resample(X_train, y_train)

In [65]:
standardizer = StandardScaler()
standardized_X_train = standardizer.fit_transform(X_tl)
standardized_X_test = standardizer.transform(X_test)

In [66]:
model = LogisticRegression(random_state=42)
model.fit(standardized_X_train, y_tl)
acc = model.score(standardized_X_test, y_test)
print(f'Accuracy: {acc}')

Accuracy: 0.7778566359119943


In [67]:
# Only TomekLinks, without standardizing the data
# Accuracy: 0.8112136266855926

In [68]:
y_predicted = model.predict(standardized_X_test)
confusion_matrix(y_test, y_predicted)

array([[892, 143],
       [170, 204]], dtype=int64)

In [69]:
# Only TomekLinks, without standardizing the data
# array([[895, 141],
#        [125, 248]], dtype=int64)

In [70]:
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.84      0.86      0.85      1035
           1       0.59      0.55      0.57       374

    accuracy                           0.78      1409
   macro avg       0.71      0.70      0.71      1409
weighted avg       0.77      0.78      0.78      1409



In [71]:
# Only TomekLinks, without standardizing the data
#               precision    recall  f1-score   support

#            0       0.88      0.86      0.87      1036
#            1       0.64      0.66      0.65       373

#     accuracy                           0.81      1409
#    macro avg       0.76      0.76      0.76      1409
# weighted avg       0.81      0.81      0.81      1409