# Accuracy Score

<blockquote>The formula is given as the number of correct predictions divided by the  total number of predictions.

<br>For binary classification it is the number of true positives + the number of true negatives divided by the total number of predictions.</blockquote>

---

## Libraries

In [20]:
import numpy as np
import pandas as pd
import warnings

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

warnings.filterwarnings("ignore")

---

## Import Data

In [3]:
data = pd.read_csv("../kdd2004.csv")

In [4]:
data["target"] = data["target"].map({-1:0, 1:1})

data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,target
0,52.0,32.69,0.3,2.5,20.0,1256.8,-0.89,0.33,11.0,-55.0,...,1595.1,-1.64,2.83,-2.0,-50.0,445.2,-0.35,0.26,0.76,0
1,58.0,33.33,0.0,16.5,9.5,608.1,0.5,0.07,20.5,-52.5,...,762.9,0.29,0.82,-3.0,-35.0,140.3,1.16,0.39,0.73,0
2,77.0,27.27,-0.91,6.0,58.5,1623.6,-1.4,0.02,-6.5,-48.0,...,1491.8,0.32,-1.29,0.0,-34.0,658.2,-0.76,0.26,0.24,0
3,41.0,27.91,-0.35,3.0,46.0,1921.6,-1.36,-0.47,-32.0,-51.5,...,2047.7,-0.98,1.53,0.0,-49.0,554.2,-0.83,0.39,0.73,0
4,50.0,28.0,-1.32,-9.0,12.0,464.8,0.88,0.19,8.0,-51.5,...,479.5,0.68,-0.59,2.0,-36.0,-6.9,2.02,0.14,-0.23,0


In [5]:
data.shape

(145751, 75)

In [6]:
data["target"].value_counts(normalize= True)

target
0    0.991108
1    0.008892
Name: proportion, dtype: float64

---

### Train Test Split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels = ["target"], axis = 1),
    data["target"],
    test_size = 0.3,
    random_state = 0 
)

X_train.shape, X_test.shape

((102025, 74), (43726, 74))

### Baseline: Predict the majority class

In [8]:
y_train_base = pd.Series(np.zeros(len(y_train)))
y_test_base = pd.Series(np.zeros(len(y_test)))

## Train ML Models

### Random Forests

In [9]:
rf = RandomForestClassifier(
    n_estimators= 100,
    random_state= 20,
    max_depth= 2,
    n_jobs= 6
)

rf.fit(X_train, y_train)

y_train_rf = rf.predict_proba(X_train)[:, 1]
y_test_rf = rf.predict_proba(X_test)[:, 1]

### Logistic Regression

In [25]:
scaler = StandardScaler()
scaler.fit(X_train)

In [32]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [27]:
logit = LogisticRegression(random_state= 0, max_iter= 2000)

logit.fit(X_train_scaled, y_train)

y_train_logit = logit.predict_proba(X_train_scaled)[:, 1]
y_test_logit = logit.predict_proba(X_test_scaled)[:, 1]

---

## Accuracy

In [28]:
print("Accuracy Baseline test: ", accuracy_score(y_test, y_test_base))
print("accuracy Random Forest test: ", accuracy_score(y_test, rf.predict(X_test)))
print("Accuracy Logistic Regression test: ", accuracy_score(y_test, logit.predict(X_test_scaled)))

Accuracy Baseline test:  0.9907377761514888
accuracy Random Forest test:  0.9962493710835658
Accuracy Logistic Regression test:  0.996958331427526


<blockquote>Judging from the accuracy scores. It suggests that the machine learning models only add a small amount of performance compared to the baseline model.</blockquote>

In [29]:
def return_minority_perc(y_true, y_pred):
    minority_total = np.sum(y_true)
    minority_correct = np.sum(np.where((y_true == 1)& (y_pred == 1), 1, 0))
    return minority_correct / minority_total * 100

In [31]:
print(f"% of minority class correctly classified, Baseline test:\t\t{return_minority_perc(y_test, y_test_base)}")
print(f"% of minority class correctly classified, Random Forest test:\t\t{return_minority_perc(y_test, rf.predict(X_test))}")
print(f"% of minority class correctly classified, Logistic Regression test:\t{return_minority_perc(y_test, logit.predict(X_test_scaled))}")

% of minority class correctly classified, Baseline test:		0.0
% of minority class correctly classified, Random Forest test:		59.75308641975309
% of minority class correctly classified, Logistic Regression test:	72.5925925925926


<blockquote>However, we can see here that the baseline model does not correctly classify any of the observations from the minority class. 

<br>The Accuracy score is not a good metric to analyze the performance of an machine learning algorithm. 

<br>This is because it has a high value regardless of how good the algorithm is.</blockquote>

---