<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Classification
© ExploreAI Academy

In this exercise, we will review and assess our understanding of the core concepts of classifier model selection.

## Learning objectives

By the end of this exercise, you should be able to:
* Build and evaluate multiple types of classification models.

## Introduction

Analysing hate speech and offensive language in tweets

Our dataset consists of roughly 5,600 tweets containing instances of hate speech and offensive language. These tweets have been curated to provide a focused dataset for building sentiment analysis and toxicity detection models. Each tweet reflects varying degrees of negativity, from casual derogatory remarks to explicit expressions of prejudice and intolerance.

By examining this dataset, we aim to understand the prevalence and patterns of hate speech and offensive language in online discourse. Through data analysis, we seek insights into the factors driving such language, as well as its impact on digital communities. Ultimately, our goal is to develop tools and strategies for mitigating the spread of harmful language online and fostering a more inclusive and respectful online environment.

In [1]:
import pandas as pd
tweets_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint/toxicity_tweets_cleaned.csv', index_col=0)
tweets_df

Unnamed: 0,Tweet,Toxicity
43039,i will beat a bitch ass tf,1
36956,thomasnye1 my momma saw how the girls danced a...,1
8373,user user dont forget his other incantation i...,0
27287,isnt it sad how i keep thinking youll change ...,0
56311,please tell this bitch im subbin her ik one of...,1
...,...,...
6429,animaladvocate melodylgattenby zoo says this ...,0
12737,alice doggy my petstagram instapets pet pets d...,0
12503,h a p p y w i n e p a r t y momentoafouna...,0
53172,stupid teabagger restaurant making customers p...,1


## Exercises

We are tasked with building multiple classifier models to predict whether a given tweet contains hate speech or offensive language. Our dataset consists of roughly 5,600 tweets, each accompanied by a label indicating whether it expresses toxicity.

The objective is to develop robust machine learning models capable of accurately classifying tweets as toxic or non-toxic based on their content. 

### Exercise 1

Before we can build our models, we need to first preprocess the text data. Preprocessing involves converting the text into a format that can be easily understood by the algorithms. Use `CountVectorizer` to transform the text data into a matrix where each row represents a tweet and each column represents a unique word in the vocabulary. 

Split the dataset into training and testing sets using a `80-20 split`.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Fit the vectorizer on the tweet text data and transform it
X = vectorizer.fit_transform(tweets_df['Tweet'])

# Convert the sparse matrix to an array
X_array = X.toarray()

# Split the dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X_array,
    tweets_df['Toxicity'],
    test_size=0.2,
    random_state=42
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")

Training set size: 4539
Testing set size: 1135
Number of features: 14747


### Exercise 2

Now we can build classifier models using the training data and assess their performance on the testing data.

Implement the following models: `Logistic Regression`, `Decision Tree`, `Support Vector Classification`, and `Nearest Neighbors`. Evaluate each model's performance using the following evaluation metrics: `accuracy`, `precision`, `recall`, and `F1 score`. Note: Running these models might take a few minutes, depending on the complexity chosen. 

In addition to this, calculate the confusion matrix for each of our models. 

What do these results tell us about our models?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Initialize the classifiers
classifiers = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Support Vector Classification": SVC(random_state=42),
    "Nearest Neighbors": KNeighborsClassifier()
}

# Dictionary to store results and confusion matrices
results = {}
conf_matrices = {}

# Train and evaluate each classifier
print("Training and evaluating classifiers...\n")
for name, clf in classifiers.items():
    print(f"Training {name}...")

    # Train the model
    clf.fit(X_train, y_train)

    # Make predictions
    y_pred = clf.predict(X_test)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Store results
    results[name] = {
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1
    }

    # Calculate and store confusion matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    conf_matrices[name] = conf_matrix

    print(f"{name} completed.\n")

# Display results
print("="*60)
print("EVALUATION METRICS")
print("="*60)
for name, metrics in results.items():
    print(f"\n{name}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")

# Display confusion matrices
print("\n" + "="*60)
print("CONFUSION MATRICES")
print("="*60)
for name, conf_matrix in conf_matrices.items():
    print(f"\n{name}:")
    print(f"  [[TN={conf_matrix[0,0]}, FP={conf_matrix[0,1]}]")
    print(f"   [FN={conf_matrix[1,0]}, TP={conf_matrix[1,1]}]]")
    print(f"\n  Matrix:\n{conf_matrix}")

Training and evaluating classifiers...

Training Logistic Regression...
Logistic Regression completed.

Training Decision Tree...
Decision Tree completed.

Training Support Vector Classification...
Support Vector Classification completed.

Training Nearest Neighbors...
Nearest Neighbors completed.

EVALUATION METRICS

Logistic Regression:
  Accuracy: 0.9101
  Precision: 0.9496
  Recall: 0.8302
  F1 Score: 0.8859

Decision Tree:
  Accuracy: 0.9154
  Precision: 0.9224
  Recall: 0.8721
  F1 Score: 0.8966

Support Vector Classification:
  Accuracy: 0.8969
  Precision: 0.9592
  Recall: 0.7883
  F1 Score: 0.8654

Nearest Neighbors:
  Accuracy: 0.7850
  Precision: 0.7373
  Recall: 0.7589
  F1 Score: 0.7479

CONFUSION MATRICES

Logistic Regression:
  [[TN=637, FP=21]
   [FN=81, TP=396]]

  Matrix:
[[637  21]
 [ 81 396]]

Decision Tree:
  [[TN=623, FP=35]
   [FN=61, TP=416]]

  Matrix:
[[623  35]
 [ 61 416]]

Support Vector Classification:
  [[TN=642, FP=16]
   [FN=101, TP=376]]

  Matrix:
[[64

: 

### Exercise 3
In addition to the performance evaluation based on metrics and confusion matrices, cross-validation scores provide further insights into the robustness and generalisation capabilities of classifier models. 

After evaluating the performance of our classifier models, we want to determine the best model based on their cross-validation scores. 

Perform 5-fold cross-validation for each classifier model using the training data and print the `mean cross-validation score`.

**Note**: This code should take a few minutes to run

In [None]:
from sklearn.model_selection import cross_val_score

# Dictionary to store cross-validation scores
cv_scores = {}

print("Performing 5-fold cross-validation...\n")
print("Note: This may take a few minutes to complete.\n")

# Perform cross-validation for each classifier
for name, clf in classifiers.items():
    print(f"Running cross-validation for {name}...")

    # Perform 5-fold cross-validation and calculate mean score
    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
    mean_score = scores.mean()
    std_score = scores.std()

    # Store the mean score
    cv_scores[name] = mean_score

    print(f"  Fold scores: {scores}")
    print(f"  Mean CV Score: {mean_score:.4f} (+/- {std_score:.4f})")
    print()

# Display summary of cross-validation scores
print("="*60)
print("CROSS-VALIDATION SUMMARY")
print("="*60)
for name, score in sorted(cv_scores.items(), key=lambda x: x[1], reverse=True):
    print(f"{name}: {score:.4f}")

# Identify the best model
best_model = max(cv_scores, key=cv_scores.get)
print(f"\nBest model based on cross-validation: {best_model} ({cv_scores[best_model]:.4f})")

## Solutions

### Exercise 1

In [None]:

from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Fit the vectorizer on the tweet text data
X = vectorizer.fit_transform(tweets_df['Tweet'])

# Convert the sparse matrix to an array
X_array = X.toarray()

from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets
# Split the data into features (X) and target labels (y)
X_train, X_test, y_train, y_test = train_test_split(X_array, tweets_df['Toxicity'], test_size=0.2, random_state=42)

### Exercise 2

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix


# Initialize the classifiers
classifiers = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Support Vector Classification": SVC(),
    "Nearest Neighbors": KNeighborsClassifier()
}

# Train and evaluate each classifier
conf_matrices = {}
results = {}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    results[name] = {"Accuracy": accuracy, "Precision": precision, "Recall": recall, "F1 Score": f1}
    # Calculate confusion matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    conf_matrices[name] = conf_matrix

# Display results
for name, metrics in results.items():
    print(f"Metrics for {name}:")
    print(metrics)
    print()

# Display confusion matrices
for name, conf_matrix in conf_matrices.items():
    print(f"Confusion Matrix for {name}:")
    print(conf_matrix)
    print()



Using Python (3.10.14), our results seem to be best for the Decision Tree model in terms of its F1 score, boasting high accuracy, precision, recall, comparing favourably to the other classifiers. The Logistic Regression model follows closely, with its ability to correctly classify a significant proportion of samples, coupled with balanced precision and recall metrics, also showing its robustness in handling toxic and non-toxic instances. Support Vector Classification (SVC), while exhibiting high precision, falters in recall, leading to an imbalance between false negatives and false positives. Nearest Neighbors (KNN), with the lowest accuracy and F1 score, struggles to strike a balance between precision and recall, resulting in suboptimal predictive performance.

It is however important to note a few things. Firstly, the skeleton for models provided here are only the start of the process of finding a suitable model. In reality, we cannot say with full certainty that the KNN model is less suitable than another if we've not attempted to find the optimal combination of hyperparameters (by not specifying the number of neighbours for instance, the default used here was 5). Secondly, if two models seem to perform similarly in terms of precision, accuracy and recall, it might be worth deciding whether False Positives are a **more wanted** phenomena than False Negatives. In the medical world, this might be prefereable. These findings underscore the importance of meticulously evaluating various classifiers and choosing the most suitable model based on specific task requirements and performance metrics.

### Exercise 3

In [None]:
from sklearn.model_selection import cross_val_score
# Dictionary to store cross-validation scores
cv_scores = {}

# Perform cross-validation for each classifier
for name, clf in classifiers.items():
    # Perform 5-fold cross-validation and store the scores
    cv_scores[name] = cross_val_score(clf, X_train, y_train, cv=5).mean()

# Display cross-validation scores
for name, scores in cv_scores.items():
    print(f"Cross-validation scores for {name}:")
    print(scores)
    print()

`Logistic Regression` maintains its superiority with the highest cross-validation score of 0.904, affirming its consistency in performance across multiple data splits. `Decision Tree` follows closely, demonstrating stable performance with a cross-validation score of 0.895. However, `Support Vector Classification (SVC)` and `Nearest Neighbors` continue to lag behind, with scores of 0.889 and 0.785, respectively. While `SVC` exhibits reasonable cross-validation performance, `Nearest Neighbors` struggles to generalise well to unseen data, indicating potential overfitting or model complexity issues. These cross-validation results reinforce the findings from the earlier performance evaluation, reaffirming `Logistic Regression` as the preferred choice for predicting toxicity levels in this dataset.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>