<a href="https://colab.research.google.com/github/darthkolli145/Cyberbullyingdetector/blob/main/SCAI_Cyberbullyingdetector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cyberbullying Detection Project

This project aims to classify text as cyberbullying or not using a Logistic Regression model. We will preprocess the data, train the model, evaluate its performance, and create a user interface for real-time classification.

## 1. Import Libraries

First, we need to import the necessary libraries for data manipulation, model building, evaluation, and GUI creation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pickle
import ipywidgets as widgets
from IPython.display import display, HTML

## 2. Load Dataset

We will load the dataset and set appropriate column names. This step involves reading the CSV file into a pandas DataFrame and checking the first few rows to ensure correctness.

In [None]:
# Load dataset and set column names
file_path = '/content/aggression_parsed_dataset.csv'
column_names = ['index', 'Text', 'ed_label_0', 'ed_label_1', 'oh_label']
df = pd.read_csv(file_path, names=column_names, header=0, engine='python')

# Display column names and first few rows
print(df.columns)
print(df.head())

Index(['index', 'Text', 'ed_label_0', 'ed_label_1', 'oh_label'], dtype='object')
   index                                               Text  ed_label_0  \
0      0  `- This is not ``creative``.  Those are the di...    0.900000   
1      1  `  :: the term ``standard model`` is itself le...    1.000000   
2      2    True or false, the situation as of March 200...    1.000000   
3      3   Next, maybe you could work on being less cond...    0.555556   
4      4               This page will need disambiguation.     1.000000   

   ed_label_1  oh_label  
0    0.100000         0  
1    0.000000         0  
2    0.000000         0  
3    0.444444         0  
4    0.000000         0  


## 3. Preprocess Dataset

We need to preprocess the text data. This involves extracting the text column and converting it into TF-IDF features. TF-IDF helps in converting text data into numerical features.

In [None]:
# Extract text column and label
text_column = 'Text'
label_column = 'oh_label'
X = df[text_column]
y = df[label_column]

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(X)

## 4. Split Dataset

To evaluate our model, we will split the dataset into training and testing sets. This will help us to validate the model's performance on unseen data.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 5. Train Classifier

We will train a Logistic Regression model using the training data. Logistic Regression is suitable for binary classification tasks like this one.

In [None]:
# Train Logistic Regression classifier
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 6. Evaluate Classifier

After training the model, we will evaluate its performance using accuracy score and classification report metrics.

In [None]:
# Evaluate the classifier
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9385060199369957
              precision    recall  f1-score   support

           0       0.94      0.99      0.97     20228
           1       0.88      0.60      0.71      2945

    accuracy                           0.94     23173
   macro avg       0.91      0.79      0.84     23173
weighted avg       0.94      0.94      0.93     23173



## 7. Save Model and Vectorizer

To use the model in the future, we will save the trained model and the TF-IDF vectorizer.

In [None]:
# Save the model and vectorizer
with open('model.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)
with open('vectorizer.pkl', 'wb') as vec_file:
    pickle.dump(vectorizer, vec_file)

## 8. Build GUI for Real-Time Classification

Finally, we built a GUI to classify new text inputs in real-time. The GUI will take text input from the user, classify it as cyberbullying or not, and display the confidence level.

In [None]:
# Function to classify text and get confidence
def classify_text(text):
    model = pickle.load(open('model.pkl', 'rb'))
    vectorizer = pickle.load(open('vectorizer.pkl', 'rb'))
    X = vectorizer.transform([text])
    prediction = model.predict(X)[0]
    confidence = np.max(model.predict_proba(X)) * 100
    return prediction, confidence

# Button click event handler
def on_button_click(b):
    text = text_input.value
    prediction, confidence = classify_text(text)
    result_label.value = f"Prediction: {'Cyberbullying' if prediction == 1 else 'Not Cyberbullying'}"
    confidence_label.value = f"Confidence: {confidence:.2f}%"

# GUI Components
title = widgets.HTML(value="<h2>Cyberbullying Detection</h2>")
text_input = widgets.Textarea(
    value='',
    placeholder='Enter the text to classify',
    description='Text:',
    disabled=False,
    layout=widgets.Layout(width='80%', height='100px')
)
button = widgets.Button(description="Classify")
button.on_click(on_button_click)
result_label = widgets.Label(value="Prediction: ")
confidence_label = widgets.Label(value="Confidence: ")

# Display GUI
display(widgets.VBox([title, text_input, button, result_label, confidence_label]))

VBox(children=(HTML(value='<h2>Cyberbullying Detection</h2>'), Textarea(value='', description='Text:', layout=…