<a href="https://colab.research.google.com/github/anandaditya07/Hate-Speech/blob/main/Untitled2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1: Install & Import Libraries**

In [15]:
!pip install -q scikit-learn pandas numpy emoji


In [16]:
import pandas as pd
import numpy as np
import re
import emoji

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score


**2: Load Dataset**

In [34]:
import pandas as pd

train_df = pd.read_csv("/content/drive/MyDrive/SubTask-B-train.csv", encoding='latin-1')
test_df = pd.read_csv("/content/drive/MyDrive/SubTask-B-test.csv", encoding='latin-1')

train_df.head()  #used for previewing the dataset

print("Raw Dataset Shape:", train_df.shape)
train_df[['tweet', 'label']].head()

Raw Dataset Shape: (19019, 2)


Unnamed: 0,tweet,label
0,à¤à¤®à¤° à¤à¤à¤¾à¤²à¤¾ à¤¦à¥à¤¨à¤¿à¤ à¤¸à...,0
1,à¤¬à¥à¤à¥à¤ªà¥ @BJP4India à¤à¥à¤¨à¥à¤¦à...,0
2,#AssemblyElections2022 à¤ªà¥à¤°à¤­à¤¾à¤°à¥ à...,0
3,#à¤¬à¥à¤°à¥à¤à¤¿à¤à¤ - à¤®à¤§à¥à¤¯à¤ªà¥...,0
4,#AmitShah à¤¨à¥ à¤à¤¹à¤¾ à¤à¤¤à¥à¤¤à¤° à¤ª...,0


**3:Text Cleaning Function**

In [26]:
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"@\w+|#\w+", "", text)
    text = emoji.replace_emoji(text, replace="")
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


**4: Apply Cleaning**

In [27]:
train_df['clean_text'] = train_df['tweet'].apply(clean_text)

# Remove empty rows
train_df = train_df[train_df['clean_text'] != ""]
train_df.reset_index(drop=True, inplace=True)

train_df[['tweet', 'clean_text']].head()

Unnamed: 0,tweet,clean_text
0,LIVE | à¤®à¤¹à¤à¤à¤¾à¤ à¤ªà¤° à¤ªà¤¡à¤¼à¥ ...,live
1,#OpinionPoll : à¤¯à¥à¤à¥ à¤à¥ à¤à¥à¤¨à¤...,live at
2,#ElectionWithNN : 'à¤ªà¤¶à¥à¤à¤¿à¤®à¥ à¤à¤...,round voting update live more updates
3,Goa Election 2022 | à¤à¥à¤µà¤¾ à¤®à¥à¤ à¤...,goa election
4,UP Election 2022: à¤¨à¥à¤ªà¤¾à¤² à¤¬à¤¾à¤°à¥...,up election


In [29]:
test_df = test_df[test_df['clean_text'] != ""]
test_df.reset_index(drop=True, inplace=True)

print("\nAfter Removing Empty Rows from test_df:", test_df.shape)


After Removing Empty Rows from test_df: (674, 4)


**5: Train–Validation Split**

In [31]:
X_train, X_val, y_train, y_val = train_test_split(
    train_df['clean_text'],
    train_df['label'],
    test_size=0.2,
    random_state=42,
    stratify=train_df['label']
)

print("Train size:", len(X_train))
print("Validation size:", len(X_val))

Train size: 2522
Validation size: 631


**6: TF-IDF Vectorization**

In [32]:
tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2)
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_val_tfidf = tfidf.transform(X_val)

print("\nTF-IDF Shape (Train):", X_train_tfidf.shape)
print("Sample TF-IDF Vector:\n", X_train_tfidf[0].toarray())



TF-IDF Shape (Train): (2522, 3814)
Sample TF-IDF Vector:
 [[0. 0. 0. ... 0. 0. 0.]]


**7: Train Logistic Regression Model**

In [22]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)


**8: Evaluation**

In [23]:
y_pred = model.predict(X_val_tfidf)

print("Classification Report:\n")
print(classification_report(y_val, y_pred))

print("Macro F1 Score:", f1_score(y_val, y_pred, average="macro"))


Classification Report:

              precision    recall  f1-score   support

           0       0.90      1.00      0.95       570
           1       0.00      0.00      0.00        61

    accuracy                           0.90       631
   macro avg       0.45      0.50      0.47       631
weighted avg       0.82      0.90      0.86       631

Macro F1 Score: 0.4746044962531224


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**9: Predict on Test Data**

In [24]:
test_df['clean_text'] = test_df['text'].apply(clean_text)

test_tfidf = tfidf.transform(test_df['clean_text'])
test_df['prediction'] = model.predict(test_tfidf)

test_df.to_csv("logistic_predictions.csv", index=False)
print("Predictions saved")

Predictions saved
