# Comment Classification in Python

This notebook mirrors the `Classification.R` script and shows how to build a text classifier in Python.

In Python, the most common libraries for text classification are `scikit-learn` and `nltk`.

In [1]:
# Install necessary packages (uncomment if needed)
# !pip install pandas scikit-learn matplotlib seaborn nltk
# !pip install openpyxl

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
import string

nltk.download('stopwords') # download stopwords dictionary so that we can later remove them from the text


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alvinzhou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Load the data

In [22]:
labeled_df = pd.read_csv("comments_labeled.csv")
test_df = pd.read_csv("comments_test.csv")
print("Labeled data shape:", labeled_df.shape)
# The labeled_df contains 600 rows (comments) and 2 columns (text, label)
print("Test data shape:", test_df.shape)
# The test_df contains 1500 rows (comments) and 1 column (text), without labels
# Display the first few rows of the labeled DataFrame
labeled_df.head()
# You can see that the 1st, 4th, and 5th comments are uncivil (1)

Labeled data shape: (600, 2)
Test data shape: (1500, 1)


Unnamed: 0,text,label
0,You obviously need lessons in English. Go get ...,1
1,Deletion review for Bezgovo cvrtje\nAn editor ...,0
2,Agree. Thanks for the clarification. Have re...,0
3,go fuck yourself in a corner,1
4,Chavez sure isnt a hero. He's classic scum. An...,1


In [5]:
print(labeled_df['label'].value_counts())
# "value_counts" counts the number of occurrences of each label in the "label" column
# The output shows that there are 300 uncivil comments (1) and 300 civil comments (0)
# Which is a 50%/50% balanced distribution between civil (0) and uncivil (1) comments
# This is not common: most datasets are imbalanced
# In those cases, you would need to use techniques like oversampling, undersampling, or using specialized algorithms that handle imbalance
# You can read: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

label
1    300
0    300
Name: count, dtype: int64


In [6]:
test_df.head()
# We will apply our selected/final model to this data in the end

Unnamed: 0,text
0,"""\n\n Best of Eurovision article \n\nThe event..."
1,Range Block \n\nYour mom has good ass meat!! 7...
2,==Time constraint== \n The article text says 4...
3,Hanibal911You're a bastard Pro-Assad. The rebe...
4,"""\n\nNoel Coward claimed that the melody can b..."


## Preprocessing and TF-IDF

In [14]:
# The stopwords are common words that do not contribute much to the meaning of a sentence, such as "the", "is", "in", etc.
# We will remove these, as well as punctuation, from the text data to focus on the more meaningful words.
stop_words = set(stopwords.words('english') + list(string.punctuation))

# Define a preprocessing function to clean the text data, where we will:
# 1. Convert text to lowercase
# 2. Remove stopwords and punctuation
def preprocess(text):
    return ' '.join([word.lower() for word in text.split() if word.lower() not in stop_words])

# Apply the preprocessing function to the 'text' column of both labeled and test DataFrames
labeled_df['clean_text'] = labeled_df['text'].astype(str).apply(preprocess)
test_df['clean_text'] = test_df['text'].astype(str).apply(preprocess)

# Now we will convert the cleaned text data into numerical features using TF-IDF vectorization.
# See the R script for detailed explanation of TF-IDF (Term Frequency-Inverse Document Frequency)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(labeled_df['clean_text'])
y = labeled_df['label'].astype(int)

X.shape
# The X is a TF-IDF matrix with 600 rows (comments) and 5350 features equal to the number of unique words in the corpus that characterize the comments.

(600, 5350)

In [23]:
# Print out the X to see how the TF-IDF matrix looks like
print(X)
# For the first row/comment, which is "You obviously need lessons in English. Go get some you ridiculous illiterate dunce."
# We can expect the its TF-IDF vector to have non-zero values for the words "obviously", "need", "lessons", "english", "ridiculous", etc.
# "(0, 1621) 0.415216366186908" means that the first row's 1621st column/feature/word has a TF-IDF value of 0.415216366186908
# And all the values from the 1st column to the 1620th column are zero, meaning that those words do not appear in the first comment

  (0, 1621)	0.415216366186908
  (0, 2462)	0.415216366186908
  (0, 4066)	0.337642473054448
  (0, 2117)	0.22242051418862424
  (0, 2142)	0.22669270755179363
  (0, 1717)	0.30965454032618644
  (0, 2838)	0.415216366186908
  (0, 3233)	0.26134537243753764
  (0, 3361)	0.3220805199217471
  (1, 3493)	0.1667559992668437
  (1, 5131)	0.10025708040637957
  (1, 3076)	0.11401780905924383
  (1, 2577)	0.1667559992668437
  (1, 3445)	0.14432863568065718
  (1, 2631)	0.08790599777855954
  (1, 1416)	0.12244704976322628
  (1, 4462)	0.1484088930555122
  (1, 3471)	0.15710989583257096
  (1, 1534)	0.12084235972526147
  (1, 1063)	0.1591412621858893
  (1, 4342)	0.11654060081391625
  (1, 528)	0.13236491527057534
  (1, 1653)	0.12794311407118808
  (1, 1329)	0.3549767367944417
  (1, 707)	0.3549767367944417
  :	:
  (597, 939)	0.07244238331674338
  (597, 3523)	0.04721447419994178
  (598, 1581)	0.3904380621932054
  (598, 5274)	0.3904380621932054
  (598, 5272)	0.3904380621932054
  (598, 4848)	0.32646917169891654
  (598, 428

## Train/Test Split

In [24]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2025)
# The above line splits the data into training and validation sets, with 80% of the data used for training and 20% for validation.
# The random_state parameter ensures that the split is reproducible.
print("Training size:", X_train.shape)
# We use 480 comments/rows for training
print("Validation size:", X_val.shape)
# We use 120 comments/rows for validation


Training size: (480, 5350)
Validation size: (120, 5350)


## Train Models

In [25]:
# ================================
# Logistic Regression
# ================================
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train, y_train)
logistic_preds = logistic_model.predict(X_val)
print("Logistic Regression Results:")
print(classification_report(y_val, logistic_preds))

Logistic Regression Results:
              precision    recall  f1-score   support

           0       0.86      0.92      0.89        60
           1       0.91      0.85      0.88        60

    accuracy                           0.88       120
   macro avg       0.89      0.88      0.88       120
weighted avg       0.89      0.88      0.88       120



In [10]:
# ================================
# K-Nearest Neighbors
# ================================
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_preds = knn_model.predict(X_val)
print("K-Nearest Neighbors Results:")
print(classification_report(y_val, knn_preds))


K-Nearest Neighbors Results:
              precision    recall  f1-score   support

           0       0.87      0.78      0.82        60
           1       0.80      0.88      0.84        60

    accuracy                           0.83       120
   macro avg       0.84      0.83      0.83       120
weighted avg       0.84      0.83      0.83       120



In [11]:
# ================================
# Support Vector Machine
# ================================
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_preds = svm_model.predict(X_val)
print("Support Vector Machine Results:")
print(classification_report(y_val, svm_preds))

Support Vector Machine Results:
              precision    recall  f1-score   support

           0       0.81      0.95      0.88        60
           1       0.94      0.78      0.85        60

    accuracy                           0.87       120
   macro avg       0.88      0.87      0.87       120
weighted avg       0.88      0.87      0.87       120



In [12]:
# ================================
# Random Forest
# ================================
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_val)
print("Random Forest Results:")
print(classification_report(y_val, rf_preds))

Random Forest Results:
              precision    recall  f1-score   support

           0       0.90      0.63      0.75        60
           1       0.72      0.93      0.81        60

    accuracy                           0.78       120
   macro avg       0.81      0.78      0.78       120
weighted avg       0.81      0.78      0.78       120



## Model Comparison

So now, we have trained four different classifiers:
1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Support Vector Machine (SVM)
4. Random Forest

Based on the results, which model do you think performs the best? Select the best model

In [13]:
best_model = rf_model  # Assuming Random Forest performed best based on the classification report

## Predict on Test Data with Best Model

In [15]:
X_test = vectorizer.transform(test_df['clean_text'])
test_preds = best_model.predict(X_test)

# Save predictions
test_df['predicted_label'] = test_preds
test_df[['predicted_label', 'text']].to_csv("comments_test_predictions_Python.csv", index=False)
test_df.head()

# You can go to your working directory and find the file comments_test_predictions_Python.csv with the predictions.
# Open it, and see if the labels make sense.
# You will find some are misclassified, but many are correct (hopefully).

Unnamed: 0,text,clean_text,predicted_label
0,"""\n\n Best of Eurovision article \n\nThe event...",best eurovision article event took place hambu...,0
1,Range Block \n\nYour mom has good ass meat!! 7...,range block mom good ass meat!! 70.251.71.245,1
2,==Time constraint== \n The article text says 4...,==time constraint== article text says 4.5 hour...,0
3,Hanibal911You're a bastard Pro-Assad. The rebe...,hanibal911you're bastard pro-assad. rebels ale...,0
4,"""\n\nNoel Coward claimed that the melody can b...",noel coward claimed melody traced back old eng...,1
