## Topic X: Bring your own topic!
You are encouraged to propose your own topic! Please note the following criteria:

• the topic should include a text classification task at its core and there should be some
annotated training data available for this task, otherwise milestones 1 and 2 cannot be
completed. If you are unsure whether your topic is suitable, we are happy to advise you.

• you are still required to work in teams of 4, so you should assemble a team to work on the
project (if necessary you can also bring in external members who are not registered for the
course)

• you should contact the exercise coordinator (G ́abor Recski) about your topic proposal, we
can discuss your ideas and recommend 1-2 instructors who can act as your mentors

# Import required packages

In [1]:
import os
import pandas as pd

# Import Data


In [2]:
# Data path
data_path = os.path.join('Data', 'C3_anonymized.csv')

# Import data
df = pd.read_csv(data_path)

# Observe Data

In [None]:
df.head(5)

What's the difference between column 'comment_text' and 'pp_comment_text':

Seems like it's a "pre-cleaned" text column:
- remove hyphen (')
- Added whitespace before-after points, coma, apostrophe 
- kept - in words such as 'left-wing'

In [None]:
for i in range(0,2):
    print("Comment text:\n",df["comment_text"][i],"\n\npp Comment text:\n",df["pp_comment_text"][i],"\n-----------------------------")

## Creating a dataset with untreated comment annotation about constructiveness (binary)

In [None]:
df_anno = df[['comment_text','constructive_binary']].copy()

#Change the constructive binary column to int (1 or 0)
df_anno['constructive_binary'] = df_anno['constructive_binary'].astype(int)

print(df_anno['constructive_binary'].value_counts())


Let's look at some of the constructive and non-constructive comments

In [None]:
pd.set_option('display.max_colwidth', None)

# Randomly sample 3 constructive comments
random_constructive = df_anno[df_anno['constructive_binary'] == 1].sample(n=3, random_state=42)

# Randomly sample 3 non-constructive comments
random_non_constructive = df_anno[df_anno['constructive_binary'] == 0].sample(n=3, random_state=42)

# Display the results
print("Random Constructive Comments:")
print(random_constructive['comment_text'])
print("\nRandom Non-Constructive Comments:")
print(random_non_constructive['comment_text'])

Some possible issues that we saw:
- use of slang like "gonna, gunna" for "going to"
- abreviations
- spelling mistakes
- telling if a comment is constructive or not can be highly subjective. That's most likely why a non-binary annotation column exist, most fitted for a regression task

but overall the texts seems cleans in general

Now let's build a function to see which words are the most used

Import dependencies and download models

In [None]:
import json
import re
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import stanza
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
stanza.download('en')

In [15]:
def summarize_most_used_words(text_list, top_n=10, language='english'):
    #Summarizes the most used words in a list of text, excluding stopwords.

    # Load the stopwords for the given language
    stop_words = set(stopwords.words(language))
    
    # Combine all texts into one large string
    all_text = ' '.join(text_list)
    
    # Convert to lowercase and remove punctuation using regex
    all_text_cleaned = re.sub(r'[^\w\s]', '', all_text.lower())
    
    # Split into words
    words = word_tokenize(all_text_cleaned, language='english')
    
    # Remove stopwords
    filtered_words = [word for word in words if word not in stop_words]
    
    # Count word frequencies and get the most common
    word_counts = Counter(filtered_words)
    most_common_words = word_counts.most_common(top_n)
    
    return most_common_words


Let's see the most used words!

In [None]:
#Global
print("top 10 most used words (without stopwords):\n")
print(summarize_most_used_words(df_anno['comment_text'], top_n=10, language='english'))

#Constructive comments
print("\ntop 10 most used words (without stopwords) in constructive comments:\n")
print(summarize_most_used_words(df_anno[df_anno['constructive_binary']==1]['comment_text'], top_n=10, language='english'))

#Non-constructive comments
print("\ntop 10 most used words (without stopwords) in non-constructive comments:\n")
print(summarize_most_used_words(df_anno[df_anno['constructive_binary']==0]['comment_text'], top_n=10, language='english'))

Now let's see if the character length and the average word length is different depending on if the comment is constructive or not

In [None]:
# Function to calculate the avg word length in a comment
def avg_word_length(text):
    words = text.split()
    if len(words) > 0:
        return sum(len(word) for word in words) / len(words)
    else:
        return 0

# Create a length column
df_anno['text_length'] = df_anno['comment_text'].apply(len)
df_anno['avg_word_length'] = df_anno['comment_text'].apply(avg_word_length)


# Create a figure with 2 subplots (1 row, 2 columns)
plt.figure(figsize=(14, 6))

# First plot: Jittered strip plot for text length
plt.subplot(1, 2, 1)
sns.stripplot(x='constructive_binary', y='text_length', data=df_anno, jitter=True, alpha=0.5)
plt.title('Jittered Strip Plot of Text Length by Constructiveness')
plt.xlabel('Constructive (0 = Not Constructive, 1 = Constructive)')
plt.ylabel('Text Length (in characters)')

# Second plot: Jittered strip plot for average word length
plt.subplot(1, 2, 2)
sns.stripplot(x='constructive_binary', y='avg_word_length', data=df_anno, jitter=True, alpha=0.5)
plt.title('Jittered Strip Plot of Average Word Length by Constructiveness')
plt.xlabel('Constructive (0 = Not Constructive, 1 = Constructive)')
plt.ylabel('Average Word Length')

# Adjust layout to prevent overlap
plt.tight_layout()

It's quite easy to see that constructive comments can lead to a big text, only a few non constructive comments go over 1000 characters while it's quite common for constructive ones. However the average world length plot is quite similar between the 2 classes, even though we see longer words in non constructive comments, which likely are mistakes (it would be surprising to find a comment where the average word length is above 100!). Let's find out about that last hypothesis.

In [None]:
# Create a figure with 2 subplots (1 row, 2 columns)
plt.figure(figsize=(14, 6))

# First plot: Box plot for text length
plt.subplot(1, 2, 1)
sns.boxplot(x='constructive_binary', y='text_length', data=df_anno)
plt.title('Box Plot of Text Length by Constructiveness')
plt.xlabel('Constructive (0 = Not Constructive, 1 = Constructive)')
plt.ylabel('Text Length (in characters)')

# Second plot: Box plot for average word length
plt.subplot(1, 2, 2)
sns.boxplot(x='constructive_binary', y='avg_word_length', data=df_anno)
plt.title('Box Plot of Average Word Length by Constructiveness')
plt.xlabel('Constructive (0 = Not Constructive, 1 = Constructive)')
plt.ylabel('Average Word Length')

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plots
plt.show()

In [None]:
long_word_comments = df_anno[df_anno['avg_word_length'] > 35]
for comment in long_word_comments['comment_text']:
    print(comment)

## Time for Pre-processing!

Tokenizing