## Topic X: Bring your own topic!
You are encouraged to propose your own topic! Please note the following criteria:

• the topic should include a text classification task at its core and there should be some
annotated training data available for this task, otherwise milestones 1 and 2 cannot be
completed. If you are unsure whether your topic is suitable, we are happy to advise you.

• you are still required to work in teams of 4, so you should assemble a team to work on the
project (if necessary you can also bring in external members who are not registered for the
course)

• you should contact the exercise coordinator (G ́abor Recski) about your topic proposal, we
can discuss your ideas and recommend 1-2 instructors who can act as your mentors

# Import required packages

In [7]:
import pandas as pd

# Import Data


In [8]:
df = pd.read_csv('Data\C3_anonymized.csv')

# Observe Data

In [9]:
df.head(5)

Unnamed: 0,article_id,comment_author,comment_counter,comment_text,njudgements_constructiveness_expt,njudgements_toxicity_expt,agree_constructiveness_expt,agree_toxicity_expt,constructive,crowd_toxicity_level,...,constructive_characteristics,non_constructive_characteristics,toxicity_characteristics,crowd_comments_constructiveness_expt,crowd_comments_toxicity_expt,other_con_chars,other_noncon_chars,other_toxic_chars,constructive_binary,pp_comment_text
0,26023945,0,source1_26023945_62,And this Conservative strategy has produced th...,3.0,3.0,0.17,0.5,1.0,4.0,...,specific_points:3\r\ndialogue:2,no_non_con:3\r\nprovocative:1,abusive:3\r\npersonal_attack:1\r\nteasing:1\r\...,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,1.0,And this Conservative strategy has produced th...
1,24565777,1,source1_24565777_106,I commend Harper for holding the debates outsi...,3.0,3.0,0.33,0.17,1.0,3.0,...,specific_points:3\r\ndialogue:2,no_non_con:2\r\nno_respect:1,abusive:1\r\npersonal_attack:1\r\nteasing:1\r\...,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,1.0,I commend Harper for holding the debates outsi...
2,28775443,2,source1_28775443_136,What a joke Rachel Notley is. This is what was...,3.0,3.0,0.83,0.0,1.0,3.0,...,specific_points:2\r\ndialogue:1,no_non_con:2\r\nprovocative:1,personal_attack:3\r\ninflammatory:3\r\nteasing...,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,1.0,What a joke Rachel Notley is . This is what wa...
3,8996700,3,source1_8996700_50,Do you need to write an essay to prove the poi...,3.0,3.0,1.0,0.83,1.0,3.0,...,dialogue:1\r\nevidence:1\r\nspecific_points:1,no_non_con:2\r\nnon_relevant:1,personal_attack:2\r\nteasing:2\r\nembarrassmen...,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,1.0,Do you need to write an essay to prove the poi...
4,29405071,4,source1_29405071_126,Rob Ford was no saint. He should never have be...,3.0,3.0,0.83,0.33,1.0,3.0,...,specific_points:3\r\nsolution:1,no_non_con:3,teasing:3\r\npersonal_attack:2\r\nabusive:1\r\...,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,\r\n\r\n,1.0,Rob Ford was no saint . He should never have b...


What's the difference between column 'comment_text' and 'pp_comment_text':

Seems like it's a "pre-cleaned" text column:
- remove hyphen (')
- Added whitespace before-after points, coma, apostrophe 
- kept - in words such as 'left-wing'

In [10]:
for i in range(0,2):
    print("Comment text:\n",df["comment_text"][i],"\n\npp Comment text:\n",df["pp_comment_text"][i],"\n-----------------------------")

Comment text:
 And this Conservative strategy has produced the angry and desperate wing-nuts like the fellow who called reporters 'lying pieces of Sh*t' this week. The fortunate thing is that reporters were able to report it and broadcast it - which may shake up a few folks who recognize a bit of themselves somewhere in there and do some reflecting. I live in hope. 

pp Comment text:
 And this Conservative strategy has produced the angry and desperate wing-nuts like the fellow who called reporters lying pieces of Sh*t this week . The fortunate thing is that reporters were able to report it and broadcast it - which may shake up a few folks who recognize a bit of themselves somewhere in there and do some reflecting . I live in hope . 
-----------------------------
Comment text:
 I commend Harper for holding the debates outside of a left-wing forum as this will help prevent the left from manipulating the debates to try to make Harper look bad. Indeed, we’ll finally have some fair debates.

## Creating a dataset with untreated comment annotation about constructiveness (binary)

In [11]:
df_anno = df[['comment_text','constructive_binary']]
print(df_anno['constructive_binary'].value_counts())


constructive_binary
1.0    6516
0.0    5484
Name: count, dtype: int64


Let's look at some of the constructive and non-constructive comments

In [12]:
pd.set_option('display.max_colwidth', None)

# Randomly sample 3 constructive comments
random_constructive = df_anno[df_anno['constructive_binary'] == 1].sample(n=3, random_state=42)

# Randomly sample 3 non-constructive comments
random_non_constructive = df_anno[df_anno['constructive_binary'] == 0].sample(n=3, random_state=42)

# Display the results
print("Random Constructive Comments:")
print(random_constructive['comment_text'])
print("\nRandom Non-Constructive Comments:")
print(random_non_constructive['comment_text'])

Random Constructive Comments:
10541                                                                                                                                                                                                                                                                                                                                                                                                                                  The world & the west will be better when the Globalists are all voted out of office. They show very little concern for their own citizens and instead try to one-up each other by impressing the dictators or climate change zealots at the UN.
3140     Actions speak louder than words of condolence. Withdrawing was the wrong thing to do a week ago and its the wrong thing to do now. Giving in to the ruthless intimidation tactics of this cancer on humanity is exactly what ISIS is trying to achieve. It's time to stand with the Free-world and in part

Some possible issues that we saw:
- use of slang like "gonna, gunna" for "going to"
- abreviations
- spelling mistakes
- telling if a comment is constructive or not can be highly subjective. That's most likely why a non-binary annotation column exist, most fitted for a regression task

but overall the texts seems cleans in general

Now let's build a function to see which words are the most used

Import dependencies and download models

In [None]:
import json
import re
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import stanza
nltk.download('punkt')
nltk.download('stopwords')
stanza.download('en')

In [31]:
def summarize_most_used_words(text_list, top_n=10, language='english'):
    #Summarizes the most used words in a list of text, excluding stopwords.

    # Load the stopwords for the given language
    stop_words = set(stopwords.words(language))
    
    # Combine all texts into one large string
    all_text = ' '.join(text_list)
    
    # Convert to lowercase and remove punctuation using regex
    all_text_cleaned = re.sub(r'[^\w\s]', '', all_text.lower())
    
    # Split into words
    words = word_tokenize(all_text_cleaned, language='english')
    
    # Remove stopwords
    filtered_words = [word for word in words if word not in stop_words]
    
    # Count word frequencies and get the most common
    word_counts = Counter(filtered_words)
    most_common_words = word_counts.most_common(top_n)
    
    return most_common_words


In [32]:
summarize_most_used_words(df_anno['comment_text'], top_n=25, language='english')

[('people', 2817),
 ('would', 2666),
 ('canada', 2528),
 ('harper', 2444),
 ('one', 2190),
 ('like', 2128),
 ('us', 1833),
 ('dont', 1691),
 ('government', 1654),
 ('get', 1566),
 ('time', 1506),
 ('many', 1410),
 ('think', 1268),
 ('years', 1266),
 ('even', 1254),
 ('canadian', 1238),
 ('party', 1192),
 ('well', 1176),
 ('canadians', 1161),
 ('good', 1156),
 ('right', 1140),
 ('much', 1113),
 ('see', 1083),
 ('world', 1081),
 ('trudeau', 1060)]

## Time for Pre-processing!

Tokenizing