# Week 5 Practicum

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
import json

In [None]:
import torch
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from tqdm import tqdm
import json

### Jsonl Cleaning and Jsonl -> Dataframe -> CSV

In this code I created a method which utalizes the jsonl file cade pulled from pushshift and basically goes through all the different values and pulls out the necessary values we'll be using within our model and some other necessary information for data exploration

In [2]:
def parse_jsonl_file(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            try:
                entry = json.loads(line.strip())
                title = entry.get('title')
                selftext = entry.get('selftext')
                link_flair_text = entry.get('link_flair_text')
                post_id = entry.get('id')
                url = entry.get('url')
                num_comments = entry.get('num_comments')
                score = entry.get('score')

                if link_flair_text == 'None':
                    continue
                
                data.append({
                    'title': title,
                    'selftext': selftext,
                    'link_flair_text': link_flair_text,
                    'id': post_id,
                    'url': url,
                    'num_comments': num_comments,
                    'score': score
                    
                })
            except json.JSONDecodeError:
                print("Error decoding JSON line:", line)
    return data

file_path = 'r_udub_posts.jsonl'
parsed_data = parse_jsonl_file(file_path)

df_jsonl = pd.DataFrame(parsed_data)

df_jsonl.head(5)

This Part of the code moves the now created dataframe and transforms it into a CSV so it's easier to handle and allows for us to all work with the same Data Set

In [3]:
csv_file_path = 'r_udub_posts.csv'
df_jsonl.to_csv(csv_file_path, index=False)
print("DataFrame saved as CSV:", csv_file_path) 

Unnamed: 0,title,selftext,link_flair_text,id,url,num_comments,score
0,Any UW redditors want to meet up Thursday 10/2...,"We failed on 10/22, but I think with a week of...",,9y4hg,https://www.reddit.com/r/udub/comments/9y4hg/a...,6,4
1,We need a UW-ified logo.,If someone here has arcane skill in the graphi...,,9ywtc,https://www.reddit.com/r/udub/comments/9ywtc/w...,2,3
2,Thursday bowling success!,[deleted],,9z66c,https://www.reddit.com/r/udub/comments/9z66c/t...,2,3
3,"Next UW meetup: Thursday, 11/5 at 11:00am in t...",This time we will be playing ping pong followe...,,a0ail,https://www.reddit.com/r/udub/comments/a0ail/n...,2,4
4,Next meetup: December 3rd. Need ideas,"Alright, so who is up for a December 3rd meetu...",,a9lq8,https://www.reddit.com/r/udub/comments/a9lq8/n...,7,2


## How accuracy is a multinomial Naive Bayes Model for predicting different flairs?

In this case we wanted to try a fairly standard model to kinda get a baseline on one how our data works and get a starting point in which we can iterate and look back on.

In [4]:
posts = pd.read_csv('r_udub_posts.csv')
posts.head(5)

Unnamed: 0,title,selftext,link_flair_text,id,url,num_comments,score
0,Any UW redditors want to meet up Thursday 10/2...,"We failed on 10/22, but I think with a week of...",,9y4hg,https://www.reddit.com/r/udub/comments/9y4hg/a...,6,4
1,We need a UW-ified logo.,If someone here has arcane skill in the graphi...,,9ywtc,https://www.reddit.com/r/udub/comments/9ywtc/w...,2,3
2,Thursday bowling success!,[deleted],,9z66c,https://www.reddit.com/r/udub/comments/9z66c/t...,2,3
3,"Next UW meetup: Thursday, 11/5 at 11:00am in t...",This time we will be playing ping pong followe...,,a0ail,https://www.reddit.com/r/udub/comments/a0ail/n...,2,4
4,Next meetup: December 3rd. Need ideas,"Alright, so who is up for a December 3rd meetu...",,a9lq8,https://www.reddit.com/r/udub/comments/a9lq8/n...,7,2


In [5]:
posts['link_flair_text'].unique()

array([nan, 'PSA', 'Rant', 'Random', 'Meme', 'Question', 'Discussion',
       'Academics', 'Student Life', 'Help', 'Event', 'Video',
       'Admissions', 'Advice', 'Poll', 'poll', 'No unrelated posts'],
      dtype=object)

This next code is pretty much our main filter to bring our data from raw data into data which we can be fed into a model. In this case we combined the title and body text into one column so we can have more text tokenizer and utalize.Alongside this we also got rid of any posts which had not body, no flair or if the post is removed/deleted. Finally we removed any flairs not currently in use then lowercased them all to combine redundant flairs. 

In [6]:
posts['combined_text'] = posts['title'] + " " + posts['selftext'].fillna("")
flair_categories = ["admissions", "academics", "student life", "advice", "discussion", "meme", "rant", "psa", "event", "poll"]

flairedNotSelf = posts[(posts['link_flair_text'].notnull()) & (posts['selftext'] != '[removed]') & (posts['selftext'] != '[deleted]') & posts['selftext'].notnull()]
ModelDataLower = flairedNotSelf.apply(lambda col: col.str.lower() if col.dtype == 'object' else col)
ModelDataFiltered = ModelDataLower[ModelDataLower['link_flair_text'].isin(flair_categories)][['combined_text', 'link_flair_text']]

ModelDataFiltered.head()

Unnamed: 0,combined_text,link_flair_text
28066,thoughts on madrona? i have an emotional suppo...,discussion
28077,soc 222 anyone has took or taking soc222(socio...,academics
28080,betsy evans - ling/anth 233 does anyone have a...,academics
28083,tell me what you want from remote teaching? th...,discussion
28088,efs experience/thoughts/opinions! i just regis...,discussion


In [7]:
print(len(ModelDataFiltered))

9031


Found the distribution of flairs so we can understand the data better.

In [8]:
ModelDataFiltered.groupby('link_flair_text').size().sort_values(ascending=False)

link_flair_text
advice          2360
academics       2100
student life    1551
admissions       977
discussion       865
poll             464
rant             388
psa              138
event            128
meme              60
dtype: int64

In the code below we just did a standard 80/20 split and kept random state 52 to get the same split.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(ModelDataFiltered['combined_text'], ModelDataFiltered['link_flair_text'], test_size=0.2, random_state=52)

print(len(X_train))
print(len(X_test))

7224
1807


Finally we vectorized the data using TF-IDF in order to remove common words.

In [12]:
tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(X_train)
X_test = tfidf_vectorizer.transform(X_test)

Then we trained the model using the sklearn multinomial naive bayes model

In [13]:
clf = MultinomialNB()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Accuracy: 0.3973436635307139


In [14]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

   academics       0.69      0.47      0.56       423
  admissions       0.70      0.04      0.07       185
      advice       0.29      0.87      0.43       442
  discussion       0.00      0.00      0.00       192
       event       0.00      0.00      0.00        29
        meme       0.00      0.00      0.00         5
        poll       0.88      0.70      0.78        87
         psa       0.00      0.00      0.00        25
        rant       0.00      0.00      0.00        85
student life       0.62      0.19      0.30       334

    accuracy                           0.40      1807
   macro avg       0.32      0.23      0.21      1807
weighted avg       0.46      0.40      0.34      1807



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


- One limitations of this model are primarily due to the useage of a TF-IDF vectorization as because it utalizes a bag of works approach it ignores the context and order of words and thus can limit out models abiltiy to extract relationships between words.

- A second limitation of this approach is we have a data imbalance as some flairs are more commonly utalized and thus our model may be weigted more towards classifying those flairs

Going forward we plan on trying to utalize Word2Vec which is able to interpret context more easily and also we plan to discuss if we want to put data mininimums on the amount of flairs necessary 

# How accurate will RoBERTa, a transformer-based model, be without much fine-tuning in comparison to the other methods?

Compared to the previous methods we used on our data, RoBERTa has a much more complex architecture. Because of this, our expectation is that in the long run it will perform better than methods like multinomial Naive Bayes and clustering. However, the results of RoBERTa depend on the fine-tuning of various knobs in the model, so it might take time to find the right training environment.

In [6]:
train_dataset = TensorDataset(X_train, torch.tensor(y_train))
test_dataset = TensorDataset(X_test, torch.tensor(y_test))
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=len(set(encoded_flairs)))

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

train_loader = DataLoader(train_dataset, batch_size=20, shuffle=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

In [8]:
model.train()
for epoch in range(4):
    print(f'Epoch {epoch}')
    for i, batch in enumerate(tqdm(train_loader)):
        batch = [item.to(device) for item in batch]
        inputs, labels = batch
        inputs = inputs.long()
        labels = labels.long()
        optimizer.zero_grad()
        outputs = model(inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
    print(f"end of Epoch {epoch}")


Epoch 0


  0%|          | 0/362 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|██████████| 362/362 [11:13<00:00,  1.86s/it]


end of Epoch 0
Epoch 1


100%|██████████| 362/362 [11:12<00:00,  1.86s/it]


end of Epoch 1
Epoch 2


100%|██████████| 362/362 [11:11<00:00,  1.86s/it]


end of Epoch 2
Epoch 3


100%|██████████| 362/362 [11:11<00:00,  1.86s/it]

end of Epoch 3





In [10]:
model.eval()
predictions = []
true_labels = []
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

In [11]:
with torch.no_grad():
    for i, batch in enumerate(tqdm(test_loader)):
        batch = [item.to(device) for item in batch]
        inputs, labels = batch
        outputs = model(inputs)
        logits = outputs.logits
        predictions.extend(torch.argmax(logits, dim=1).cpu().tolist())
        true_labels.extend(labels.cpu().tolist())

100%|██████████| 226/226 [00:52<00:00,  4.27it/s]


In [12]:
accuracy = accuracy_score(true_labels, predictions)
print("Accuracy:", accuracy)

Accuracy: 0.5633646928610957


Compared to the previous methods, RoBERTa without fine-tuning performed about the same as the best result we had before, but was significantly better than the average result of all the methods. There are still a lot of changes we can make in the model to affect the accuracy, such as batch size, learning rate, and number of epochs, as well as more complicated data exploration into the types of text under each flair. 

Some limitations we encountered when trying to run this model included the computational cost, which we partially solved using virtual gpu processing power on Google Colab. However we encountered usage limits on Google Colab's free environment, which will delay how much we can mess around with RoBERTa in a day.

## General Next Steps

- Fine tune RoBERTA
- Try different transformer models
- Try training models with a balanced dataset
- Make a new CSV for the "model ready" Data

### Alex: 
initial data cleaning, set up and ran RoBERTa model

### Cade

### Tyler:
For this week I priamrily worked on transforming the Json to CSV, I also cleaned/prepared the data for tokenization and I trained/tested the naive bayes model with TF-IDF