<font style='font-size:1.5em'>**Initial Data Analysis**</font>

<font style='font-size:1.2em'>Using Reddit CSV</font>

**Author: Sentipedia**  

**Course: DS105L Project** 

**DATE: 3/29/2023** 

---


## Imports 

### Importing Libraries 

In [3]:
# importing libraries for web scrapping and API acess
import requests
from bs4 import BeautifulSoup

# importing libraries for dataframes 
import pandas as pd
import numpy as np

# importing libraries for plots 
from plotnine import *

# importing libraries for ploting networks
import networkx as nx
import matplotlib.pyplot as plt

# Data

### Looking at the Data 

In [4]:
# import language model 
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# downloading NLTK analyzers 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /Users/ap/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ap/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ap/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [5]:
df_reddit = pd.read_csv('reddit.csv', usecols=['comment'])
df_reddit.head()

Unnamed: 0,comment
0,"Agree. With the Tories pushing Brexit, they h..."
1,I just thought that with plenty of vaccination...
2,"If you don't listen to your core voter base, t..."
3,Much as I agree with your potential projection...
4,"Brexit, the word, is why it succeeded. Nobody ..."


### Using NLTK

In [7]:
df_reddit_nltk = df_reddit.head(1000).copy()

analyzer = SentimentIntensityAnalyzer()
i = 0
df_reddit_nltk['pos'] = None
df_reddit_nltk['score'] = None

for comment in df_reddit_nltk['comment']:
     tokens = nltk.word_tokenize(df_reddit_nltk['comment'][i])
     df_reddit_nltk['pos'][i] = nltk.pos_tag(tokens)
     df_reddit_nltk['score'][i] = analyzer.polarity_scores(df_reddit_nltk['comment'][i])
     i = i + 1

df_reddit_nltk

Unnamed: 0,comment,pos,score
0,"Agree. With the Tories pushing Brexit, they h...","[(Agree, NNP), (., .), (With, IN), (the, DT), ...","{'neg': 0.22, 'neu': 0.606, 'pos': 0.174, 'com..."
1,I just thought that with plenty of vaccination...,"[(I, PRP), (just, RB), (thought, VBD), (that, ...","{'neg': 0.063, 'neu': 0.705, 'pos': 0.232, 'co..."
2,"If you don't listen to your core voter base, t...","[(If, IN), (you, PRP), (do, VBP), (n't, RB), (...","{'neg': 0.025, 'neu': 0.959, 'pos': 0.016, 'co..."
3,Much as I agree with your potential projection...,"[(Much, JJ), (as, IN), (I, PRP), (agree, VBP),...","{'neg': 0.053, 'neu': 0.871, 'pos': 0.076, 'co..."
4,"Brexit, the word, is why it succeeded. Nobody ...","[(Brexit, NN), (,, ,), (the, DT), (word, NN), ...","{'neg': 0.0, 'neu': 0.776, 'pos': 0.224, 'comp..."
...,...,...,...
995,"Exactly, many interpreted it as a vote for a ...","[(Exactly, RB), (,, ,), (many, JJ), (interpret...","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
996,Just because parties committed to implimenting...,"[(Just, RB), (because, IN), (parties, NNS), (c...","{'neg': 0.026, 'neu': 0.762, 'pos': 0.212, 'co..."
997,I understand that but to persuade the EU27 tha...,"[(I, PRP), (understand, VBP), (that, DT), (but...","{'neg': 0.0, 'neu': 0.906, 'pos': 0.094, 'comp..."
998,&gt; The whole partnership is damaged and the ...,"[(&, CC), (gt, NN), (;, :), (The, DT), (whole,...","{'neg': 0.067, 'neu': 0.74, 'pos': 0.193, 'com..."


### Using DistilBERT Model

In [1]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_clas

In [17]:
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'LABEL_1'

In [20]:
df_reddit_DistilBERT = df_reddit.head(10).copy()
i = 0
df_reddit_DistilBERT['predicted'] = None

for comment in df_reddit_DistilBERT['comment']:
     inputs = tokenizer(df_reddit_DistilBERT['comment'][i], return_tensors="pt")
     with torch.no_grad():
          logits = model(**inputs).logits

     predicted_class_id = logits.argmax().item()
     df_reddit_DistilBERT['predicted'][i] = model.config.id2label[predicted_class_id]
     i = i + 1

df_reddit_DistilBERT

Unnamed: 0,comment,predicted
0,"Agree. With the Tories pushing Brexit, they h...",LABEL_1
1,I just thought that with plenty of vaccination...,LABEL_1
2,"If you don't listen to your core voter base, t...",LABEL_1
3,Much as I agree with your potential projection...,LABEL_1
4,"Brexit, the word, is why it succeeded. Nobody ...",LABEL_1
5,"""Brexit means Brexit""... =""\n\nThe one stand o...",LABEL_1
6,The word actually wasnt really used much anyw...,LABEL_1
7,"It was used, but only as a kind of novelty ter...",LABEL_1
8,This is exact right - the portmanteau of Grex...,LABEL_1
9,Its probably somewhere between the two of us ...,LABEL_1


In [12]:
text = "This is an example sentence to encode using DistilBERT."
encoded_text = tokenizer.encode("[CLS] " + text + " [SEP]")
input_ids = torch.tensor([encoded_text])
outputs = model(input_ids)
hidden_states = outputs[0]

In [13]:
input_ids

tensor([[  101,   101,  2023,  2003,  2019,  2742,  6251,  2000,  4372, 16044,
          2478,  4487, 16643, 23373,  1012,   102,   102]])

In [14]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.1000, -0.0733]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)