torch is not installed locally on collab, so we must do so using pip

In [73]:
pip install transformers torch



Import packages

In [74]:
import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from torch.utils.data import Dataset, DataLoader
from torch.nn.functional import softmax
import pandas as pd

Load the pre-trained RobertA model and tokenizer

In [75]:
model_name = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name)

#set the model to evaluation mode, otherwise it will try and train, this is not needed since it is a preconfigured model trained on lots of data
model.eval()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

This is creating a version of the dataset that is tokenised and can be used by the RobertA model

In [76]:
class SentimentDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]["text"]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            truncation=True,
            padding="max_length",
            return_tensors="pt",
            return_attention_mask=True,
        )
        return {
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
        }


Now that the config is done we read in the named entity dataset that spacy has generated for use.

In [77]:
from google.colab import drive
drive.mount('/content/gdrive')

df = pd.read_csv("gdrive/My Drive/Dissertation Complete/college_confidential_NE.csv")
df.rename({'content':'text'}, axis=1, inplace=True)

df.shape

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


(444622, 2)

We have to parse the dataset in sections due to google collab randomly disconnecting the runtime, if being ran locally, remove this cell and wait a long time for it to complete.

In [78]:
df = df[0:100000]
df = df.reset_index(drop=True)

In [79]:
df.tail()

Unnamed: 0,text,university
995,I completely agree with your guidance counselo...,Harvard
996,Thanks for the advice. I think I just need t...,Yale
997,"i would add that Yale does not take ""interest""...",Yale
998,I think they will believe your transcript. ...,Yale
999,Why don't you sell us on why you think Harvard...,Harvard


Now we make an instance of the SentimentDataset class using the NER data

In [80]:
sentiment_dataset = SentimentDataset(df, tokenizer)

This is the function that performs the sentiment analysis, this function will be called for every row in a loop until every row has been predicted and given a value. These values are saved to a dataframe "sentiment_df"

In [81]:
results = []

for idx in range(len(sentiment_dataset)):
    sample = sentiment_dataset[idx]
    input_ids = sample["input_ids"]
    attention_mask = sample["attention_mask"]
    inputs = {
        "input_ids": input_ids.unsqueeze(0),
        "attention_mask": attention_mask.unsqueeze(0),
    }

    with torch.no_grad():
        outputs = model(**inputs)

    probs = softmax(outputs.logits, dim=1)
    sentiment_label = torch.argmax(probs, dim=1).item()
    sentiment_prob = probs[0][sentiment_label].item()
    text = df.iloc[idx]["text"]

    results.append({"text": text, "sentiment_label": sentiment_label, "sentiment_prob": sentiment_prob})

sentiment_df = pd.DataFrame(results)

Now we grab the values from the model and assign them to our NER dataframe, each corresponding to the row.

In [82]:
df['RobertA_score'] = sentiment_df['sentiment_prob']

Write the results to a new csv for use in the ranking

In [83]:
from google.colab import drive

drive.mount('/content/gdrive')
path = '/content/gdrive/My Drive/Dissertation Complete/RobertA_base_results.csv'

with open(path, 'a') as f:
  df.to_csv(f, encoding='utf-8', index=False, header=False)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


This is just to confirm that the partioned sections have ran correctly, ignore if running locally

In [89]:
import pandas as pd
from google.colab import drive

drive.mount('/content/gdrive')
df = pd.read_csv("gdrive/My Drive/Dissertation Complete/RobertA_base_results.csv")

df.tail()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Unnamed: 0,text,university,RobertA_score
444617,and why did Cornell have to put up nets under ...,Cornell,0.513806
444618,Because Cornell has lots of bridges.,Cornell,0.512402
444619,When a lifelong friend of my brother was an ...,MIT,0.508957
444620,Heard from a Yalie who said Princeton's partie...,Princeton,0.514205
444621,Absolutely most top schools have their fair sh...,Columbia,0.513273
