# About this notebook
This notebook is meant to be called and ran by an eventbridge or lambda function to trigger the sentiment scoring for data from an s3 bucket. 

The sections first run through all the installations for sagemaker and other necessary libraries, then all the required imports.

The next section calls a BERT-based model from Huggingface,[cardiffnlp/twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) to create a function. This function, `rate_text(text)`, takes in argument `text<string>` and returns a `float` value where $-1<float<1$ to rate the sentiment of the text. 

The following section starts a connection with the s3 bucket and retrieves the data. It then converts it to a pandas df, then calls the function `rate_text()` on each row and stores the value in a new column called `df[mood_score]`

The next section calls another BERT-based model from Huggingface, [j-hartmann/emotion-english-distilroberta-base](https://huggingface.co/j-hartmann/emotion-english-distilroberta-base) to create another function. This function, `sentiments(text)` returns an array of positive scores, where $0<score<1$, for 7 emotions, namely 
1. Anger
2. Disgust
3. Fear
4. Joy
5. Neutral
6. Sadness
7. Surprise
meaning it returns an array of this shape:
Output:
`[[{'label': 'anger', 'score': 0.004419783595949411},
  {'label': 'disgust', 'score': 0.0016119900392368436},
  {'label': 'fear', 'score': 0.0004138521908316761},
  {'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'neutral', 'score': 0.005764586851000786},
  {'label': 'sadness', 'score': 0.002092392183840275},
  {'label': 'surprise', 'score': 0.008528684265911579}]]`

The last section calls this second model and adds the scores to 7 different columns in the df. The final df is then combined and written into a destination bucket as a new csv file.

## Section 1: installs and imports

In [30]:
!pip install gensim

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [32]:
!pip install "sagemaker>=2.48.0" --upgrade

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [33]:
!pip3 install transformers
!pip3 install torch torchvision

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


### Imports

In [34]:
import sagemaker
import os
import boto3
from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFace

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import AutoConfig
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

   
import time
import pandas as pd
import io
import re 

import warnings
warnings.filterwarnings('ignore')

In [35]:
try:
    role = sagemaker.get_execution_role()
    session = sagemaker.Session()
    bucket = "raw-data-is459-chatgpt-sentiments"
    prefix = "write/"
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

role = get_execution_role()

print("Output will be stored in {}/{}".format(bucket, prefix))
print("\nIAM Role: {}".format(role))

Output will be stored in raw-data-is459-chatgpt-sentiments/write/

IAM Role: arn:aws:iam::183972219153:role/service-role/AmazonSageMaker-ExecutionRole-20230329T165220


## Section 2: text rating (-1 to 1 numeric score for overall sentiment)

In [36]:
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
tokenizer.model_max_length = 512

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [37]:
###--- this function abstracts out the text processing. returns +ve, -ve or 0 for positive mood, negative, or neutral
def rate_text(text):
    """
    This function takes a text and puts it through the model we retrieved from huggingfaces. The model returns 
        positive: <float>
        neutral: <float>
        negative <float>
    We want to find the one with the highest magnitude and represent it as a float from -1.00 to 1.00
    """
    # sanitize the data
    text=str(text)
    if len(text)> 1200:
        text=text[:1200]
    text = preprocess(text)
    
    # call the huggingfaces model
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    
    #extract the response and turn it into a float
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    ranking = np.argsort(scores)
    ranking = ranking[::-1]
    for i in range(scores.shape[0]):
        l = config.id2label[ranking[i]]
        s = scores[ranking[i]]
    
    winner=config.id2label[ranking[np.argmax(scores)]]
    multiplier = 1 if winner=="positive" else 0 if winner=="neutral" else -1
    print(multiplier * scores.max())
    return multiplier * scores.max()

def score_mood(df):
    """
    This is a helper function to help pandas apply the above rate_text function more easily to a dataframe
    """
    df["mood_score"] = df["text"].apply(rate_text)
    return df

## Section 3: Reading from s3 bucket
Currently this uses the 500 line test data under the intermediate folder on our project s3 bucket. Calling the model and modifying the test data takes 90s

In [38]:
conn = boto3.client('s3')
bucket = "raw-data-is459-chatgpt-sentiments"
subfolder = "read"
bucket_contents = conn.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']

## Section 4: Sentiment scores
This part modifies the df to append 7 more columns for 7 emotions: `['anger','disgust','fear','joy','neutral','sadness','surprise']`. Going through the test data takes 50s

In [39]:
from transformers import pipeline
classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", return_all_scores=True)
results=classifier("I love this!")

In [2]:
def update_moods(df):
    """
    This function calls the classifier on each row of a df
    """
    for index, row in df.iterrows():
        # Apply the classifier function to the "text" column
        text=str(row["text"])
        if len(text)>1200:
            text=text[:1200]
        results = classifier(text)
        results = results[0]
        # store results in the df
        for result in results:
            df.loc[index, result["label"]] = result["score"]
    return df

## Section 5: Writing to s3 and calling the pipeline functions
We first take the csv from s3, then batch them, call the above functions in sequence, then write the dataframe as a csv again into another folder

In [41]:
# These functions are for the main function to call to batch and write the results to s3

def get_batch_name(idx):
    return f'batch_{idx // 500}'

def write_to_s3(df):
    timestamp=str(time.time())
    fname = f"final_data_{timestamp}.csv"
    csv_buffer = df.to_csv(index=False)
    s3_object = os.path.join(prefix, fname)
    boto3.Session().resource("s3").Bucket(bucket).Object(s3_object).put(Body=csv_buffer)

    s3_train_data = "s3://{}/{}".format(bucket, s3_object)
    print("Uploaded data to S3: {}".format(s3_train_data))

In [43]:
def main(df):
    print("----- batching")
    batched_df = df.groupby(get_batch_name, sort=False)
    print("---- looping")            
    batch_no=0
    finished_batches=0
    for batch_name, batch_df in batched_df:
        batch_no+=1
        if batch_no<=finished_batches:
            batch_no+=1
        else:
            batch_df = score_mood(batch_df)
            print("===sentiment scored===")
            # Initialise the columns
            sentiments = ['anger','disgust','fear','joy','neutral','sadness','surprise']
            for sentiment in sentiments:
                batch_df[sentiment] = 0.0
            batch_df=update_moods(batch_df)
            print("===moods added===")
            print(f"This batch_df is {len(batch_df)} long")
            print("---writing to s3----")
            write_to_s3(batch_df)
            print("====Batch done=====")
            
        
def call_segment(df,file_path):
    print(file_path)
    s3 = boto3.resource('s3')
    main(df)
    timestamp=str(time.time())
    # delete the original file from the source path
    s3.Object(bucket_name=bucket, key=file_path).delete()
    print(f"----deleted {file_path}")

df = pd.DataFrame();
# print(contents)
for f in bucket_contents:
    file_path=f['Key']
    print(file_path)
#     if "cleaned_twitter.csv" in file_path: # i will use cleaned_twitter first, all dataset needs to be in the same format
    if not(file_path==subfolder+"/"):
        obj = conn.get_object(Bucket=bucket, Key=f"{file_path}")
        contents = obj['Body'].read().decode('utf-8')
        data = pd.read_csv(io.StringIO(contents))
        df = pd.concat([data], axis=0)
        call_segment(df,file_path)
print("----success,  all done----")
    

read/
read/reddit_ChatGPT_1680084943.csv
read/reddit_ChatGPT_1680084943.csv
----- batching
---- looping
0.8354777693748474
-0.5499265193939209
===sentiment scored===
===moods added===
This batch_df is 2 long
---writing to s3----
Uploaded data to S3: s3://raw-data-is459-chatgpt-sentiments/write/final_data_1680512659.4417644.csv
====Batch done=====
----deleted read/reddit_ChatGPT_1680084943.csv
read/reddit_ChatGPT_1680100961.csv
read/reddit_ChatGPT_1680100961.csv
----- batching
---- looping
-0.8242226243019104
-0.8727233409881592
-0.662940263748169
===sentiment scored===
===moods added===
This batch_df is 3 long
---writing to s3----
Uploaded data to S3: s3://raw-data-is459-chatgpt-sentiments/write/final_data_1680512661.786155.csv
====Batch done=====
----deleted read/reddit_ChatGPT_1680100961.csv
read/reddit_ChatGPT_1680103686.csv
read/reddit_ChatGPT_1680103686.csv
----- batching
---- looping
-0.690098226070404
0.5132157206535339
-0.8242226243019104
-0.8727233409881592
-0.662940263748169


# Citations

## Hartmann's emotion DistilRoBERTa model
```
@misc{hartmann2022emotionenglish,
  author={Hartmann, Jochen},
  title={Emotion English DistilRoBERTa-base},
  year={2022},
  howpublished = {\url{https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/}},
}
```

## Loureiro et. al.'s Twitter RoBERTa model
```
@inproceedings{loureiro-etal-2022-timelms,
    title = "{T}ime{LM}s: Diachronic Language Models from {T}witter",
    author = "Loureiro, Daniel  and
      Barbieri, Francesco  and
      Neves, Leonardo  and
      Espinosa Anke, Luis  and
      Camacho-collados, Jose",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-demo.25",
    doi = "10.18653/v1/2022.acl-demo.25",
    pages = "251--260"
}
```