## Using Notebook Environments 
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the `print()` function as usual.

Notebook environments support code cells and markdown cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`)

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Installing requisite packages
    !pip install transformers

    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Change working directory to health
    %cd /content/drive/MyDrive/LLM4SocBeSci/day_1

We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the `import <name> as <nickname>` syntax (e.g. `import pandas as pd`). These are usually standardized nicknames. We here make use two packages:

1. `pandas`: A very popular package for reading and manipulating data in python.
2. `transformers`: A HF package for loading and manipulating transformer-based models.

In [2]:
 import pandas as pd
from transformers import pipeline

## Sentiment Analysis 

The dataset contains 1000 randomly sampled tweets from [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140), collected by [Go et al., (2009)](https://www-cs-faculty.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf). It contains two columns:
1. `tweet`: The text of the tweet.
2. `sentiment`: The sentiment of the tweet (`'POSITIVE'` or `'NEGATIVE'`)

In [3]:
# Loads first 1000 rows from the csv file
twitter = pd.read_csv('twitter.csv')
twitter['sentiment'] = twitter['sentiment'].replace({0: 'NEGATIVE',  4: 'POSITIVE'})
twitter

Unnamed: 0,tweet,sentiment
0,@JenniferHen ...alk or anything hope you had ...,POSITIVE
1,Super hott outsidee! On my way to matamoros 2 ...,POSITIVE
2,Is updating twitter from her NEW computer my ...,POSITIVE
3,hates that i have to work today! i want to be ...,NEGATIVE
4,@twinsquirrel me too... nice to have coffee w...,POSITIVE
...,...,...
995,It's #followfriday so I'm suggesting you follo...,POSITIVE
996,Jack daniels blew up in the cooler,NEGATIVE
997,Strawberry Mentos are the BEST!! thanks again ...,POSITIVE
998,@shaylay11 awww thats soo saad,NEGATIVE


We can see that the dataset contains relatively balanced classes of positive and negative tweets.

In [4]:
# Printing the balance of positive and negative tweets
twitter['sentiment'].value_counts()

sentiment
POSITIVE    507
NEGATIVE    493
Name: count, dtype: int64

We will now load the `transformers` sentiment analysis pipeline to predict the sentiment of the tweets. The pipeline is a high-level API that allows us to easily use pre-trained models for a variety of tasks. In this case, we will use a sentiment fine-tuned version of the DistilBERT model for sentiment analysis.

In [None]:
# Load sentiment analysis pipeline
pipe = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

With the pipeline loaded, we can now use it to predict the sentiment of the tweets. We will use the `tweet` column of the `twitter` dataframe as input to the pipeline.

In [5]:
# Predict sentiment of tweets
predictions = pipe(twitter['tweet'].tolist())

# Display the first 10 predictions
predictions[:10]

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9996434450149536},
 {'label': 'POSITIVE', 'score': 0.989424467086792},
 {'label': 'POSITIVE', 'score': 0.9993859529495239},
 {'label': 'NEGATIVE', 'score': 0.9835692048072815},
 {'label': 'POSITIVE', 'score': 0.9971075654029846},
 {'label': 'POSITIVE', 'score': 0.9996768236160278},
 {'label': 'NEGATIVE', 'score': 0.7675129771232605},
 {'label': 'POSITIVE', 'score': 0.9445950984954834},
 {'label': 'NEGATIVE', 'score': 0.9981271624565125},
 {'label': 'POSITIVE', 'score': 0.9478254914283752}]

The pipeline returns a list of dictionaries, where each dictionary contains the predicted sentiment and the corresponding score. We will extract the predicted sentiment and add it to the `twitter` dataframe.

In [7]:
# Joining the predictions with the original data
twitter['sentiment_pred'] = [x['label'] for x in predictions]
twitter

Unnamed: 0,tweet,sentiment,sentiment_pred
0,@JenniferHen ...alk or anything hope you had ...,POSITIVE,NEGATIVE
1,Super hott outsidee! On my way to matamoros 2 ...,POSITIVE,POSITIVE
2,Is updating twitter from her NEW computer my ...,POSITIVE,POSITIVE
3,hates that i have to work today! i want to be ...,NEGATIVE,NEGATIVE
4,@twinsquirrel me too... nice to have coffee w...,POSITIVE,POSITIVE
...,...,...,...
995,It's #followfriday so I'm suggesting you follo...,POSITIVE,NEGATIVE
996,Jack daniels blew up in the cooler,NEGATIVE,NEGATIVE
997,Strawberry Mentos are the BEST!! thanks again ...,POSITIVE,POSITIVE
998,@shaylay11 awww thats soo saad,NEGATIVE,NEGATIVE


We can see that the pipeline has predicted the sentiment of the tweets. We will now check the accuracy of the model by comparing the predicted sentiment with the actual sentiment.

In [8]:
# Checking the accuracy of the model
true_or_false = (twitter['sentiment'] == twitter['sentiment_pred'])
accuracy = true_or_false.sum() / len(true_or_false)
accuracy

0.719

## Text Generation

We will now use the `transformers` text generation pipeline. We begin by loading the pipeline with the `gpt2` model. The pipeline will generate text based on a given prompt.

In [9]:
pipe = pipeline('text-generation', model='gpt2')

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

TASK: Please enter any prompt you like in the variable `prompt` below that you wish the model to continue generating text from. Feel free to play around with the `max_length` parameter to see how it affects the generated text.

In [None]:
prompt = "[PLEASE ENTER YOUR QUESTION HERE]"

# Generate text based on the prompt
output = pipe(prompt, max_length=100)

# Print the generated text
output[0]['generated_text']

TASK: Try replacing 'gpt2' above with 'gpt2-medium' or another text generation model on the [HF model hub](https://huggingface.co/models) to see how the generated text changes.