## Using Notebook Environments 
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the `print()` function as usual.

Notebook environments support code cells and markdown cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`)

## Environment Setup (run before presentation)

In [1]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Installing requisite packages
    !pip install transformers sentence-transformers &> /dev/null

    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Change working directory to health
    %cd /content/drive/MyDrive/LLM4BeSci_GSERM2024/day_1

We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the `import <name> as <nickname>` syntax (e.g. `import pandas as pd`). These are usually standardized nicknames. We here make use three packages:

1. `pandas`: A very popular package for reading and manipulating data in python.
2. `sentence_transformers`: A package for extracting features from text data using transformer-based models.
3. `transformers`: A HF package for loading and manipulating transformer-based models.

In [2]:
 import pandas as pd
from sentence_transformers import SentenceTransformer
from transformers import pipeline

In [None]:
# Installs relevant models by running pipelines (ignore details here) 
SentenceTransformer('all-MiniLM-L6-v2')
pipeline('zero-shot-classification', model='valhalla/distilbart-mnli-12-1')
pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')
pipeline('text-generation', model='gpt2')

**[RUN TILL HERE BEFORE PRESENTATION TO GIVE TIME FOR INSTALLATION]**

## Feature Extraction

We will begin by extracting features (numerical representations) from the text data using the `sentence-transformers` package. We will use the following three sentences, stored as a list of strings, as input to the model.:

In [None]:
sentences = [
    "I feel great this morning",
    "I am feeling very good today",
    "I am feeling terrible"
]

We will use the `all-MiniLM-L6-v2` model to extract features from the sentences. The model will encode the sentences into a 384-dimensional vector representation. We will then print the features as a pandas dataframe for easy viewing.

In [14]:
# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract features
features = model.encode(sentences)

# Print the features as a pandas dataframe
pd.DataFrame(features, index=sentences)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
I feel great this morning,-0.026462,-0.044373,0.072443,0.034526,0.089534,-0.050451,0.018811,0.071296,-0.020522,-0.043637,...,-0.005689,-0.000328,-0.049055,0.016308,-0.027642,0.017276,0.065253,0.017496,-0.02281,-0.036687
I am feeling very good today,-0.043895,-0.020341,0.066563,-0.00631,0.02598,-0.04042,0.079304,-0.0097,-0.04292,-0.025988,...,-0.045309,0.049151,-0.049057,0.017821,-0.018061,-0.010441,0.04307,0.01844,-0.008274,-0.006016
I am feeling terrible,0.017495,-0.057904,0.033315,0.00171,0.051957,-0.048159,0.007659,0.119096,0.029929,-0.06896,...,0.038813,0.003015,-0.074585,-0.018391,-0.026449,0.005867,0.051495,-0.009829,0.030009,-0.064299


**TASK 1**: Have a scroll through the features. Can you see that the first two sentences are more similar to each other than they are to the third sentence? Why do you think this is the case?

**TASK 2**: Try copy-pasting one of the sentences and change a word or two with a synonym and add it to the list of `sentences`. For instance, you could change "I feel *great* this morning" to "I feel *fantastic* this morning". What do you notice about the features of this new sentence compared to the original?

## Text Generation
We will now use the `transformers` text generation pipeline. We begin by loading the pipeline with `gpt2`. 


In [6]:
pipe = pipeline('text-generation', model='gpt2')

We will now generate text based on a prompt. The prompt is a starting point that the model uses to generate text. Since GPT-2 has not been assistant-tuned, it will "try to" generate text that is likely to follow the prompt. We set the `max_length` parameter to 100 to limit the length of the generated text to 100 tokens.

In [8]:
prompt = """
    Once upon a time in a land far far away, there was a young prince named John. He was known for his bravery and courage. 
    One day, he decided to go on an adventure to explore the unknown lands.
"""

# Generate text based on the prompt
output = pipe(prompt, max_length=100)

# Print the generated text
output[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'\n    Once upon a time in a land far far away, there was a young prince named John. He was known for his bravery and courage. \n    One day, he decided to go on an adventure to explore the unknown lands.\n    To enter the depths of the earth to learn more about the people he found, would be to face the unknown, or he might face the darkness that lies there. \n    This prince named John grew'

**TASK 1**: Please enter a new prompt  in the variable `prompt` that you wish the model to continue generating text from. Feel free to play around with the `max_length` parameter to see how it affects the generated text.

**TASK 2**: Try replacing `'gpt2'` above with `'gpt2-medium'` or another text generation model on the [HF model hub](https://huggingface.co/models) to see how the generated text changes.

## Sentiment Analysis 

In addition to feature extraction, Hugging Face's `transformers` library provides a high-level API for a variety of other tasks. These tasks can be viewed in the lef-hand panel under "Natural Language Processing" on the [HF model hub](https://huggingface.co/models). Many of these tasks include models that have been fine-tuned on specific datasets to perform well on the task at hand. As an example of such a task, we will now use the sentiment analysis pipeline to predict the sentiment of a few sentences.

We will now load the `transformers` sentiment analysis pipeline to predict the sentiment of sentences. We will use the same sentences as before.

In [5]:
# Define the sentences
sentences = [
    "I feel great this morning",
    "I am feeling very good today",
    "I am feeling terrible"
]

# Load sentiment analysis pipeline
pipe = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Predict sentiment of the sentences
sentiments = pipe(sentences)

# Print the predicted sentiments as a pandas dataframe
pd.DataFrame(sentiments, index=sentences)

Unnamed: 0,label,score
I feel great this morning,POSITIVE,0.999873
I am feeling very good today,POSITIVE,0.999872
I am feeling terrible,NEGATIVE,0.999489


As you can see, not only doe the model predict the sentiment of the sentences (`'label'`), but it also provides a confidence score for each prediction (`'score'`). 