# Day 1
In this exercise, we will explore the capabilities of transformer-based models for natural language processing (NLP) using the Hugging Face (HF) `transformers` library. We will use the `sentence-transformers` package to extract features from text data and the `transformers` library to perform sentiment analysis and text generation tasks.

By the end of this exercise, you will have learned how to:
- Extract features from text data using transformer-based models
- Perform sentiment analysis on text data
- Generate text using transformer-based models


## Using Notebook Environments 
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the `print()` function as usual.

Notebook environments support code cells and markdown (text) cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`).

**NOTE: Please only complete the BONUS TASKS at the end if you have finished everything else**.

## Environment Setup

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    
    # Installing requisite packages
    !pip install transformers sentence-transformers &> /dev/null

We begin by loading the requisite packages. For those coming from R, packages in Python are sometimes given shorter names for use in the code via the `import <name> as <nickname>` syntax (e.g. `import pandas as pd`). These are usually standardized nicknames. We here make use three packages:

1. `pandas`: A very popular package for reading and manipulating data in python.
2. `sentence_transformers`: A package for extracting features from text data using transformer-based models.
3. `transformers`: A HF package for loading and manipulating transformer-based models.

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from transformers import pipeline

## Feature Extraction

The following begins by extracting features (or embeddings) from the text data, which are numerical representations of the meaning of text, using the `sentence-transformers` package. To start, it uses three sentences that the code cell places in a list of strings. This list is provided as input to the model. 

The code makes use of the `all-MiniLM-L6-v2` model, which is a small and efficient embedding model, to extract features from the sentences. The model will encode the sentences into 384-dimensional vector representations. The cell will then print the features as a pandas dataframe for easy viewing. 

Run the cell below. 

In [None]:
# Define sentences
sentences = [
    "I feel great this morning",
    "I am feeling very good today",
    "I am feeling terrible"
]

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract features
features = model.encode(sentences)

# Print the features as a pandas dataframe
pd.DataFrame(features, index=sentences)

**TASK 1**: Have a scroll through the features printed by the cell. Can you see that the features of the first two sentences are more similar to each other (i.e., have similar numerical values) than they are to the third sentence? Why do you think this is the case?

**TASK 2**: Try to add another sentence to the `sentences` list defined above. Use one of the existing sentences but replace one or two words with a synonym. For instance, you could change "I feel *great* this morning" to "I feel *fantastic* this morning". Then rerun the cell. What do you notice about the features of this new sentence compared to the original?

**BONUS TASK**: Try replacing `'all-MiniLM-L6-v2'` with another `sentence-transformers`-compatible model. You can find other compatible models [here](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#) (under 'Original Models'), along with details about their sizes and performances.

## Text Generation
This section uses the `transformers` text generation pipeline. The cell below begins by loading the pipeline with the `gpt2` model--the smaller and open great-grandparent of ChatGPT. **We use this model for introductory purposes only: GPT-2 is not used for serious applications nowadays, and we would not recommned doing so (you will soon see why).** It is nevertheless fun to play around with, and useful to get an impression for how far langauge models have come in the last several years.

The cell then defines a prompt. The prompt is a starting point that the model uses to generate text. GPT-2 will use it to generate text that is likely to follow the prompt. We set the `max_new_tokens` parameter to 100 to limit the length of the generated text to 100 tokens.

Run the cell below.

In [None]:
pipe = pipeline('text-generation', model='gpt2')

prompt = """
Once upon a time in a land far far away, there was a young prince named John.
He was known for his bravery and courage. One day, he decided to go on an adventure to explore the unknown lands.
"""

# Generate text based on the prompt
output = pipe(prompt, max_new_tokens=100)

# Print the generated text
print(output[0]['generated_text'])

**TASK 3**: Please enter a new prompt in the variable `prompt` that you wish the model to continue generating text from. Feel free to play around with the `max_new_tokens` parameter to see how it affects the generated text.

**TASK 4**: Try replacing `'gpt2'` above with `'EleutherAI/gpt-neo-125m'` or another text generation model on the [HF model hub](https://huggingface.co/models) to see how the generated text changes (you will have to select a model in the hundreds of millions of parameter range for it to fit on the CPU and run in a reasonable timeframe - we will show you how to use the GPU later in the week).

## Sentiment Analysis 
In addition to feature extraction, Hugging Face's `transformers` library provides a high-level API for a variety of other tasks. These tasks can be viewed in the lef-hand panel under "Natural Language Processing" on the [HF model hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trendings). Many of these tasks include models that have been fine-tuned on specific datasets to perform well on the task at hand. As an example of such a task, we will now use the `'text-classification'` pipeline to do sentiment classification on a few sentences.

The cell below will now load the `transformers` `'text-classification'` pipeline with [`'tabularisai/multilingual-sentiment-analysis'`](https://huggingface.co/tabularisai/multilingual-sentiment-analysis) to predict the sentiment of sentences for the same sentences as before.

Run the cell below.

In [None]:
# Define the sentences
sentences = [
    "I feel great this morning",
    "I am feeling very good today",
    "I am feeling terrible"
]

# Load sentiment analysis pipeline
pipe = pipeline('text-classification', model='tabularisai/multilingual-sentiment-analysis')

# Predict sentiment of the sentences
sentiments = pipe(sentences)

# Print the predicted sentiments as a pandas dataframe
pd.DataFrame(sentiments, index=sentences)

As you can see, not only does the model predict the sentiment of the sentences (`'label'`), but it also provides a confidence score for each prediction (`'score'`).

**TASK 5:** Try checking out the languages supported by the model on the [model card](https://huggingface.co/tabularisai/multilingual-sentiment-analysis) (under 'Model Details'). Use a translation software of your choice to translate the above sentences into another supported langauge. Do the sentiment labels remain roughly the same?