# Getting Started with Large Language Models for the CS Curriculum

Eric Manley

Drake University



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/LLM4CSCurriculumWorkshop/blob/main/LLM_Workshop.ipynb)


## Purpose of the workshop

* Demonstrate how to use the **transformers** Python library
* Discuss where it can be used in college CS curricula
* Share resources for further learning and course development

## Transformers in a nutshell

A **transformer** is a neural network architecture based on the concept of **attention**
* they're what make LLMs work - behind ChatGPT et al.
* You feed a lot of text data into the neural network, and it learns which words relate to other words


<div>
    <center>
        <table>
            <tr>
                <td><img src="images/simple_self_attention.png" width=400px></td>
                <td><img src="images/attention_vis1.png" width=300px></td>
            </tr>
        </table>
    </center>
</div>


*image source:* Speech and Language Processing Fig. 10.2, https://web.stanford.edu/~jurafsky/slp3/10.pdf

*image source:* from the original paper on transformers - **attention is all you need** https://arxiv.org/pdf/1706.03762.pdf

## Why transformers?

Unlike previous neural network architectures, they can be trained *in parallel*

LLMs use big models (take lots of words as input, encodings for lots of word senses, lots of layers for extracting high level features of text, trained on massive amounts of text)

<div>
    <center>
        <img src="images/transformer_encoder_decoder.png" width=300px>
    </center>
</div>

*image source:* Hugging Face NLP Course - **How do transformers work?** https://huggingface.co/learn/nlp-course/chapter1/4

## Installing the Hugging Face `transformers` library

You can install it with pip - this code should work running it locally or in Google Colab

In [None]:
import sys
!{sys.executable} -m pip install transformers

### What is Hugging Face?

Hugging Face is a private company
* Founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf
* Based in New York City

Provide popular free, open-source libraries for natural language processing (and other) tasks

Host *hundreds of thousands of models* that you can use in your own programs

## A first tranformers program: the sentiment analysis pipeline

**Sentiment analysis** attempts to identify the overall feeling intended by the writer of some text

The creators of this model **trained** it on lots of examples of text that were labeled as either *positive* or *negative*

A **pipeline** is a series of steps for performing **inference**
* tokenize and preprocess the input text (more on this later)
* ask the model for a prediction
* post-process model's result and turn it into something you can use


<div>
    <center>
        <img src="images/full_nlp_pipeline.svg" width=600px>
    </center>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

We *are* specifying the kind of task: `sentiment-analysis`

We *are not* asking for a specific model, so it picks one of many it has by default

The first time you do this, it will have to download the model - this can take some time depending on your network connection

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

results = classifier("I love how easy it is to build sentiment-aware applications with the transformers library!")
print(results)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9984305500984192}]


**Test it out:** Try changing the input to get different labels/scores

## Activity: Specifying a model

Now try asking for a specific model. 

Replace one line of code in your earlier example.

You can find out more about this model by checking out its model card: https://huggingface.co/SamLowe/roberta-base-go_emotions

What are some things you notice about this model that are different than the first one?

In [None]:
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

## Activity: Explore additional models

Go to the Hugging Face models page: https://huggingface.co/models
* click `Text Classification`
* find another model that looks interesting to you and try it out
* you might be able to find models for spam detection, fake news detection, topic classification, etc.

## What about sequence-to-sequence models?

The transformers library has models for generating output sequences - long text as input and output
* summarization
* translation
* question answering

Example:

In [3]:
from transformers import pipeline

summarizer = pipeline("summarization")


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [5]:
# article copied from https://www.npr.org/2024/04/02/1242197022/biden-xi-jinping-call-china
example_news_article = """
BEIJING and WASHINGTON, D.C. — President Biden and Chinese leader Xi Jinping held what a senior Biden administration official dubbed a "check-in" call on Tuesday, marking the first conversation between the leaders since their face-to-face meeting in California in November.
The latest thorn in Taiwan-China tensions: pineapples
World
The latest thorn in Taiwan-China tensions: pineapples

The call touched on everything from Taiwan to the situation on the Korean Peninsula, artificial intelligence and Russia's war in Ukraine.

According to the Chinese readout, Xi told Biden strategic awareness "must always be the first 'button' to be fastened" in bilateral ties. The Chinese leader also elaborated his position on issues concerning Hong Kong, human rights and the South China Sea, the readout says.
Taiwan's election was a vote for continuity, but adds uncertainty in ties with China
World
Taiwan's election was a vote for continuity, but adds uncertainty in ties with China

The Chinese leader warned again that the "Taiwan issue" is an "insurmountable red line" in bilateral ties. Xi also urged Biden to "translate" his commitment of not supporting "Taiwan independence" into concrete actions, according to the readout.

Biden, in the call, emphasized the importance of maintaining peace and stability across the Taiwan Strait and the rule of law and freedom of navigation in the South China Sea, according to a White House readout.

The two leaders also discussed the global geopolitical situation. Biden, according to the White House, raised concerns over China's support for Russia's defense industrial base and its impact on European and transatlantic security. He also emphasized Washington's "enduring commitment" to the complete denuclearization of the Korean Peninsula.

Tuesday's call was the first time Biden and Xi have talked since they met in northern California in November. There, they agreed on a range of steps to try to prevent increasingly fraught U.S.-China ties from slipping into conflict, including more frequent contact at the leader level, between militaries and beyond.

Ahead of the call, a senior administration official told reporters the conversation would not represent a change in U.S. policy toward China, and competition remains a key feature.

"Intense competition requires intense diplomacy to manage tensions, address misperceptions and prevent unintended conflict. And this call is one way to do that," said the official, who spoke on condition of anonymity as he was not permitted to speak on the record.

Biden raised perennial U.S. concerns about China's "unfair trade policies and non-market economic practices," according to the White House readout — an issue that will be front and center when Treasury Secretary Janet Yellen visits China later this week.

The president also reiterated to his Chinese counterpart that Washington will continue to "take necessary actions to prevent advanced U.S. technologies from being used to undermine our national security, without unduly limiting trade and investment," the White House readout said.
"""

In [6]:
summary = summarizer(example_news_article)
print(summary)

[{'summary_text': ' President Biden and Chinese leader Xi Jinping held what a senior Biden administration official dubbed a "check-in" call on Tuesday . The call touched on everything from Taiwan to the situation on the Korean Peninsula, artificial intelligence and Russia\'s war in Ukraine . Tuesday\'s call was the first time Biden and Xi have talked since they met in northern California in November .'}]


## What about chat bots?

Chat bots need models that have been trained on conversational text. 

To get the next response in a conversational thread, you need to pass in the entire conversation up to that point.

Models often use special tokens like `<s>` and `</s>` to indicate where a sequence begins and ends, but it is different for different models: https://huggingface.co/docs/transformers/en/model_doc/blenderbot

In [10]:
text_gen = pipeline("text2text-generation", model="facebook/blenderbot-400M-distill")

Downloading config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/730M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/127k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/62.9k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/16.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [39]:
conversation = "<s>What is computer science?</s>"
result = text_gen(conversation)
print(result)

[{'generated_text': ' Computer science is a branch of mathematics that deals with computing.'}]


In [40]:
conversation += "<s>"+result[0]["generated_text"]+"</s>"
conversation += "<s>Is it only related to math?</s>"
result = text_gen(conversation)
print(result)

[{'generated_text': ' Yes, it is the study of algorithms and the theory of computation.'}]


In [41]:
conversation += "<s>"+result[0]["generated_text"]+"</s>"
print(conversation)

<s>What is computer science?</s><s> Computer science is a branch of mathematics that deals with computing.</s><s>Is it only related to math?</s><s> Yes, it is the study of algorithms and the theory of computation.</s>


## What about data?

Hugging Face also hosts lots of useful data sets

In [42]:
import sys
!{sys.executable} -m pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable


In [46]:
from datasets import load_dataset

dataset = load_dataset("go_emotions")
dataset

Downloading readme:   0%|          | 0.00/9.40k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/347k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/350k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 5427
    })
    train: Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 43410
    })
    validation: Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 5426
    })
})

In [52]:
print("The first text in the dataset:",dataset["test"]["text"][0])
print("The first text in the dataset:",dataset["test"]["labels"][0])
print("What does that label mean?",dataset["test"].features["labels"].feature.int2str(25))

The first text in the dataset: I’m really sorry about your situation :( Although I love the names Sapphira, Cirilla, and Scarlett!
The first text in the dataset: [25]
What does that label mean? sadness


## What about large models?

Large models take a lot of resources to work with

Many large models have smaller cousins that can be used for test purposes

For example, the T5 model comes in different sizes, ranging from 60 million parameters to 11 billion: https://huggingface.co/google-t5

## Discussion

Where do you see language models fitting into the curriculum?

From what we've covered today, is there anything that is accessible to CS 1 or CS 2? Does it make sense to introduce it there?

## Resources

Free NLP Textbook: Speech and Language Processing by Dan Jurafsky and James H. Martin 
* https://web.stanford.edu/~jurafsky/slp3/
* great for theoretical and intuitive understanding of concepts

Hugging Face NLP Course: https://huggingface.co/learn/nlp-course/
* great for engineering/implementation

Course Materials: https://github.com/ericmanley/F23-CS195NLP
* Natural Language Processing course for undergrads that includes lots of implementation
* Includes Jupyter Notebooks like this one

Fine-Tuning Models for new data
* Hugging Face fine-tuning chapter: https://huggingface.co/learn/nlp-course/chapter3/1
* From my NLP course: https://github.com/ericmanley/F23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb
