# Hugging Face Language Model Lab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/S24-CS143AI/blob/main/hugging_face_language_model_lab.ipynb)

## Installing the Hugging Face `transformers` library

You can install it with pip - this code should work running it locally or in Google Colab

In [None]:
import sys
!{sys.executable} -m pip install transformers

### What is Hugging Face?

Hugging Face is a private company
* Founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf
* Based in New York City

Provide popular free, open-source libraries for natural language processing (and other) tasks

Host *hundreds of thousands of models* that you can use in your own programs

## A first tranformers program: the sentiment analysis pipeline

**Sentiment analysis** attempts to identify the overall feeling intended by the writer of some text

The creators of this model **trained** it on lots of examples of text that were labeled as either *positive* or *negative*

A **pipeline** is a series of steps for performing **inference**
* tokenize and preprocess the input text (more on this later)
* ask the model for a prediction
* post-process model's result and turn it into something you can use


<div>
    <center>
        <img src="images/full_nlp_pipeline.svg" width=600px>
    </center>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

We *are* specifying the kind of task: `sentiment-analysis`

We *are not* asking for a specific model, so it picks one of many it has by default

The first time you do this, it will have to download the model - this can take some time depending on your network connection

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

results = classifier("It would be really sad if I wasn't so happy")
print(results)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9973069429397583}]


**Test it out:** Try changing the input to get different labels/scores

### Exercise: Specifying a model

Now try asking for a specific model. 

Replace one line of code in your earlier example.

You can find out more about this model by checking out its model card: https://huggingface.co/SamLowe/roberta-base-go_emotions

What are some things you notice about this model that are different than the first one?

In [None]:
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

### Exercise: Explore additional models

Go to the Hugging Face models page: https://huggingface.co/models
* click `Text Classification`
* find another model that looks interesting to you and try it out
* you might be able to find models for spam detection, fake news detection, topic classification, etc.

## What about sequence-to-sequence models?

The transformers library has models for generating output sequences - long text as input and output
* summarization
* translation
* question answering

Example:

In [2]:
from transformers import pipeline

summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
  return self.fget.__get__(instance, owner)()


In [3]:
# article copied from https://www.npr.org/2024/04/02/1242197022/biden-xi-jinping-call-china
example_news_article = """
BEIJING and WASHINGTON, D.C. — President Biden and Chinese leader Xi Jinping held what a senior Biden administration official dubbed a "check-in" call on Tuesday, marking the first conversation between the leaders since their face-to-face meeting in California in November.
The latest thorn in Taiwan-China tensions: pineapples
World
The latest thorn in Taiwan-China tensions: pineapples

The call touched on everything from Taiwan to the situation on the Korean Peninsula, artificial intelligence and Russia's war in Ukraine.

According to the Chinese readout, Xi told Biden strategic awareness "must always be the first 'button' to be fastened" in bilateral ties. The Chinese leader also elaborated his position on issues concerning Hong Kong, human rights and the South China Sea, the readout says.
Taiwan's election was a vote for continuity, but adds uncertainty in ties with China
World
Taiwan's election was a vote for continuity, but adds uncertainty in ties with China

The Chinese leader warned again that the "Taiwan issue" is an "insurmountable red line" in bilateral ties. Xi also urged Biden to "translate" his commitment of not supporting "Taiwan independence" into concrete actions, according to the readout.

Biden, in the call, emphasized the importance of maintaining peace and stability across the Taiwan Strait and the rule of law and freedom of navigation in the South China Sea, according to a White House readout.

The two leaders also discussed the global geopolitical situation. Biden, according to the White House, raised concerns over China's support for Russia's defense industrial base and its impact on European and transatlantic security. He also emphasized Washington's "enduring commitment" to the complete denuclearization of the Korean Peninsula.

Tuesday's call was the first time Biden and Xi have talked since they met in northern California in November. There, they agreed on a range of steps to try to prevent increasingly fraught U.S.-China ties from slipping into conflict, including more frequent contact at the leader level, between militaries and beyond.

Ahead of the call, a senior administration official told reporters the conversation would not represent a change in U.S. policy toward China, and competition remains a key feature.

"Intense competition requires intense diplomacy to manage tensions, address misperceptions and prevent unintended conflict. And this call is one way to do that," said the official, who spoke on condition of anonymity as he was not permitted to speak on the record.

Biden raised perennial U.S. concerns about China's "unfair trade policies and non-market economic practices," according to the White House readout — an issue that will be front and center when Treasury Secretary Janet Yellen visits China later this week.

The president also reiterated to his Chinese counterpart that Washington will continue to "take necessary actions to prevent advanced U.S. technologies from being used to undermine our national security, without unduly limiting trade and investment," the White House readout said.
"""

In [4]:
summary = summarizer(example_news_article)
print(summary)

[{'summary_text': ' President Biden and Chinese leader Xi Jinping held what a senior Biden administration official dubbed a "check-in" call on Tuesday . The call touched on everything from Taiwan to the situation on the Korean Peninsula, artificial intelligence and Russia\'s war in Ukraine . Tuesday\'s call was the first time Biden and Xi have talked since they met in northern California in November .'}]


## What about chat bots?

Chat bots need models that have been trained on conversational text. 

To get the next response in a conversational thread, you need to pass in the entire conversation up to that point.

The `Conversation` object allows you to append messages to a thread to be used with the `conversational` pipeline.

In [5]:
from transformers import pipeline, Conversation

chatbot = pipeline("conversational", model="facebook/blenderbot-400M-distill")

In [6]:
# Conversation objects initialized with a string will treat it as a user message
conversation = Conversation("What is computer science?")
conversation = chatbot(conversation)
print(conversation.messages[-1]["content"])


No chat template is defined for this tokenizer - using the default template for the BlenderbotTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.



 Computer science is a branch of mathematics that deals with the theory of computation.


In [7]:
conversation.add_message({"role": "user", "content": "Are you sure it is only a branch of mathematics? Doesn't it involve other things?"})
conversation = chatbot(conversation)
print(conversation.messages[-1]["content"])

 Yes, it involves the study of algorithms and how they can be used to solve problems.


In [8]:
# Here's what the whole conversation looks like
print(conversation)

Conversation id: 2ef6049e-3a26-48fb-b264-35a1b8f5fd40
user: What is computer science?
assistant:  Computer science is a branch of mathematics that deals with the theory of computation.
user: Are you sure it is only a branch of mathematics? Doesn't it involve other things?
assistant:  Yes, it involves the study of algorithms and how they can be used to solve problems.



In [9]:
# Here's some insight into how those messages are stored
conversation.messages

[{'role': 'user', 'content': 'What is computer science?'},
 {'role': 'assistant',
  'content': ' Computer science is a branch of mathematics that deals with the theory of computation.'},
 {'role': 'user',
  'content': "Are you sure it is only a branch of mathematics? Doesn't it involve other things?"},
 {'role': 'assistant',
  'content': ' Yes, it involves the study of algorithms and how they can be used to solve problems.'}]

### Exercise: Put this in a loop

Write a loop that continually asks the user for input and displays a response from the chatbot.

### Exercise: Try a larger model

Try the `"facebook/blenderbot-3B"` model - it has 3 billion parameters instead of  400 million. It **might not work** - it could end up using too much memory on Colab.

### Exercise: Experiment with the temperature

You can set the `do_sample` and `temperature` parameters to affect how random the output is. Setting `do_sample=True` will allow it to use some randomness in generating output. The `temperature` affects how random it allows the output to be. Experiment with different temperature values and determine which value you're happiest with.

In [10]:
conversation = Conversation("What is computer science?")
conversation = chatbot(conversation,do_sample=True,temperature=2.5)
print(conversation.messages[-1]["content"])

 computer science is the study of the algorithms and computing systems and how they interact
