# Hugging Face and OpenAI Use

The first part of this notebook replicates code from the Hugging Face LLM course, specifically content found on:
* "Transformers, what can they do?" - https://huggingface.co/learn/llm-course/chapter1/3

You may find it very useful to read that page, as well the entirety of their Chapter "1. Transformers Models"

## Starter imports

The transformers library is already installed, but we do need to import it.  Specifically, we'll recreate the "model" and "tokenizer" from Chapter 1 of the required book, as well as import the function called "pipeline."

As noted on the above linked page:
> There are three main steps involved when you pass some text to a pipeline:
> * The text is preprocessed into a format the model can understand.
> * The preprocessed inputs are passed to the model.
> * The predictions of the model are post-processed, so you can make sense of them.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline

In [None]:
# Load our default model and tokenizer from Chapter 1

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    dtype="auto",
    trust_remote_code=False,
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Pipeline objectives

In the introductory slides, I mentioned NLP tasks like the following:
* Text Classification
* Sequence Labeling
* Information Extraction
* Question Answering
* Natural Language Generation
* Machine Translation
* Summarization
* Text Similarity & Retrieval
* Text Normalization & Understanding
* Dialogue & Conversational Understanding

Let's look at how several of these can be carried out with the 'pipelines' function.

## Text Classification
One example of classification is sentiment analysis (e.g. class labels of positive vs negative)

In [None]:
classifier = pipeline("sentiment-analysis",
                      model=model)
classifier("I've been waiting for a HuggingFace course my whole life.")

In [None]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

In [None]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", 
     "I hate this so much!"]
)

## Sequence Labeling
One example is Named Entity Recognition (NER) - finding names of people, organizations, locations, etc.

In [None]:
ner = pipeline("ner", grouped_entities=True)

In [None]:
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

## Information Extraction
* to be continued

## Question-Answering
- Answer questions based on given text.

In [None]:
question_answerer = pipeline("question-answering")

In [None]:
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

## Natural Language Generation
- Generating new text!

In [None]:
generator = pipeline("text-generation", 
                     # model=model,   # the book's phi3 model example
                     # tokenizer=tokenizer,
                     model="HuggingFaceTB/SmolLM2-360M"   # Hugging Face course suggestion
                    )

In [None]:
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

## Machine Translation
- Convert text from one language to another.

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

In [None]:
translator("Ce cours est produit par Hugging Face.")

## Summarization
- Produce a shorter version of text while keeping the main ideas.

In [None]:
summarizer = pipeline("summarization")

In [None]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

In [None]:
summarizer_output = summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

summarizer_output[0]['summary_text']

## Text Similarity and Retrieval
* to be continued

## Text Normalization and Understanding
- Preprocessing or transforming language - here we consider a basic task: tokenization.
- Pipeline will not actually be able to help us do this.  We need to go straight for the 'tokenizer'.

In [None]:
# tokenizer defined at top

In [None]:
text = "Hello world"
tokens = tokenizer(text)
print('Token IDs: ',tokens["input_ids"])
print('Tokenizer reversion: ',tokenizer.convert_ids_to_tokens(tokens["input_ids"]))

The above output will be different for different models.  We will look more at this next week.

## Dialogue and Conversational Understanding
* to be continued

## Mask Filling

Turn now to the API notebook for continuation.