## Introducing Transformers: Powerful NLP Tools for Data Analytics
Learning about the HuggingFace Transformer models and Python package is essential for data analytics students, as it provides  access to state-of-the-art natural language processing (NLP) capabilities that can be easily integrated into their data analysis workflows.<P>
The Transformers package (and corresponding model hub) offers pre-trained models for a variety of NLP tasks, enabling students to quickly incorporate powerful text processing functionalities into their projects, while also learning how to fine-tune and customize these models to specific domains and problems. Familiarity with Transformers equips data analytics students with knowledge of cutting-edge language modeling techniques, which are increasingly important as data sources become more text-based, preparing them for the evolving landscape of data-driven insights in the era of advanced deep learning models.<P>

The Hugging Face Model Hub is a platform where the community can share and collaborate on trained models. The models themselves are typically open-source and free to use, but you should always check the specific license attached to each model for any restrictions or requirements. The license information is usually included in the model's page on the Model Hub.

However, while the models are free to use, keep in mind that using the Hugging Face API to access these models may not be free. Hugging Face offers a certain amount of free API calls, but beyond that limit, you'll need to pay.

Also, while the models are typically open-source, the datasets used to train these models may not be. Some models may have been trained on proprietary or confidential data. Always check the documentation associated with each model for details about the training data.

This workbook was adapted from this course: https://huggingface.co/learn/nlp-course/chapter1/1<P>

Transformer models are used to solve all kinds of NLP tasks. Here are some examples:
- Text Classification (sentiment analysis)
- Zero-Shot Classification
- Text Generation
- Question Answering
- Summarization

We are going to start with a high-level tool called a 'pipeline'. This is the easiest way to start to use transformers.

### Table of Contents <a name="top"></a>
1. [Use your first transformer pipeline: sentiment analysis using classification](#sentiment-analysis)
2. [Zero-shot Classification](#zero-shot)
3. [Text Generation](#text-generation)
4. [Question Answering](#question-answering)
5. [Summarization](#summarization)
6. [What's inside a pipeline?](#pipeline)
7. [Your assignment](#assign)



In [1]:
# We already have the transformers package installed, let's check it.
%pip show transformers

Name: transformers
Version: 4.31.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /opt/conda/lib/python3.10/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: autogluon.multimodal
Note: you may need to restart the kernel to use updated packages.


## The sentiment analysis pipeline <a name="sentiment-analysis"></a>
In Hugging Face's Transformers library, a pipeline is a high-level, easy-to-use abstraction for performing tasks with transformer models. It encapsulates the complex process of applying a transformer model into a simple function call.

A pipeline ties together several steps to make model predictions on inputs:
 - preprocessing
 - passing inputs through a model
 - postprocessing

For example, if you're using a pipeline for text classification, you would input raw text, and the pipeline would handle tokenization, input formatting, model inference, and output interpretation.


In [2]:
# Import the pipeline from the transformers package
# You can ignore any warnings. We can discuss them or even supress them if needed.
from transformers import pipeline

2024-04-29 14:10:50.632146: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# Let's create a pipeline for the NLP task called "sentiment analysis". We will use a trusted model:
# Here is the model card: https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english
#
sa = pipeline("sentiment-analysis",model = "distilbert-base-uncased-finetuned-sst-2-english")

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [4]:
# Here is a 2023 Q4 Earning statement from Google's Sundar Pichai -- Chief Executive Officer
data = '''
Hello, everyone. Our results reflect strong momentum and product innovation continuing into 2024. 
Today, I'm going to talk about four main topics. One are investments in the AI, including how it's 
helping Search; two, subscriptions, which reached $15 billion in annual revenue, up five times since 2019.
'''

In [5]:
# Send the data to the pipeline and see the analysis result
sa(data)

[{'label': 'POSITIVE', 'score': 0.9994671940803528}]

In [6]:
# Here is a fictitious negative earnings statment
data = '''
Unfortunately, we experienced several significant supply chain constraints during the quarter,
which limited our ability to meet customer demand for our flagship product. This, combined with
increased competition in our core market, resulted in lower-than-expected revenue and earnings 
for the period. We are working diligently to resolve these supply issues and strengthen our competitive
positioning, but expect these headwinds to persist through at least the first half of 2023.
'''

In [7]:
# Send the data to the pipeline and see the analysis result
sa(data)

[{'label': 'NEGATIVE', 'score': 0.9996461868286133}]

In [8]:
# We can also send a list of sentences and the pipeline will classify each item in the list
data =  [
"Unfortunately, we experienced several significant supply chain constraints during the quarter, which limited our ability to meet customer demand for our flagship product.",
"This, combined with increased competition in our core market, resulted in lower-than-expected revenue and earnings for the period.",
"We are working diligently to resolve these supply issues and strengthen our competitive positioning.",
"We see a bright future in 2024!"
]

In [9]:
# Inference all items in the list
response = sa(data)
# Print the first 30 characters of the sentence and the inference result
for i,r in enumerate(response):
    print(data[i][0:30]+'....',r)

Unfortunately, we experienced .... {'label': 'NEGATIVE', 'score': 0.9997628331184387}
This, combined with increased .... {'label': 'NEGATIVE', 'score': 0.9997612833976746}
We are working diligently to r.... {'label': 'POSITIVE', 'score': 0.9739233255386353}
We see a bright future in 2024.... {'label': 'POSITIVE', 'score': 0.9997740387916565}


#### Your turn. Practice using the sentiment-analysis pipeline
[Top of Page](#top)

In [10]:
# Your code here:





## Zero-shot-classification <a name="zero-shot"></a>
Zero-shot classification is a type of NLP task where the model is asked to classify input into categories it has not seen during training. It's called "zero-shot" because the model receives zero examples of some classes before being asked to classify them.

In the context of Hugging Face's Transformers, the zero-shot-classification pipeline is a powerful tool that allows you to perform this task. It leverages models that have been trained on a large corpus of text, learning a rich representation of the semantics of the language. This allows them to generalize well to new, unseen classes.

In [10]:
from transformers import pipeline 

# Model card: https://huggingface.co/facebook/bart-large-mnli
# Larger model
zsc = pipeline("zero-shot-classification", model ="facebook/bart-large-mnli") # 1.63GB
#
# Similar smaller model
#zsc = pipeline("zero-shot-classification", model ="valhalla/distilbart-mnli-12-1") # 890 MB

In [11]:
statement = '''
This new oven has exceeded all my expectations. 
The intuitive control panel makes it a breeze to use, and the convection 
heating cooks my meals to perfection every time. 
I would highly recommend this product to anyone looking to upgrade their home cooking setup.
'''

classes=[
"Product Reviews",
"Political Speeches",
"Personal Diary Entries",
"Scientific Research Papers",
"Fictional Short Stories",
"News Articles",
"Cooking Recipes",
"Poetry",
"Legal Contracts",
"Software Documentation"
]
response = zsc(statement,candidate_labels=classes)
# Print the raw result
response

{'sequence': '\nThis new oven has exceeded all my expectations. \nThe intuitive control panel makes it a breeze to use, and the convection \nheating cooks my meals to perfection every time. \nI would highly recommend this product to anyone looking to upgrade their home cooking setup.\n',
 'labels': ['Product Reviews',
  'Personal Diary Entries',
  'Cooking Recipes',
  'News Articles',
  'Poetry',
  'Fictional Short Stories',
  'Legal Contracts',
  'Political Speeches',
  'Scientific Research Papers',
  'Software Documentation'],
 'scores': [0.305452436208725,
  0.1289210170507431,
  0.10184430330991745,
  0.09160089492797852,
  0.09086976200342178,
  0.08533546328544617,
  0.06718843430280685,
  0.04665406420826912,
  0.04468149319291115,
  0.03745215758681297]}

In [13]:
# Print the result formatted
print("Input Statement:\n", response['sequence'], "\n")

print('The most likely class is:', response['labels'][0], 'with a probablity of:', response['scores'][0],'\n')
print('All classes with corresponding probabilities')
for label, score in zip(response['labels'], response['scores']):
    print(f"Label: {label}, Score: {score}")

Input Statement:
 
This new oven has exceeded all my expectations. 
The intuitive control panel makes it a breeze to use, and the convection 
heating cooks my meals to perfection every time. 
I would highly recommend this product to anyone looking to upgrade their home cooking setup.
 

The most likely class is: Product Reviews with a probablity of: 0.305452436208725 

All classes with corresponding probabilities
Label: Product Reviews, Score: 0.305452436208725
Label: Personal Diary Entries, Score: 0.1289210170507431
Label: Cooking Recipes, Score: 0.10184430330991745
Label: News Articles, Score: 0.09160089492797852
Label: Poetry, Score: 0.09086976200342178
Label: Fictional Short Stories, Score: 0.08533546328544617
Label: Legal Contracts, Score: 0.06718843430280685
Label: Political Speeches, Score: 0.04665406420826912
Label: Scientific Research Papers, Score: 0.04468149319291115
Label: Software Documentation, Score: 0.03745215758681297


#### Your turn. Practice using the zero-shot classification pipeline
[Top of Page](#top)

In [14]:
# Your code here




## Text Generation <a name="text-generation"></a>
Text generation is a common task in NLP that involves generating human-readable text. It's often used to create responses for chatbots, generate narratives for games, write articles, and more.

In the context of Hugging Face's Transformers, the text-generation pipeline uses a language model to generate text. Given some input text, the model generates text that continues from the input. The generated text is intended to be a plausible continuation of the input text, as if the same author were writing it.

In [1]:
from transformers import pipeline

# let's use OpenAI's GPT2, now an open source model. This was released in 2019
# Model card: https://huggingface.co/openai-community/gpt2
tg = pipeline("text-generation", model = "openai-community/gpt2")

2024-04-29 14:12:32.283127: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [2]:
# Give the pipeline a prompt and it will continue the text
response = tg("In a world where AI is commonplace,",max_new_tokens = 100)
print('\n',response[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



 In a world where AI is commonplace, this is a massive victory. If you work for a high-tech startup like ours, you're not expected to spend long on human resources.

These days, a startup is usually a handful of people, each with three or four employees who make very few money on the $20 a year market. If things go south, you might even miss out on a significant investment.

"Companies don't have to rely on robots for hiring and promotion," says Peter Klemman, president


In [3]:
# The pipeline also has some parameters we can use to really customize the output.
# They are documented here: https://huggingface.co/docs/transformers/en/main_classes/text_generation

# My selection of parameters makes the output kind of wacky
response = tg("In a world where AI is commonplace,",
              max_new_tokens = 100,
              temperature=0.99, 
              top_k=50,
              repetition_penalty=1.2,
              num_return_sequences=3)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [4]:
# This will have 3 generated responses
for i,r in enumerate(response):
    print('Response:', i+1,'\n\n',r['generated_text'],'\n')

Response: 1 

 In a world where AI is commonplace, you might wonder why we even bother to look around.
'Technology isn't the only challenge.' So if I was able (to take my own personal computing) in such an emergency scenario — one of us could be seriously injured by something big happening next year or possibly die from drowning this summer … that's not going away anytime soon,' Mr Kessel says about technology change and how it'll impact our behaviour… 

Response: 2 

 In a world where AI is commonplace, this would not be surprising. Humans are already trying to make progress as human innovation reaches its apogee of self-driving cars that allow drivers and passengers alike to travel across high speed highways in the blink—and with them one step closer toward solving many key problems we face today:
 

Response: 3 

 In a world where AI is commonplace, it's not surprising to see that this will be the main mode of computing.
 Why? Because when humans make decisions they're guided and to

In [5]:
# Just another example
data = '''To summarize the Star Wars Episode 3 movie is'''
response = tg(data,max_new_tokens = 200, temperature=.7, repetition_penalty=2.0)
print(response[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


To summarize the Star Wars Episode 3 movie is set in a galaxy far, distant future where all life on this planet has been wiped out. A group of smugglers who are seeking to escape from their ship have arrived and want us here too: they call themselves "The Galactic Empire," but it's only when you begin working aboard your starship that we learn more about them... In fact I can't even fathom how other people like me could make sense of what these two guys mean by saying such things as "...it looks just fine" or something else entirely? And if everyone knew there was nothing wrong with being human before any one person ever did anything stupid (like throwing away his lightsaber), for example would those very same pirates then say no thanks?!


Also - because The Force Awakens isn;t really an action-packed film -- not at least yet! So why do some fanboys keep going back into prequels without knowing everything so thoroughly??? Why does anyone still try again after reading up upon every det

#### Your turn. Practice using text generation.
[Top of Page](#top)

In [20]:
# Your code here



## Question Answering <a name="question-answering"></a>
In NLP, question answering  is a task where the system is given a passage of text (the "context") and a question related to that text, and it tries to answer the question based on the information in the text.

In the context of Hugging Face's Transformers, the question-answering pipeline is used for this task. It uses models that have been trained similar question-ansswering patterns.

In [6]:
from transformers import pipeline

# https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad
qa = pipeline("question-answering",model="distilbert/distilbert-base-cased-distilled-squad")

In [7]:
# There are many parameters you can customize the inference call, but at the minimum you need a context and question
qa(context="My name is Kurt and I work at Cal Poly in San Luis Obispo.",
    question="Where does Kurt work?")

{'score': 0.7949253916740417, 'start': 30, 'end': 38, 'answer': 'Cal Poly'}

In [8]:
# But, it has limitations
qa(context="My name is Kurt and I work at Cal Poly in San Luis Obispo.",
    question="What is Kurt's favorite actor?")

{'score': 0.5232281684875488, 'start': 30, 'end': 38, 'answer': 'Cal Poly'}

In [9]:
# Lots of options here: https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.QuestionAnsweringPipeline
# Another slightly more complex example
cx = '''Pickleball is a paddleball sport that combines elements of tennis, badminton, and table tennis. 
The game is played on a badminton-sized court with a perforated plastic ball and solid paddles. Pickleball was invented
in 1965 on Bainbridge Island, Washington by Joel Pritchard, Bill Bell, and Barney McCallum as a way to provide an 
entertaining game for the whole family. The sport quickly gained popularity, especially among retirees, due to its 
relatively simple rules and the ability to play competitively at different skill levels. Today, pickleball is one of 
the fastest growing sports in North America, with millions of players of all ages participating across thousands of 
indoor and outdoor courts.'''

questions = [
"What other sports does pickleball incorporate elements from?",
"Where was pickleball first invented?",
"Who are credited with inventing the sport of pickleball?",
"Why did pickleball initially become popular, especially among retirees?",
"What is one reason for the rapid growth of pickleball in recent years?"
]

for q in questions:
    response = qa(context = cx, question = q)
    print(q, 'Answer:', response['answer'], 'Score:', response['score'], '\n')

What other sports does pickleball incorporate elements from? Answer: tennis, badminton, and table tennis Score: 0.9531278610229492 

Where was pickleball first invented? Answer: Bainbridge Island, Washington Score: 0.6088538765907288 

Who are credited with inventing the sport of pickleball? Answer: Joel Pritchard, Bill Bell, and Barney McCallum Score: 0.986624002456665 

Why did pickleball initially become popular, especially among retirees? Answer: relatively simple rules and the ability to play competitively at different skill levels Score: 0.2571156620979309 

What is one reason for the rapid growth of pickleball in recent years? Answer: relatively simple rules and the ability to play competitively at different skill levels Score: 0.2947987914085388 



#### Your turn. Practice question answering here.
[Top of Page](#top)

In [25]:
# Your code here





## Summarization <a name="summarization"></a>
Summarization is a task NLP where the goal is to create a concise and meaningful summary of a longer text. The summary should retain the key points of the original text while significantly reducing its length.

In the context of Hugging Face's Transformers, the summarization pipeline is used for this task. It uses a model that has been trained on a summarization task to generate summaries of input text.

In [26]:
from transformers import pipeline
# Model card: https://huggingface.co/facebook/bart-large-cnn
# Parameters: https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.SummarizationPipeline
summer = pipeline("summarization", model = "facebook/bart-large-cnn")

In [27]:
data = '''
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
'''
response = summer(data)
print(response[0]['summary_text'])

America has changed dramatically during recent years. The number of graduates in traditional engineering disciplines has declined. There are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues. Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering.


## What's inside a pipeline? <a name="pipeline"></a>
In summary, a Hugging Face pipeline is a high-level, easy-to-use abstraction for performing tasks with Transformer models. It encapsulates several steps into a single callable object, including tokenization, running the model, and post-processing the model's outputs.

Here's a simplified view of what's inside a pipeline:

**Tokenizer**: This is responsible for converting the input text into a format that the model can understand. This includes splitting the text into tokens, mapping tokens to their IDs, and creating the necessary input tensors.

**Model**: This is the Transformer model that performs the actual task (e.g., text classification, named entity recognition, question answering, etc.). The model takes the tensors produced by the tokenizer and returns output tensors.

**Post-processor**: This takes the output tensors from the model and converts them into a more user-friendly format. The exact nature of the post-processing depends on the specific task.

In [28]:
# What's inside a pipeline?
# summer.tokenizer
# summer.model
# summer.model.config
# summer.device
# summer.framework

## Your assignment <a name="assign"></a>

Pick out your favorite NLP Pipleine from above and experiment with it here. Perhaps use a differnt model or give it different input. 

OR

Find a Pipeline we have not implemented yet and see if you can read the docs and get it to work:

https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#natural-language-processing

[Top of Page](#top)

In [29]:
# Your code here




### Utility

In [30]:
# Hugging Face models take up local disk space.
# This will delete all the cached models and free up disk space. 
# We are not done with Hugging Face models, so don't do it yet.
def purgeHF():
    # purge cache
    !rm -rf ~/.cache/huggingface
#purgeHF()