# **HuggingFace Quick Tutorial**

Huggingface.co is a platform that allows people to share and use large language models (LLM) and other AI models easily. It is similar to GitHub but mainly used for LLM.

Pipeline: pipelines on huggingface are concepts that abstract most of the complex details and offer a simple API to various tasks. You can write code with pipleline to call various LLM models with a few lines of code.

You can use this link to study more pipelines in Huggingface:

https://huggingface.co/docs/transformers/en/main_classes/pipelines


Use the link below to find tasks on Huggingface:

https://huggingface.co/tasks

# **Section 1: Simple HuggingFace Examples.**

We need to install the transformers library as the first step.

In [1]:
!pip install transformers



The first example is on sentiment analysis.
If no model was supplied, huggingface will choose the default one.



In [2]:
# sentiment-analysis
from transformers import pipeline

classifier=pipeline("sentiment-analysis")
result=classifier("Today isn't a bad day!")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9983673691749573}]


The second exammple is on image to text.

This is the model we choose: https://huggingface.co/Salesforce/blip-image-captioning-base

In [3]:
from transformers import pipeline

#img to text
def img2text(url):
  res = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
  text=res(url)
  print(text)
  print("-----------")
  text2=text[0]["generated_text"] # we only need the text part
  return text2

#call the function
img2text("https://pages.mtu.edu/~cai/sat4520/dog.jpg")

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]



[{'generated_text': 'a small dog with a collar on'}]
-----------


'a small dog with a collar on'

The third example is to expand text, or text generation

We use this model: https://huggingface.co/distilbert/distilgpt2

You can adjust the seed to get different stories.

In [4]:
# text-generation
from transformers import pipeline, set_seed

# text-generation
txt="a small dog with a collar on"
generator=pipeline("text-generation", model="distilbert/distilgpt2")
set_seed(6)
result= generator(txt, max_length=50, num_return_sequences=1)

print(result)
print(result[0]["generated_text"])

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "a small dog with a collar on it's back, and is very much the same dog that you see it in. Once you catch the dog and move out of the crate, what happens next? How many times will it get out of the crate"}]
a small dog with a collar on it's back, and is very much the same dog that you see it in. Once you catch the dog and move out of the crate, what happens next? How many times will it get out of the crate


The forth example is a bit complicated, it is on text to speech.

This is the model we choose: https://huggingface.co/microsoft/speecht5_tts

Read the instruction, you need to install some libraries.

The output audio file is example.mp3.

In [5]:
!pip install --upgrade transformers sentencepiece datasets[audio]

Collecting transformers
  Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets[audio]
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets[audio])
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets[audio])
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux20

In [6]:
# sample code for text to speech
from transformers import pipeline
from datasets import load_dataset
import soundfile as sf
import torch

#text 2 voice
def txt2voice(story):
  synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")

  #define embedding
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
  speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

  #generate voice from text
  speech = synthesiser(story, forward_params={"speaker_embeddings": speaker_embedding})

  #output voice file
  sf.write("example.mp3", speech["audio"], samplerate=speech["sampling_rate"])

#call the function
txt2voice("This is a test. I love cooper.")

config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/585M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/50.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.3M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/7931 [00:00<?, ? examples/s]

The fifth example is on voice recognization, or voice to text.

We want to convert the voice file named example.mp3 back to text.

This the model we use: https://huggingface.co/openai/whisper-large-v3

In [7]:
# speech recognization
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", "openai/whisper-large-v2")
result=pipe("example.mp3")

print(result)

config.json:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/4.29k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


{'text': ' This is a test. I love Cooper.'}


# **Section 2: A More Complicated Example**

In this example, we want to accomplish the following task: you will be given an image, you need to write a story based on the image, then read the story using computer generated voice. It can be divided into three subtasks: image to text, expand text to a story, then story to voice.

**Task 1: image to text in huggingface**

We will reuse the example 2 in Section 1.

**Task 2: text to a story**

We will reuse the example 3 in Section 1.

**Task 3: text to voice with huggingface**

We will reuse the example 4 in Section 1.

**Now we can put all three parts together**

# **Assignment:**
Write code to add all three parts together. Use functions when possible to simplified the code.

The source image file is here: https://pages.mtu.edu/~cai/sat4520/dog.jpg

The generated voice file should be named speech.mp3.

In [9]:
##### Write your code here #####

#import libraries
from transformers import pipeline, set_seed
import requests

#img to text
def img2text(url):
  res = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
  text = res(url)
  text2 = text[0]["generated_text"]  # Extracting the text part
  return text2

# text-generation
def txtGen(txt):
    generator = pipeline("text-generation", model="distilbert/distilgpt2")
    set_seed(6)  # Setting seed for reproducibility
    result = generator(txt, max_length=50, num_return_sequences=1)
    generated_text = result[0]["generated_text"]
    return generated_text

#text 2 voice
def txt2voice(story):
    synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")

    # Load speaker embeddings
    embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
    speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

    # Generate voice from text
    speech = synthesiser(story, forward_params={"speaker_embeddings": speaker_embedding})

    return speech["audio"]

# Function to save audio to file
def save_audio(audio_data, filename):
    with open(filename, "wb") as f:
        f.write(audio_data)

# Function to execute all tasks
def complete_task(url, output_file):
    # Task 1: Image to text
    text_from_image = img2text(url)

    # Task 2: Text generation
    generated_story = txtGen(text_from_image)

    # Task 3: Text to voice
    voice_data = txt2voice(generated_story)

    # Saving voice data to file
    save_audio(voice_data, output_file)

# Call the function
complete_task("https://pages.mtu.edu/~cai/sat4520/dog.jpg", "speech.mp3")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
