<a href="https://colab.research.google.com/github/bejoyjon/Data-science/blob/master/HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **WTF is HuggingFace?**

Hugging Face is like a Github specifically for ML Models.
<<describe how it works here>>

**Source - https://www.datacamp.com/datalab/w/ddd92206-5362-4269-86de-bdafdd5b97a0/edit**

# **Check HuggingFace capabilities for various AI related tasks**

import cell

In [None]:
from transformers import pipeline
import gradio as gr

**Pipeline Options**

Using Pipelines, the available tasks are:


---


 'audio-classification', 'automatic-speech-recognition', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-feature-extraction', 'image-segmentation', 'image-text-to-text', 'image-to-image', 'image-to-text', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text-to-audio', 'text-to-speech', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY'



---



Define all global variables here:


In [None]:
TABLE_MODEL_NAME = "google/tapas-base-finetuned-wtq"

Use pipelines ("add difference between pipelines and direct model usage here")

In [None]:
#document_pipeline = pipeline("document-question-answering", model="")
table_pipeline = pipeline("table-question-answering", model=TABLE_MODEL_NAME)

Or alternatively, you can use the model directly

In [None]:
from transformers import AutoModelForTableQuestionAnswering, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained(TABLE_MODEL_NAME)
model = AutoModelForTableQuestionAnswering.from_pretrained(TABLE_MODEL_NAME)

Using Inference providers is also an option, though it leads to billing, and therefore is not a serious choice for learning. e.g.



```
import os
from huggingface_hub import InferenceClient
client = InferenceClient(
  provider = 'auto',
  api_key = os.environ['HF_TOKEN']
)
```



# **AutoModel and AutoTokenizer**
Using HuggingFace AutoModel and AutoTokenizer - HuggingFace saves models and their associated pre-processors in the HF hub. AutoModel and AutoTokenizer have inbuilt methods to handle all models.

The catch here is - AutoModel and AutoTokenizer can only load the base model and not the task-finetuned ones.

* AutoTokenizer works for text because the pre-processing done for text is tokenization.



In [None]:
try:
  import os,torch, huggingface_hub as hf_hub, datasets
except Exception as e:
  print("Libs not installed yet, using !pip install now......")
  !pip install torch huggingface_hub datasets pyarrow transformers
  import os,torch, huggingface_hub as hf_hub, datasets, transformers
  pass

In [None]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-emoji")
model = AutoModel.from_pretrained("cardiffnlp/twitter-roberta-base-emoji")

You can see the various parameters of the tokenizer and model by "printing" the object.

In [None]:
print(tokenizer)
print(model)

What if we need to use the Model for Sequence Classification tasks? We use the next most generic class called AutoModelForSequenceClassification

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")

An even more specific approach is to use the actual class in the transformer library associated with a particular model.

In [None]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification

tokenizer = RobertaTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
model = RobertaForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")

Using the above knowledge, we can build an NLP pipeline with the Flan-t5-base model.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

In [None]:
input_language="English"
output_language="Dutch"
input_question = "How many kilometers from Washington DC to Miami beach?"
input_text = f"translate {input_language} to {output_language}: {input_question}" # you can even create a list of input texts for batch processing

In [None]:
inputs = tokenizer(input_text, return_tensors = "pt") #returns a dictionary with tensor of type pt (PyTorch)

<<Explain why we need torch here. Is it an inference machine??>>

In [None]:
with torch.no_grad(): #we needed to mention return_tensors = "pt" because we were going to use PyTorch.
  outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0])) #outputs is going to be a list of same size as input_text