<a href="https://colab.research.google.com/github/dvp-git/NLP_related/blob/main/Hugging_Face_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers , what they can do?


In [1]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:

# pipeline() : It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer
from transformers import pipeline


In [3]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for this my whole life")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.994350790977478}]

In [4]:
# Pass multiple sentences using a list
classifier(["I've been wanting to meet Mr.Jackson","I'm sad that I din't have tea today."])

[{'label': 'POSITIVE', 'score': 0.9992484450340271},
 {'label': 'NEGATIVE', 'score': 0.9842413067817688}]

In [5]:
classifier(["That's crazy"])

[{'label': 'NEGATIVE', 'score': 0.6786131858825684}]

In [6]:
classifier(["Yeah, right!"])

[{'label': 'POSITIVE', 'score': 0.9997729659080505}]


1. The text is preprocessed into a format the model can understand.
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so you can make sense of them.












Available pipelines and more:
```
feature-extraction (get the vector representation of a text)
fill-mask
ner (named entity recognition)
question-answering
sentiment-analysis
summarization
text-generation
translation
zero-shot-classification
```



ZERO-SHOT CLASSIFICATION : Using your own labels

In [7]:
classifier_zero_shot = pipeline('zero-shot-classification')
classifier_zero_shot("This course is about NLP and transformers usage in modern NLP applications", candidate_labels=["education","research","deep learning"])

# print(help(classifier_zero_shot))


No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This course is about NLP and transformers usage in modern NLP applications',
 'labels': ['education', 'research', 'deep learning'],
 'scores': [0.6395835876464844, 0.3378009498119354, 0.022615382447838783]}

```

"""
Args:
 |          sequences (`str` or `List[str]`):
 |              The sequence(s) to classify, will be truncated if the model input is too large.

 |          candidate_labels (`str` or `List[str]`):
 |              The set of possible class labels to classify each sequence into. Can be a single label, a string of
 |              comma-separated labels, or a list of labels.

 |          hypothesis_template (`str`, *optional*, defaults to `"This example is {}."`):
 |              The template used to turn each label into an NLI-style hypothesis. This template must include a {} or
 |              similar syntax for the candidate label to be inserted into the template. For example, the default
 |              template is `"This example is {}."` With the candidate label `"sports"`, this would be fed into the
 |              model like `"<cls> sequence to classify <sep> This example is sports . <sep>"`. The default template
 |              works well in many cases, but it may be worthwhile to experiment with different templates depending on
 |              the task setting.
 
 |          multi_label (`bool`, *optional*, defaults to `False`):
 |              Whether or not multiple candidate labels can be true. If `False`, the scores are normalized such that
 |              the sum of the label likelihoods for each sequence is 1. If `True`, the labels are considered
 |              independent and probabilities are normalized for each candidate by doing a softmax of the entailment
 |              score vs. the contradiction score.
 """ 
 ```


TEXT GENERATION

In [8]:
generator = pipeline("text-generation")
generator("I am very ",num_return_sequences=5,max_length=25)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I am very **********-e-Yee-jou-I.'},
 {'generated_text': 'I am very icky. A lot of times I go for two and three times a day.\n\nWhen I was'},
 {'generated_text': 'I am very icky."\n\nHe didn\'t see her face, and it wasn\'t until she started to laugh that'},
 {'generated_text': 'I am very icky when I take my cat and they are on the floor right now... I love you, and I'},
 {'generated_text': "I am very \xa0confident that I am about to find out if I am truly capable. We've been talking all"}]

```
 Methods defined here:
 |  
 |  __call__(self, text_inputs, **kwargs)
 |      Complete the prompt(s) given as inputs.
 |      
 |      Args:
 |          args (`str` or `List[str]`):
 |              One or several prompts (or one list of prompts) to complete.
 |          return_tensors (`bool`, *optional*, defaults to `False`):
 |              Whether or not to return the tensors of predictions (as token indices) in the outputs. If set to
 |              `True`, the decoded text is not returned.
 |          return_text (`bool`, *optional*, defaults to `True`):
 |              Whether or not to return the decoded texts in the outputs.
 |          return_full_text (`bool`, *optional*, defaults to `True`):
 |              If set to `False` only added text is returned, otherwise the full text is returned. Only meaningful if
 |              *return_text* is set to True.
 |          clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
 |              Whether or not to clean up the potential extra spaces in the text output.
 |          prefix (`str`, *optional*):
 |              Prefix added to prompt.
 |          handle_long_generation (`str`, *optional*):
 |              By default, this pipelines does not handle long generation (ones that exceed in one form or the other
 |              the model maximum length). There is no perfect way to adress this (more info
 |              :https://github.com/huggingface/transformers/issues/14033#issuecomment-948385227). This provides common
 |              strategies to work around that problem depending on your use case.
 |      
 |              - `None` : default strategy where nothing in particular happens
 |              - `"hole"`: Truncates left of input, and leaves a gap wide enough to let generation happen (might
 |                truncate a lot of the prompt and not suitable when generation exceed the model capacity)
 |      
 |          generate_kwargs:
 |              Additional keyword arguments to pass along to the generate method of the model (see the generate method
 |              corresponding to your framework [here](./model#generative-models)).
 |      
 |      Return:
 |          A list or a list of list of `dict`: Returns one of the following dictionaries (cannot return a combination
 |          of both `generated_text` and `generated_token_ids`):
 |      
 |          - **generated_text** (`str`, present when `return_text=True`) -- The generated text.
 |          - **generated_token_ids** (`torch.Tensor` or `tf.Tensor`, present when `return_tensors=True`) -- The token
 |            ids of the generated text.
 ```


In [9]:
generator_2 = pipeline("text-generation",model="distilgpt2")
generator_2("I'm facing difficulty learning", num_return_sequences=2, max_length=30)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I\'m facing difficulty learning to work in a different occupation, and sometimes I find that the experience becomes more challenging," she said.\n\n\n\n'},
 {'generated_text': "I'm facing difficulty learning. There's a need to continue learning. I need to be Racing, and I need to continue learning.\n\n\n"}]

Mask filling : Fill in the blanks kind of task


In [10]:
unmasker = pipeline("fill-mask")
unmasker("The <mask> are distributing snacks at school",top_k=2)  # top_k --> How many top results you want to display

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.18719838559627533,
  'token': 2786,
  'token_str': ' boys',
  'sequence': 'The boys are distributing snacks at school'},
 {'score': 0.1015859916806221,
  'token': 1972,
  'token_str': ' girls',
  'sequence': 'The girls are distributing snacks at school'}]

NER: NAMED ENTITY RECOGNITION -> Identifying entities

In [11]:
ner = pipeline("ner",aggregation_strategy="simple")
#print(help(ner))
ner("HI I'm Darryl and I am at Buffalo")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'LOC',
  'score': 0.5850019,
  'word': 'H',
  'start': 0,
  'end': 1},
 {'entity_group': 'PER',
  'score': 0.99896634,
  'word': 'Darryl',
  'start': 7,
  'end': 13},
 {'entity_group': 'LOC',
  'score': 0.8737017,
  'word': 'Buffalo',
  'start': 26,
  'end': 33}]

QUESTION ANSWERING

In [12]:
question_answer = pipeline("question-answering")
question_answer(
    question="Where do I work?",
    context="Working at Browns is great for the heart and soul",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9930919408798218, 'start': 11, 'end': 17, 'answer': 'Browns'}

``` Does not generate the answer, Just uses answer from the context```

SUMMARIZATION

In [13]:
summarize = pipeline("summarization")
summarize("A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs.[1][2][3] This makes them applicable to tasks such as unsegmented, connected handwriting recognition[4] or speech recognition.[5][6] Recurrent neural networks are theoretically Turing complete and can run arbitrary programs to process arbitrary sequences of inputs.[7]\The term \"recurrent neural network\" is used to refer to the class of networks with an infinite impulse response, whereas \"convolutional neural network\" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior.[8] A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.\
Both finite impulse and infinite impulse recurrent networks can have additional stored states, and the storage can be under direct control by the neural network. The storage can also be replaced by another network or graph if that incorporates time delays or has feedback loops. Such controlled states are referred to as gated state or gated memory, and are part of long short-term memory networks (LSTMs) and gated recurrent units. This is also called Feedback Neural Network (FNN)."
)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' A recurrent neural network is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes . This allows it to exhibit temporal dynamic behavior . RNNs can use their internal state (memory) to process variable length sequences of inputs . This makes them applicable to tasks such as unsegmented, connected handwriting recognition[4] or speech recognition .'}]

TRANSLATION


In [14]:
!pip install sentencepiece
translator = pipeline("translation",model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]



[{'translation_text': 'This course is produced by Hugging Face.'}]