## Little about transformers and hugging face 

###  The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks.  Transformers are semi-supervised machine learning models that are primarily used with text data and  have replaced recurrent neural networks in natural language processing tasks.

### The Hugging Face transformers package is an immensely popular Python library providing pretrained models that  are extraordinarily useful for a variety of natural language processing (NLP) tasks. It previously supported only PyTorch, but, as of late 2019, TensorFlow 2 is supported as well.

In [1]:
# install transformers
!pip install transformers



### Text Generation

#### import model (GPT2LMHeadModel) to generate text form pretrained model 
#### and tokenizer (GPT2Tokenizer) to encode and decode
#### using tensorflow

In [2]:
import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [3]:
# if it is the first time it will download for you the pretrained model  
# we can use also use the small version just by replacing  gpt2-large by gpt2

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")

In [4]:
# if it is the first time it will download for you the pretrained model  
# we can use also the small version just by replacing  gpt2-large by gpt2

model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)

In [5]:
# the model will generate text for you based on this sentence 

sentence = 'Natural language processing'

In [6]:
# encode the sentence to tokens

input_ids = tokenizer.encode(sentence, return_tensors='pt')

In [7]:
# display the input_ids

print('encoded sentence :',input_ids[0],'\n','normale sentence :',tokenizer.decode(input_ids[0]))

encoded sentence : tensor([35364,  3303,  7587]) 
 normale sentence : Natural language processing


In [8]:
# generate text until the output length reaches 300 word

output = model.generate(input_ids, max_length=300, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

In [9]:
# the output variable has now an encoded text result

output

tensor([[35364,  3303,  7587,   357,    43, 22182,     8,   318,   257,  8173,
           326,  3578,  9061,   284,  1833,   262,  3616,   286,  3288,  3303,
            13,   406, 22182,   318,  1912,   319,   262,  2126,   326,   262,
          1692,  3632,   318,  6007,   286,  7587,   257,  1588,  2033,   286,
          1321,   287,   257,  1790,  2278,   286,   640,    11,   290,   326,
           428,  1321,   460,   788,   307,   973,   284,   787,  1167,  4972,
           546,   262,   995,  1088,   514,    13,   198,   198,  1890,  1672,
            11,   257,  3644,  1244,   307,  1498,   284, 13249,   326,   257,
          1048,   318,  7954,   416,  2045,   379,   511, 16324, 14700,    11,
           393,   326,   484,   389,   287,  2356,   416, 22712,   511,  1767,
          8650,    13,   317,  3644,   460,   635,   307,  8776,   284,  7564,
          1728,  2456,   290, 20144,    11,   884,   355,   366,    40,  1842,
           345,     1,   393,   366,  1026,   338,  

In [10]:
# decode the output

print(tokenizer.decode(output[0], skip_special_tokens=True))

Natural language processing (LNP) is a technique that allows computers to understand the meaning of natural language. LNP is based on the idea that the human brain is capable of processing a large amount of information in a short period of time, and that this information can then be used to make inferences about the world around us.

For example, a computer might be able to infer that a person is angry by looking at their facial expressions, or that they are in pain by analyzing their body movements. A computer can also be trained to recognize certain words and phrases, such as "I love you" or "It's a beautiful day in New York City." This type of machine learning can be applied to a wide variety of tasks, including image recognition, speech recognition and text processing.


In [11]:
# write the text in txt file and save it to the actuall directory

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
with open('blog.txt' , 'w') as f: 
    f.write(generated_text)

In [12]:
# read the saved file to check content

f = open("blog.txt", "r")
print(f.read()) 

Natural language processing (LNP) is a technique that allows computers to understand the meaning of natural language. LNP is based on the idea that the human brain is capable of processing a large amount of information in a short period of time, and that this information can then be used to make inferences about the world around us.

For example, a computer might be able to infer that a person is angry by looking at their facial expressions, or that they are in pain by analyzing their body movements. A computer can also be trained to recognize certain words and phrases, such as "I love you" or "It's a beautiful day in New York City." This type of machine learning can be applied to a wide variety of tasks, including image recognition, speech recognition and text processing.


### Summariziation  , Sentiment analysis , Question answer  
### there are other models that we can use them :  audio-classification,  automatic-speech-recognition,  feature-extraction,  text-classification,  token- classification,   question-answering,   table-question-answering', fill-mask,      summarization,  translation,  text2text-generation,   text-generation,  zero-shot-classification,  conversational,    image-classification,  image-segmentation,   object-detection,  translation_XX_to_YY

### Summariziation

In [13]:
# we use pipeline for easily downloading and use the summarization
from transformers import pipeline

In [14]:
# use summarization 

summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


In [15]:
# write the text that will be summarized

Blog = ''' 
NLP tasks :
Human language is filled with ambiguities that make it incredibly difficult to write software that accurately determines 
the intended meaning of text or voice data. 
Homonyms, homophones, sarcasm, idioms, metaphors, grammar and usage exceptions, 
variations in sentence structure—these just a few of the irregularities of human language that take humans years to learn,
but that programmers must teach natural language-driven applications to recognize and understand accurately from the start,
if those applications are going to be useful.
Several NLP tasks break down human text and voice data in ways that help the computer make sense of what it's ingesting.
Some of these tasks include the following:

    Speech recognition, also called speech-to-text, is the task of reliably converting voice data into text data.
    Speech recognition is required for any application that follows voice commands or answers spoken questions.
    What makes speech recognition especially challenging is the way people talk—quickly, slurring words together, 
    with varying emphasis and intonation, in different accents, and often using incorrect grammar.
    Part of speech tagging, also called grammatical tagging, is the process of determining 
    the part of speech of a particular word or piece of text based on its use and context. 
    Part of speech identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in ‘What make of car do you own?’
    Word sense disambiguation is the selection of the meaning of a word with multiple meanings  through a process 
    of semantic analysis that determine the word that makes the most sense in the given context. For example, 
    word sense disambiguation helps distinguish the meaning of the verb 'make' in ‘make the grade’ (achieve) vs.
    ‘make a bet’ (place).
    Named entity recognition, or NEM, identifies words or phrases as useful entities.
    NEM identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.
    Co-reference resolution is the task of identifying if and when two words refer to the same entity.
    The most common example is determining the person or object to which a certain pronoun refers (e.g., ‘she’ = ‘Mary’), 
    but it can also involve identifying a metaphor or an idiom in the text 
    (e.g., an instance in which 'bear' isn't an animal but a large hairy person).
    Sentiment analysis attempts to extract subjective qualities—attitudes, emotions, sarcasm, confusion, suspicion—from text.
    Natural language generation is sometimes described as the opposite of speech recognition or speech-to-text; 
    it's the task of putting structured information into human language. 


'''

In [16]:
# summarize the text with result between 30 and 90 word 

summarizer(Blog, max_length=90, min_length=30, do_sample=False)

[{'summary_text': " Human language is filled with ambiguities that make it incredibly difficult to write software that accurately determines the intended meaning of text or voice data . NLP tasks break down human text and voice data in ways that help the computer make sense of what it's ingesting ."}]

### Sentiment  Analysis

In [17]:
# use  sentiment-analysis classifier if it is the first time it will download the model for you 


classifier = pipeline("sentiment-analysis")


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [18]:
# the sentence to classifiy it can contain more than one sentence 

sentences_to_class= classifier(['i am happy','i am not happy','i am sorry but i am happy.'])



In [19]:
#print results

print(sentences_to_class)

[{'label': 'POSITIVE', 'score': 0.9998801946640015}, {'label': 'NEGATIVE', 'score': 0.9997896552085876}, {'label': 'POSITIVE', 'score': 0.999850869178772}]


### Question  Answer

In [20]:
# import tokenizer and model for question-answer

from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf


model = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


In [21]:
# we use this text to extract answer for the choosing question
# so the model will take the question and will search the answer in the text provided

text = r"""
 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]


In [22]:
for question in questions:
    print(model(question,text))

{'score': 0.5269609689712524, 'start': 255, 'end': 263, 'answer': 'over 32+'}
{'score': 0.9512110948562622, 'start': 93, 'end': 122, 'answer': 'general-purpose\narchitectures'}
{'score': 0.8400999307632446, 'start': 334, 'end': 360, 'answer': 'TensorFlow 2.0 and PyTorch'}
