<a href="https://colab.research.google.com/github/elianedb/BERT_SE/blob/main/Experimenting_with_NLP_Medium.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practicing some Hugging Face Transformers Code
This notebook goes through a lightning quick demonstration of some of the cutting-edge language models that are available, open source, to anyone with an internet connection. The organisation HuggingFace (https://huggingface.co/) has made them super easy to use, so feel free to play around with the inputs here if you want to see how these work.

This code comes from https://huggingface.co/transformers/task_summary.html



## Running the code in this notebook

Running the code in this notebook is really easy. As you run your mouse over a block of code on the left hand side of each block you'll see a small play button - just click the play button to run the code. You'll see a whirling circle; that means the code's running. Leave it to finish running and then run the next block of code. Usually you'll see some output below the block of code. Each block will take a few seconds to run as you're running this code on some Google servers somewhere.

First, we need to load up the library that runs these models as they aren't pre-loaded onto Google Colab

In [None]:
!pip install -q transformers
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

[K     |████████████████████████████████| 3.8 MB 5.1 MB/s 
[K     |████████████████████████████████| 6.5 MB 41.0 MB/s 
[K     |████████████████████████████████| 77 kB 5.7 MB/s 
[K     |████████████████████████████████| 895 kB 50.7 MB/s 
[K     |████████████████████████████████| 596 kB 58.7 MB/s 
[?25h

# Paraphrase Detection
This nifty pre-trained model has been trained to classify whether two sentences are paraphrases of each other. We start off by loading the pre-trained model. This model has been **fine-tuned** on this task, meaning that after the initial model was trained to do it's original (more general task), it was then trained with a data set comprising examples of paraphrased sentences. This is a great example of **transfer learning**.

Don't worry if you get what looks like a warning after running this code. You can ignore it, the model's good to go.

In [None]:
# Import tokenizer (thing that turns words to number for the model) and the model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
# Setup the two classed for the model
classes = ["not a paraphrase", "a paraphrase"]

## Giving the model some examples
Below are a few examples we'll feed to the model to see how good it is at detecting whether something is a paraphrase or not. Go ahead and change the bits of text between the quotations marks "" to see if you can fool the model.

In [None]:
sequence_1 = "Rick and Morty is my favourite show on Netflix"
sequence_2 = "I love building machine learning models to understand language"
sequence_3 = "Natural Language Processing is one of my favourite areas of machine learning"

In [None]:
# Tokenize the sentences, turning them into a numerical representation the model can read
paraphrase = tokenizer(sequence_1, sequence_3, return_tensors="tf")
not_paraphrase = tokenizer(sequence_2, sequence_3, return_tensors="tf")

# Let's get the model's predictions
paraphrase_predictions = tf.nn.softmax(model(paraphrase)[0], axis =1).numpy()[0]
not_paraphrase_predictions = tf.nn.softmax(model(not_paraphrase)[0], axis = 1).numpy()[0]

Now, let's take a look at how good it is at picking up whether one sentence is a paraphrase of another

In [None]:
# Let's take a look at the results from the 1st and 3rd sentences
print(" === Example one ===")
print("Sentence 1: {} \nSentence 2: {}".format(sequence_1, sequence_3))
print("\nModel prediction:")
for i in range(len(classes)):
  print("Probability it's {}: {}%".format(classes[i], round(paraphrase_predictions[i]*100,2)))

# And now from the the 2nd and 3rd sentences
print("\n === Example Two ===")
print("Sentence 2: {} \nSentence 3: {}".format(sequence_2, sequence_3))
print("\nModel prediction:")
for i in range(len(classes)):
  print("Probability it's {}: {}%".format(classes[i], round(not_paraphrase_predictions[i]*100,2)))

# Transformers Q&A 
The ease with which one can build question and answering (Q&A) models blows my mind. Basically, the way the model works is by finding the most likely start and end token (word) positions based on the question asked. The below code utilises a pre-trained model called BERT for Q&A. 

Code taken from https://huggingface.co/transformers/task_summary.html

In [None]:
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

Below we load the fine-tuned model. Don't worry if you some warning messages; the model is good to go.

In [None]:
# We load up the pretrained tokenizer and model, both trained/fine-tuned on a QA task
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
qa_model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

To make this task more real, I went and copied the first paragraph of the Wikipedia article on Apple computers and then thought of two questions to ask. After you see how this model works, there's a code block below for you to enter in your own text and the questions you want to the model to try answer.

In [None]:
# Add some text and questions
text = r"""Apple was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976 to develop and sell Wozniak's Apple I personal computer, 
           though Wayne sold his share back within 12 days. It was incorporated as Apple Computer, Inc., in January 1977, and sales of its computers, 
           including the Apple II, grew quickly. Within a few years, Jobs and Wozniak had hired a staff of computer designers and had a production line. 
           Apple went public in 1980 to instant financial success. Over the next few years, Apple shipped new computers featuring innovative graphical 
           user interfaces, such as the original Macintosh in 1984, and Apple's marketing advertisements for its products received widespread critical 
           acclaim. However, the high price of its products and limited application library caused problems, as did power struggles between executives. 
           In 1985, Wozniak departed Apple amicably and remained an honorary employee,[9] while Jobs and others resigned to found NeXT.[10]
          As the market for personal computers expanded and evolved through the 1990s, Apple lost market share to the lower-priced duopoly of Microsoft 
          Windows on Intel PC clones. The board recruited CEO Gil Amelio to what would be a 500-day charge for him to rehabilitate the financially troubled 
          company—reshaping it with layoffs, executive restructuring, and product focus. In 1997, he led Apple to buy NeXT, solving the desperately 
          failed operating system strategy and bringing Jobs back. Jobs pensively regained leadership status, becoming CEO in 2000. Apple swiftly 
          returned to profitability under the revitalizing Think different campaign, as he rebuilt Apple's status by launching the iMac in 1998, 
          opening the retail chain of Apple Stores in 2001, and acquiring numerous companies to broaden the software portfolio. In January 2007, 
          Jobs renamed the company Apple Inc., reflecting its shifted focus toward consumer electronics, and launched the iPhone to great critical 
          acclaim and financial success. In August 2011, Jobs resigned as CEO due to health complications, and Tim Cook became the new CEO. 
          Two months later, Jobs died, marking the end of an era for the company. 
      """

questions = ["Who founded Apple?",
             "Where did Apple lose market share?"]

Let's see how the model does...

In [None]:
def answer_question(question, text):
  # Function that simplifies answering a question
  for question in questions:
    # Concatenate the question and the textx
    inputs = tokenizer(question, text, add_special_tokens = True, return_tensors = 'tf')
    # Get the input ids (numbers) and convert to tokens (words)
    input_ids = inputs["input_ids"].numpy()[0]
    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    # Run the pretrained model to get the logits (raw scores) for the scores
    output = qa_model(inputs)

    # Get the most likely beginning and end
    answer_start = tf.argmax(output.start_logits, axis = 1).numpy()[0]
    answer_end = (tf.argmax(output.end_logits, axis = 1)+1).numpy()[0]
    # Turn the tokens from the ids of the input string, indexed by the start and end tokens back into a string
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print("Question {} \nAnswer: {}".format(question, answer))

answer_question(questions, text)

## Play around with the model yourself
You can edit the text and the questions below to see how the model answers your own questions to your own piece of text. Make sure to keep the format as it is below though, with three quotations marks """ on either side of your **text** and your **questions** within the single quotes within the square brackets [ ]

In [None]:
text = r"""Enter in the text you want the AI to find answers in here """
questions = ["Type question 1 here",
             "Type question 2 here or leave blank"]

answer_question(questions, text)