# Natural Language Processing

## What is NLP?

It is a branch of AI that gives the machines the ability to process, understand and derive meaning from human languages. It combines the field of linguistics and computer science to decipher the language from **text** and **speech**.

NLP techniques aim to build a bridge between humans and technology with the help of Language Modelling

`Language modelling` is the core approach behind Natural Language Processing. This **statistical approach** analyzes the pattern in human language with the aim to predict the next word.

A `language model` outputs a **probability distribution over words or word sequences**. A language model gives the probability that a certain word (or sequence of words) is "valid".

In other words, and more future-oriented:

NLP is a technique that enables ML algorithms to interpret and to understand the way humans communicate.

## NLP tasks

With the help of NLP techniques we can perform various tasks:

- sentiment analysis
- text classification & clustering (eg. spam detection, topic analysis)
- chatbot and virtual assistants
- translations
- text summarisation
- auto-correction of text 

## NLP libraries

There are two most popular NLP libararies in python:
- `nltk` - Natural Language Toolkit
- `spaCy` - NLP library written in Python & Cython (C-extension for Python)

Origins:

- `nltk` was built by scholars and researchers whose aim was to create complex NLP funtions. It grounds a foundation for NLP algorithms.
- `spaCy` is more often used in production, for business purposes. 

Pros/Cons:

- `spaCy`- more modern, cleaner API, support for parallel processing, good documentation, faster, more efficient.

- `nltk` - more tools, more packages - could be useful for a specific/niche task.

Other interesting NLP libraries:

- `TextBlob`- easier than `nltk`, good for sentiment analysis, comes with a native wrapper around the Google Translate API
- Stanford `CoreNLP` - rich in features, very efficient and popular in production
- `Gensim`- developed in Czech Republic, powerful and scalable library
- `vaderSentiment`- lexicon and rule-based sentiment analysis tool
- `flair` - simple framework for NLP projects developed by Humboldt Uni in Berlin

### Fundamental processing steps/skills in NLP

In order to derive meaning from text and get a disired response, we need to follow specific steps. These are core skills & steps involved in Natural Language Processing:

**1. Pre-processing steps** - almost always the text that we get to work with is not 'clean' and ready to be processed. Typical pre-processing steps in NLP are:
    
- Cleaning data from irrelevant information
- Removing stop words
- Stemming
- Lemmatization

Notice, that depending on your NLP approach it might not be neccessary to perform above steps because:
- they might not fit your modelling objective (eg. stopwords might be important for your analysis)
- or because the numerical word represention approach (see step No. 3) does it automatically for you.

**2. Tokenizing** - breaking down the text corpus into single words. Singular Words and their relationships are at the core of NLP

**3. Assigning numerical representation to words (tokens)**

- Bag-of-words approach
- Word Embeddings approach

**4. Choosing & applying appropriate ML algorithm for a given NLP task**

**Other useful skills and techniques:**

- Part of Speech Tagging (POS Tagging)
- Named Entity Tagging
- REGEX

Performing NLP tasks is virtually impossible without knowledge of `regex` -  sequence of characters that specifies a search pattern in text. Let's dive into REGEX as a warm up before doing NLP.

# NLP skills & techniques
## Regex

Knowledge of Regex is a prerequisite for working with textual data.

Regular expressions in Python is essentialy a small programming language embedded inside Python.

REGEX answers the question:
- "Does this piece of text match the pattern"?
- "Is there a match for a pattern anywhere in a string"?

#### Simplest Regex case - match characters

re is an in-built Python module. It has four key methods used to look for a pattern in a string:
- `.finditer()`
- `.match()`
- `.findall()`
- `.search()`

and two methods to modify the string

- `.split()`
- `.sub()`

In [13]:
import re

test = "123abc456789abc123ABC"

#Let's look for 'abc' or in the above text

pattern = re.compile(r"abc") 
matches = pattern.finditer(test)

#matches = re.finditer(r"abc", test)

for match in matches:
    print(match)
    
#notice that regular expression is case-sensitive

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>


> ### TASK

Apply .match() and .search() method on pattern. What is the output and what does it mean? *You do not need for loop for those two functions.*

In [14]:
#.match() method determines whether the pattern matches at the beginning of the string
#.search() method looks for the first occurence of the pattern in a string

#### `raw` vs. regular `string`

The above string is a raw string, it has no escape characters. `raw` string treat all characters as literal characters.

Notice the difference between `raw` and just `string`.

**regular `string`**

In [15]:
a ='Hello\nMr\nProfessor'

In [16]:
print(a)

Hello
Mr
Professor


**`raw` string**

In [17]:
b = r"Hello\nMr\nProfessor"
print(b)

Hello\nMr\nProfessor


`r` prefix changes how the string literal is interpreted. Without the `r`, backslashes are treated as **escape characters**. With `r`, backslashes are treated as literal. The type stays the same - `str` or `object`

You might want to always make sure to use raw version of string in order to catch all characters literally.

#### Escape characters
#### What is the point of escape characters?

Escape characters - backlash (\) followed by the desired character is a way to allow illegal characters in a string or indicate special characters like new line or tab. For example `"` is an illegal character in a string:

In [18]:
txt = "My friends used to call her "Lola" when she was young"
print(txt)

SyntaxError: invalid syntax (1686009549.py, line 1)

In order to use "" in a string you need to escape it.

In [19]:
txt = "My friends used to call her \"Lola\" when she was young"
print(txt)

My friends used to call her "Lola" when she was young


Popular escaped characters:

- `\n` - new line
- `\t` - tabulator
- `\\` - single backlash
- `\"` - quotation mark

> ### TASK

1. assign a string *"My friends used to call her "Lola" when she was young"* to a new variable called "new_string"
2. search for "Lola" pattern in this string - quotation marks `"` need to be included in your output. 
Experiment with finditer(), findall() and search() method

#### Meta characters in regular expressions

Metacharacters are the building blocks of RegEx. Characters in RegEx are understood to be either a metacharacter that has a special meaning or just a regular character with a literal meaning.

In other words, Meta characters in regex have **special meaning**

- `\` Marks the next character as either a special character or a literal. For example, n matches the character n, whereas \n matches a newline character. The sequence \\\ matches \ and \\( matches (.

- `^` Matches the beginning of input
- `$` Matches the end of input
- `*` Matches the preceding character zero or more times. For example, zo* matches either z, zo or zoo.
- `+` Matches the preceding character one or more times. For example, zo+ matches zo, zoo but not z.
- `?` Matches the preceding character zero or one time. For example, a?ve? matches the ve in never.
- `.` Any character (except a newline character)

In [20]:
a ="hello world, hello world"
re.findall(".", a)

['h',
 'e',
 'l',
 'l',
 'o',
 ' ',
 'w',
 'o',
 'r',
 'l',
 'd',
 ',',
 ' ',
 'h',
 'e',
 'l',
 'l',
 'o',
 ' ',
 'w',
 'o',
 'r',
 'l',
 'd']

In [21]:
a ="hello world, hello world"
re.findall("hello", a)

['hello', 'hello']

In [22]:
a ="hello world, hello world"
re.findall("^hello", a)

['hello']

#### What do we need `regex`for in data analysis?

Three key useful methods that leverage regex in data analysis and data cleaning:
- `str.contains()`
- `str.extract()`
- `str.replace()`

In [23]:
import pandas as pd

In [24]:
df = pd.DataFrame({'name' : ["Alicia Johnson", "Robert Patt\\Robinson", "Sean Pean", "Alex Patt"], 
                   'birthday' : ["28 May 2000", "13 June 1967", "09 March 2005", "14 May 1999"]})
df.head()

Unnamed: 0,name,birthday
0,Alicia Johnson,28 May 2000
1,Robert Patt\Robinson,13 June 1967
2,Sean Pean,09 March 2005
3,Alex Patt,14 May 1999


**`str.contains()`**

Return persons with a lastname "Patt"

In [25]:
df['name'].str.contains('Patt')

0    False
1     True
2    False
3     True
Name: name, dtype: bool

In [28]:
df[df['name'].str.contains('Patt')]

Unnamed: 0,name,birthday
1,Robert Patt\Robinson,13 June 1967
3,Alex Patt,14 May 1999


**`str.replace()`**

In [33]:
df['name'][1]

'Robert Patt\\Robinson'

In [31]:
df['new_name'] = df['name'].str.replace(r"\\", " ", regex = True)
df.head()

Unnamed: 0,name,birthday,new_name
0,Alicia Johnson,28 May 2000,Alicia Johnson
1,Robert Patt\Robinson,13 June 1967,Robert Patt Robinson
2,Sean Pean,09 March 2005,Sean Pean
3,Alex Patt,14 May 1999,Alex Patt


**`str.extract()`**

Create a new column with only year

In [32]:
df['birth_year'] = df['birthday'].str.extract(r'(.{4}$)', expand=False).str.strip()
df.head()

Unnamed: 0,name,birthday,new_name,birth_year
0,Alicia Johnson,28 May 2000,Alicia Johnson,2000
1,Robert Patt\Robinson,13 June 1967,Robert Patt Robinson,1967
2,Sean Pean,09 March 2005,Sean Pean,2005
3,Alex Patt,14 May 1999,Alex Patt,1999


> ### TASK

Create a new column - "day" and extract only the day from the "birthday" column

This is just a basic regex tutorial, check regex cheat sheets and experiment with regex using this tool: https://regex101.com/

# Text classification task

Let's get back to the core and get to know standard NLP techniques.

We will first explore a simple text classification task and get familiar with the overall handling of text in NLP and suitable algorithm.

The core in NLP is a "word" - that is why the first step in NLP is to `tokenize`our text - extract singular words. After tokenizing, we can choose the approach:

## Bag of words approach

Why do we need this approach? When performing NLP tasks, we deal with numerical vectors. Since text is not numerical, we need to find a way to convert it and bag-of-words approach comes in handy. Bag of words helps us convert textual data into numerical representation.

Imagine we would like to create an algorithm that classify user reviews into two categories:

- reviews about pizza
- reviews about rice

1. "I love this pizza"
2. "The dough is delicious"
3. "I love this sticky rice"
4. "The rice is delicios"

Bag-of-words approach will extract each unique word from all reviews in the following way:

"I love this pizza The dough is delicious sticky rice."

Let's implement bag-of-words with scikit-learn

scikit-learn provides utilities for the most common ways to extract numerical features from text content: https://scikit-learn.org/stable/modules/feature_extraction.html

In [33]:
from sklearn.feature_extraction.text import CountVectorizer

X_train =  ["I love this pizza", "The pizza is not good", "I love this delicious sticky rice", "The rice is very very good"] #in NLP terminology you could call this 'corpus'

#corpus =  ["I love this pizza", "The dough is delicious", "I love this sticky rice", "The rice is delicious"]

#Let's transform the list into a vector representation

vectorizer = CountVectorizer() #tokenizing happens in the background

X_train_vectors = vectorizer.fit_transform(X_train)

In [38]:
print(X_train_vectors[2])

  (0, 3)	1
  (0, 9)	1
  (0, 0)	1
  (0, 7)	1
  (0, 6)	1


In [39]:
print(vectorizer.get_feature_names_out())
print(X_train_vectors.toarray())

['delicious' 'good' 'is' 'love' 'not' 'pizza' 'rice' 'sticky' 'the' 'this'
 'very']
[[0 0 0 1 0 1 0 0 0 1 0]
 [0 1 1 0 1 1 0 0 1 0 0]
 [1 0 0 1 0 0 1 1 0 1 0]
 [0 1 1 0 0 0 1 0 1 0 2]]


In [39]:
#Let's make a dataframe - Just for visual representation

bow_df = pd.DataFrame(X_train_vectors.toarray(), columns = vectorizer.get_feature_names_out())
bow_df.insert(0, 'review_text', X_train)
bow_df

Unnamed: 0,review_text,delicious,good,is,love,not,pizza,rice,sticky,the,this,very
0,I love this pizza,0,0,0,1,0,1,0,0,0,1,0
1,The pizza is not good,0,1,1,0,1,1,0,0,1,0,0
2,I love this delicious sticky rice,1,0,0,1,0,0,1,1,0,1,0
3,The rice is very very good,0,1,1,0,0,0,1,0,1,0,2


By default, **CountVectorizer** is non binary - notice 'very' is assigned value 2.

**Vectorization** is the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (`tokenization`, `counting` and `normalization`) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely **ignoring the relative position of the words in the document**.

### Train your first NLP model

NLP is a technique that enables ML algorithms to interpret and to understand the way humans communicate.

Remember that in order to perform a given ML task (eg. sentiment classification or text classification) `ML algorithms` process `data`, then create a `ML model`, apply it on `unseen dataset` and generate `result`.

![algorithm-Copy1](algorithm-Copy1.png)

Assign labels to your sentences in order to train the model. Let's first visualise it in a dataframe:

In [41]:
y_train = ["pizza","pizza","rice","rice"]
bow_df["label"] = y_train
bow_df

Unnamed: 0,review_text,delicious,good,is,love,not,pizza,rice,sticky,the,this,very,label
0,I love this pizza,0,0,0,1,0,1,0,0,0,1,0,pizza
1,The pizza is not good,0,1,1,0,1,1,0,0,1,0,0,pizza
2,I love this delicious sticky rice,1,0,0,1,0,0,1,1,0,1,0,rice
3,The rice is very very good,0,1,1,0,0,0,1,0,1,0,2,rice


We are about to train our model. Which of the columns are features and which are labels?

Let's build a simple classifier, to classify which review belongs to which category - pizza or rice? We will use Support Vector Machines - namely Support Vector Classification:

### Support Vector Classification

How does this ML algorithm work? 

SVMs finds a separating line(or hyperplane) between data points of two classes. `SVM algorithm` takes the data as an input and finds a line that separates the two categories.

Let's start with a two-dimensional problem. Suppose you have two categories and you want to correctly classify a new observation that will appear on the chart. With SVM you are finding a line - Optimal Hyperplane that will have the largest Maximum Margin between the two categories.

![svm](svm.png)

If your data doesn't allow to draw a clear line because your data points overlap, you can still use a Kernel Trick. Details: https://towardsdatascience.com/svm-feature-selection-and-kernels-840781cc1a6c

In [42]:
from sklearn import svm

clf_svm = svm.SVC(kernel = 'linear', probability = True)
clf_svm.fit(X_train_vectors, y_train)

Let's predict a label for a new sentence:

In [43]:
X_test = vectorizer.transform(["This pizza is great"]) # Try with "this pizza is delicious"
clf_svm.predict(X_test)

array(['pizza'], dtype='<U5')

In [44]:
clf_svm.predict_proba(X_test)[0]

array([0.55565672, 0.44434328])

#### Bag of n-grams approach

In some cases you might want to capture and count more than just one token. We can use n-grams, meaning expressions that consist of n number of words. Let's do bag of words again, but this time with bi-grams.

In [45]:
vectorizer_ngram = CountVectorizer(ngram_range = (1,2))

X_train_vectors_ngram = vectorizer_ngram.fit_transform(X_train)

print(vectorizer.get_feature_names_out())
print(X_train_vectors.toarray())

bow_df = pd.DataFrame(X_train_vectors_ngram.toarray(), columns = vectorizer_ngram.get_feature_names_out())
bow_df.insert(0, 'review_text', X_train)
bow_df

['delicious' 'good' 'is' 'love' 'not' 'pizza' 'rice' 'sticky' 'the' 'this'
 'very']
[[0 0 0 1 0 1 0 0 0 1 0]
 [0 1 1 0 1 1 0 0 1 0 0]
 [1 0 0 1 0 0 1 1 0 1 0]
 [0 1 1 0 0 0 1 0 1 0 2]]


Unnamed: 0,review_text,delicious,delicious sticky,good,is,is not,is very,love,love this,not,...,sticky rice,the,the pizza,the rice,this,this delicious,this pizza,very,very good,very very
0,I love this pizza,0,0,0,0,0,0,1,1,0,...,0,0,0,0,1,0,1,0,0,0
1,The pizza is not good,0,0,1,1,1,0,0,0,1,...,0,1,1,0,0,0,0,0,0,0
2,I love this delicious sticky rice,1,1,0,0,0,0,1,1,0,...,1,0,0,0,1,1,0,0,0,0
3,The rice is very very good,0,0,1,1,0,1,0,0,0,...,0,1,0,1,0,0,0,2,1,1


#### Limitation of bag of words approach

If a model haven't seen a word before, it won't be able to link it to correct category. The only information we provide are the words and occurences in the text.

In [46]:
X_test = vectorizer.transform(["Those pizzas' dough is very good"]) # 
clf_svm.predict(X_test)

array(['rice'], dtype='<U5')

In [47]:
clf_svm.predict_proba(X_test)[0]

array([0.4791528, 0.5208472])

In order to make Bag of Words approach more accurate you can apply a TFIDF transformation. Additional materials:

- SVM explained: https://www.youtube.com/watch?v=efR1C6CvhmE&ab_channel=StatQuestwithJoshStarmer
- TFIDF in sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- TFIDF explained: https://monkeylearn.com/blog/what-is-tf-idf/

### Problem of Sparsity
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in #total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

## Word Embeddings

Word Embeddings approach converts text into numerical representation and aims to capture **semantic meaning of a word**.

Word embeddings are **word vector representations** in which words with similar semantic meaning have similar representation. Word vectors are one of the most efficient ways to represent words.

In general, word vectors are much better in representing words than one hot encoded vectors. In bag of words approach, the vectors for “pizza” and “dough” are as close together as "pizza" and "forest" - the words in Bag of Words are completely isolated entities.

Check the following examples:

- " The **pizza** is made to perfection"
- " **Pizza**'s **dough** is very delicious"
- "The **dough** should have stayed longer in the **oven**"

In Word embeddings if pizza and dough are related, dough and oven are related, then pizza and oven will also be related (in great simplification this is what happens in word embeddings approach)

Let's use spaCy library to generate word vectors: https://spacy.io/usage/linguistic-features#vectors-similarity

We will not train Word Vector from scratch, we can take advantage of existing Word Embeddings.

Two main methods in word embeddings:

- continuos bag of words
- skip gram

If you are interested how the Word Embeddings are computed, check this article out: https://medium.com/analytics-vidhya/word-embeddings-in-nlp-word2vec-glove-fasttext-24d4d4286a73

In [48]:
#!pip3 install spacy

In [49]:
#!python3 -m spacy download en_core_web_md

In [48]:
import spacy

nlp = spacy.load("en_core_web_md") #Save Language Class in nlp variable (nlp is a convention)

In [49]:
doc = nlp("This pizza is delicious") #nlp class processes the text and saves it to doc object

Now we can access various elements of the doc object - The doc object owns the sequence of tokens and all their annotations. Let's check the vector for pizza

In [50]:
doc[1].vector

array([-6.7467e-01, -1.8932e-01,  5.8039e-01, -2.2462e-01,  9.0182e-02,
        1.0529e+00, -6.3681e-01,  1.6173e-01,  9.5415e-01,  7.0889e-01,
       -5.7771e-01,  6.4541e-01, -5.5105e-01, -4.7180e-01,  2.5780e-01,
       -5.1330e-01,  1.1842e-01,  1.3074e+00,  5.7581e-02, -4.5192e-02,
        4.2439e-01, -3.8203e-01,  9.4305e-02, -6.0597e-02, -3.8719e-01,
       -6.3133e-01, -1.8201e-01,  2.3236e-01,  4.9453e-01, -1.1618e+00,
        2.7373e-01,  5.3102e-01,  5.3476e-01, -7.9732e-01,  4.2004e-02,
        2.2655e-01,  1.6060e-02,  2.8268e-02,  5.4367e-01,  8.4875e-01,
        1.2247e-01, -1.2477e-01,  3.2317e-02, -9.0382e-02,  1.6604e-01,
        7.4026e-01,  5.3353e-02,  5.4560e-01,  3.2068e-01,  1.8807e-01,
        1.8679e-01, -3.1118e-01,  1.4894e-01,  4.0011e-01,  2.3684e-01,
       -7.6084e-01,  3.9626e-01, -9.1346e-03,  1.3314e-01,  4.6786e-01,
        1.6860e-01, -3.7035e-01, -2.0765e-01,  1.5869e-01, -1.5605e-01,
       -4.1760e-01,  4.1834e-01,  2.8982e-01, -3.6611e-01,  5.89

Now, lets compute Word Embeddings for our small X_train dataset. We get averaged values for each sentence from our dataset.

In [51]:
print(X_train)

['I love this pizza', 'The pizza is not good', 'I love this delicious sticky rice', 'The rice is very very good']


In [52]:
docs = [ nlp(text) for text in X_train]

#print(docs[0].vector)
docs[0]

I love this pizza

In [53]:
X_train_embeddings = [ doc.vector for doc in docs]

clf_svm_embeddings = svm.SVC(kernel = 'linear', probability = True)
clf_svm_embeddings.fit(X_train_embeddings, y_train)

In [54]:
X_test = ["Those pizzas dough is very good"]
test_docs = [nlp(text) for text in X_test]
test_embeddings = [doc.vector for doc in test_docs]

clf_svm_embeddings.predict(test_embeddings)

array(['pizza'], dtype='<U5')

In [55]:
clf_svm_embeddings.predict_proba(test_embeddings)[0]

array([0.51250099, 0.48749901])

Word embeddings drawbacks:

- words with multiple meanings are conflated into a single representation (bank - institution or place to sit, light - in terms of colour or in terms of lightweight, easy)

- with averaging random outliers, extreme values cause problems. Word Vectors might not be the best choice after all for classification 

## Let's explore NLP pre-processing techniques

- Removing stop words
- Stemming
- Lemmatization
- POS tagging

Remember that oftentimes you do not need to perform those steps yourself because it's not needed, or the process is happening under the hood of the algorithm.

### Define stop words

In some cases, we might not be interested in including some words in our analysis. We define a list of unwanted words and call the list stopwords. Example stopwords are: its, an, the, for, and that, you.

> Before defining & removing stopwords, you need to tokenize your sentence

In [None]:
#from nltk.tokenize import word_tokenize
#from nltk.corpus import stopwords

stop_words = stopwords.words('english')

text = "Let's check which stopwords will be removed from this sentence"

words = word_tokenize(text)

stripped_text = []
for word in words:
  if word not in stop_words:
    stripped_text.append(word)

" ".join(stripped_text)

Load a pre-existing stopwords  from stopwords:

### Stemming

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

text = "Working in difficult weather conditions can be very stressful"
words = word_tokenize(text)

stemmed_words = []
for word in words:
  stemmed_words.append(stemmer.stem(word))

" ".join(stemmed_words)

### Lemmatizing

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

text = "Working in difficult weather conditions can be very stressful"
words = word_tokenize(text)

lemmatized_words = []
for word in words:
  lemmatized_words.append(lemmatizer.lemmatize(word, pos='v'))

" ".join(lemmatized_words)

### POS tagging

In [None]:
#!pip3 install textblob

In [None]:
#!python3 -m textblob.download_corpora

In [57]:
from textblob import TextBlob

text = "My pizza was overcooked"

tb_text = TextBlob(text)

tb_text.tags

[('My', 'PRP$'), ('pizza', 'NN'), ('was', 'VBD'), ('overcooked', 'VBN')]

## Recurrent Neural Network

We haven't discussed Neural Networks in this course, but for completeness we need to highlight that Recurent Neural Networks (type of Neural Network) is the state-of-the art approach to solving NLP tasks.

A Neural Network, similarly to other classic ML algorithms is a collection of algorithms that aims at uncovering relationships in data. The algorithmic process is inspired by functioning of neuron in human's brain.

Recurrent Neural Networks are perfectly suited for:

- text classification
- text generation

RNNs is ideal for solving problems where the sequence is more important than the individual items themselves.

## Transformers

There are many different Neural Networks that are succesful with solving NLP tasks. One particular neural network - Transformers Neural Network has proven to be especially effective for many NLP tasks.

Transformers is a sequence-to-sequence architecture.

You can check exemplary Google Colab notebooks and play around yourself: https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/huggingface_pytorch-transformers.ipynb



