<a href="https://colab.research.google.com/github/d-tomas/transform4europe/blob/main/notebooks/document_representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document representation

In this *notebook* we will review different techniques to transform textual representations into numerical vectors, such as TF-IDF weighting schema and word embeddings.

## Initial setup

In [None]:
# Install the Transformers library

!pip install transformers[sentencepiece]

In [None]:
# Import the required libraries

import gensim  # Word embedding models
import gensim.downloader  # Download pre-trained word embedding models
from gensim.models import KeyedVectors  # Load pre-trained word embedding models
import matplotlib.pyplot as plt  # Display word clouds
import nltk  # NLP library
from nltk.stem.porter import *  # Stemmer tool
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer  # Term by document matrix with TF
from sklearn.feature_extraction.text import TfidfVectorizer  # Term by document matrix with TF-IDF
import spacy  # NLP library
from transformers import pipeline  # Transformer models

# Install the SpaCy model for English texts
spacy.cli.download('en_core_web_sm')

# Load the model
nlp = spacy.load('en_core_web_sm')

# Download example text files ('news.txt' and 'alices_adventures_in_wonderand.txt')
!wget https://raw.githubusercontent.com/d-tomas/transform4europe/main/datasets/news.txt
!wget https://raw.githubusercontent.com/d-tomas/transform4europe/main/datasets/alices_adventures_in_wonderland.txt

## N-gram extraction


In [None]:
# Extract bigrams and trigrams from text

with open('news.txt') as file:
    content = file.read()

list_bigrams = nltk.ngrams(content.split(), 2)  # split() the sentence into a list of words
list_trigrams = nltk.ngrams(content.split(), 3)

print('---------')
print('Bigrams:')
print('---------')
for bigram in list_bigrams:
  print(bigram)

print('----------')
print('Trigrams:')
print('----------')
for trigram in list_trigrams:
  print(trigram)

In [None]:
# The previous approach does not consider sentence boundaries
# We can read the file line by line and extract n-grams for each line separately

with open('news.txt') as file:
    content = file.readlines()  # Get a list of lines

# Remove empty lines, blanks and new line characters
content = [line.strip() for line in content if line.strip()]

for line in content:
    trigrams = nltk.ngrams(line.split(), 3)  # Extract 3-grams for each line
    for trigram in trigrams:
        print(trigram)

### Exercise

In [None]:
# Repeat the analysis on 'alices_adventures_in_wonderland.txt', obtaining also 4-grams and 5-grams in addition to bigrams and trigrams
# Use the first procedure (no need to consider sentence boundaries)


## Normalisation / pre-processing

In [None]:
# Remove punctuation, lowercase, remove stopwords and get the stem of the words

text = 'The Netherlands earned sweet revenge on Spain on Friday at the Fonte Nova in Salvador, hammering Spain 5-1 to put an emphatic coda on their loss in the 2010 World Cup finals.'

document = nlp(text)  # Process the text with SpaCy

document = [token for token in document if not token.is_punct]  # Remove punctuation
print('No punctuation: ' + str(document))

document = [token for token in document if not token.is_stop]  # Remove stopwords
print('No stopwords: ' + str(document))

document = [token.lower_ for token in document]  # Lowercase
print('Lowercased: ' + str(document))

stemmer = PorterStemmer()
document = [stemmer.stem(token) for token in document]  # Stem of the words
print('Stems: ' + str(document))

### Exercise

In [None]:
# Repeat the previous analysis on the content of 'alices_adventures_in_wonderland.txt'


## Weighting schema

In [None]:
# Build the term by document matrix using the TF weighting schema

corpus = ['I do not like this restaurant', 'I like this restaurant very much', 'I think it is a very very bad place', 'I love this place']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
print(X.shape)

vectorizer2 = CountVectorizer(analyzer = 'word', ngram_range = (2, 2))  # Extract bigrams
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names_out())
print(X2.toarray())
print(X2.shape)

In [None]:
# Build the term by document matrix using the TF-IDF weighting schema

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

### Exercise

In [None]:
# Get the term by document matrix, using TF weighting schema and trigrams on 'news.txt'

## Word embeddings

In [None]:
# Download and load into memory a word embedding model pre-trained with 100 billion words from Google News
# It's gonna take a while...

model = gensim.downloader.load('glove-wiki-gigaword-300')

In [None]:
# Show the vector representing a word

model['dog']

In [None]:
# Check the size of the returned vector

len(model['dog'])

In [None]:
# Get the 5 most similar words to a given one 

model.most_similar('desert', topn = 5)

In [None]:
# Analogy: 'France' is to 'Paris' as 'Madrid' is to... (France - Paris + Madrid)
# The model is lowercased, thus we cannot use capitalised tokens

model.most_similar(positive=['madrid', 'france'], negative=['paris'], topn=1)

In [None]:
# Ditch unrelated terms

model.doesnt_match(['wine', 'beer', 'coke', 'whysky'])

In [None]:
# Similarity between words
# Beware of algorithmic bias!!

model.similarity('woman', 'housework')

## Transformers

🤗 [Transformers](https://huggingface.co/transformers/) library provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with deep interoperability between Jax, PyTorch and TensorFlow.

There are more than 30,000 pre-trained [models](https://huggingface.co/models) and 2,000 [datasets](https://huggingface.co/datasets) available in their web page, covering tenths of different tasks in more than 100 languages.

This demo exemplifies the use of [pipelines](https://huggingface.co/transformers/main_classes/pipelines.html). These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, and Question Answering.

The following examples are inspired in the 🤗 Transformers library [course](https://huggingface.co/course/chapter1/3?fw=pt).

### Sentiment analysis
Classify a sentence according to positive or negative sentiments.

In [None]:
# Load the sentiment analysis model ('distilbert-base-uncased-finetuned-sst-2-english' by default)

model = pipeline('sentiment-analysis')

In [None]:
# Try it!

model('This is the best course I have ever attended in my life. Praise to David!')

### Zero-shot classification
Classify text according to a set of given labels.

In [None]:
# Load the zero-shot classification model ('facebook/bart-large-mnli' by default)

model = pipeline('zero-shot-classification')

In [None]:
# Try it!

model('This lecture is about Natural Language Processing', candidate_labels=['education', 'politics', 'business', 'sports'])

### Text generation
Predict the words that will follow a specified text prompt, creating a coherent portion of text that is a continuation from the given context.

In [None]:
# Load the text generation model ('gpt2' by default)

model = pipeline('text-generation')

In [None]:
# Try it! (you will get a different output each time)

model('I opened the door and found')

In [None]:
# Tyr it tuning some parameters (maximum length generated and number of returned sentences)!

model('The book was amazing', max_length=40, num_return_sequences=3)

### Masked language modelling
Mask a token in a sequence with a masking token, and prompt the model to fill that mask with an appropriate token.

In [None]:
# Load the masked language modelling model ('distilroberta-base' by default)

model = pipeline('fill-mask')

In [None]:
# Try it (returning the 'top_k' words)!

model('I <mask> this lecture.', top_k=5)

### Named entity recognition
Classify tokens according to a class (e.g. person, organisation or location).

In [None]:
# Load the named entity recognition model ('dbmdz/bert-large-cased-finetuned-conll03-english' by default)

model = pipeline('ner', grouped_entities=True)

In [None]:
# Try it!

model('My name is David and I live in Spain.')

### Question answering
Extract an answer from a text given a question.

In [None]:
# Load the question answering model ('distilbert-base-cased-distilled-squad' by default)

model = pipeline('question-answering')

In [None]:
# Try it!

model(question='Where do I work?', context='My name is David and I work really hard at the Unviersity of Alicante')

### Machine translation
Translate from one language to another.

In [None]:
# Load the machine translation model from ES to EN ('Helsinki-NLP/opus-mt-es-en')
# Try different models changing 'Helsinki-NLP/opus-mt-{src}-{tgt}' (src = source language, tgt = target)

model = pipeline('translation', model='Helsinki-NLP/opus-mt-es-en')

In [None]:
# Try it!

model('Ojalá el próximo año pueda ir a Alicante')

# References

* [Alice's adventures in Wonderland](https://www.gutenberg.org/ebooks/11)
* [Gensim](https://radimrehurek.com/gensim/index.html)
* [Hugging Face](https://huggingface.co/)