<a href="https://colab.research.google.com/github/dgromann/MCMLR_2023W/blob/main/Bonus_Exercise1_MCMLR_2023W.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bonus Exercises 1: Multilingual and Crosslingual Methods and Language Resources**

This notebook represents two bonus exercises for the lecture Multilingual and Crosslingual Methods and Language Resources (2023W 340168-1). For each of these you can obtain a maximum of 3 points that are added to the points of your final exam. The sections where your code should go are marked with 👋 ⚒.

Bonus Exercise 1: Information Extraction with TF-IDF

## **Information Extraction with TF-IDF**

For high-resource languages, supervised approaches represent a viable solution. However, for low-resource languages it is at times necessary to use unsupervised approaches to obtain information from texts. We will work on English here for the sake of understanding the results, but the methods can be applied to any language.

For this exercise, you will implement a mini-example of information extraction, or rather feature extraction, with Term Frequency-Inverse Document Frequency (TF-IDF).

### **Term Frequency-Inverse Document Frequency (TF-IDF)**

In this exercise, you will write a simple implementation of the TF-IDF algorithm and compare your implementation with the one in sklearn. TF-IDF represents an effective method for extracting features from text without any supervision. In documents, there are usually some terms that occur frequently, but might not represent the best features for identifying categories or topics in a document. Instead, TF-IDF assigns higher values to words that occur frequently in one document, but not in all documents. Thereby, they provide more and better information on the potential contents of a document than rare frequency counts. Originally, the technique was used for ranking documents in search engines. Today, it is still used for topic modeling, i.e., identifying topics of documents automatically, term extraction, etc.

### Toy Example of Documents

We will use the following toy example, where each sentence in the list is considered a document on its own.

In [None]:
docs = ["this is the story behind the red house on the street with twenty houses",
      "a man decided to build his own house on our street, which should change the street",
       "all the houses were painted white and all the neighbors were happy with this",
       "the man thought to himself this is wrong and painted his house red",
        "he went from his house to his neigbor's house and explained",
       "his neighbor thought this is a good idea and painted his house orange",
       "now they thought the houses started to look right for a street named rainbow road"
       ]

Use [spaCy](https://spacy.io/) to preprocess these "documents" in the list `docs`. The folowing preprocessing steps need to be performed:


1.   Tokenization
2.   POS tagging
3.   Lemmatization

We first need to import spaCy and load the English model.


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

👋 ⚒  Perform the spacy preprocessing steps described above and remove the POS tags that are indicated in the list below.

In [None]:
pos_to_be_removed =['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE']

def preprocess(sentence):

preprocessed = []
for sentence in docs:
  preprocessed.append(preprocess(sentence))

print("Preprocessed sentences: ", preprocessed)

### Term Frequency (TF)

The term frequency is calculated as the relative frequency of the word in a specific document, that is, the absolute frequency divided the number of words in the document. For the term $t$ in document  $d$, this is the count of the term $n_{t,d}$ divided by the count of all words $\sum n_{t',d}$ in the document $d$ :

$TF_{i,j} = \frac{n_{t,d}}{\sum n_{t',d}}$


👋 ⚒  Write a function to calculate the Term Freqquency (TF) for each word in each document. The result will be a list of term frequencies for each document.

In [None]:
def compute_term_frequency(bag_of_words):


term_frequencies = []
for doc in preprocessed:
  term_frequencies.append(compute_term_frequency(doc))


### Inverse Document Frequency (IDF)

The document frequency $df_t$ is the number of documents in which the term $t$ occurs. We again consider the relative document frequency, that is $df_t$ divided by the number of all documents $d$.

$DF = \frac{df_t}{d}$

It has turned out that the inverse of this formula performs better, especially when scaling it with a logarithm. This gives us the Inverse Document Frequency (IDF):

$IDF = \log \frac{d}{df_t}$

👋 ⚒  Write a function to calculate the Inverse Document Frequency (IDF). The result will be a list of words with IDF values.

In [None]:
import math

# the logarithm can be calculated with math.log()

def compute_inverse_document_frequency(full_doc_list):
#Your code should go here

idf_values = compute_inverse_document_frequency(preprocessed)
print(idf_values)

### TF-IDF

The final value is calculated by multiplying the values of the previous two calculations:

$TF-IDF = tf_t \ x \ log \frac{d}{df_t}$

👋 ⚒  Write a function to calculate the final TF-IDF scores for each word.

In [None]:
#Your code should go here

### Compare your results to sklearn

sklearn provides an implementation for calculating the TF-IDF values. Compare your calculations to these values.

👋 ⚒  Use the [TfidfVectorizer of sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to calculate the TF-IDF values for the same corpus as above.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Your code here

What can you learn about the documents based on these extracted feautres?

👋 ⚒  Write your textual answer right here.