<a href="https://colab.research.google.com/github/Uskmbv/HWPanda_Uskembayeva/blob/main/UskembayevaAltyn_Bonus_Exercise1_MCMLR_2023W.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bonus Exercises 1: Multilingual and Crosslingual Methods and Language Resources**

This notebook represents two bonus exercises for the lecture Multilingual and Crosslingual Methods and Language Resources (2023W 340168-1). For each of these you can obtain a maximum of 3 points that are added to the points of your final exam. The sections where your code should go are marked with 👋 ⚒.

Bonus Exercise 1: Information Extraction with TF-IDF

## **Information Extraction with TF-IDF**

For high-resource languages, supervised approaches represent a viable solution. However, for low-resource languages it is at times necessary to use unsupervised approaches to obtain information from texts. We will work on English here for the sake of understanding the results, but the methods can be applied to any language.

For this exercise, you will implement a mini-example of information extraction, or rather feature extraction, with Term Frequency-Inverse Document Frequency (TF-IDF).

### **Term Frequency-Inverse Document Frequency (TF-IDF)**

In this exercise, you will write a simple implementation of the TF-IDF algorithm and compare your implementation with the one in sklearn. TF-IDF represents an effective method for extracting features from text without any supervision. In documents, there are usually some terms that occur frequently, but might not represent the best features for identifying categories or topics in a document. Instead, TF-IDF assigns higher values to words that occur frequently in one document, but not in all documents. Thereby, they provide more and better information on the potential contents of a document than rare frequency counts. Originally, the technique was used for ranking documents in search engines. Today, it is still used for topic modeling, i.e., identifying topics of documents automatically, term extraction, etc.

### Toy Example of Documents

We will use the following toy example, where each sentence in the list is considered a document on its own.

In [1]:
docs = ["this is the story behind the red house on the street with twenty houses",
      "a man decided to build his own house on our street, which should change the street",
       "all the houses were painted white and all the neighbors were happy with this",
       "the man thought to himself this is wrong and painted his house red",
        "he went from his house to his neigbor's house and explained",
       "his neighbor thought this is a good idea and painted his house orange",
       "now they thought the houses started to look right for a street named rainbow road"
       ]

Use [spaCy](https://spacy.io/) to preprocess these "documents" in the list `docs`. The folowing preprocessing steps need to be performed:


1.   Tokenization
2.   POS tagging
3.   Lemmatization

We first need to import spaCy and load the English model.


In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")

👋 ⚒  Perform the spacy preprocessing steps described above and remove the POS tags that are indicated in the list below.

In [5]:

pos_to_be_removed = ['ADV', 'PRON', 'CCONJ', 'PUNCT', 'PART', 'DET', 'ADP', 'SPACE']


def preprocess(sentence):
    #tokenization
    tokens = nlp(sentence)

    #filter out unwanted POS tags
    filtered_tokens = [token.lemma_ for token in tokens if token.pos_ not in pos_to_be_removed]

    #join the filtered tokens to form a preprocessed sentence
    preprocessed_sentence = ' '.join(filtered_tokens)

    return preprocessed_sentence

#list
docs = [
    "this is the story behind the red house on the street with twenty houses",
    "a man decided to build his own house on our street, which should change the street",
    "all the houses were painted white and all the neighbors were happy with this",
    "the man thought to himself this is wrong and painted his house red",
    "he went from his house to his neighbor's house and explained",
    "his neighbor thought this is a good idea and painted his house orange",
    "now they thought the houses started to look right for a street named rainbow road"
]

#preprocess
preprocessed = []
for sentence in docs:
    preprocessed.append(preprocess(sentence))


print("Preprocessed sentences: ", preprocessed)


Preprocessed sentences:  ['be story red house street twenty house', 'man decide build own house street should change street', 'house be paint white neighbor be happy', 'man think be wrong paint house red', 'go house neighbor house explain', 'neighbor think be good idea paint house orange', 'think house start look right street name rainbow road']


### Term Frequency (TF)

The term frequency is calculated as the relative frequency of the word in a specific document, that is, the absolute frequency divided the number of words in the document. For the term $t$ in document  $d$, this is the count of the term $n_{t,d}$ divided by the count of all words $\sum n_{t',d}$ in the document $d$ :

$TF_{i,j} = \frac{n_{t,d}}{\sum n_{t',d}}$


👋 ⚒  Write a function to calculate the Term Freqquency (TF) for each word in each document. The result will be a list of term frequencies for each document.

In [6]:
def compute_term_frequency(bag_of_words):
    #count the frequency
    term_frequency = {term: bag_of_words.count(term) / len(bag_of_words.split()) for term in set(bag_of_words.split())}

    return term_frequency

term_frequencies = []
for doc in preprocessed:
    term_frequencies.append(compute_term_frequency(doc))

# Print the computed term frequencies for each document
for i, doc_tf in enumerate(term_frequencies, 1):
    print(f"Document {i} Term Frequencies: {doc_tf}")


Document 1 Term Frequencies: {'red': 0.14285714285714285, 'be': 0.14285714285714285, 'house': 0.2857142857142857, 'twenty': 0.14285714285714285, 'street': 0.14285714285714285, 'story': 0.14285714285714285}
Document 2 Term Frequencies: {'own': 0.1111111111111111, 'house': 0.1111111111111111, 'decide': 0.1111111111111111, 'man': 0.1111111111111111, 'change': 0.1111111111111111, 'street': 0.2222222222222222, 'build': 0.1111111111111111, 'should': 0.1111111111111111}
Document 3 Term Frequencies: {'be': 0.2857142857142857, 'house': 0.14285714285714285, 'happy': 0.14285714285714285, 'paint': 0.14285714285714285, 'neighbor': 0.14285714285714285, 'white': 0.14285714285714285}
Document 4 Term Frequencies: {'red': 0.14285714285714285, 'wrong': 0.14285714285714285, 'be': 0.14285714285714285, 'think': 0.14285714285714285, 'house': 0.14285714285714285, 'man': 0.14285714285714285, 'paint': 0.14285714285714285}
Document 5 Term Frequencies: {'explain': 0.2, 'neighbor': 0.2, 'house': 0.4, 'go': 0.2}
Do

### Inverse Document Frequency (IDF)

The document frequency $df_t$ is the number of documents in which the term $t$ occurs. We again consider the relative document frequency, that is $df_t$ divided by the number of all documents $d$.

$DF = \frac{df_t}{d}$

It has turned out that the inverse of this formula performs better, especially when scaling it with a logarithm. This gives us the Inverse Document Frequency (IDF):

$IDF = \log \frac{d}{df_t}$

👋 ⚒  Write a function to calculate the Inverse Document Frequency (IDF). The result will be a list of words with IDF values.

In [7]:
import math

# the logarithm can be calculated with math.log()

def compute_inverse_document_frequency(full_doc_list):
    #frequency (df)
    document_frequencies = {term: sum(1 for doc in full_doc_list if term in doc.split()) for term in set(' '.join(full_doc_list).split())}

    #calculate the idf
    idf_values = {term: math.log(len(full_doc_list) / df) for term, df in document_frequencies.items()}

    return idf_values

#compute IDF
idf_values = compute_inverse_document_frequency(preprocessed)

print("IDF Values: ", idf_values)


IDF Values:  {'own': 1.9459101490553132, 'be': 0.5596157879354227, 'think': 0.8472978603872037, 'man': 1.252762968495368, 'happy': 1.9459101490553132, 'paint': 0.8472978603872037, 'go': 1.9459101490553132, 'road': 1.9459101490553132, 'red': 1.252762968495368, 'house': 0.0, 'look': 1.9459101490553132, 'explain': 1.9459101490553132, 'idea': 1.9459101490553132, 'neighbor': 0.8472978603872037, 'build': 1.9459101490553132, 'should': 1.9459101490553132, 'white': 1.9459101490553132, 'rainbow': 1.9459101490553132, 'street': 0.8472978603872037, 'right': 1.9459101490553132, 'orange': 1.9459101490553132, 'good': 1.9459101490553132, 'start': 1.9459101490553132, 'wrong': 1.9459101490553132, 'twenty': 1.9459101490553132, 'decide': 1.9459101490553132, 'name': 1.9459101490553132, 'change': 1.9459101490553132, 'story': 1.9459101490553132}


### TF-IDF

The final value is calculated by multiplying the values of the previous two calculations:

$TF-IDF = tf_t \ x \ log \frac{d}{df_t}$

👋 ⚒  Write a function to calculate the final TF-IDF scores for each word.

In [10]:
#Your code should go here
term_frequencies = [
    {'red': 0.14285714285714285, 'be': 0.14285714285714285, 'house': 0.2857142857142857, 'twenty': 0.14285714285714285, 'street': 0.14285714285714285, 'story': 0.14285714285714285},
    {'own': 0.1111111111111111, 'house': 0.1111111111111111, 'decide': 0.1111111111111111, 'man': 0.1111111111111111, 'change': 0.1111111111111111, 'street': 0.2222222222222222, 'build': 0.1111111111111111, 'should': 0.1111111111111111},
    {'be': 0.2857142857142857, 'house': 0.14285714285714285, 'happy': 0.14285714285714285, 'paint': 0.14285714285714285, 'neighbor': 0.14285714285714285, 'white': 0.14285714285714285},
    {'red': 0.14285714285714285, 'wrong': 0.14285714285714285, 'be': 0.14285714285714285, 'think': 0.14285714285714285, 'house': 0.14285714285714285, 'man': 0.14285714285714285, 'paint': 0.14285714285714285},
    {'explain': 0.2, 'neighbor': 0.2, 'house': 0.4, 'go': 0.2},
    {'be': 0.125, 'think': 0.125, 'house': 0.125, 'idea': 0.125, 'paint': 0.125, 'neighbor': 0.125, 'orange': 0.125, 'good': 0.125},
    {'think': 0.1111111111111111, 'house': 0.1111111111111111, 'look': 0.1111111111111111, 'rainbow': 0.1111111111111111, 'name': 0.1111111111111111, 'street': 0.1111111111111111, 'road': 0.1111111111111111, 'right': 0.1111111111111111, 'start': 0.1111111111111111}
]

idf_values = {
    'own': 1.9459101490553132, 'be': 0.5596157879354227, 'think': 0.8472978603872037,
    'man': 1.252762968495368, 'happy': 1.9459101490553132, 'paint': 0.8472978603872037,
    'go': 1.9459101490553132, 'road': 1.9459101490553132, 'red': 1.252762968495368,
    'house': 0.0, 'look': 1.9459101490553132, 'explain': 1.9459101490553132,
    'idea': 1.9459101490553132, 'neighbor': 0.8472978603872037, 'build': 1.9459101490553132,
    'should': 1.9459101490553132, 'white': 1.9459101490553132, 'rainbow': 1.9459101490553132,
    'street': 0.8472978603872037, 'right': 1.9459101490553132, 'orange': 1.9459101490553132,
    'good': 1.9459101490553132, 'start': 1.9459101490553132, 'wrong': 1.9459101490553132,
    'twenty': 1.9459101490553132, 'decide': 1.9459101490553132, 'name': 1.9459101490553132,
    'change': 1.9459101490553132, 'story': 1.9459101490553132
}

#function to compute final TF-IDF
def compute_final_tf_idf(tf_values, idf_values):
    # Calculate the final TF-IDF for each term using the given formula
    final_tf_idf_scores = {term: tf_values[term] * idf_values[term] for term in tf_values}
    return final_tf_idf_scores

# compute final TF-IDF scores
example_final_tf_idf_scores = [compute_final_tf_idf(tf_values, idf_values) for tf_values in term_frequencies]

for i, doc_final_tf_idf in enumerate(example_final_tf_idf_scores, 1):
    print(f"Document {i} Final TF-IDF Scores: {doc_final_tf_idf}")



Document 1 Final TF-IDF Scores: {'red': 0.17896613835648115, 'be': 0.07994511256220323, 'house': 0.0, 'twenty': 0.277987164150759, 'street': 0.12104255148388623, 'story': 0.277987164150759}
Document 2 Final TF-IDF Scores: {'own': 0.21621223878392368, 'house': 0.0, 'decide': 0.21621223878392368, 'man': 0.1391958853883742, 'change': 0.21621223878392368, 'street': 0.18828841341937858, 'build': 0.21621223878392368, 'should': 0.21621223878392368}
Document 3 Final TF-IDF Scores: {'be': 0.15989022512440645, 'house': 0.0, 'happy': 0.277987164150759, 'paint': 0.12104255148388623, 'neighbor': 0.12104255148388623, 'white': 0.277987164150759}
Document 4 Final TF-IDF Scores: {'red': 0.17896613835648115, 'wrong': 0.277987164150759, 'be': 0.07994511256220323, 'think': 0.12104255148388623, 'house': 0.0, 'man': 0.17896613835648115, 'paint': 0.12104255148388623}
Document 5 Final TF-IDF Scores: {'explain': 0.38918202981106265, 'neighbor': 0.16945957207744075, 'house': 0.0, 'go': 0.38918202981106265}
Docu

### Compare your results to sklearn

sklearn provides an implementation for calculating the TF-IDF values. Compare your calculations to these values.

👋 ⚒  Use the [TfidfVectorizer of sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to calculate the TF-IDF values for the same corpus as above.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Your code here

docs = [
    "this is the story behind the red house on the street with twenty houses",
    "a man decided to build his own house on our street, which should change the street",
    "all the houses were painted white and all the neighbors were happy with this",
    "the man thought to himself this is wrong and painted his house red",
    "he went from his house to his neighbor's house and explained",
    "his neighbor thought this is a good idea and painted his house orange",
    "now they thought the houses started to look right for a street named rainbow road"
]

vectorizer = TfidfVectorizer()

#fit and transform the corpus
tfidf_sklearn = vectorizer.fit_transform(docs)

#get terms
terms = vectorizer.get_feature_names_out()

#dictionary
tfidf_sklearn_dict = {term: tfidf_sklearn[:, vectorizer.vocabulary_[term]].toarray().flatten() for term in terms}

for i, doc_tfidf_sklearn in enumerate(tfidf_sklearn_dict.values(), 1):
    print(f"Document {i} TF-IDF Scores (sklearn): {dict(zip(terms, doc_tfidf_sklearn))}")

Document 1 TF-IDF Scores (sklearn): {'all': 0.0, 'and': 0.0, 'behind': 0.5230727856650398, 'build': 0.0, 'change': 0.0, 'decided': 0.0, 'explained': 0.0}
Document 2 TF-IDF Scores (sklearn): {'all': 0.0, 'and': 0.0, 'behind': 0.16111149274275427, 'build': 0.2330220550530006, 'change': 0.21603863218596028, 'decided': 0.2210326403843424, 'explained': 0.0}
Document 3 TF-IDF Scores (sklearn): {'all': 0.31832354532454604, 'and': 0.0, 'behind': 0.0, 'build': 0.0, 'change': 0.0, 'decided': 0.0, 'explained': 0.0}
Document 4 TF-IDF Scores (sklearn): {'all': 0.0, 'and': 0.29193943488357654, 'behind': 0.0, 'build': 0.0, 'change': 0.0, 'decided': 0.0, 'explained': 0.0}
Document 5 TF-IDF Scores (sklearn): {'all': 0.0, 'and': 0.29193943488357654, 'behind': 0.0, 'build': 0.0, 'change': 0.0, 'decided': 0.0, 'explained': 0.0}
Document 6 TF-IDF Scores (sklearn): {'all': 0.0, 'and': 0.29193943488357654, 'behind': 0.0, 'build': 0.0, 'change': 0.0, 'decided': 0.0, 'explained': 0.0}
Document 7 TF-IDF Scores 

What can you learn about the documents based on these extracted feautres?

👋 ⚒  Write your textual answer right here.

In exploring TF-IDF, it's like shining a light on words that really matter in our documents. When a term, say "house," has a high TF-IDF, it means it's a big deal in that particular story. Consistent high values for words like "rainbow" and "orange" across documents give a unique touch. In Document 5, with "explain" and "neighbor" standing out, there's something interesting happening. On the other hand, common words like "the" with low TF-IDF just blend in. TF-IDF strikes a balance, highlighting important words while considering their rarity. Finding patterns, like the TF-IDF 0.0 for "house" in Document 3, adds an interesting layer. Despite academic challenges, working with TF-IDF gives a practical insight into our textual data.