# Resit Assignment 3a

## Due: Friday, November 25, 2022 at 5pm (submission via Canvas)


* Please submit your assignment as **a single .zip file** using Canvas (Assignments --> Assignment 3). Please put the notebooks as well as the Python modules (files ending with .py) in one folder, which you call ASSIGNMENT_3_FIRSTNAME_LASTNAME. Please zip this folder and upload it as your submission. DO NOT include the data files in your zip folder.

* Please name your zip file with the following naming convention: RESIT_3_FIRSTNAME_LASTNAME.zip


In this block, we covered:

* Chapter 12 - Importing external modules 
* Chapter 13 - Working with Python scripts
* Chapter 14 - Reading and writing text files
* Chapter 15 - Off to analyzing text 

**Can I use external modules other than the ones treated so far?**


For now, please try to avoid it. All the exercises can be solved with what we have covered in the course. 

In order to easily donwload the dataset for this assignment, you will need to install the datasets module


In [None]:
# Install Related Modules
! pip install datasets

The following code downloads the "Simplified English Wikipedia" dataset from the HuggingFace dataset hub, consisting of more than 160k articles. Since this is a small assignment, we will be playing only with the first 1000 articles from the dataset. The code already converts the dataset into a JSON file, which you should know how to manipulate.

The JSON file has the following simple structure:
{
    "title": ["Title 1", "Title 2", ..., "Title 1000"],
    "text": ["Text 1", "Text 2", ..., "Text 1000"]
}

where the first element of the title list is the title of the first element of the text list, the second title corresponds to the second text, etcetera ...

In [None]:
from datasets import load_dataset
import json

# Please adjust this path if it is more convenient for you
save_filepath="WikipediaSimple_1000k.json"

wiki_dataset = load_dataset("wikipedia", "20200501.simple")['train']
print(wiki_dataset)

selected_data = wiki_dataset[:1000]
with open(save_filepath, "w") as f:
    json.dump(selected_data, f)

print(f"File was succesfully downloaded and the first 1000 articles saved in '{save_filepath}' !")


## Functions & scope

### Excercise 1:

Define a function called `match_titles_texts` which takes one positional parameter called **filepath** (a string) and a keyword parameter called **return_sorted** (a boolean), which is set as True by default.

The function:
* opens the JSON file defined by the filepath and saves the content into a python dictionary (please use the JSON file created in the previous cell)
* Iterates through the dictionary to pair each title with its respective text
* returns a list of tuples such that returned_list = [(Title_1, Text_1), (Title_2, Text_2), ..., (Title_1000, Text_1000)]
* If the parameter `return_sorted=True` then return the list alphabetically ordered by Title. Otherwise return the list in the same order as it is.

Hints:
* Hint 1: Use the context manager and json module to read in the file.
* Hint 2: Recall that the zip() function can help you.
* Hint 3: There is a function which sorts items in an iterable called 'sorted'. Look at the documentation to see how it is used. 
* Hint 4: Don't forget to write a docstring. Please make sure that the docstring generally explains with the input is, what the function does, and what the function returns. The docstrings are mandatory!

In [None]:
# your code here
import json
def match_titles_texts(filepath, return_sorted=True):
    # your code goes here (you can erase the pass)
    pass

input_filepath = "WikipediaSimple_1000k.json"

# Return Articles without sorting
wiki_articles = match_titles_texts(input_filepath, return_sorted=False)
assert [art[0] for art in wiki_articles[:2]] == ['Rosh Hashanah', 'Longueville, Calvados']

# Return Articles sorted lexicographically
wiki_articles = match_titles_texts(input_filepath)
assert [art[0] for art in wiki_articles[:2]] == ['1 Giant Leap', '1100 New York Avenue']



## Working with python scripts

### Exercise 2
For the following excercises we will make use of the `NLTK library`.

* Write a function called `get_unique_tokens` which takes one positional parameter called **text** (a string) and one keyword parameter called **case_sensitive** (boolean). The function tokenizes the text and returns a set containing the unique words in the text. Remember that if **case_sensitive=True** it means that *word* and *Word* are considered two different tokens and if **case_sensitive=False** then you should only save the lowercased version of the tokens.


#### 2a)

In [None]:
# your code here
from nltk import word_tokenize

def get_unique_tokens(text, case_sensitive=True):
    unique_words = set()
    tokens = word_tokenize(text)
    for tok in tokens:
        if case_sensitive:
            unique_words.add(tok)
        else:
            unique_words.add(tok.lower())
    return unique_words

#### 2b)

You will now write a function called `get_text_diversity` with one positional parameter called **text**, which can be any string. This function will calculate the ratio of unique tokens inside a text compared to the total number of tokens. The idea behind this is to have a metric that informs us about the complexity of a text (e.g. a very complex text introduces several terms and synonyms whereas a simple text tends to use a narrow vocabulary). Formally we define it as:

$text\_diversity=\frac{unique\_tokens}{all\_tokens}*100$ 

If the text has all of their tokens appearing one single time then token_diversity=1, otherwise the more repetitive the article is, the closer text_diversity gets to 0.

Write such a function and apply it to the 1000 Wikipedia articles. Print the Top-5 articles (with the highest diversity) and the Bottom-5 articles (with the lowest diversity).

In [None]:
# Your code here ...

### Exercise  3

#### a.) Define a function called `lemmatize_text`

* Write a helper-function called `lemmatize_text`, this should receive a **text** as input (positional parameter) and return it lowercased and lemmatized (i.e. a list of lemmas). The second positional parameter should be **lemmatizer** which is the initialized lemmatizer from NLTK. Finally, the function also has a keyword parameter **only_content_words** with a default value set to False.

* Remember that to lemmatize you need to know the POS tag of the token. We already provided with the list of tags to distinguish Nouns, Verbs and Adjectives. Your function should lemmatize appropriately for those words. For simplicity, if the POS tag is not from those three categories, here you can assume the lemma is the same as the original token

* Please make use of the nltk methods to achieve this (tokenize, obtain Part-of-speech tags and lemmatize). See Chapter 15.

* If **only_content_words = True** then return the list of lemmatized tokens that correspond only to Nouns, Verbs, and Adjectives (ignore any other token) 

* Test your function in the notebook using the first article from the provided `test_articles` list.


#### b.) Define a function called `build_global_vocabulary`

* The main function has one positional parameter **wiki_articles** (list of strings), each string in the list is a wikipedia article. The function will determine how often each word lemma occurs in all articles by keeping a dictionary with the lemmas as keys and the frequency as values. To achive this, you will have to lemmatize and lowercase each token before adding it to the vocabulary counts. Make use of the `lemmatize_text` function you just wrote and call it with **only_content_words=True** so you get only the interesting words into your vocabulary 

* Call the function `lemmatize_text` inside the `build_global_vocabulary` function in order to count the lowercased lemmas only.

* Remember how we used dictionaries to count words? If not, have a look at Chapter 10 - Dictionaries. 

* Test your function in the notebook with the `test_articles` list containing 2 fake short articles.

#### c.) Create a python script 

* Create a main script called `resit_main.py`. Place the function definition of the **build_global_vocabulary** function in `resit_main.py`.

* Place your helper function definition, i.e., **lemmatize_text**, in a separate auxiliary script called `resit_utils.py`. 

* Import your helper function **lemmatize_text** into `resit_main.py` and also use the **test_articles** list there to confirm that everything is still working as expected by calling the script `resit_main.py` from the terminal (remember to add a print statement in the main script so the result appeard in the terminal). 

* Careful! The function must do two things: return the vocabulary counts (so they can be re-used later) and also print it in the terminal. 

* To confirm everything is working, basically the same result has to appear in the terminal as it appeared in the notebook when solving exercise 3a. and 3b.

**Please submit these scripts together with the other notebooks**.

Don't forget to add docstrings to your functions. 

In [None]:
# This is given as help already:
verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]
noun_tags = ["NN", "NNP", "NNS"]
adj_tags = ["JJ", "JJR", "JJS"]

test_articles = ["Fake Article: this is meant to imitate a real wikipedia article, nothing else. The fake article was written in 2022, which is actually not that long ago! End of Fake Article.", 
                    "Short Article: this short article is also fake, but it is shorter than the other one"
                    ]

import nltk
import string

# 3a
def lemmatize_text():
    # Modify the function definition so it receives the requested parameters
    # The rest of your code goes here (you can erase the pass)
    pass

# 3b
def build_global_vocabulary():
    # Modify the function definition so it receives the requested parameters
    # The rest of your code goes here (you can erase the pass)
    pass
    

# Test only lemmatize_text()
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
text = test_articles[0]
content_lemmas = lemmatize_text(text, lmtzr)
print(content_lemmas)
assert content_lemmas == ['fake', 'article', 'be', 'mean', 'real', 'wikipedia', 'article', 'nothing', 'fake', 'article', 'be', 'write', 'be', 'end', 'fake', 'article']

# Test build global_vocab()
vocab = build_global_vocabulary(test_articles)
assert vocab['fake'] == 4


# 3 c
## MOVE EVERYTHING TO AN EXTERNAL SCRIPT

### Exercise 4 

In this exercise, we want to see what content words are used most frequently in the wikipedia corpus.

Move the function **match_titles_texts** to the `resit_utils.py` script.

Now we come back to the notebook. We will import here the function **build_global_vocabulary** from the main script and **match_titles_texts** from the `resit_utils.py` script. In the notebook code cell, do the following: 
* Open and read the initial JSON file using the imported **match_titles_texts** function
* Obtain a list that contains only the article texts (without the titles)
* Call the build_global_vocabulary function with the 1000 articles as input
* Get the Top-100 lemmas found in all the wikipedia articles
* Save them into the file **top_100_lemmas.json**, where the structure of the JSON is:
{"top_100": [('lemma_1', freq_lemma_1), ('lemma_2', freq_lemma_2), ..., ('lemma_100', freq_lemma_100)]}

In [None]:
# Your imports go here

# Your code goes here

### Exercise 5

**Make a rudimentary search engine**

Finally, write in this notebook a function called **search_titles_with_terms** which has one positional parameter: `wiki_articles` (a list containing all the tuples (title, article)) and a keyword parameter `search_terms` (a list containing strings that represent keywords to match), which has a default value of **[]** (empty list). 
* If the list is empty, avoid iterating the articles and return an empty list right away.
* To improve the matches of your search function, lemmatize the terms inside the article texts (you can import here again the lemmatize_text() an use it, just beware that it requires a string as input, NOT a list of lemmas! TIP: use the join() python method for this)
* The function in summary:
    * Iterates the 1000 articles everytime it is called
    * Lemmatizes each one of the search terms
    * Gets the lemmas of each article text
    * Returns the **set of (unique) titles** (without the texts!) of those articles whose text contain at least one of the lemmatized provided terms.




In [None]:
# your imports go here


def search_titles_with_terms():
    # Modify the function definition so it receives the requested parameters
    # The rest of your code goes here (you can erase the pass)
    pass



my_titles = search_titles_with_terms(wiki_articles, terms=["movies"])
print(my_titles)
assert len(my_titles) == 102
assert 'The Chronicles of Narnia: Prince Caspian' in my_titles
print('Success!')