# 2021-2022 Digital Text Analysis | Bootcamp
## August exam period

Welcome to the practical exam. Below are a number of questions that test the skills you've learnt during the bootcamp.

Guidelines:

- You have **2 hours** to complete the exam.
- This is an **open book exam**. You may use the web, course notes or any other resources to assist you. You may NOT get live assistance from third parties (incl. fellow students), e.g. by solving the exam together with others in a chat.


Marking:
- The exam consists of a total of 20 points.
- Each exercise states how many points can be earned.
- Each point represents a grade, e.g. if you achieve 18 points, your grade is an 18.
- Points can be earned for **partial answers**, so fill in what you can. 

Tips:

- This exam does not ask you to show skills that you have not learne during the bootcamp. Keep it simple, keep it familiar. 
- Use notes to plan your answers.
- Use `print` statements to explore the data and debug your code.
- Show us the steps of the process that you **do** understand, even if you can't complete the process fully.
- In Python there is no `NoCodeError`, so don't stare at empty cells

## 1. Friends (5 points)

Below, we'll work with an open dataset [from Kaggle](https://www.kaggle.com/datasets/rezaghari/friends-series-dataset) on the popular sitcom ["Friends"](https://en.wikipedia.org/wiki/Friends) which aired from 1994 to 2004. Each row in this dataset describes data on a single episode (but note that some episodes, like season finales are listed more than once, because they consist of multiple parts that aired separately.)

- Load the dataset (which you can find as `friends.csv`in the same folder as this notebook) using `pandas`.
- Display the first 10 entries in the dataset.
- Print how many episodes are included and find out how many seasons are included.
- Consider the ratings (cf. `Stars` column):
    + Rank the rows in the dataset according to their ratings in decreasing order: which is the most popular episode?
    + Create a new series containing the mean rating per season.
    + Rank (in increasing order) the resulting series of means and plot them using a bar chart.
- Count how often each Director appears in the dataset and plot their cumulative frequency as a *horizontal* bar chart.
- Iterate over the `Summary` column: write these summaries of each Friends episode away to a new plain text file (`summaries.txt`) and write away each summary to a separate line.  Make sure to write to the file in a safe way. Make sure that the summaries do not contain any linebreaks anymore before writing them away.

In [None]:
# code here

## 2. Frequency distributions (5 points)

- Iterate over the summaries in `summaries.txt` in a line-by-line fashion and store them in a tuple. Make sure to read the file in a safe way.
- Use `collections.Counter` to iterate over the summaries and count out which of the lead characters (see dictionary below, which includes gender information) is most frequently mentioned in the summaries.
- Convert the resulting frequency distribution to a `pandas` object (use as column names `character` and `count`) and plot the result as a bar chart.
- Consider character interaction: consider all pairs of characters and count (in a data structure of your own choice) how many episodes these characters co-occur. Which two characters are most frequently mentioned together in the summaries?
- **Intermediate:** Consider the statistics that you collected in the previous question: try to write (preferably elegant) code to explicitly count how often the men interact with one another and how often the women interact with one another. Also calculate the mass of mixed-gender interactions.

In [None]:
leads = {'Joey': 'M', 'Rachel': 'F', 'Monica': 'F', 'Phoebe': 'F', 'Chandler': 'M', 'Ross': 'M'}

In [None]:
# code here

## 3. Functions and regular expressions (5 points)

- Define a function that can sentence-tokenize an arbitrary string (an input parameter that you can call `text`) and returns the resulting sentences as a tuple. For this purpose, you should work with a regular expression that matches any series of the punctuation marks '.', '!', and '?' immediately followed by at least one instance of whitespace.
- Add a boolean parameter `lowercase` to the function's signature which defaults to False and which will lowercase the resulting sentences when set to `True` and add the desired behaviour to the string.
- The function should raise a `ValueError` if one of the two parameters (`text` and `lowercase`) does not have the expected type. Add an informative error message.
- Use this naive sentence tokenizer to count how many sentences the summaries contain in total. (Assert that you have a higher sentence count than summary count, because summaries contain multiple sentences.)
- **Advanced:** At the start of a sentence, all words are normally capitalized. In the summaries, we should be able to naively detect the names of other, non-lead characters, because their names will be capitalized, also when they occur in the middle of a sentence. Exploit this regularity and attempt to automatically detect the names of other characters. Count their occurences in the summaries to find out the most frequently occuring *side* characters (i.e. while ignoring lead character). *Hint: the most frequently mentioned side character appears to be "Emily"*.
    + Remove genitival endings (e.g. `Paolo's`)
    + Also, strings that are all caps (e.g. 'TV') should be disregarded.

In [None]:
# code here

## 4. A search engine (5 points)

Implement a (modest) search engine for our Friends dataset:
1. Invite the user of this notebook to insert a query string using `input()`. Use whitespace to split multiple query terms as a simple tokenization strategy.
2. Retrieve all episodes from the data set in which one of the search terms is present in the summary column or the title column.
3. Improve your search engine: episodes in which we find more matches of the query terms should be ranked higher in the result.
4. A search command might not yield any results: make sure that your code keeps on asking the user for a new query string using a `while` loop.
5. **Advanced**: If the query did not yield any results, suggest to the user an alternative query that is somehow similar to the original one.

In [None]:
# code here