# Exercises week 6

In these exercises, you will continue to work with the set of 19th century novels from Project Gutenberg, specifically the 18 texts in the directory `data/gutenberg/training/`.

The goals for today are as follows:
- Create Pandas `Series` objects from lists and dictionaries.
- Rename the labels in the index
- Sort the data
- Perform arithmetic on Series objects
- Make visualizations (bar plots)
- Advanced: smoothing

In [None]:
%matplotlib inline
import os
from glob import glob
from collections import Counter
import nltk
import pandas as pd

## 1. Plotting the readability of 19th century fiction

### Creating Series objects

Take the readability results from week 5 (a dictionary mapping filenames to readability scores), and put them in a Series.

In [None]:
# code from week 5 here

In [None]:
# create Series
readab = pd.Series(...)
# inspect what the Series looks like
readab

### manipulating the index

Your filenames probably include the path which makes them long and cumbersome: `data/gutenberg/training/austen-sense.txt` etc.

We can change those labels by changing the `.index` of our Series object. This works as follows. Create a new list with exactly the same number of items as the original index. Finally, replace the index by assigning the new index to it. For example:

```python
data = pandas.Series([0, 1])
newindex = ['a', 'b']
data.index = newindex
```

To make our filenames shorter, we can re-use the function `remove_dir_ext` from Chapter 3 to remove the directories and the extension. Apply this function to all the items in the index. The goal is to have a `Series` object with clear and short names such as `austen-sense`.

In [None]:
# your code here

### Sorting data

We can sort our data to easily see which book has the lowest and highest score. Pandas provides two methods for this:

```python
data.sort_index()
data.sort_values()
```

Try them both. Do you understand the difference? Note that these methods return a *new* sorted copy. If you want to keep the sorted version, you have to assign it:

```python
data = data.sort_values()
```

Sort the Series with the readability scores by the scores.

In [None]:
# your code here

Now provide a simple bar plot. A horizontal bar plot fits best, because this makes the names of the novels more readable.

In [None]:
# your code here

The bar plot is missing a label for the x-axis. Here's how to add it:

```python
ax = data.plot.barh()
ax.set_xlabel('My label')
```

Add the label to your plot. A good label would be "Readability (ARI)" which describes the quantity shown and the specific formula that was used.

In [None]:
# your code here

## 2. Relative sentiment

Use the sentiment scores you computed in the exercises of week 5. Put them in a Series, sort them, fix the index, and plot them.

In [None]:
# your code here

Now we will reconsider a question research in the previous week's exercises:

- The books have different lengths, is this a problem? If so, can you think of something to correct for this?

The answer is yes, we should fix this! We should first know the length of each book. Since the sentiment scores count tokens, the relevant length is the total number of tokens in a text.

Count the number of tokens in each file. Create a dictionary with a mapping of filenames to the number of tokens in that file. Put it into a Series object, just as you did for the sentiment scores.

In [None]:
# your code here

How can we now make the sentiment scores for different texts comparable? The answer is that we should calculate the proportion of the sentiment score over the number of tokens:

$$\textrm{sentiment proportion}(\textrm{text}) = \frac{\textrm{sentiment score of text}}{\textrm{number of words in text}}$$

It turns out we can very easily compute this if we have two Series objects, as long as they have the exact same index:

```python
data1 = pandas.Series(...)
data2 = pandas.Series(...)
proportion = data1 / data2
```

Apply this to the sentiment scores and the number of tokens to obtain a 'sentiment proportion' for each novel. Also sort and plot these scores. What differences do you seebetween the normal sentiment scores and the proportional sentiment scores?

In [None]:
# your code here

## 3. Sentiment arcs

Let's do something more advanced: we will try to plot the sentiment arc of a text. Instead of summing the sentiment scores into a single number, we will put the sentiment score of each token in a list. This will allow us to track the sentiment over time (at least over what we may call "text time").

Since this exercises is more advanced, the code is given. Read the code, run it, and try to analyze the results.

We pick Sense & Sensibility to analyze. We first load its tokens into a Series object:

In [None]:
with open('data/gutenberg/training/austen-sense.txt', encoding='utf8') as inp:
    austen = nltk.word_tokenize(inp.read().lower())

Now we will create a list which contains for each token a number:

- -1 if it is a negative word
- 1 if it is a positive word
- 0 if it is neither


In [None]:
with open('data/positive-words.txt', encoding='utf8') as inp:
    positive_words = set(inp.read().splitlines())
with open('data/negative-words.txt', encoding='utf8') as inp:
    negative_words = set(inp.read().splitlines())

def sentiment_arc(filename, positive_words, negative_words):
    with open(filename, encoding='utf8') as inp:
        tokens = nltk.word_tokenize(inp.read().lower())
    sentiment = []
    for token in tokens:
        if token in positive_words:
            sentiment.append(1)
        elif token in negative_words:
            sentiment.append(-1)
        else:
            sentiment.append(0)
    return sentiment

austen_sentim = pd.Series(sentiment_arc(
        'data/gutenberg/training/austen-sense.txt',
        positive_words, negative_words))
austen_sentim

This Series provides the raw data for a "sentiment arc", but we need to zoom out from individual words to the sentiment over longer stretches of text, say chunks of a 5000 tokens.

We can do this by applying some advanced Pandas magic, namely a "rolling sum". For each token, we collect the sum of the sentiment scores for the preceding 5000 tokens:

In [None]:
austen_arc = austen_sentim.rolling(5000).sum()

 This gives us a nice plot arc:

In [None]:
ax = austen_arc.plot();
ax.set_xlabel('text time')
ax.set_ylabel('sentiment score')
austen_arc

The question now is: is this plot meaningful? Try to look at one or more of the peaks, and see if you can trace it back to a particular part of the novel and perhaps particular events.

Note that the numbers on the x-axis are token numbers. If you want to took at the context around token 10000 for example, you could do so as follows:

In [None]:
print(' '.join(austen[10000:10100]))