# Introduction to Text Analysis

## Challenge 1: Regex parsing

Using the regex pattern `p` above, print the `set` of unique characters in *Monty Python*:

In [None]:
matches = re.findall(p, document)
names = set([x[0] for x in matches])
print(names, len(names))

You should have 84 different characters.

Now use the `set` you made above to gather all dialogue into a character `dictionary`, with the keys being the character name and the value being a list of that character's lines:

In [None]:
# char_dict["ARTHUR"] should give you a list of strings with his dialogue

# Solution 1

char_dict = {}

for name in names:
    lines = []
    for line in matches:
        if name == line[0]:
            lines.append(line[1])
    char_dict[name] = lines

In [None]:
# Solution 2 (list comprehension)

char_dict = {}

for name in names:
    char_dict[name] = [line[1] for line in matches if name == line[0]]

In [None]:
# Solution 3 (dictionary comprehension)

char_dict = {name: [line[1] for line in re.findall(p, document) if line[0] == name] for name in names}

In [None]:
char_dict["ARTHUR"]

## Challenge 2: Removing noise

Write a function below that takes a string as an argument and returns that string without punctuation or stopwords (HINT: You can get a good start for a list of stopwords here: `from nltk.corpus import stopwords`)

In [None]:
def rem_punc_stop(text_string):
    
    from string import punctuation
    from nltk.corpus import stopwords

    for char in punctuation:
        text_string = text_string.replace(char, "")

    toks = word_tokenize(text_string)
    toks_reduced = [x for x in toks if x.lower() not in stopwords.words('english')]
    
    return toks_reduced

## Challenge 3: POS Frequency

Create a frequency distribution for Arthur's parts of speech:

In [None]:
tag_fd = nltk.FreqDist(tag for (word, tag) in [item for sublist in tagged_sents for item in sublist])
tag_fd.most_common()

## Challenge 4: Sentiment

How about we look at all characters? Create an empty list `collected_stats` and iterate through `char_dict`, calculate the net polarity of each character, and append a tuple of e.g. `(ARTHUR, 11.45)` back to `collected_stats`:

In [None]:
collected_stats = []
for k in char_dict.keys():
    blob = TextBlob(' '.join(char_dict[k]))
    net_pol = 0
    for sentence in blob.sentences:
        pol = sentence.sentiment.polarity
        net_pol += pol
    collected_stats.append((k, net_pol))

Now `sort` this list of tuples by polarity, and print the list of characters in *Monty Python* according to their sentiment:

In [None]:
sorted_stats = sorted(collected_stats, key=lambda x: x[1])
for t in sorted_stats:
    print(t[0], t[1])

## Challenge 5: word2vec

Play around with the word2vec model above and try to put into words exactly what the model does, and how one should interpret the results. How would you contrast this with the "bag of words" model?