# Introduction to Text Analysis

## Challenge 1: Regex parsing

Using the regex patter `p` above, print the `set` of unique characters in *Monty Python*:

In [10]:
matches = re.findall(p, document)
chars = set([x[0] for x in matches])
print(chars, len(chars))

{'INSPECTOR', 'DENNIS', 'KNIGHTS OF NI', 'PIGLET', 'SECOND BROTHER', ' GREEN KNIGHT', 'BLACK KNIGHT', 'CAMERAMAN', 'CARTOON CHARACTERS', 'FRENCH GUARDS', ' CRAPPER', 'LAUNCELOT', 'PATSY', 'LEFT HEAD', 'SIR GALAHAD', 'MIDDLE HEAD', 'VOICE', 'FATHER', 'ROGER THE SHRUBBER', 'ARTHUR', 'SIR ROBIN', 'CARTOON CHARACTER', 'AMAZING', 'OTHER FRENCH GUARD', 'TIM THE ENCHANTER', 'GUESTS', 'CRONE', 'CARTOON MONKS', 'S WIFE', 'PRINCE HERBERT', 'SIR LAUNCELOT', 'DINGO', 'KING ARTHUR', 'HISTORIAN', ' BLACK KNIGHT', 'MONKS', 'KNIGHTS', 'SUN', 'RANDOM', 'GOD', 'OLD MAN', ' BEDEVERE', 'BRIDGEKEEPER', ' PARTY', 'GIRLS', 'HEAD KNIGHT OF NI', 'MINSTREL', 'STUNNER', 'GALAHAD', 'FRENCH GUARD', 'KNIGHT', 'S FATHER', 'GUEST', 'ANIMATOR', 'CROWD', 'NARRATOR', 'WITCH', 'MAN', 'WOMAN', 'TIM', 'MASTER', 'DEAD PERSON', 'BEDEVERE', 'DIRECTOR', 'CONCORDE', 'WINSTON', 'LOVELY', 'GREEN KNIGHT', 'HEAD KNIGHT', 'ROGER', 'PRISONER', 'BROTHER MAYNARD', ' GIRLS', 'BORS', 'ALL HEADS', 'ZOOT', 'MAYNARD', 'CUSTOMER', 'ROBIN', '

You should have 84 different characters.

Now use the `set` you made above to gather all dialogue into a character `dictionary`, with the keys being the character name and the value being a list of dialogues.:

In [11]:
# char_dict["ARTHUR"] should give you a list of strings with his dialogue

char_dict = {}
for n in chars:
    char_dict[n] = re.findall(re.compile(r'(?:' + n + ': )(.+)'), document)

In [12]:
char_dict["ARTHUR"]

['Whoa there!  [clop clop clop] ',
 'It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!',
 'I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.',
 'Yes!',
 'What?',
 'So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--',
 'We found them.',
 'What do you mean?',
 'The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?',
 'Not at all.  They could be carried.',
 'It could grip it by the husk!',
 "Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.",
 'Please!',
 "I'm not interested!",
 'Will you ask your master if he wants to join my court at Camelot?!',
 

## Challenge 2: Removing noise

Write a functio below that takes a string as an argument and returns that string without punctuation or stopwords (HINT: You can get a good start for a list of stopwords here: `from nltk.corpus import stopwords`)

In [18]:
def rem_punc_stop(text_string):
    
    from string import punctuation
    from nltk.corpus import stopwords

    for char in punctuation:
        text_string = text_string.replace(char, "")

    toks = word_tokenize(text_string)
    toks_reduced = [x for x in toks if x.lower() not in stopwords.words('english')]
    
    return toks_reduced

## Challenge 3: Sentiment

How about we look at all characters? Create an empty list `collected_stats` and iterate through `char_dict`, calculate the net polarity of each character, and append a tuple of e.g. `(ARTHUR, 11.45)` back to `collected_stats`:

In [34]:
collected_stats = []
for k in char_dict.keys():
    blob = TextBlob(' '.join(char_dict[k]))
    net_pol = 0
    for sentence in blob.sentences:
        pol = sentence.sentiment.polarity
        net_pol += pol
    collected_stats.append((k, net_pol))

Now `sort` this list of tuples by polarity, and print the list of characters in *Monty Python* according to their sentiment:

In [35]:
sorted_stats = sorted(collected_stats, key=lambda x: x[1])
for t in sorted_stats:
    print(t[0], t[1])

MASTER -3.075
TIM -1.4726488095238095
FRENCH GUARD -1.2197916666666666
CRONE -1.0
KNIGHT -0.9155092592592596
BLACK KNIGHT -0.8583333333333333
SIR ROBIN -0.6
ALL HEADS -0.5928571428571429
GUEST -0.25
PIGLET -0.2
CUSTOMER -0.19857142857142884
GREEN KNIGHT -0.18333333333333335
DENNIS -0.16186507936507938
 BLACK KNIGHT -0.08333333333333333
GOD -0.07619047619047614
GUESTS -0.07500000000000001
BORS -0.0627976190476191
ROGER -0.04999999999999999
 GIRLS 0.0
ARMY OF KNIGHTS 0.0
KNIGHTS OF NI 0.0
CARTOON MONKS 0.0
 GREEN KNIGHT 0.0
HEAD KNIGHT OF NI 0.0
CAMERAMAN 0.0
CARTOON CHARACTERS 0.0
FRENCH GUARDS 0.0
 CRAPPER 0.0
PATSY 0.0
SIR GALAHAD 0.0
VOICE 0.0
AMAZING 0.0
OTHER FRENCH GUARD 0.0
TIM THE ENCHANTER 0.0
PRINCE HERBERT 0.0
SIR LAUNCELOT 0.0
KING ARTHUR 0.0
 BEDEVERE 0.0
 PARTY 0.0
STUNNER 0.0
S FATHER 0.0
BROTHER MAYNARD 0.0
ANIMATOR 0.0
OLD CRONE 0.0
WINSTON 0.0
LOVELY 0.0
PRISONER 0.0
MONKS 0.0
SIR BEDEVERE 0.0
SUN 0.0
INSPECTOR 0.0
ROGER THE SHRUBBER 0.1
NARRATOR 0.10800865800865794
DI

## Challenge 4: word2vec

Play around with the word2vec model above and try to put into words exactly what the model does, and how one should interpret the results. How would you contrast this with the "bag of words" model?