## Harry Potter and the Data Scientist's Stone

Do spells in Harry Potter fall into different categories? Can we figure out what clusters they belong to, and maybe create new spells along those axes? Was Lord Voldemort really asexual? These are some of the mysteries I'll aim to uncover here.

In [18]:
%matplotlib inline 

import matplotlib.pyplot as plt
import nltk, re, os, codecs, string
import numpy as np
import pandas as pd
from sklearn import feature_extraction
import mpld3

Rabid Harry Potter fans have compiled a list of spells [here](http://harrypotter.wikia.com/wiki/List_of_spells) and [here](http://www.pojo.com/harrypotter/spelist.shtml), so I've gone ahead and scraped the pages for spell names and their meanings.

In [27]:
s = open("spells.txt","rb")
spells = s.read().lower().split('\n')[0:-1]

lines = [i.split('\t') for i in spells]

name = [i[0] for i in lines]
variety = [i[1] for i in lines]
meaning = [i[2].translate(string.maketrans("",""), string.punctuation) for i in lines]


Stem and tokenize the meanings, and have a peek with pandas.

In [24]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # some regex filtering
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

stemmed = [tokenize_and_stem(i) for i in meaning]

In [38]:
df = pd.DataFrame({'name' : name, 'variety' : variety, 'meaning' : meaning, 
                   'stemmed' : stemmed})
df.head()

Unnamed: 0,meaning,name,stemmed,variety
0,resulting effect,incantation,"[result, effect]",type of spell / charm
1,summons an object,accio,"[summon, an, object]",charm
2,shoots water from wand,aguamenti,"[shoot, water, from, wand]",charm
3,opens locked objects,alohomora,"[open, lock, object]",charm
4,clears the targets airway,anapneo,"[clear, the, target, airway]",spell


Now for the analysis. TF-IDF measures how often unusual words appear in texts, and we can use this metric to find out how similar different spells are.

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

stopwords = nltk.corpus.stopwords.words('english')

tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=200, min_df=0.1, 
        stop_words='english', use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(meaning)

km = KMeans(n_clusters=3)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

What have we just achieved? With a quick stroll through sklearn, we've clustered these spells in a way that will allow us to predict future spell types.

In [40]:
df['cluster'] = clusters
df.head()

Unnamed: 0,meaning,name,stemmed,variety,cluster
0,resulting effect,incantation,"[result, effect]",type of spell / charm,0
1,summons an object,accio,"[summon, an, object]",charm,2
2,shoots water from wand,aguamenti,"[shoot, water, from, wand]",charm,0
3,opens locked objects,alohomora,"[open, lock, object]",charm,2
4,clears the targets airway,anapneo,"[clear, the, target, airway]",spell,0


So it turns out clustering was super weak. Let's try a cool technique called latent dirichlet allocation. Once we convert our data to a bag-of-words representation, LDA is able to use a Bayesian approach to find "hidden topics" that underlie our sentences.

In [45]:
from gensim import corpora, models, similarities 

#remove stop words
text = [[word for word in line if word not in stopwords] for line in stemmed]

# filter out uncommon and too-common words
dictionary = corpora.Dictionary(text)
dictionary.filter_extremes(no_below=1, no_above=0.8)

# convert to one-hot
corpus = [dictionary.doc2bow(t) for t in text]

In [60]:
lda = models.LdaModel(corpus, num_topics=2, id2word=dictionary, 
                      update_every=5, chunksize=100, passes=50)

It's still a bit hazy, but we can start to make out a pattern: the first topic is curses (violent spells directed against people), and the second is charms (spells that transfigure objects and generally aren't permanent). We'll take another look with more data and more advanced models soon.

In [66]:
topics_matrix = lda.show_topics(formatted=False, num_words=10)
topics_matrix = np.array(topics_matrix)
topic_words = topics_matrix[:,:,1]
for i in topic_words:
    print [str(word) for word in i], "\n"

['object', 'oppon', 'wand', 'make', 'counter', 'item', 'conjur', 'creat', 'unforgiv', 'lock'] 

['spell', 'reveal', 'victim', 'target', 'magic', 'protect', 'caus', 'stop', 'allow', 'user'] 

