# Building a cooccurrence matrix
A cooccurrence matrix is a simple way to show how characters are related to each other by counting how many times one character appears with another character. In this case, if two characters are tagged as both being in one fan fiction, I count that as a cooccurence. So if a fanfiction's 'characters' field reads '[Harry P., Draco M.]', then the cell where row 'Harry' intersects with column 'Draco' intersect gets a point. The same is true for row 'Draco' and column 'Harry'. So let's build it! 

## Read in the data

In [1]:
import pandas as pd

data_directory = '../data/'

# here is the corrected data
filename = "corrected_data.json"

# take the transpose so that the column names are the metadata keys, not the story ids
df = pd.read_json(data_directory + filename).transpose()

# let's see what it looks like!
df.head()

Unnamed: 0,author_id,characters,genres,language,num_chapters,num_favs,num_follows,num_reviews,num_words,published,rated,status,title,updated
10000036,2846408,"[[Ron W., Hermione G.]]","[Romance, Humor]",English,1,5.0,2.0,3,5436,1389050085,T,Complete,Kilts and other adversities,1389050085
10000109,5232542,"[James P., Lily Evans P.]","[Humor, Romance]",English,1,4.0,,2,747,1389051817,K+,Complete,Of Dead Puppies and Dropped Pianos,1389051817
10000111,5438139,[],[],Spanish,1,7.0,6.0,5,400,1389051884,K+,Incomplete,Querido primo Harry,1389051884
10000114,5437478,"[Hermione G., Draco M.]","[Romance, Humor]",Spanish,1,11.0,13.0,8,1757,1389051911,K+,Incomplete,Conciertos en Hogwarts,1389051911
10000137,4626918,"[Bellatrix L., Luna L.]","[Humor, Horror]",English,3,,,3,659,1389052244,T,Complete,Crazy songs from crazy people,1389414254


Use this cell for development, otherwise the cell below to get all of the data.

In [103]:
dev = df.head(n=1000).copy()
len(dev)

1000

In [133]:
dev = df.copy()

## Clean data
Weed out the fan fictions that don't list which characters feature.

In [134]:
characters = dev[dev['characters'].apply(lambda x: len(x)>0)]
len(characters)

507475

Two helper functions. The 'characters' field can be either a list or a list of lists (but no more than that- so no list of lists of lists). We'll just flatten them. I think something like [['Ron W., 'Hermione G.']] means that Ron and Hermione feature in a romantic way, though I'm not sure if that is true. Regardless, whether it's [['Ron W., 'Hermione G.']] or ['Ron W.', 'Hermione G.'] it counts as a point so I'm fine with flattening it.

In [4]:
def isListOfLists(obj):
    return any(isinstance(el, list) for el in obj)

In [5]:
def flatten_list(l):
    if isListOfLists(l):
        returnList = []
        for sublist in l:
            if isinstance(sublist, list):
                for item in sublist:
                    returnList.append(item)
            else:
                returnList.append(sublist)
        return returnList;
    else:
        return l

Some small tests to make sure we are flattening correctly

In [74]:
test1 = [['hi', 'there']]
test2 = ['hi', 'there']
test3 = [['hi', 'there'],'joe']
test4 = [['hi', 'there'],'joe',['seeya','later']]
print(flatten_list(test1))
print(flatten_list(test2))
print(flatten_list(test3))
print(flatten_list(test4))

['hi', 'there']
['hi', 'there']
['hi', 'there', 'joe']
['hi', 'there', 'joe', 'seeya', 'later']


Now flatten the data!

In [135]:
flattened = characters['characters'].map(lambda l: flatten_list(l))

## Begin building the matrix
We'll use the notion of 'dummies' which simply indicates whether or not that character was tagged in the fan fiction. This line below will create a column for each character. If the character has a 1, then they are in the fan fiction, otherwise they'll have a 0.  

In [136]:
# from https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies
dummies = pd.get_dummies(flattened.apply(pd.Series).stack()).sum(level=0)

Now that we have a dataframe with a bunch of 1's and 0's, we can actually go ahead and construct the cooccurrence matrix just by using the dot product! 

In [137]:
# from https://stackoverflow.com/questions/20574257/constructing-a-co-occurrence-matrix-in-python-pandas
dummies_asint = dummies.astype(int)
coocc = dummies_asint.T.dot(dummies_asint)

In [139]:
coocc.head()

Unnamed: 0,Unnamed: 1,A. Dippet,A. Kirke,A. Lynch,A. Pye,A. Sinistra,Aberforth D.,Abraxas M.,Adrian P.,Alastor M.,...,Whomping Willow,William S.,William the Pukwudgie,Winky,Xenophilius L.,Yaxley,Zacharias S.,Zacharias S.] Megan J.,Zhang Fei,Úrsula F.
,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Dippet,0,24,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
A. Kirke,0,0,11,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Lynch,0,0,0,13,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Pye,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Using the matrix
Cooccurrence matrices have the nice property that the value along the diagonal (in other words, when a character intersects with itself), is the number of total fan fictions about that character. So if we just want a count of total number of fan fictions about Harry...

In [138]:
coocc['Harry P.']['Harry P.']

151951

It looks like my scraping still left a few not quite character names so we can go and clean those up.

In [83]:
coocc.drop(coocc.keys()[0], 1, inplace=True)
coocc.head()

Unnamed: 0,A. Dippet,A. Kirke,A. Lynch,A. Pye,A. Sinistra,Aberforth D.,Abraxas M.,Adrian P.,Alastor M.,Albert R.,...,Whomping Willow,William S.,William the Pukwudgie,Winky,Xenophilius L.,Yaxley,Zacharias S.,Zacharias S.] Megan J.,Zhang Fei,Úrsula F.
,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Dippet,24,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
A. Kirke,0,11,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Lynch,0,0,13,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Pye,0,0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can also use the cooccurrence to see which characters are most associated with each other. For example, to see the top 8 characters most often appearing in fan fictions with Ron...

In [284]:
coocc['Ron W.'].nlargest(8)

Ron W.         38037
Hermione G.    26687
Harry P.        9173
Draco M.        2612
Ginny W.        1646
OC               747
Luna L.          614
Rose W.          468
Name: Ron W., dtype: int64

Neat! Let's save that off.

In [285]:
coocc.to_csv(data_directory+'cooccurrence.csv')

## Building a character frequency table
This is easy now that we have the cooccurrence table!

In [192]:
char_freqs = []
for name in coocc.keys():
    char_freqs.append([name,coocc[name][name]])
char_freq_df = pd.DataFrame(char_freqs, columns=['name','count'])
char_freq_df.head()

Unnamed: 0,name,count
0,,1
1,A. Dippet,24
2,A. Kirke,11
3,A. Lynch,13
4,A. Pye,3


In [263]:
sorted_char_freq = char_freq_df.sort_values(['count','name'], ascending=False)

In [264]:
top_200 = sorted_char_freq.head(n=200)

In [266]:
names = top_200.pop('name')
top_200.index = names
top_200.head()

Unnamed: 0_level_0,count
name,Unnamed: 1_level_1
Harry P.,151951
Hermione G.,127244
Draco M.,110555
Severus S.,50641
Lily Evans P.,46677


In [267]:
top_200['count']['Remus L.']

37010

## A bit more cleaning
As any Harry Potter fan knows, the Marauders and the Founders are groups of people. It seems like fanfiction.net has their own tag for them. Though interesting, for a character frequency table this isn't that useful. So let's give each of the marauders (Lupin, Sirius, James, Pettigrew) an additional number of fan fictions for each fan fic written about the marauders (since after all, it's about them!)

In [268]:
marauders_bonus = top_200['count']['Marauders']
marauders = ['Remus L.', 'Sirius B.', 'James P.', 'Peter P.']
for marauder in marauders:
    orig_val = top_200['count'][marauder]
    top_200.set_value(marauder, 'count', orig_val + marauders_bonus)

And also for the founders...

In [269]:
founders_bonus = top_200['count']['Founders']
print(founders_bonus)
founders = ['Godric G.', 'Salazar S.', 'Helga H.', 'Rowena R.']
for founder in founders:
    orig_val = top_200['count'][founder]
    print(orig_val)
    top_200.set_value(founder, 'count', orig_val + founders_bonus)

94
843
1300
559
790


And of course for good old Tom also known as Voldemort!

In [270]:
tom_riddle_count = top_200['count']['Tom R. Jr.']
voldemort_count = top_200['count']['Voldemort']
top_200.set_value('Voldemort', 'count', tom_riddle_count+ voldemort_count)

Unnamed: 0_level_0,count
name,Unnamed: 1_level_1
Harry P.,151951
Hermione G.,127244
Draco M.,110555
Severus S.,50641
Lily Evans P.,46677
James P.,47897
Sirius B.,47487
Ginny W.,44599
Ron W.,38037
Remus L.,39764


Now we can get rid of those values.

In [272]:
top = top_200.drop(['Marauders', 'Founders', 'Tom R. Jr.'])

In [273]:
len(top)

197

Lastly, rather than largely varying numbers, we'll see what percentage of fan fics were written about each character. Keep in mind that this is only for the top 200.

In [274]:
top = top.sort_values('count', ascending=False)

In [275]:
percents = top/top.sum()

In [276]:
percents.columns=['percent']

In [277]:
percents

Unnamed: 0_level_0,percent
name,Unnamed: 1_level_1
Harry P.,0.152031
Hermione G.,0.127311
Draco M.,0.110613
Severus S.,0.050668
James P.,0.047922
Sirius B.,0.047512
Lily Evans P.,0.046701
Ginny W.,0.044622
Remus L.,0.039785
Ron W.,0.038057


In [278]:
percents.to_csv(data_directory+'char_freq_ff.csv')