## Cleaning the co-occurrence matrix
This notebook reads in the co-occurrence matrix and takes only the top 50 characters. This file is then saved as a CSV and is intended to be fed into a javascript script using d3 to generate some nice visuals. We want to minimize the file size that needs to go to d3 since the user's browser needs to grab that data and organize it... In other words, do as much as the preprocessing now so that it won't have to be done over and over again later!

### Read in the data

In [10]:
import pandas as pd

df = pd.read_csv('../data/cooccurrence.csv')

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,A. Dippet,A. Kirke,A. Lynch,A. Pye,A. Sinistra,Aberforth D.,Abraxas M.,Adrian P.,Alastor M.,...,Whomping Willow,William S.,William the Pukwudgie,Winky,Xenophilius L.,Yaxley,Zacharias S.,Zacharias S.] Megan J.,Zhang Fei,Úrsula F.
,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Dippet,0,24,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
A. Kirke,0,0,11,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Lynch,0,0,0,13,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Pye,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Remove some dirty data
Something about the web parsing script generated these NaN's and Unnamed fields so let's get rid of them.

In [22]:
df.drop(df.index[0], inplace=True)

In [23]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [24]:
df.head()

Unnamed: 0,A. Dippet,A. Kirke,A. Lynch,A. Pye,A. Sinistra,Aberforth D.,Abraxas M.,Adrian P.,Alastor M.,Albert R.,...,Whomping Willow,William S.,William the Pukwudgie,Winky,Xenophilius L.,Yaxley,Zacharias S.,Zacharias S.] Megan J.,Zhang Fei,Úrsula F.
A. Dippet,24,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
A. Kirke,0,11,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Lynch,0,0,13,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Pye,0,0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A. Sinistra,0,0,0,0,175,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### Getting the most popular characters
A nice feature about cooccurrence matrices is that the diagonal represents the total number of fan fictions about that character. There's probably a clever way to do this but I just changed the number of fan fiction threshold until I got about 50 characters. All of the not as popular characters have their names added to an array to be deleted from the dataframe in the next step.

In [33]:
to_delete = []
for name in df.keys():
    if df[name][name] < 1700:
        to_delete.append(name)
    else:
        print(name)
print(413-len(to_delete))

Albus D.
Albus S. P.
Andromeda T.
Angelina J.
Astoria G.
Bellatrix L.
Bill W.
Blaise Z.
Cedric D.
Charlie W.
Cho C.
Draco M.
Fleur D.
Fred W.
George W.
Ginny W.
Harry P.
Hermione G.
James P.
James S. P.
Katie B.
Lily Evans P.
Lily Luna P.
Lucius M.
Luna L.
Marauders
Minerva M.
Molly W.
N. Tonks
Narcissa M.
Neville L.
OC
Oliver W.
Pansy P.
Percy W.
Peter P.
Petunia D.
Regulus B.
Remus L.
Ron W.
Rose W.
Scorpius M.
Seamus F.
Severus S.
Sirius B.
Teddy L.
Theodore N.
Tom R. Jr.
Victoire W.
Voldemort
50


In [34]:
for name in to_delete:
    df.drop(name, axis=1, inplace=True)
    df.drop(name, inplace=True)

In [35]:
df

Unnamed: 0,Albus D.,Albus S. P.,Andromeda T.,Angelina J.,Astoria G.,Bellatrix L.,Bill W.,Blaise Z.,Cedric D.,Charlie W.,...,Rose W.,Scorpius M.,Seamus F.,Severus S.,Sirius B.,Teddy L.,Theodore N.,Tom R. Jr.,Victoire W.,Voldemort
Albus D.,10372,25,6,1,1,51,3,4,11,6,...,7,17,7,1948,265,9,9,321,0,445
Albus S. P.,25,6693,0,0,11,7,1,2,0,5,...,1580,3316,2,149,22,169,3,23,40,15
Andromeda T.,6,0,2658,1,6,669,4,0,1,2,...,1,7,0,19,194,179,0,4,14,16
Angelina J.,1,0,1,1735,2,0,2,3,8,11,...,1,2,2,4,3,3,0,0,0,0
Astoria G.,1,11,6,2,2357,4,1,69,3,1,...,27,159,4,9,2,10,43,5,5,3
Bellatrix L.,51,7,669,0,4,7242,5,1,2,17,...,5,7,3,304,1037,10,2,144,0,1255
Bill W.,3,1,4,2,1,5,1808,0,3,213,...,3,1,1,26,21,29,3,1,68,2
Blaise Z.,4,2,0,3,69,1,0,3975,8,16,...,5,13,38,41,10,1,305,12,3,10
Cedric D.,11,0,1,8,3,2,3,8,2426,11,...,0,0,4,18,15,1,4,13,0,14
Charlie W.,6,5,2,11,1,17,213,16,11,2655,...,3,11,0,39,22,13,13,1,5,5


And that's it! Now let's save it off and let d3 run with it.

In [36]:
df.to_csv('../data/cooccurrences-min.csv')