# Building a Master Dictionary of Valid Words

We'll build a dictionary of words and phrases that could be valid clues for Codenames.  There are two primary categories:
* words such as cat, boat and king
* proper nouns which may be more than one word such as names (George Washington) or titles (Animal Farm)

In [1]:
import json
import re
import pandas as pd

# Words

We'll words from the [ConceptNet](http://conceptnet.io/) database.  ConceptNet is a network which connects words, phrases and concepts together.  This will be very useful to us later on, but for now we'll just use it for building our dictionary.

First, load the data which was found here: https://github.com/commonsense/conceptnet5/wiki/Downloads

In [2]:
df = pd.read_csv('dictionary/conceptnet-assertions-5.5.5.csv.gz', sep='\t', header=None, names=['uri', 'relation', 'from', 'to', 'json'])
df.head()

Unnamed: 0,uri,relation,from,to,json
0,"/a/[/r/Antonym/,/c/ab/агыруа/n/,/c/ab/аҧсуа/]",/r/Antonym,/c/ab/агыруа/n,/c/ab/аҧсуа,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."
1,"/a/[/r/Antonym/,/c/adx/thəχ_kwo/a/,/c/adx/ʂap_...",/r/Antonym,/c/adx/thəχ_kwo/a,/c/adx/ʂap_wə,"{""dataset"": ""/d/wiktionary/fr"", ""license"": ""cc..."
2,"/a/[/r/Antonym/,/c/adx/tok_po/a/,/c/adx/ʂa_wə/]",/r/Antonym,/c/adx/tok_po/a,/c/adx/ʂa_wə,"{""dataset"": ""/d/wiktionary/fr"", ""license"": ""cc..."
3,"/a/[/r/Antonym/,/c/adx/ʂa_wə/a/,/c/adx/tok_po/]",/r/Antonym,/c/adx/ʂa_wə/a,/c/adx/tok_po,"{""dataset"": ""/d/wiktionary/fr"", ""license"": ""cc..."
4,"/a/[/r/Antonym/,/c/adx/ʂap_wə/a/,/c/adx/thəχ_k...",/r/Antonym,/c/adx/ʂap_wə/a,/c/adx/thəχ_kwo,"{""dataset"": ""/d/wiktionary/fr"", ""license"": ""cc..."


`from` and `to` are the words that are connected and `relation` is how they are connected.  For now we'll just consider the `from` column. We can use the metadata on the word to only keep English words (`en`), and then we'll strip off the metadata.

In [3]:
df = df[['from', 'json']].drop_duplicates(subset='from')
df = df[(df['from'].str.extract(r'/./([^/]*)/.*') == 'en')]
df['from'] = df['from'].str.extract(r'/./[^/]*/([^/]*).*')
df.head()

  
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,from,json
7469,0,"{""dataset"": ""/d/wiktionary/fr"", ""license"": ""cc..."
7470,12_hour_clock,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."
7471,24_hour_clock,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."
7472,5,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."
7473,a.c,"{""dataset"": ""/d/wiktionary/fr"", ""license"": ""cc..."


There are several different [knowledge sources](https://github.com/commonsense/conceptnet5/wiki/Knowledge-sources) for ConceptNet.  The two we'll keep are the English Wikitionary and Verbosity.  

In [4]:
# Keep: wiktionary and verbosity
df = df[
    df['json'].str.contains('verbosity') | df['json'].str.contains('/d/wiktionary/en')
]
df.head()

Unnamed: 0,from,json
7470,12_hour_clock,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."
7471,24_hour_clock,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."
7472,5,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."
7474,a.m,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."
7477,ab_extra,"{""dataset"": ""/d/wiktionary/en"", ""license"": ""cc..."


Finally we'll remove any multi-word rows:

In [5]:
words = df['from']
words = words[~words.str.contains('_')].drop_duplicates()
print(words.head())
print()
print("Number of words:", len(words))

7472            5
7474          a.m
7479    abactinal
7480      abandon
7482     abapical
Name: from, dtype: object

Number of words: 492938


And we're left with 492938 "words".

# Proper Nouns

Next, we'll build a list of valid proper nouns.  ConceptNet does include some from [DBPedia](http://wiki.dbpedia.org/), but we'll go straight to the source to get a more complete list.  DBPedia is a project to extract structured data from Wikipedia.  They release many different datasets here: http://wiki.dbpedia.org/downloads-2016-10

The one that we'll use right now is the English Language instance type data.  Fer each article on Wikipedia, it will be labeled with what it is such as a person, place or book.  We can filter out only the objects of types which could have proper nouns.

First, we read in the data.  It comes as a `ttl` file which I found easiest to just parse line-by-line.

In [6]:
import re

rows = []
with open('dictionary/instance_types_en.ttl', 'r', encoding="utf8") as f:
    f.readline()
    for l in f.readlines():
        try:
            split = l.split(' ')
            rows.append((split[0].split('/')[-1], split[2]))
        except:
            print(l)

The two columns of interest are the object name and the type:

In [7]:
df = pd.DataFrame.from_records(rows[:-1], columns=['object', 'type'])
df.head(10)

Unnamed: 0,object,type
0,Anarchism>,<http://www.w3.org/2002/07/owl#Thing>
1,Achilles>,<http://www.w3.org/2002/07/owl#Thing>
2,Autism>,<http://dbpedia.org/ontology/Disease>
3,Alabama>,<http://dbpedia.org/ontology/AdministrativeReg...
4,Abraham_Lincoln>,<http://dbpedia.org/ontology/OfficeHolder>
5,Abraham_Lincoln__1>,<http://dbpedia.org/ontology/TimePeriod>
6,Abraham_Lincoln__2>,<http://dbpedia.org/ontology/TimePeriod>
7,Abraham_Lincoln__3>,<http://dbpedia.org/ontology/TimePeriod>
8,An_American_in_Paris>,<http://www.w3.org/2002/07/owl#Thing>
9,Animalia_(book)>,<http://dbpedia.org/ontology/Book>


After cleaning it up we're left with:

In [8]:
df = df[df['type'].str.contains('dbpedia.org/ontology')].copy()
df['object'] = df['object'].str.strip(r'[<>]').str.split('__', expand=True)[0].str.replace('_', ' ').str.replace(r'\(.*\)', '').str.strip()
df = df.drop_duplicates()
df['type'] = df['type'].str.strip(r'[<>]').str.split('/', expand=True)[4]
df['object'] = df['object'].str.replace(r'\s+', ' ')
df['object'] = df['object'].str.replace('%22', '"')
df['object'] = df['object'].str.replace('%3F', '?')
df = df[~(df['object'] == '')].copy()

df.head(10)

Unnamed: 0,object,type
2,Autism,Disease
3,Alabama,AdministrativeRegion
4,Abraham Lincoln,OfficeHolder
5,Abraham Lincoln,TimePeriod
9,Animalia,Book
10,Academy Awards,Award
11,Actrius,Film
13,Allan Dwan,Person
14,Allan Dwan,PersonFunction
15,Alain Connes,Scientist


There are hundreds of nuanced instance types.  To figure out which might be "proper noun" types, I took a partially manual approach.  I generated stats on the number of words, number of capitalized words, etc. and then by hand sorted through them to decide which ones seemed to consist of proper nouns and which didn't.

In [9]:
def countUpper(x):
    words = x.split(' ')
    return len([w for w in words if w[0].isupper()]) - 1

def countWords(x):
    return len(x.split(' '))

def countNumbers(x):
    return len(re.findall(r'[0-9]', x))

df['num_upper'] = df['object'].apply(countUpper)
df['num_words'] = df['object'].apply(countWords)
df['num_numbers'] = df['object'].apply(countNumbers)
df['percent_upper'] = df['num_upper']/(df['num_words'] - 1)

df[df['num_words'] > 1].groupby('type').mean().sort_values('percent_upper', ascending=False).to_csv('dictionary/type_stats.csv')

df[df['num_words'] > 1].groupby('type').mean().sort_values('percent_upper', ascending=False)

Unnamed: 0_level_0,num_upper,num_words,num_numbers,percent_upper
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BoxingLeague,3.000000,4.000000,0.000000,1.000000
ChemicalElement,1.000000,2.000000,0.000000,1.000000
HorseTrainer,1.580247,2.580247,0.000000,1.000000
VideogamesLeague,2.000000,3.000000,0.000000,1.000000
LacrossePlayer,1.023148,2.023148,0.000000,1.000000
Guitarist,1.006849,2.006849,0.000000,1.000000
SpeedwayLeague,2.153846,3.153846,0.000000,1.000000
SnookerChamp,1.000000,2.000000,0.000000,1.000000
VoiceActor,1.001873,2.001873,0.000000,1.000000
BowlingLeague,2.000000,3.000000,0.000000,1.000000


After sorting through them manually I ended up with the 336 types in [`proper_noun_types.csv`](proper_noun_types.csv).

In [10]:
proper_noun_types = pd.read_csv('proper_noun_types.csv')
proper_noun_types

Unnamed: 0,proper_nouns
0,CricketLeague
1,SpaceShuttle
2,Guitarist
3,HorseTrainer
4,LacrossePlayer
5,MixedMartialArtsLeague
6,CurlingLeague
7,Stream
8,SumoWrestler
9,SpeedwayLeague


There's still a little bit more filtering and cleaning to go.  First, I filterd down to the proper noun types.  And I noticed there were many `TimePeriod`s that were names, so I removed the ones with commas (more on that it a second).  

In [11]:
proper_nouns = df[
    df['type'].isin(proper_noun_types['proper_nouns']) &
    ~((df['type'] == 'TimePeriod') & (df['object'].str.contains(',')))
]
proper_nouns

Unnamed: 0,object,type,num_upper,num_words,num_numbers,percent_upper
3,Alabama,AdministrativeRegion,0,1,0,
4,Abraham Lincoln,OfficeHolder,1,2,0,1.000000
5,Abraham Lincoln,TimePeriod,1,2,0,1.000000
9,Animalia,Book,0,1,0,
10,Academy Awards,Award,1,2,0,1.000000
11,Actrius,Film,0,1,0,
13,Allan Dwan,Person,1,2,0,1.000000
14,Allan Dwan,PersonFunction,1,2,0,1.000000
15,Alain Connes,Scientist,1,2,0,1.000000
16,Aristotle,Philosopher,0,1,0,


Then I wanted to clean up article titles which included commas such as place names (Rome, Ohio) or people with titles.  So instead of having an entry in the dictionary that was  "Rome, Ohio" there would just be "Rome."

Again, through a manual process I identified the following types to do the split on.

In [12]:
split = [
    'Settlement', 'Village', 'City', 'Town', 'AdministrativeRegion', 'HistoricBuilding', 
    'Noble', 'School', 'OfficeHolder', 'Building', 'Royalty', 'HistoricPlace', 'University', 
    'Person', 'PersonFunction', 'Place', 'ReligiousBuilding', 'MilitaryPerson', 'Baronet',
    'MilitaryUnit', 'Politician', 'Road', 'Station', 'Island', 'RailwayLine', 'Museum',
    'Park', 'Venue', 'MemberOfParliament', 'Hospital', 'Prison', 'Mountain', 'River',
    'MilitaryStructure', 'Bridge', 'Hotel', 'Lake', 'Airport', 'College', 'Stadium',
    'SiteOfSpecialScientificInterest', 'Saint', 'Monarch', 'WorldHeritageSite', 
    'Governor', 'ShoppingMall', 'ChristianBishop', 'Monument', 'CricketGround'
]

proper_nouns.loc[
    proper_nouns['object'].str.contains(',') & proper_nouns['type'].isin(split),
    'object'
] = proper_nouns[
    proper_nouns['object'].str.contains(',') & proper_nouns['type'].isin(split)
]['object'].str.split(',', expand=True)[0]

proper_nouns

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,object,type,num_upper,num_words,num_numbers,percent_upper
3,Alabama,AdministrativeRegion,0,1,0,
4,Abraham Lincoln,OfficeHolder,1,2,0,1.000000
5,Abraham Lincoln,TimePeriod,1,2,0,1.000000
9,Animalia,Book,0,1,0,
10,Academy Awards,Award,1,2,0,1.000000
11,Actrius,Film,0,1,0,
13,Allan Dwan,Person,1,2,0,1.000000
14,Allan Dwan,PersonFunction,1,2,0,1.000000
15,Alain Connes,Scientist,1,2,0,1.000000
16,Aristotle,Philosopher,0,1,0,


Then, to make the format match the words from ConceptNet, I made everything lowercase and replaced spaces with underscores.

In [13]:
proper_noun_words = proper_nouns['object'].str.lower().str.replace(' ', '_').drop_duplicates()
proper_noun_words = proper_noun_words[
    ~proper_noun_words.str.contains(r'^list_of')
]
proper_noun_words

3                                           alabama
4                                   abraham_lincoln
9                                          animalia
10                                   academy_awards
11                                          actrius
13                                       allan_dwan
15                                     alain_connes
16                                        aristotle
17         academy_award_for_best_production_design
19                                         ayn_rand
23                                          algeria
29                                     andre_agassi
30                                      animal_farm
32                                          andorra
33                                           alaska
41                                    aldous_huxley
43                            america_the_beautiful
48            american_national_standards_institute
50                                a_modest_proposal
51          

After all that, we'er left with a master dictionary of about 2.75 million words and proper nouns.

In [14]:
dic = dict([(w, 1) for w in (list(words) + list(proper_noun_words))])
len(dic)

2740624

In [15]:
import json
json.dump(dic, open('dictionary/words.json', 'w'))