This is a script to generate bags of words which are both as comprehensive as possible, but also as specific as possible.

To do so, we will use a number of strategies: 


> 1) expansion of seed words, obtaining all synonyms, hyponyms and semantically related words; 

> 2) filter out words with unintended meanings.


We will use two phases of expansion and selection: 

> A. First we will use WordNet to obtain synonyms and hyponyms of the seed word, and then prune them according to meaning.

> B. Second, we will use Word2vec to: 
>>  B1. Create a semantic vector map of your custom corpus;

>>  B2. Use the semantic map to evaluate whether the semantic cloud of each word is consistent with the intended meaning. E.g. is 'brawn' in our corpus closer to words related to grease or to strenght?

>>  B3. Filter out clouds of words with irrelevant meanings, and add new words from the appropriate clouds if meaningful.

In [None]:
## first we import all revelant modules

import os
from os import path
import re

## from NLTK import wordnet, stopwords, Lemmatizer
import nltk
nltk.download('all')
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemma=WordNetLemmatizer()

## Import widgets to select words using checkboxes
import ipywidgets as widgets
from itertools import compress
layout = widgets.Layout(width='auto')

## Import word2vec
!pip install gensim
from gensim.models.word2vec import Word2Vec

## to resolve contractions in english: can't --> can not
!pip install contractions
import contractions

## To turn numeber into words
!pip install num2words
from num2words import num2words


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloadin

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 5.8 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 30.5 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.24
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting n

The main free parameters are the list of seed words and the language in which you want to obtain the final list of words: note you can add the seed words in english and obtain the final word list in your language of choice

In [None]:
## to see available wordnet languages and their codes type  wn.langs()
print("wordnet languages")
print(wn.langs())

## to see available stopword languages and their codes type stopwords.fileids()
print("stopword languages")
print(stopwords.fileids())


wordnet languages
dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'dan', 'ell', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])
stopword languages
['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [None]:
## language in which we want the output of wordnet
language_wordnet = 'eng'

## language for stopwords
stopword_language = 'english'

## language for converting numbers to words
number_language = 'en'

## to see available number-to-word languages and their coresgo here https://github.com/savoirfairelinux/num2words

##################################################
#Now the main part: your list of seed words 
##################################################

In [None]:
## add your seed words in the following list

seed_words = ['finance','money','invest']

### Part A. First we will use WordNet to obtain synonyms and hyponyms of the seed word, and then prune them according to meaning.

In [None]:
## declare the functions

## Function 1
## input is a seed word and a language, output is a list of words, length of the list, and a list of synsets
def generate_word_list(seed_word, language):
  
  ## we create an empty list to store the final word list
  list_of_lemmas = []
  list_of_meanings = []

  ## a function to add a word to a list
  add_to_list = lambda list1, item1: list1.append(item1)

  ## a function to return the hyponyms of a synset
  hypos = lambda s:s.hyponyms()

  ## wn.synset obtains the list of synonyms and meanings for that word, in different syntactic categories
  meanings = wn.synsets(seed_word, pos=wn.NOUN+wn.VERB+wn.ADJ)

  ## loop over set of meanings in synset
  for meaning in meanings:

    ## add synset, definition, and a list of all associated lemmas into the list_of_meanings
    list_of_meanings += [[meaning, meaning.definition(), [lemma.name() for lemma in meaning.lemmas(language)]]]

    ## append all synonyms (lemmas()) of that meaning to the list_of_lemmas 
    [add_to_list(list_of_lemmas,lemma.name()) for lemma in meaning.lemmas(language)]

    ## loop over the list of all possible hyponyms
    for hyponym in meaning.closure(hypos):

      ## add synsets, definition, and a list of all associated lemmas into the list_of_meanings
      list_of_meanings += [[hyponym, hyponym.definition(), [lemma.name() for lemma in hyponym.lemmas(language)]]]

      ## append all synonyms (lemmas()) of that hyponym to the list_of_lemmas 
      [add_to_list(list_of_lemmas,lemma.name()) for lemma in hyponym.lemmas(language)]

  ##eliminate list duplications by applying the set transformation
  set_of_lemmas= [*set(list_of_lemmas)]

  ## sort alphabetically
  set_of_lemmas.sort()

  ##length
  length = len(set_of_lemmas)

  return(set_of_lemmas,length,list_of_meanings)


Loop to run the function for every seed in the list of seed words

In [None]:
list_meanings = []

##create list of meanings
for seed_word in seed_words:
  meanings = generate_word_list(seed_word, language_wordnet)[2]
  list_meanings += meanings

##eliminate list (of lists) duplications by applying the set transformation
import itertools
list_meanings.sort()

## groupby also eliminates duplications
list(list_meanings for list_meanings,_ in itertools.groupby(list_meanings))
print(list_meanings)

[[Synset('appropriation.n.01'), 'money set aside (as by a legislature) for a specific purpose', ['appropriation']], [Synset('arbitrage.n.01'), 'a kind of hedged investment meant to capture slight differences in price; when there is a difference in the price of something on two different markets the arbitrageur simultaneously buys at the lower price and sells at the higher price', ['arbitrage']], [Synset('back.v.05'), 'support financial backing for', ['back']], [Synset('banking.n.01'), 'engaging in the business of keeping money for savings and checking accounts or for exchange or for issuing loans and credit etc.', ['banking']], [Synset('banking.n.02'), 'transacting business with a bank; depositing or withdrawing funds or requesting a loan etc.', ['banking']], [Synset('boodle.n.01'), 'informal terms for money', ['boodle', 'bread', 'cabbage', 'clams', 'dinero', 'dough', 'gelt', 'kale', 'lettuce', 'lolly', 'lucre', 'loot', 'moolah', 'pelf', 'scratch', 'shekels', 'simoleons', 'sugar', 'wam

## Now we select the appropriate meanings using a checkbox widget

In [None]:
## Note: in the output cell, make sure you have scrolled all the way up (and down)

selection_widget = widgets.VBox(
    [
        widgets.Checkbox(value=True, description=str(item), disabled=False, indent=False, layout=layout)
        for item in list_meanings
    ]
)

selection_widget


VBox(children=(Checkbox(value=True, description="[Synset('appropriation.n.01'), 'money set aside (as by a legi…

In [None]:
## Here is the filtered list of meanings

filtered_meanings = list(
  compress(list_meanings, [widget.value for widget in selection_widget.children]))

## now we extract just the lemmas 
filtered_list = [word for lemmas in filtered_meanings for word in lemmas[2]]

## eliminate duplications and sort alphabetically

filtered_list= [*set(filtered_list)]

## sort alphabetically
filtered_list.sort()  
print(filtered_list)

['Civil_List', 'arbitrage', 'bank_deposit', 'banking', 'big_bucks', 'big_money', 'budget', 'bull', 'bundle', 'buy_into', 'commit', 'coronate', 'corporate_finance', 'cover', 'crown', 'demand_deposit', 'deposit', 'empower', 'endow', 'endue', 'finance', 'financing', 'floatation', 'flotation', 'fund', 'funding', 'gift', 'high_finance', 'home_banking', 'index_fund', 'indue', 'invest', 'investing', 'investment', 'job', 'leverage', 'leveraging', 'megabucks', 'money', 'nest_egg', 'pension_fund', 'petty_cash', 'pile', 'place', 'put', 'refinance', 'savings', 'seed', 'shelter', 'shinplaster', 'speculate', 'sterling', 'subsidisation', 'subsidization', 'superannuation_fund', 'tie_up', 'token_money', 'trust_fund', 'war_chest']


## Part B

###B1. Create a semantic vector map of your custom corpus

Word2Vec will generate a semantic vector map in which words are closer in a vector manifold if they are semantically more similar. For an introduction to Word2Vec, see the tutorial:

https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1

Basically, word2vec will evaluate the context in which words are used (which other words co-occur with them) and map each word in the multi-dimensional space in relation to other words. 

To create that vector space, we need to feed a list of sentences to word2vec.So first we need a dataset of texts to input to word2vec in order to create the semantic vector map. 

You can also download a pretrained model: see here https://radimrehurek.com/gensim/models/word2vec.html

**If you are using colab, we simply copy paste a zipfile with all the texts, if using jupyter, place the zip folder in the same directory as the script**

In [None]:
## unzip file dragged to files to a folder 
!unzip example_texts.zip


## set the zip_dir based on the name of the folder that you compressed
zip_dir = "/example_texts"

base_dir1 = os.getcwd()
root_folder = base_dir1 + zip_dir 


Archive:  example_texts.zip
replace __MACOSX/._example_texts? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: __MACOSX/._example_texts  
  inflating: example_texts/CumberlandRichard-1772-The fashionable love68.txt  
  inflating: __MACOSX/example_texts/._CumberlandRichard-1772-The fashionable love68.txt  
  inflating: example_texts/Lyly John-1591-Mother Bombi.txt  
  inflating: __MACOSX/example_texts/._Lyly John-1591-Mother Bombi.txt  
  inflating: example_texts/WilliamShakespeare-1602-All's Well That Ends.txt  
  inflating: __MACOSX/example_texts/._WilliamShakespeare-1602-All's Well That Ends.txt  
  inflating: example_texts/HolcroftThomas-1781-Duplicity a comedy159.txt  
  inflating: __MACOSX/example_texts/._HolcroftThomas-1781-Duplicity a comedy159.txt  
  inflating: example_texts/CoryeJohn-1672-The generous enemies49.txt  
  inflating: __MACOSX/example_texts/._CoryeJohn-1672-The generous enemies49.txt  
  inflating: example_texts/LilloGeorge-1738-Marina a play of th212.txt  
  

Here is set of processing functions to clean the input texts

In [None]:


## WE DEFINE ALL THE FUNCTIONS WE WILL NEED

## This function removes any mention to website 
## this application of re.sub() substitutes strings in the input starting with http with ''
def clean_url(input):
  output = re.sub(r'http\S+', '', input)
  return output

## This function fixes contractions - turns "don't" into "do not", or "can't" into "can not"
def fix_contraction(input):
  output = contractions.fix(input)
  return output

## This function finds all non-alphanumeric characters [^a-zA-Z] and substitutes them with '' 
def clean_non_alphanumeric(input):
  output = re.sub(r'[^a-zA-Z0-9]', ' ', input)
  return output

## This function takes a string with both lower and upper case elements and returns only lower case elements
def clean_lowercase(input):
  output = str(input).lower()
  return output

## The function tokenize takes a string as input and returns a list of tokens 
def clean_tokenization(input):
  output = nltk.word_tokenize(input)
  return output

## eliminate stop words
def clean_stopwords(input):
  stop_words = set(stopwords.words(stopword_language))
  output = [item for item in input if item not in stop_words]
  return output

## This function turns numeric values into words, "6" into "six"
def numbers_to_words(input):
  output =[]
  for item in input:
    if item.isnumeric()==True:
      output += [num2words(item, lang=number_language)]
    else:
      output +=[item]
  return output

## Lemmatize tokens
def clean_lemmatization(input):
  output = [lemma.lemmatize(word=w, pos='v') for w in input]
  return output

## in this function we create a new list that includes only words with lenght > 2
def clean_length(input):
  output= [word for word in input if len(word) > 2]
  return output

## COUNT WORD FREQUENCIES
def count_word_frequencies(input):
  word_count = nltk.FreqDist(input)
  word_count_most_common= word_count.most_common()
  output = [(count[0],count[1]/len(input)) for count in word_count_most_common]
  return output

## finally, after al the filtering steps, we can put the string back together
def convert_to_string(input):
  output = ' '.join(input)
  return output

## here we can select the relevant functions to add to the preprocessing pipeline

In [None]:
## We define the pipeline as a function

def pre_processing_pipeline(input):
  w1 = clean_url(input)

  

  w2 = fix_contraction(w1)
  w3 = clean_non_alphanumeric(w2)
  w4 = clean_lowercase(w3)
  w5 = clean_tokenization(w4)
  w6 = numbers_to_words(w5)
  w7 = clean_stopwords(w6)
  w8 = clean_lemmatization(w7)
  clean_list = clean_length(w8)
  #word_frequencies = count_word_frequencies(clean_list)
  #filtered_string = convert_to_string(clean_list)

  #return(filtered_string,word_frequencies,clean_list)
  return(clean_list)

We now create a list of sentences, each sentence a list of words. This is the necessary input for word2vec.

In [None]:
# FUNCTIONS

## Function to select .txt files and store them as a list of paragraphs, each a list of words, to use as input to the function WordVec

def literary_words_list(root_folder):
    """

    Parameters
    ----------
    root_folder : a file path where .txt files are located
    e.g. 'c:\\Users\\Maria\\Dropbox\\Maria Brackin\\Finance and Text Analysis\\texts'

    Returns
    -------
    A list of the paragraphs, each paragraph a list of words

    
    """

    # Create list for clean sentences
    words_list = []

    # Iterates over the path, folders and subfolders looking for txt files
    for path, subdirs, files in os.walk(root_folder):
        for file in files:
            if '.txt' in file and 'model' not in file:
                print(file)
                name = os.path.join(path, file)
                file_text = open(name, encoding = 'utf-8').read()

                # Creates a list of paragraphs - lines
                text_list_paragraphs = file_text.split('\n')

                # Iterate over paragraphs
                for paragraph in text_list_paragraphs:

                    ## remove \r at the end of some paragraphs
                    paragraph = paragraph.replace('\r', '')

                    ## for each pragaph create a list of sentences
                    list_of_sentences = nltk.sent_tokenize(paragraph)

                    ## for each sentence apply the preprocessing pipeline, including sord tokenization, and add the tokeized sentence list to the final words_list
                    for sentence in list_of_sentences:
                      #print(sentence)
                      # Add the paragraphs to the word2vec input list
                      words_list += [pre_processing_pipeline(sentence)]
            
    return words_list

We run the function to create a list of sentences, each with a list of words (we also apply the preprocessing pipeline to clean the texts)

In [None]:
# Create list of sentences from .txt files of literary plays

word2vec_input = literary_words_list(root_folder)

GentlemanFrancis-1771-The tobacconist a c107.txt
ShadwellThomas-1688-The squire of Alsati281.txt
BehnAphra-1677-The townfopp or S18.txt
WilliamShakespeare-1607-Coriolanus.txt
Jonson Ben-1599-Every Man Out of His Humou.txt
TateNahum-1687-The islandprincess 310.txt
Chapman George Jonson Ben Marston John-1605-Eastward Ho.txt
GentlemanFrancis-1772-Cupid's revenge an 117.txt
Field Nathan-1612-A Woman Is a Weathercock.txt
Heywood Thomas-1604-The Wise Woman of Hogsdon.txt
RavenscroftEdward-1687-Titus Andronicus or234.txt
InchbaldMrs-1785-Appearance is agains16.txt
DrydenJohn-1692-All for love or Th90.txt
Webster John-1612-The White Devi.txt
KenrickW William-1766-Falstaff's wedding 201.txt
CowleyMrs Hannah-1786-A school for greybea54.txt
Massinger Philip-1632-The City Madam.txt
LeeNathaniel-1684-Constantine the grea171.txt
Brome Alexander-1638-The Cunning Lovers.txt
BrownJohn-1755-Barbarossa A traged14.txt
Chapman George-1602-Sir Giles Goosecap.txt
Shirley James-1631-The Humorous Courtier.txt


The necessary input for word2vec is a list of sentences, each a list of words


In [None]:
word2vec_input[:100]

[['nay',
  'nay',
  'though',
  'thy',
  'name',
  'face',
  'thou',
  'hadst',
  'face',
  'brass',
  'thou',
  'shalt',
  'face'],
 ['must',
  'unable',
  'handle',
  'excellent',
  'subject',
  'though',
  'shame',
  'thee',
  'long',
  'since',
  'part',
  'anatomize',
  'calf',
  'head',
  'thine'],
 ['calf', 'head'],
 ['blood',
  'life',
  'mind',
  'mark',
  'resentment',
  'legible',
  'character',
  'upon',
  'tyburn',
  'visage',
  'thine',
  'put',
  'thy',
  'feature',
  'mourn'],
 ['come',
  'see',
  'whose',
  'stomach',
  'bear',
  'bruise',
  'best',
  'tickle',
  'pamper',
  'side'],
 ['poor',
  'ignorant',
  'impertinent',
  'ungrateful',
  'wretch',
  'whose',
  'life',
  'disgrace',
  'speak',
  'save',
  'vile',
  'emblem',
  'empty',
  'cask',
  'much',
  'sound',
  'content',
  'canst',
  'thou',
  'forget',
  'mouldy',
  'crust',
  'suffolk',
  'cheese',
  'dead',
  'small',
  'beer',
  'thou',
  'wert',
  'starve',
  'common',
  'bare',
  'rib',
  'rat',
  'lim

Now we build the semantic vector space by adding the Sentence corpus to word2vec

In [None]:
## Here we build the vector space with Word2Vec

SentenceCorpus = word2vec_input
word2vec_output = Word2Vec(SentenceCorpus, min_count=1)

In [None]:
## Save vector space

word2vec_output.save('w2v_model.txt')
model = Word2Vec.load('w2v_model.txt')


In [None]:
## Note, you can also import pretrained models, using the following syntax:

'''
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))
['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

# Download the "glove-twitter-25" embeddings - this will take a while, and longer for larger corpora - e.g. google-news
model = gensim.downloader.load('glove-twitter-25')
'''

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


B2. Use the semantic map to evaluate whether the semantic cloud of each word is consistent with the intended meaning. E.g. is 'brawn' in our corpus closer to words related to grease or to strenght?

In [None]:
print(filtered_list)

# example of filtered list
#filtered_list = ['Civil_List', 'arbitrage', 'bank_deposit', 'banking', 'big_bucks', 'big_money', 'budget', 'bull', 'bundle', 'buy_into', 'commit', 'coronate', 'corporate_finance', 'cover', 'crown', 'demand_deposit', 'deposit', 'empower', 'endow', 'endue', 'finance', 'financing', 'floatation', 'flotation', 'fund', 'funding', 'gift', 'high_finance', 'home_banking', 'index_fund', 'indue', 'invest', 'investing', 'investment', 'job', 'leverage', 'leveraging', 'megabucks', 'money', 'nest_egg', 'pension_fund', 'petty_cash', 'pile', 'place', 'put', 'refinance', 'savings', 'seed', 'shelter', 'shinplaster', 'speculate', 'sterling', 'subsidisation', 'subsidization', 'superannuation_fund', 'tie_up', 'token_money', 'trust_fund', 'war_chest']


In [None]:
# FUNCTIONS

#1 Function to use word2vec to inquiry about the 10 most similar semantically words to each seed word in word_list

def get_word2vec_list(word_list,model):

  list_of_word2vec_lists = []
  for word in word_list:
    try:

      ## here is the crucial line - we are using the model that we trained to get the most similar words within our corpus
      list_vects = model.wv.most_similar([word],topn=10)

      new_list = []
      new_list += [word]
      for item in list_vects:
        word1 = item[0]
        new_list += [word1]

      #print(new_list)
      #print('\n')
      list_of_word2vec_lists += [new_list]


    
    except KeyError:
      continue
  return(list_of_word2vec_lists)

In [None]:
# create lists of ecologically valid words 
check_vectorSpace = get_word2vec_list(filtered_list, model)

## we print the cloud of the 20 words most similar to each of the lemmas in our filtered list
for item in check_vectorSpace: print(item) 

  if sys.path[0] == '':


['arbitrage', 'prospectus', 'mooc', 'classification', 'localization', 'unlocker', 'compo', 'traineeship', 'ticker', 'facilitation', 'coursera']
['banking', 'finance', 'dept', 'infrastructure', 'compliance', 'healthcare', 'transportation', 'corporate', 'housing', 'financial', 'insurance']
['budget', 'tax', 'funding', 'commission', 'policy', 'minimum', 'financial', 'investment', 'reserve', 'debt', 'medical']
['bull', 'trash', 'roll', 'garbage', 'ball', 'heavy', 'blow', 'beat', 'rolling', 'play', 'dirt']
['bundle', 'package', 'vinyl', 'custom', 'sample', 'disc', 'limited', 'includes', 'giveaway', 'selection', 'accessory']
['commit', 'punish', 'suffer', 'warn', 'blame', 'tougher', 'committing', 'allow', 'threaten', 'succeed', 'forced']
['cover', 'poster', 'version', 'guitar', 'album', 'style', 'playlist', 'background', 'theme', 'song', 'video']
['crown', 'gold', 'platinum', 'silver', 'golden', 'diamond', 'crystal', 'cross', 'pearl', 'ring', 'tiger']
['deposit', 'refund', 'payment', 'paymen

B3. Filter out clouds of words with irrelevant meanings, and add new words from the appropriate clouds if meaningful.

We can now use the same widget to select the relevant semantic coulds, i.e. the ones which reflect the meanings we are interested in (this selection is a bit subjective, I would say that the cloud is approproate if either the first 5 words, or more than 10 overall have semantically appropriate meanings

In [None]:
import ipywidgets as widgets
from itertools import compress
layout = widgets.Layout(width='auto')

selection_widget2 = widgets.VBox(
    [
        widgets.Checkbox(value=True, description=str(item), disabled=False, indent=False, layout=layout)
        for item in check_vectorSpace
    ]
)

selection_widget2

VBox(children=(Checkbox(value=True, description="['budget', 'marrowbone', 'tayl', 'gullet', 'rubber', 'stich',…

In [None]:
## Here is the filtered list of semantic clouds

filtered_word2vec = list(
  compress(check_vectorSpace, [widget.value for widget in selection_widget2.children]))

## now we extract just the words 
flat_list = [word for lists in filtered_word2vec for word in lists]

## eliminate duplications, and sort alphabetically
flat_list= [*set(flat_list)]
flat_list.sort()  
print(flat_list)

['affluent', 'attribute', 'benefit', 'benevolence', 'bounty', 'cash', 'chatwell', 'commodity', 'dower', 'dowry', 'expenditure', 'farthing', 'favor', 'fund', 'gift', 'giver', 'goods', 'intill', 'money', 'moneys', 'mony', 'mortgage', 'peccas', 'penny', 'protaction', 'purse', 'reckon', 'recompense', 'subscription', 'summe', 'thankful', 'validity', 'ware']


## Now we do a final check to eliminate words that are way off

In [None]:
selection_widget3 = widgets.VBox(
    [
        widgets.Checkbox(value=True, description=str(item), disabled=False, indent=False, layout=layout)
        for item in flat_list
    ]
)

selection_widget3

VBox(children=(Checkbox(value=True, description='affluent', indent=False, layout=Layout(width='auto')), Checkb…

In [None]:
final_list = list(
  compress(flat_list, [widget.value for widget in selection_widget3.children]))

print(final_list)

['affluent', 'attribute', 'benefit', 'benevolence', 'bounty', 'cash', 'commodity', 'dower', 'dowry', 'expenditure', 'favor', 'fund', 'gift', 'giver', 'goods', 'money', 'moneys', 'mony', 'mortgage', 'penny', 'purse', 'recompense', 'subscription', 'summe', 'ware']
