# Generating the gender corpus

Note: with the pickle files you just need to run from the points that have =====. This way you don't have to re-scrape the wikipedia, for example.

# 1. Intro

The objective of this dataset is to create a ground truth to test NLP models. To do so, the dataset is composed of a set of manipulated phrases where every phrase has:
+ the original phrase
+ 2 phrases where only the subject is female and male
+ 2 phrases where everything (both the subject and the object) are female and male

The original phrases are obtained from Wikipedia entries from the books that appeared as the most popular in the day 17/3/2022 of the Gutemberg project. This list has been saved for later use (as well as the list with the author of the book)  

# 2. Getting the urls

In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pickle

driver = webdriver.Chrome()
url_book_names='https://www.gutenberg.org/browse/scores/top'
driver.get(url_book_names)

soup=BeautifulSoup(driver.page_source,'lxml')
tr_p1=soup.find_all('tr')
#list 100 ebooks yesterday (17/3/2022)
list_xpath='/html/body/div[1]/div/ol[1]'
book_titles=driver.find_element_by_xpath(list_xpath)
book_list=book_titles.text.split('\n')
book_and_author=book_titles.text.split('\n')

for i in range(len(book_list)):
    by=book_list[i].find(' by ')
    book_list[i]=book_list[i][:by]
    book_and_author[i]=book_and_author[i][:book_and_author[i].find(' (')]

with open('book_list.pkl', 'wb') as f:
    pickle.dump(book_list, f)

with open('book_and_author.pkl', 'wb') as f:
    pickle.dump(book_and_author, f)



# ========================

# 3. Get Wikipedia entries

In [1]:
import pickle
with open('book_list.pkl', 'rb') as f:
    books = pickle.load(f)

with open('book_and_author.pkl', 'rb') as f:
    book_and_author = pickle.load(f)


In [2]:
import string
url_books=[]
for book in book_and_author:
    plot=[]
    url_google='https://www.google.com/search?q="en.wikipedia.org"+'
    for i in book:
        if i in string.punctuation:
            book=book.replace(i,'')
        #book=book.replace(i,' ')
    url_books.append(url_google+book.replace(' ','+'))


In [3]:
print(url_books[0])

https://www.google.com/search?q="en.wikipedia.org"+Frankenstein+Or+The+Modern+Prometheus+by+Mary+Wollstonecraft+Shelley


To get to the Wikipedia of all of these books, firstly it is needed to get to google and accept the cookies.

In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pickle

part_of_wiki=[]
driver = webdriver.Chrome()
driver.get('https://www.google.com/')

In [None]:
#this is in its own cell because sometimes the brower notices that we are a robot and we need to close it and re-run the previous cell
n=0
sentences=[] #list of all 3rd person sentences of all book wikipedia pages
texts=[] # list of the texts of the entire wikipedia pages 

In [None]:
#main loop :
#       gets in the wikipedia pages of the 100 chosen books
#       scraps all text from the paragraphs in each page
#       applies the get_3rd_person_phrases(text) to keep only the phrases in the 3rd person

for url in url_books[88:]:
    #getting to the wikipedia page, after knowing the url
    n+=1
    
    driver.get(url)
    driver.find_element_by_partial_link_text("Wikipedia").click()
    
    #getting the text on the page
    wiki_paragraphs=driver.find_elements_by_tag_name('p')
    for i in wiki_paragraphs:
        t=i.text
        texts.append(t)
    
    print(n)
    print(url)

The Wikipedia references in these paragraphs were deleted and the texts saved.

In [None]:
# removing the [1] wikipedia references
import re
for i in range(len(texts)):
    texts[i]=re.sub( r'\[.*?\]', '', texts[i])

with open('raw_texts.pkl', 'wb') as f:
    pickle.dump(texts, f)

# ========================

# 4. From Wikipedia paragraphs to SpaCy sentences

In [4]:
with open('raw_texts.pkl', 'rb') as f:
    texts = pickle.load(f)

In [5]:
print(len(texts))

5239


In [12]:
texts[1]

'Frankenstein; or, The Modern Prometheus is an 1818 novel written by English author Mary Shelley. Frankenstein tells the story of Victor Frankenstein, a young scientist who creates a sapient creature in an unorthodox scientific experiment. Shelley started writing the story when she was 18, and the first edition was published anonymously in London on 1 January 1818, when she was 20. Her name first appeared in the second edition, which was published in Paris in 1821.'

In [6]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [9]:
texts_span=[]
for paragraph in texts:
    phrase_gen=nlp(paragraph).sents #spacy generator
    for p in phrase_gen:
        texts_span.append(p)

In [11]:
print(len(texts_span))

19562


In [13]:
#in texts_doc each phrase is a spacy doc
texts_docs=[nlp(phrase.text) for phrase in texts_span]
#this cell takes quite a bit so I will pickle texts_doc

In [14]:
texts_docs[0:4]

[Frankenstein; or, The Modern Prometheus is an 1818 novel written by English author Mary Shelley.,
 Frankenstein tells the story of Victor Frankenstein, a young scientist who creates a sapient creature in an unorthodox scientific experiment.,
 Shelley started writing the story when she was 18, and the first edition was published anonymously in London on 1 January 1818, when she was 20.,
 Her name first appeared in the second edition, which was published in Paris in 1821.]

In [15]:
import pickle
with open('spacy_doc_from_raw_texts_wiki.pkl', 'wb') as f:
    pickle.dump(texts_docs,f)

# 5. Selecting phrases which have the ROOT verb in the 3rd person singular

In [16]:
with open('spacy_doc_from_raw_texts_wiki.pkl', 'rb') as f:
    texts_docs=pickle.load(f) 

In [17]:
def get_roots_third_person(text):
    #text is a list of spacy docs that correspond to sentences

    third_person_phrases=[] #phrases in the 3rd person
    roots=[] # verbs that the model considers to be the verb of the independent sentence (root verb)
    root_childs=[] #subjects and objects of the root verb (?)

    for phrase in text:
        for token in phrase:
            if token.dep_=='ROOT':
                if token.morph.get('Person')==['3'] and token.morph.get('Number')==['Sing']:
                    third_person_phrases.append(phrase)
    return third_person_phrases

In [18]:
third_person_phrases=get_roots_third_person(texts_docs)
print(len(third_person_phrases))

7855


# 6. Keeping only smaller sentences

Large sentences will be harder to learn for the computer (as they are harder to understand for humans). Some guidelines indicate that an average of 15-20 words are clearer phrases. 

https://techcomm.nz/Story?Action=View&Story_id=106#:~:text=A%20common%20plain%20English%20guideline,2009%3B%20Vincent%2C%202014) -> (Cutts, 2009; Plain English Campaign, 2015; Plain Language Association InterNational, 2015)(Cutts, 2009; Vincent, 2014).

Because of this, I excluded phrases that had more than 30 tokens (some tokens are punctuation so I left a bit of a buffer and this also allows for sentences that are a bit more complex) 

This step is a bit optional, but it removes almost 3000 sentences.

In [20]:
small_phrases=[]
for i in texts_docs:
    if len(i)<30:
        small_phrases.append(i)

len(small_phrases)

12623

In [46]:
third_person_phrases_small=get_roots_third_person(small_phrases)
print(len(third_person_phrases_small),'small 3rd person phrases')
print(len(list(set(third_person_phrases_small))),'small different 3rd person phrases')


5040 small 3rd person phrases
5040 small different 3rd person phrases


# 7. Deleting phrases that didn't end in '.' 

I also deleted some phrases that didn’t end with a ‘.’ Because these phrases would not finish a thought (some phrases are not well obtained by the spacy model – more complex models give more accurate phrase delimitation).

In [48]:
indexes_to_delete=[]
n=0
print('len inicial',len(third_person_phrases_small))
for i in range(len(third_person_phrases_small)):
    if third_person_phrases_small[i][-1].text!='.':
        n+=1
        indexes_to_delete.append(i)

third_person_phrases_small=[third_person_phrases_small[i] for i in range(len(third_person_phrases_small)) if i not in indexes_to_delete]
print(n)
print('len final',len(third_person_phrases_small))

len inicial 5040
324
len final 4716


# 8. Remove sentences where the subject is 'it'

In [50]:
n_neutral_small_phrases=[]


for idx in range(len(third_person_phrases_small)):
    for token in third_person_phrases_small[idx]:
        if token.dep_=='ROOT':
            root_child=[word for word in token.children]
            for t in root_child:
                if 'nsubj' in t.dep_ and t.morph.get('Gender')!=['Neut']:
                     n_neutral_small_phrases.append(third_person_phrases_small[idx])

In [51]:
print(len(n_neutral_small_phrases))

4354


# 9. Selecting "allowed" sujects (nouns, pronouns and proper nouns)

One of the ways to reduce the list of sentences more is selecting which kinds of things can be a subject such as proper nouns (even though these may be for example names of countries or cities), pronouns such as she/he/her/etc., and some common nouns that are related to humans such as "the mother".
+ proper nouns: `token.pos_ == 'PROPN'`
+ she/he: `token.morph.get('PronType')==['Prs] and (token.morph.get('Gender')==['Masc'] or token.morph.get('Gender')==['Fem']`
+ common names: added to words to check (for example her father, etc - spacy identifies father as the subject in this case): `token.pos_=='NOUN`

I also separated the phrases where the subject was a PROPN from the others because we may want to change these names from a predefined list.

In [4]:
def get_root_subjects(doc):
    for token in doc:
        if token.dep_=='ROOT':
            root_child=[word for word in token.children]
            for t in root_child:
                if 'nsubj' in t.dep_:
                    return t
                    #returns a token
    

In [60]:
PROPN=[]
pronouns=[]
nouns=[]

for idx in range(len(n_neutral_small_phrases)):
        root_subject=get_root_subjects(n_neutral_small_phrases[idx])
        if root_subject.pos_=='PROPN':
            PROPN.append(n_neutral_small_phrases[idx])
        elif root_subject.pos_=='NOUN':
            nouns.append(n_neutral_small_phrases[idx])
        elif root_subject.morph.get('PronType')==['Prs'] and (root_subject.morph.get('Gender')==['Masc'] or root_subject.morph.get('Gender')==['Fem']):
            pronouns.append(n_neutral_small_phrases[idx])

print('the list of phrases with proper nouns as subjects has {} elements'.format(len(PROPN)))
print('the list of phrases with common nouns as subjects has {} elements'.format(len(nouns)))
print('the list of phrases with he/she pronouns as subjects has {} elements'.format(len(pronouns)))

the list of phrases with proper nouns as subjects has 1956 elements
the list of phrases with common nouns as subjects has 1446 elements
the list of phrases with he/she pronouns as subjects has 759 elements


In [61]:
#saving the "raw" three lists

with open('PROPN_phrases.pkl', 'wb') as f:
    pickle.dump(PROPN,f)

with open('pronouns_phrases.pkl', 'wb') as f:
    pickle.dump(pronouns,f)

with open('nouns_phrases.pkl', 'wb') as f:
    pickle.dump(nouns,f)

## ======================= 

## 9.1 Choosing which of the subjects seem useful

In [2]:
import pickle

with open('PROPN_phrases.pkl', 'rb') as f:
    phrases_PROPN=pickle.load(f) 

with open('pronouns_phrases.pkl', 'rb') as f:
    phrases_pron=pickle.load(f) 

with open('nouns_phrases.pkl', 'rb') as f:
    phrases_nouns=pickle.load(f) 

With these datasets I chose which would be the nouns and proper nouns that could be useful for the dataset so that I could restrict a bit more the phrases to choose by hand. To do that I printed the list of subjects of the dataframes and chose by hand the ones that seem to be related to persons/characters.

In [5]:
subj_nouns=[get_root_subjects(i) for i in phrases_nouns]
print(subj_nouns)

[crew, crew, story, Victor, Creature, Creature, Creature, Creature, Creature, Victor, father, ice, father, father, Part, monster, monster, essay, family, edition, crew, crew, story, Victor, Creature, Creature, Creature, Creature, Creature, Victor, father, ice, father, father, Part, monster, monster, essay, family, edition, Pride, wife, manner, housekeeper, novel, theme, Pride, Marriage, marriage, Inheritance, Pride, behaviour, sequel, girl, legacy, weather, entry, chapter, caterpillar, procession, biographer, critic, case, brand, Wonderland, scholar, binding, film, play, production, item, Fitzgerald, father, theme, neighbor, difference, difference, novel, work, contest, decision, repudiation, adaptation, episode, man, revelation, jury, thought, Resurrection, Resurrection, humour, opposite, Resurrection, sea, poisoning, Darkness, shadow, blood, dictum, play, nanny, Torvald, letter, Torvald, shift, story, sub, approaches, hunter, probity, protagonist, whale, Starbuck, leader, pursuit, ca

In [72]:
useful_nouns_subject=['Victor','father','wife','girl','Fitzgerald','man','Torvald','Tashtego','Starbuck','woman',
'mother','teacher','Farson', 'Browning','boy','sister','Utterson','Lanyon','boy','Jaggers','Estella','aunt','Jane','Rhys',
'Hester','Heart','charwoman','Marlow','Svidrigailov','Cassedy','Huck','Sanders','prince','Murry','Swinburne',
'Anatole','Deasy','Anatole','Chryses','Achilles','Diomedes','Hector','Thetis','Kleos','goddess','Agamemnon',
'Scrooge','Danglars','Douglass','king','Fantine','Thénardier','Marius','grandfather','Eurycleia',
'Penelope','Hobbes','Léonce','Reisz','Adèle','Chopin','Treatise','aunt','Vronsky','Levin','Eliza','Eva','husband',
'Crisóstomo','Elías','Fagin','Cephalus','Pygmalion','Gyges','brother','Eliza','hero','heroine','Watson',
'woman','gentleman','Sissy','uncle','overman','Wolper']
#houskeeper? (it is neutral?) hunter leader carpenter narrator beggar bartender president precursor
#assistant character friend minister accountant manager clerk author 
#child
#Dracula is both the character and the book's name (see each phrase) - this may eventually also happen with Frankenstein
#maid is always feminin? priest always male?
#chairman/woman - maybe we can say "nouns that countain man/woman"  -> maybe this doesnt work because woman has the word man 
#Fate see if it is fate or a propnoun (appears with capslock - maybe im mistaking this with Faith); Hope
#soldier at the time this was a male almost for sure but now it isn't, keep it?

In [73]:
nouns_after=[]
for phrase in phrases_nouns:
    if get_root_subjects(phrase).text in useful_nouns_subject:
        nouns_after.append(phrase)

print(len(nouns_after))


209


In [8]:
subj_PROPN=[get_root_subjects(i) for i in phrases_PROPN]
print(subj_PROPN)

[Frankenstein, Frankenstein, Walton, Victor, Victor, Victor, Victor, Victor, Victor, Clerval, Victor, Victor, Victor, Victor, Victor, Victor, Walton, Victor, Walton, Mary, Ovid, Blackwell, Prometheus, Toro, Frankenstein, Frankenstein, Walton, Victor, Victor, Victor, Victor, Victor, Victor, Clerval, Victor, Victor, Victor, Victor, Victor, Victor, Walton, Victor, Walton, Mary, Ovid, Blackwell, Prometheus, Toro, Bennet, Elizabeth, Elizabeth, Collins, Darcy, Bennet, Collins, Collins, Charlotte, Elizabeth, Jane, Elizabeth, Catherine, Fitzwilliam, Elizabeth, Darcy, Darcy, Elizabeth, Elizabeth, Elizabeth, Wickham, Lydia, Bingley, Catherine, Elizabeth, Darcy, Elizabeth, Bennet, Elizabeth, Catherine, Austen, Joyce, Impressions, Reynolds, Breen, Adventures, Alice, Alice, Mouse, Alice, Rabbit, Alice, Alice, Alice, Alice, Alice, Cat, Hatter, Alice, Alice, Queen, Turtle, Turtle, Alice, Alice, Alice, Gardner, Duck, Dormouse, Turtle, Alice, Alice, Alice, Alice, Alice, Wilde, Gatsby, Fitzgerald, Gatsb

In [53]:
useful_propn_subject=['Walton','Victor','Clerval','Mary','Blackwell','Toro','Bennet','Elizabeth','Collins','Darcy',
    'Charlotte','Jane','Catherine','Fitzwilliam','Wickham','Lydia','Bingley','Catherine','Austen','Joyce','Reynolds',
    'Alice','Tom','Jordan','Myrtle','Nick','Daisy','Fitzgerald','Gatsby','Wilde','George','Bechtel','Buchanan','Marx',
    'Levy','Lorry','Evrémonde','Marquis','Gaspard','Carton','Darnay','Solomon','Defarge','Jerry','Manette',
    'Lucie','Defarge','Darnay','Carlyle','Simon','Törnqvist','Nora','Kristine','Krogstad','Torvald','Rank',
    'Nora','Kristine','Mencken','Stoddart','Dorial', 'Basil','Alan','James','Ishmael','Ahab','Queequeg','Stubb','Pip',
    'Dick','Bryant','Bezanson','Wright','Arvin','Matthiessen','Melville','Bryant','Milder','Gilman','Lanser','Treichler',
    'Horowitz','Edelstein','Harker','Lucy','Mina','Helsing','Showalter','Redmond','Enfield','Jekyll','Lanyon','Poole',
    'Utterson','Hyde','Wright','Havisham','Joe','Biddy','Herbert','Molly','Drummle','Jaggers','Magwitch','Jane','Reed',
    'Brocklehurst','Temple','Helen','Rochester','Rivers','John','Hester','Chillingworth','Pearl','Dimmesdale','Conrad',
    'Hochschild','Marlow','Nylander','Samsa','Sudau','Rubio','Drüke','Nabokov','Frank','Snitkina','Raskolnikov',
    'Marmeladov','Sonya','Razumikhin','Porfiry','Dunya','Mikolka','Luzhin','Lebezyatnikov','Svidrigailov','Dostoevsky',
    'Huck','Jim','Loftus','Polly','Twain','Hearn','Alberti','Finn','Eltis','Ellmann','Aynesworth', 'Wilde','Jack',
    'Bracknell','Gwendolen','Algernon','Gwendolen','Bracknell','Earnest','Foster','Edwards','Carby','Anderson',
    'Sanders','Gilbert','Ulysses','Joyce','Stephen','Bloom','Mulligan','Gerty','Pierre',
    'Boris','Rostov','Drubetskoy','Andrei','Denisov','Nikolai','Hélène','Pierre','Natasha','Rostov','Bolkonsky','Annenkov',
    'Strakhov','Dunnigan','Bagnall','Moore','Walden','Agamemnon','Chryses','Odysseus','Thetis','Aphrodite',
    'Athena','Diomedes','Zeus','Nestor','Poseidon','Polydamas','Hera','Achilles','Patroclus','Hephaestus','Hector','Priam',
    'Homer','Arnold','Pan','Barrie','Peter','Wendy','Robertson','Maimie','Darling','Lily','Bell','Hook','Smee','Faria',
    'Dantès','Mondego','Bertuccio','Carderousse','Andrea','Dorothy','Glinda','Henry','Oz','Beth','Jo','Meg','Laurence',
    'Brooke','Laurie', 'Meg', 'Beth','Amy','Lizzie','Saxton','Alcott','Carol','Scrooge','Irving','Kelly','Jim',
    'Potter','Joe','Becky','Douglas','Hugo','Myriel','Valjean','Fantine','Javert','Thénardier','Marius','Cosette',
    'Gavroche','Enjolras','Perry','Karamazov','Pavlovich','Fyodorovich','Alyosha','Snegiryov','Smerdyakov','Grushenka',
    'Katerina','Zosima','Dmitri','Ilyusha','Alyosha','Snegiryov','Ivan','Kolya','Quixote','Fernando','Sancho','Anne',
    'Cervantes','Athena','Telemachus','Odysseus','Penelope','Hobbes','Doyle','Edna','Robert','Mary',
    'Tully','Zuckert','Kenny','Emma','Knightley','Byrne','Henry','Karenin','Levin','Vronsky','Stiva','Bartlett',
    'Heathcliff','Dean','Earnshaw','Hindley','Catherine','Edgar','Cathy','Lockwood','Wiltshire','Scott','Alejandro',
    'Stewart','Alejandro','Eliza','Clare','Eva','Legree','Shelby','Dámaso','Crisóstomo','Guevarra','Salví','Tiago','Elías',
    'Guevarra','María','Twist','Oliver','Brownlow','Fagin','Nancy','Sikes','Bumble','Brownlow','Nancy','Rose','Bumble',
    'Bates','Buck','Mercedes','Thornton','Pizer','Gianquitto','Higgins','Eliza','Doolittle','Higgins',
    'Polemarchus','Glaucon','Adeimantus','Gulliver','Mendez','Pedro','Jacobs','Brent','Martha','Jacobs',
    'Benjamin','William','Benny','Ellen','Flint','Bruce','John','Faustus','Lucifer','Marianne','Brandon','Steele',
    'Edward','Pollock','Favret','Holmes','Drebber','Crane','Rudkus','Jonas','Jurgis','Gradgrind','Bounderby','Stephen',
    'Sparsit','Tom','Blackpool','Louisa','Gradgrind','Harthouse','Cecilia','Bitzer','Rachael','Bazalgette','Mills','Pooh',
    'Hanyu','Charlie','Ashbee','Marcus','Mary','Weatherstaff','Fauntleroy','Gerzina','Masson','Burnett','Virgil','Dante',
    'Beatrice','David','Peggotty','Dora','Agnes','Hollington','Leavis','Needham','Bottiglia','Dexter',
    'Anne','Henrietta','Benwick','Clay','Russell','Croft','Isagani','Basilio','Simoun']

maybe_useful_propn_subject=['Frankenstein','Mouse','Rabbit','Cat','Hatter','Queen','Turtle','Gardner','Duck',
    'Dormouse','Turtle','Pequod','Dracula','Witch','Lion','Ghost']


#some names are precedeed by miss & co - change this if the gender changes
#in Alice in wonderland's story there are several characters that are named after animals, maybe these will be useful
# because the phrases will work exchanging these names for human names - same for the wizard of oz

In [54]:
PROPN_after=[]
PROPN_after_maybe=[]

for phrase in phrases_PROPN:
    if get_root_subjects(phrase).text in useful_propn_subject:
        PROPN_after.append(phrase)
    elif get_root_subjects(phrase).text in maybe_useful_propn_subject:
        PROPN_after_maybe.append(phrase)

print(len(PROPN_after))
print(len(PROPN_after_maybe))

1230
35


## 9.2 Remove subjects that only appear once

I also concluded that if a name only appears one time probably is not related to the plot of the story but it is probably just a comment of someone that studied the text which is not what we want to a removed those names from the list.

In [55]:
def single_subj(list_of_subjects,list_docs):
    #list_of_subjects is a list with strings that correspond to the root subjects of the spacy doc (snetence)
    #list_docs is a list of Spacy docs (sentences)
    single=[]
    list_subj=[get_root_subjects(i) for i in list_docs]
    for nn in list_of_subjects:
        n=0
        for i in range(len(list_subj)):
            if list_subj[i].text==nn:
                n+=1
        if n==1:
            single.append(nn)
    return single

In [56]:
propn_after_single=[]
single_subj_propn=single_subj(useful_propn_subject,PROPN_after)
for phrase in PROPN_after:
    if get_root_subjects(phrase).text not in single_subj_propn:
        propn_after_single.append(phrase)

print('before removing there were {} phrases with proper nouns as subjects and after removing there are {} '.format(len(PROPN_after), len(propn_after_single)))

before removing there were 1230 phrases with proper nouns as subjects and after removing there are 1065 


# 10. Deleting duplicated phrases

The following three dataframes are the ones that can be used to choose the phrases by hand:
+ nouns_after - has 108 phrases (some of which are prop nouns that were wrongly selected as common nouns by spacy)
+ propn_after_single - has 1149 phrases (some don't correspond to the plot but to researchers commenting on the meaning of the plots)
+ phrases_pron - has 759 phrases and are all the phrases that have "he/she" as the subject

In [74]:
def equal_phrases(list_of_phrases):
    same_phrases=[] #list of the duplicate phrase that appears in 2nd place

    for phrase1_idx in range(len(list_of_phrases)):
        for phrase2_idx in range(len(list_of_phrases)):
            if list_of_phrases[phrase1_idx].text==list_of_phrases[phrase2_idx].text and phrase2_idx!=phrase1_idx:
                same_phrases.append(list_of_phrases[phrase1_idx])

    same_phrases=list(set(same_phrases))
    
    return same_phrases

In [75]:
eq_propn=equal_phrases(propn_after_single)
eq_nouns=equal_phrases(nouns_after)
eq_pronouns=equal_phrases(phrases_pron)

#takes a bit of time

In [76]:
print(len(eq_propn))
print(len(eq_nouns))
print(len(eq_pronouns))


0
12
0


In [61]:
print(len(propn_after_single))

for i in eq_propn:
    propn_after_single.remove(i)

print(len(propn_after_single))

1022


ValueError: list.remove(x): x not in list

In [77]:
for i in eq_nouns:
    nouns_after.remove(i)


In [62]:

for i in eq_pronouns:
    phrases_pron.remove(i)

In [78]:
print(len(propn_after_single))
print(len(nouns_after))
print(len(phrases_pron))

1022
197
741


After the deletion of the duplicates:
+ nouns_after - has 197 phrases
+ propn_after_single - has 1022 phrases
+ phrases_pron - has 741 phrases

# 11. Highlighting of root verb and subject (function)

To make the identification of the subjects easier the following function gives a version of a row with the subject and verb highlighted.

In [37]:

def highlight_word(phrase,colour_verb,colour_subj):
    # phrase is a doc string
    # colour is a string with rgb values (ex. 'rgb(155,217,230)')

    sent=[]
    root_verb=[root for root in phrase if root.dep_=='ROOT'][0]

    for token in phrase:
        if token.dep_=='ROOT':
            sent.append(" <span style='background: {}'>{}</span> ".format(colour_verb,token.text))
        elif 'nsubj' in token.dep_ and token in root_verb.children:
            sent.append(" <span style='background: {}'>{}</span> ".format(colour_subj,token.text))
        else:
            sent.append(token.text)

    
    return ' '.join(sent) #returns a string

In [38]:
# nouns_after
# propn_after_single 
# phrases_pron


from IPython.display import HTML

colour_verb='rgb(25, 108, 56)'
colour_subj='rgb(188, 108, 37)'

p=phrases_pron[0]

display(HTML(highlight_word(p,colour_verb,colour_subj)))

# 12. Selection of useful phrases

This function can be used to iterate over a list of Spacy docs and presents the highlighted phrase. This is useful because it is easier to avaliate which are the useful phrases for the dataset. 
This selection is done by hand and the selected sentences should: 
+ be part of the plot and not someone unrelated to the plot speaking about it
    + considerations about the author/book
+ not contain citations ('this character said "this"')
    + some contain " " 
    + some contain you/me 
+ phrases starting with -"  Chapter x - ..."
+ errors
    + in phrases (the Rachel, wrong words (81 from PROPN))
    + phrases starting with lower case (don't have the entire thought)
    + starting with numbers ("(number)")
+ confusing phrases that usually have a '-' 

I added the indexes to a list (because it is easier to write) but then added the sentences of these indexes to another list so that if we change the dataset, we don't lose the sentences already removed.

In [79]:
# nouns_after
# propn_after_single 
# phrases_pron

for i in range(0,len(nouns_after)):
    print(i)
    display(HTML(highlight_word(nouns_after[i],colour_verb='rgb(25, 108, 56)',colour_subj='rgb(188, 108, 37)')))

0


1


2


3


4


5


6


7


8


9


10


11


12


13


14


15


16


17


18


19


20


21


22


23


24


25


26


27


28


29


30


31


32


33


34


35


36


37


38


39


40


41


42


43


44


45


46


47


48


49


50


51


52


53


54


55


56


57


58


59


60


61


62


63


64


65


66


67


68


69


70


71


72


73


74


75


76


77


78


79


80


81


82


83


84


85


86


87


88


89


90


91


92


93


94


95


96


97


98


99


100


101


102


103


104


105


106


107


108


109


110


111


112


113


114


115


116


117


118


119


120


121


122


123


124


125


126


127


128


129


130


131


132


133


134


135


136


137


138


139


140


141


142


143


144


145


146


147


148


149


150


151


152


153


154


155


156


157


158


159


160


161


162


163


164


165


166


167


168


169


170


171


172


173


174


175


176


177


178


179


180


181


182


183


184


185


186


187


188


189


190


191


192


193


194


195


196


In [68]:
pronouns_indexes_rem=[11,31,48,49,56,63,67,72,73,74,75,93,94,111,112,145,146,147,148,149,150,195,214,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,234,235,236,247,298,337,339,340,350,351,352,353,354,359,395,412,424,451,460,461,462,463,464,465,466,469,470,471,472,473,507,515,561,569,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,595,596,598,602,603,670,685,686,687,700,701,727,732,738,740]
pronouns_del=[phrases_pron[i] for i in range(len(phrases_pron)) if i in pronouns_indexes_rem]
pronouns_keep=[phrases_pron[i] for i in range(len(phrases_pron)) if i not in pronouns_indexes_rem]


In [80]:
nouns_indexes_rem=[17,18,35,36,47,50,51,68,69,70,75,87,88,91,97,102,111,112,132,142,143,175,190 ]
nouns_del=[nouns_after[i] for i in range(len(nouns_after)) if i in nouns_indexes_rem]
nouns_keep=[nouns_after[i] for i in range(len(nouns_after)) if i not in nouns_indexes_rem]


In [69]:
PROPN_indexes_rem=[24,25,26,27,33,41,42,42,44,45,46,47,48,49,50,73,74,80,88,115,145,146,144,143,147,148,150,151,152,153,154,155,156,157,158,159,184,297,298,300,301,302,303,304,314,315,324,326,327,328,329,330,331,392,397,400,412,414,415,416,417,418,420,422,423,427,436,437,440,497,513,514,561,571,573,575,588,589,597,598,602,603,604,605,606,634,636,635,637,638,639,666,667,700,749,779,793,795,796,813,814,815,816,817,818,825,827,829,830,831,832833,839,840,850,851,864,920,921,922,923,927,928,933,934,935,936,937,946,947,948,949,950,965,966,967,968,969,970,971,972,973,974,993,994,995,996,1003,1019 ]
PROPN_del=[propn_after_single[i] for i in range(len(propn_after_single)) if i in PROPN_indexes_rem]
PROPN_keep=[propn_after_single[i] for i in range(len(propn_after_single)) if i not in PROPN_indexes_rem]

In [81]:
print('PROPN phrases to delete: {} \n PROPN phrases to keep: {}'.format(len(PROPN_del),len(PROPN_keep)))
print('nouns phrases to delete: {} \n nouns phrases to keep: {}'.format(len(nouns_del),len(nouns_keep)))
print('pronouns phrases to delete: {} \n pronouns phrases to keep: {}'.format(len(pronouns_del),len(pronouns_keep)))

PROPN phrases to delete: 146 
 PROPN phrases to keep: 876
nouns phrases to delete: 23 
 nouns phrases to keep: 174
pronouns phrases to delete: 104 
 pronouns phrases to keep: 637


After this selection of the duplicates:
+ nouns_keep - has 174 phrases
+ PROPN_keep - has 876 phrases
+ pronouns_keep - has 637 phrases

# 13. Saving the final phrases

In [84]:
import pickle

with open('nouns_keep.pkl', 'wb') as f:
    pickle.dump(nouns_keep, f)

with open('PROPN_keep.pkl', 'wb') as f:
    pickle.dump(PROPN_keep, f)

with open('pronouns_keep.pkl', 'wb') as f:
    pickle.dump(pronouns_keep, f)