# Corenplot Summaries treatment

The aim of this script is to extract the lexic (words) used in each summaries associated to film characters. It will further be used to analyze the lexicon associated to each gender.

The Corenplot dataset is composed of 42'306 summaries in XML format, each of them have already been processed with a NLP (Stanford NLP). 

The following informations are retrieved from the summaries:
- The sentence composing the summaries
- The words dependencies, id est, the grammatical architecture of a single sentence, thus the association, in form of governor/dependent, of the words.
- The coreference, id est, the links between words of different sentence

The following script can be categorized as "messy" due to poor knowledge of the author in modules offering capacity to read xml files. The script is composed of various function to increase readablity, the main part is a the bottom and consist of a simple loop on all the summaries, aka, all the xml files. The script lack of optimization due to time restriction and thus needs approximately 11 hours to run.

### Modules

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib as plt
import lxml 
from bs4 import BeautifulSoup

import xmltodict 
import pprint
import json
from tqdm import tqdm

### Coreference treatment

**Goal**: Create a pandas dataframe containing in numeric style each coreference, thus the sentence number and the word position number for each governor word associated to a dependent word. In other words, it is a relation table which contain linked words.

**How**: The function first loop on the coreference tag in the xml file, for each coreference tag, one representative word with multiple dependent words. The governor/representative word is a subject gramatically speaking, for example, it allows to find to who is attributed a certain pronouns: "she" in the 5th sentence is associated to "Marie" in 2nd sentence. The benefit from this analysis is to obtain character name (and furtherdown the gender) of sentences where no explicit names are used.

**Comment**: This function is associated to the link_words function

In [2]:
def coreference_treatment(Bs_data):
    
    all_link = pd.DataFrame(columns=["representative_sentence","representative_head","dep_sentence","dep_head"],dtype='object')
    
    #select the bottom part of the xml file => coreference which represent link between sentences (var = pointer)
    coreference = Bs_data.find_all('coreference')

    #loop on each coreference, which is made of multiple sentence, one representative and one or more dependant sentences
    for index in np.arange(1,len(coreference)):
    
        #select a specific coreference [index]
        linked_sentence = np.array(coreference[index].find_all("sentence"))
        #the representative sentence is alwais the first sentence. The representative normally contain the character name (or subject)
        representative = linked_sentence[0]
        #all the other sentences from 1 coreference are the dependant
        dep = linked_sentence[1:]
    
        #the heads represent the word of each sentence that have a relation: ex. "representative_head" = Arthur & dep_head = "him"
        linked_sentence_head = np.array(coreference[index].find_all("head"))
        rep_head=linked_sentence_head[0]
        dep_head=linked_sentence_head[1:]
    
        #concatenation of dependant sentence with dependant head
        temp = np.concatenate([dep,dep_head],axis=1)
    
        #concatenante each dependant sentence + head to a single representant sentence+head  
        temp2 = np.concatenate([np.full([dep.shape[0],1],representative),np.full([dep.shape[0],1],rep_head)],axis=1)
    
        #concatenate all together
        temp = np.concatenate([temp2,temp],axis=1)
    

        #transform in DataFrame for simpliciti
        temp = pd.DataFrame(data=temp,columns=["representative_sentence","representative_head","dep_sentence","dep_head"])
        #save result
        all_link = all_link.append(temp,ignore_index = True)
    
    return all_link

### Dependence treatment

**Goal**: As mentionned above, the dependence are the words connection from a single sentence. The aim of this function is to create a dataframe containing each words association in relation table. It is similar to coreference treatment, but here the words are directly implemented in the table (not numeric values)

**How**: The collapsed-ccprocessed dependencies tag are used. It consist of the third level of dependencies (the first and second one are steps, and each of them is processed by the NLP). All the governor and dependant words are associated, and the type of gramatical relationship is also saved. To achieve this, a loop is first done on each dependance that are further appended to a global dataframe for a single summary. 

In [3]:
def dependence_treatment(Bs_data):

#cc-processed reference
    dependence_by_summary = pd.DataFrame(columns=['type','gov','dep'],dtype='object')
    dependence = Bs_data.find_all('collapsed-ccprocessed-dependencies')
    
    #loop on each dependence (collapsed) tag
    for dep_index in np.arange(0,len(dependence)):
        
        #get the dependent words and the govenor words, the dependency type matrix is created but not filled
        linked_words_attr = np.zeros([len(dependence[dep_index].find_all('dep')),1],dtype="object")
        linked_words_gov = pd.DataFrame(data=np.array(dependence[dep_index].find_all("governor")),dtype="object")
        linked_words_dep = pd.DataFrame(data=np.array(dependence[dep_index].find_all("dependent")),dtype="object")
        
        #filling of the dependency type using a loop
        for attribute_index in np.arange(0,len(dependence[dep_index].find_all("dep"))):
            linked_words_attr[attribute_index] = dependence[dep_index].find_all("dep")[attribute_index].attrs['type']
            
        linked_words_attr = pd.DataFrame(data=linked_words_attr,dtype='object')
        
        #concatenation of each table into a single one
        dependence_table = pd.concat([linked_words_attr,linked_words_gov,linked_words_dep],axis=1)
        dependence_table=dependence_table.set_axis(['type','gov','dep'],axis=1)
        
        #save the result
        dependence_by_summary=dependence_by_summary.append(dependence_table,ignore_index=True)
        
    return dependence_by_summary

### Sentence treatment

**Goal**: The complex-to-understand function has multiple roles. The main goal is to analyze each sentence of the summary and to retrieve any character name defined as a "Person" object by the NLP. It create a dataframe where each row is a sentence, with all the Lemmas of the sentence and also grammatical characteristic lemmas regrouped (verb, nouns, etc). The second goal is to create a matrix (similar to a BOW) containing each sentence by rows and each words of the sentences by columns.

**How**: First a loop is computed on each sentence of a summary. Then, for each sentence, all the words, the grammatical structure type, the object type and the lemmas is retrieved. The "Person" object are found and further processed: If a person have a first and second name that are subsequent, the function record it as a single person (eg. Harry Potter), this is done by comparing the grammatical type of subsequent "Persons" placement in the sentence (if there is a "Person" at word_id=11 and word_id=12). Then the function also handle the fact that there can be multiple characters in one sentence and thus create a single string to save the characters name (Harry Potter & Hagrid). If no character is found, it simply add "NoCharacter" in the character column. Finally, the words of a certain gramatical type are grouped together to be saved in a single cell depending of the word type.


**Comment**: This the messy function as it was the first created and the architecture works, but has requiered adjustement all along the script creation. It can be easily seen that the python/xml file handling was not of high performance. 


In [4]:
#The complex-to-understand part, here the goal is to concatenate each sentence word and to extract a potential character in the sentence


def sentence_treatment(Bs_data):
    
    #compute the number of sentence in a summary
    number_of_sentence = len(Bs_data.find_all('sentence',id=True))

    #initialization of the variable which will store the results
    all_sentence = pd.DataFrame(columns=["sentence_id","Lemma","Characters","multiple_character","Full_sentence"],dtype="object")

    #the sentence matrix is used to have each sentence by row and each word by column
    sentence_matrix= np.zeros([500,500],dtype="object")
    sentence_matrix.fill("NaN")
    
    #dataframe to save for each sentences, words of a specific type: ex adjective
    words_type_saving = pd.DataFrame(columns=np.array(["sentence_id","JJ","NN","NNP","NNS","PRP","PRP$","VB","VBG","VBP","VBZ"]),dtype='object')


    #loop on each sentence of the summary
    for index in range(number_of_sentence):
    
         #select sentence number [index]
        sentence = Bs_data.find_all('sentence')[index]
    
        #create array of each useful variable : 
        # - The sentence's words
        # - The kind of grammatical structure (POS), ex: subject, verb, pronouns, etc
        # - The object type (NER), ex: DATE, PERSON, ORGANIZATION, etc
        # - The associated lemma (for each word): beeing => be

        words = np.array(sentence.find_all("word"))
        pos = np.array(sentence.find_all("POS"))
        ner = np.array(sentence.find_all("NER"))
        lemma = np.array(sentence.find_all("lemma"))
    
        #concatenate all those variable in an array
        con = np.concatenate([words,lemma,pos,ner],axis=1)
    
        #create a dataframe for comodity
        temp_sentence = pd.DataFrame(data=con,columns=["Word","Lemma","Grammar","Object"])
        # add an index for each sentence in one summary
        temp_sentence["sentence_id"] = index
        # as in the file the word's index start from 1 (for coreferences), the creation of this index is used after
        temp_sentence["Idx"] = temp_sentence.index +1

        #select the potential characters in a sentence, 
        characters = temp_sentence[temp_sentence.Object == "PERSON"]

        #tricks to put together characters name made of multiple word: ex Arthur Lambert
        # creation of a second index for words with a increased step of 1
        idx2 = characters.Idx.values +1 
        charac2 = characters.copy()
        charac2.Idx = idx2

        #merge the doubled character dataframe with a difference of one in word index: Thus, each characters name that are 
        # spread in two words are linked together and merged
        merged_characters_for_people_with_forname_and_last_name = pd.merge(characters,charac2,on="Idx")
    
        #------------------
        #Here begins a set of conditions to determine what sort of name the NLP has found:
        # - only characters with a forname
        # - characters with forname and lastname
        # - multiple characters
        # And also the string work to have readable characters name
        
        only_forname = False
        
        # As the merge occurs for successive words, if the characters only have a single-word name, the merge table length is 0
        # Also, it is ensured with the second condition that there is some characters found previously (in characters dataframe)
        if len(merged_characters_for_people_with_forname_and_last_name)==0 and len(characters) > 1:
            only_forname = True
    
        # multiple characters
        if (len(characters)>1 | only_forname):
            
            # multiple characters with mulitple words: ex Duke Henry
            if len(merged_characters_for_people_with_forname_and_last_name)!=0:
                
                # Merging "Duke" + " " + "Henry"
                full_name = merged_characters_for_people_with_forname_and_last_name.Word_y +\
                " " + merged_characters_for_people_with_forname_and_last_name.Word_x
                
                # Merging the grammatical object to ensure next that it correspond to names: ex not "Henry" + " " + "and"
                both_grammar = merged_characters_for_people_with_forname_and_last_name.Grammar_x + \
                merged_characters_for_people_with_forname_and_last_name.Grammar_y
                
                #condition computation to get full names :  "Henry" + " " + "and" would be "NNPAND"
                both_subjects_words = (both_grammar == "NNPNNP")
                
                #selection of full names (only Subject words) into the dataframe of potential full names
                full_name=full_name[both_subjects_words]
                
                #String manipulation to get each full names separeted by "&"
                sentence_characters = str(full_name.values).replace("[","").replace("]","").replace("'"," ").replace("  "," &")
            #multiple characters with  no composed/full names
            else:
                
                # if there is more than one characters, just ensure there is no duplicate
                if len(characters)>1:
                    characters = characters[characters.Word.duplicated(keep='first')==False]
                
                # find characters with only fornames
                characters_list = str(characters.Word.values)
  
                # String concatenation
                sentence_characters = characters_list.replace("[","").replace("]","").replace("'"," ").replace("  "," &")
            
            #condition about NLP not find a characters name even if there is one (Object != PERSON | ORGA)
            if len(sentence_characters) ==0:
                sentence_characters =  "NoCharacter"
        else:
            # normally only one character
            characters_list = str(characters.Word.values)
        
            sentence_characters = characters_list.replace("[","").replace("]","").replace("'"," ").replace("  "," &")
        
    
        # last condition to ensure that if there is no characters, the "NoCharacter" is assigned
        if len(characters) == 0:
            sentence_characters = "NoCharacter"


        # get sorted each type of words
        words_type = pd.DataFrame(data=np.concatenate([words,pos],axis=1),columns=["Words","Type"],dtype='object')
        #group the words of each type by type
        words_type_grouped = words_type.groupby('Type')['Words'].apply(', '.join).reset_index()
        #some manipulation to get a dataframe with type as columns
        words_type_grouped=words_type_grouped.set_index(words_type_grouped.Type).transpose().drop(labels="Type",axis=0)
        
        #save result in general table (thus for each sentences)
        words_type_saving = words_type_saving.append(words_type_grouped)
        words_type_saving["sentence_id"].iloc[index] = index
        
        #recreate the full sentence in single string (for conveniance while testing the script)
        full_sentence = temp_sentence.Word
        final_sentence = ""
        for word in full_sentence:
            final_sentence += " " +word
    
    
        # Join the lemma of each sentence in a single string
        temp_sentence = temp_sentence.groupby('sentence_id')['Lemma'].apply(', '.join).reset_index()
        
        #assigne characters to a sentence
        temp_sentence['Characters'] = sentence_characters
        test =temp_sentence     
        
        #assign the full sentence (only used for commodity while doing the algorithm)
        temp_sentence['Full_sentence'] = final_sentence
    
    
        #computing the sentence matrix which contain each sentence on a row and each word of this sentence sorted in columns
        sentence_matrix[index,:len(words)] = words.reshape([1,len(words)])
    
    

        #append the result of one sentence to the other ones of a same summary
        all_sentence = all_sentence.append(temp_sentence)
        
        #append the words of specific types
    all_sentence = pd.merge(all_sentence,words_type_saving,on="sentence_id")
        
        
    return all_sentence,sentence_matrix,test

### Link Words

**Goal**: From the coreference relation table (numeric) create a table of actual words (string). The coreference that do not contain any character name (based on upper case letter) are droppped.

**How**: Use the relation table with sentence number and word placement into the sentence number combined to the senence matrix (row = sentence, column = word placement) to create the completed links sentence

In [5]:
#recreate the links 

def link_words(all_link,sentence_matrix):

# initialization
    all_link_completed = all_link.copy()
    all_link_completed["rep_name"] = "NaN"
    all_link_completed["dep_name"] = "NaN"

    #attribue for each link the associated words
    for row in range(len(all_link)):
        all_link_completed["rep_name"].iloc[row] = sentence_matrix[int(all_link["representative_sentence"].iloc[row])-1,int(all_link["representative_head"].iloc[row])-1]
        all_link_completed["dep_name"].iloc[row] = sentence_matrix[int(all_link["dep_sentence"].iloc[row])-1,int(all_link["dep_head"].iloc[row])-1]

    
    #remove rows with same representative and dependant words
    all_link_completed = all_link_completed[all_link_completed.rep_name != all_link_completed.dep_name]

    #remove rows where the representative does not start with an uppercase letter (= "Nom Propre")
    to_drop = np.full([len(all_link_completed),1],-1)
    
    #loop to save links with reference to some names
    for index in range(len(all_link_completed)):
    
        #condition based on character name starting with upper case letter, thus dropping links where no names appears as representative
        row_not_abandonned=0 #variable just used to check performance
        if str(all_link_completed.rep_name.iloc[index])[0].isupper():
            row_not_abandonned +=1
        
        else:
            to_drop[index] = all_link_completed.index[index]
        
    #drop links without characters names
    all_link_completed= all_link_completed.drop(np.unique(to_drop)[1:])  
    
    return all_link_completed

### Character assignment to sentence without any explicit character name found using the coreference

**Goal**: When a sentence has no character ("NoCharacter"), the function looks into the coreference links completed (string, see above) to detect if the sentence is linked to another one where a character has been found

**How**: Loop on sentence without character, find the corresponding sentence in the coreference links, if there is one, assign the character name to the sentence, otherwise assign "NotFound"

In [6]:
def assign_characters_to_sentence_without_explicit_character_name(all_sentence,all_link_completed):
    # this part consist of attributing character names to sentence where no character was found by the NLP (thus, no PERSON object)
    all_sentence_completed = all_sentence.copy()
    
    #loop on sentences without character: denoted "NoCharacter" as character name
    for no_charac_sentence in all_sentence[all_sentence.Characters == "NoCharacter"].sentence_id:
        
        #get the character name from the completed links table
        rep=all_link_completed[all_link_completed.dep_sentence == str(no_charac_sentence+1)].rep_name
    
    
        #if no character name was found in links before, "NotFound" is set as "character name" to this sentence
        if rep.empty:
            all_sentence_completed.iloc[no_charac_sentence].Characters = "NotFound"
        
        #otherwise the character name is inserted in table
        else:
            all_sentence_completed.iloc[no_charac_sentence].Characters =rep.values[0]
            
    return all_sentence_completed

## Assign gender from characters_metadata


**Goal**: Search for correspondances in character name found in the summary and character name in the character metadata (which regroup also the actors characteristics). If the name match, the gender of the actor is associated to the character from the summary.

**How**: First lists of words linked to a gender, or recurrent undefined (in term of gender) words, are created. These came from the first 1500 lines of the character metadata and was manually done. Next, the involved character are obtained by a match on the film id. Some work is done on both the involved character (from metadata) and the character from a summary. It consists of lowercasing each letter and to remove the potential undefined words in each character name (There was a lot of lost due to "Insp. Ron" beeing just "Ron" or "Inspector Ron" respectively in metadata or summaries. 
In a second time, a loop on each character and nested loop if there is multiple character is performed. First a direct match between the reworked character name is done. If it don't match, None is applied to gender and more conditions are applied:
- if the first word of the character contains a typical male name (eg Mr.)
- if the first word of the character contains a typicale female name (eg Mrs.)
- if the first word of the character name match with any of the metadata words (thus from Harry, if in the metadata it is registered as Harry Potter, it will match). This condition can be problematic if there is some undefined words not registered and if some characters of a same film have the same surname)
The gender are obtained if a match occurs and assigned to characters in list form


**Commment**: This is the only function that is not directly in the main, in this part of the "group_by_character" function (see below)

In [209]:
def assign_gender_from_characters_metadata(all_sentence_completed,characters_data,film_id):
    condition_to_look_into_metadata = 1

    
    #-------------------------------------------------------------------------------------------------------------------------------
    #the lists are based on character_name data 1500 first lines

    male_words = np.array(["dad","monsieur","mr","mr.","boy","man","king","prince","emperor","father","brother",'sir'])
    female_words = np.array(["mom","madame","mrs","mrs.","ms","ms.","miss","woman","girl","queen","princess","empress","nurse"\
                    ,"mother","lady","sister","wife"])
    undefined_words = np.array(["lieutenant","sgt","sgt.","commander","fbi","profiler","boss","detective","dr","dr.","doc","duke"\
                   ,"young","doctor","capt.","captain","jr.","judge","reverend","major","principal","professeur"\
                   ,"professor","inspector","general","lt.","officer","student","president","insp."])
    #-------------------------------------------------------------------------------------------------------------------------------

    #retrieve the characters from the character metadata using the film id
    involved_characters = characters_data[characters_data["Wikipedia_movie_ID"]==int(film_id)]
    
    #if the metadata dataframe is empty just return to main 
    if involved_characters.empty:
        return all_sentence_completed

    #-------------------------------------------------------------------------------------------------------------------------------
    #involved character pre-treatment
    if (all(pd.isna(involved_characters.Character_name))):
        condition_to_look_into_metadata = 0
        
    #split the characters name (if multiple words are composing it)
    inv_charac_name = involved_characters.Character_name.str.lower().str.split(" ",expand=True)
    #insert NaN for words in undefined_words
    inv_charac_name[inv_charac_name.isin(undefined_words)] = None
    #harmonisation of the NaN type => None (for object)
    inv_charac_name[inv_charac_name.isna()]=None
    #recreate a single string from characters (potential) multiple names
    inv_charac_name=pd.DataFrame(inv_charac_name.stack().reset_index())
    inv_charac_name = inv_charac_name.groupby(inv_charac_name.level_0)[0].apply(" ".join)
    #join the result to the initial table
    involved_characters=involved_characters.join(inv_charac_name,on=involved_characters.index).rename(columns={0:"Character_name_reworked"})

    #-------------------------------------------------------------------------------------------------------------------------------
    #character found in summaries pretreatment for comparison with character_metadata
    
    #same as above function for involved character but with a loop to tackle multiple characters
    summary_charac=pd.DataFrame(all_sentence_completed.Characters.tolist(),dtype="object")
    column_to_drop =  np.arange(summary_charac.shape[1])
    for columns in summary_charac.columns:
        charac = summary_charac[columns].str.lower().str.split(" ",expand=True)
        charac[charac.isin(undefined_words)] = None
        charac[charac.isna()]=None
        charac=pd.DataFrame(charac.stack().reset_index())
        charac = charac.groupby(charac.level_0)[0].apply(" ".join)
        charac.name = "reworked_" + str(columns) 
        summary_charac=summary_charac.join(charac,on=summary_charac.index)

    summary_charac=summary_charac.drop(columns=column_to_drop)
    #-------------------------------------------------------------------------------------------------------------------------------

    all_sentence_completed["Gender"] = 'NaN'

    i=0
    #loop on each row of the summary characters
    for characters in all_sentence_completed.Characters:
        
        j=0
        number_of_character = len(characters)

        character_gender_list = []
        #loop on each characters of a row (if many essentially, else it is a single way loop)
        for single_character_number in range(number_of_character):
            single_character=characters[single_character_number]
         
            #-------------------------------------------------------------------------------------------------------------------------------
            #try a direct match of reworked character names
            if (involved_characters[involved_characters.Character_name_reworked == summary_charac.iloc[i,j]].empty == False):
                character_gender = list(involved_characters[involved_characters.Character_name_reworked == summary_charac.iloc[i,j]].Actor_gender)
            else:
                character_gender = None
            #-------------------------------------------------------------------------------------------------------------------------------                     
            
            # if no direct match, then alternatives are tried
            if (character_gender == None) & (isinstance(summary_charac.iloc[i,j],str)):
 
                # the first part of name is determinant, as it is either a words linked to a male or female (eg. Mr.)
                name_split = summary_charac.iloc[i,j].strip().split(" ")
                
                #find if male word implied
                if any(pd.Series(name_split[0]).isin(male_words)):
                    character_gender = list("M")
                    
                #find if female word implied
                elif any(pd.Series(name_split[0]).isin(female_words)):
                    character_gender = list("F")
                
                #look if the first word of the name (surname) is in the character list (e.g Harry in summaries is Harry Potter in metadata, this will match) 
                elif (condition_to_look_into_metadata == 1)  &  (isinstance(involved_characters.Character_name_reworked,str)):
                    if (any(name_split[0] ==  involved_characters.Character_name_reworked.str.lower().str.split(" ",expand=True).iloc[:,0])):
                        character_gender=list(involved_characters[involved_characters.Character_name_reworked.str.lower().str.split(" ",expand=True).iloc[:,0] == name_split[0]].Actor_gender)
                    
                
                #-------------------------------------------------------------------------------------------------------------------------------
            character_gender_list.append(character_gender)
            j+=1

        all_sentence_completed.Gender.iloc[i] = character_gender_list
        i+=1
        
        
        
        
    return all_sentence_completed

### Grouping on characters 

**Goal**: The idea is to regroup each sentences around a specific character or set of characters. Thus, if there is 3 sentences (3 rows at start) with the same character, it results in one row with all the characteristics (lemmas, specific lemmas (eg. nouns, verb) and governor/dependent words). It also launch the gender identification.

**How**: It is mainly an excessive use of a groupby(characters) completed with a .apply(','.join) function. it has to be done independently for each type of variable that needs to be grouped and all the subsets are then merge in one table. 

In [195]:
def group_by_characters(all_sentence_completed,all_summary_characters,dependence_by_summary,characters_data):
    
    #remove useless spaces: such as " Marie " == "Marie" will be true
    all_sentence_completed.Characters= all_sentence_completed.Characters.str.strip()

    #Used for computing and merging the grouped words of each type by characters to the following dataframe
    all_sentence_completed_copy = all_sentence_completed.copy()

    #Finally, the regroupement of all the sentences  to unique characters for each summary 
    
    all_sentence_completed = all_sentence_completed.groupby('Characters')['Lemma'].apply(','.join).reset_index()
    all_sentence_completed['Lemma'] = all_sentence_completed['Lemma'].str.split(",").apply(list)
    
    
    
    #type_final = all_sentence_completed_copy.groupby('Characters')['JJ'].apply(', '.join).reset_index()
    type_to_keep = np.array(["JJ","NN","NNP","NNS","PRP","PRP$","VB","VBG","VBP","VBZ"])
    
    #group words of each type in a single cell by characters
    all_sentence_completed_copy = all_sentence_completed_copy.fillna("")
    for type_name in type_to_keep:
        
        type_final = all_sentence_completed_copy.groupby('Characters')[type_name].apply(', '.join).reset_index()
        type_final[type_name] = type_final[type_name].str.split(",").apply(list)
        all_sentence_completed = pd.merge(all_sentence_completed,type_final, on="Characters")
        
        
        
    #working on the collapsed dependence
    #find dependence implying character names
    dep_words_by_character = dependence_by_summary[dependence_by_summary["gov"]\
                                                   .isin(all_sentence_completed.Characters.unique())]\
                                                   .groupby('gov')["dep"].apply(', '.join).reset_index()
    dep_words_by_character.columns=["Characters","dependent_words"]
    #create a list the words neglecting their type of relation with the character name
    if (dep_words_by_character["dependent_words"].isnull().all() == False):
        dep_words_by_character["dependent_words"] = dep_words_by_character["dependent_words"].str.split(",").apply(list)
    
    #same as above but for the governor words   
    gov_words_by_character = dependence_by_summary[dependence_by_summary["dep"]\
                                                   .isin(all_sentence_completed.Characters.unique())]\
                                                   .groupby('dep')["gov"].apply(', '.join).reset_index()
    gov_words_by_character.columns=["Characters","governor_words"]
    if (gov_words_by_character["governor_words"].isnull().all() == False):
        gov_words_by_character["governor_words"] = gov_words_by_character["governor_words"].str.split(",").apply(list)
    
    #join the lists of words to the characters in the global table
    all_sentence_completed = all_sentence_completed.join(gov_words_by_character.set_index('Characters'),on="Characters",how='left')
    all_sentence_completed = all_sentence_completed.join(dep_words_by_character.set_index('Characters'),on="Characters",how='left')
    
    #create a list of character rather than a simple string (mainly if there is multiple character)
    all_sentence_completed["Characters"] = all_sentence_completed["Characters"].str.split("&").apply(list)
    
    
    #adding a column with the film ID
    film_id = path[len("data/corenlp_plot_summaries/"):].replace(".txt.xml","")
    all_sentence_completed["film_id"] = film_id
    
    
    

    #launche the function to seek for characters' gender
    all_sentence_completed_2 = assign_gender_from_characters_metadata(all_sentence_completed,characters_data,film_id)
    
  
    #save result in the new dataset
    all_summary = all_summary_by_characters.append(all_sentence_completed_2)
    
    
    return all_summary

# MAIN

The main function/script is a simple loop on each summaries, it applies the above function in the good order and result in a single dataframe containing all the summaries lemmas, specific words, gov/dep words and gender based on each characters

In [None]:
all_summary_by_characters = pd.DataFrame(columns=["film_id","Characters","Lemma"],dtype="object")

summary_index = 0

# assign directory
directory = 'data/corenlp_plot_summaries'

#open characters data
path_charac = 'data\MovieSummaries\character.metadata.tsv'
colnames=['Wikipedia_movie_ID', 'Freebase_movie_ID', 'Movie_release_date', 'Character_name','Actor_date_of_birth', 'Actor_gender', 'Actor_height', 'Actor_ethnicity', 'Actor_name', 'Actor_age_at_movie_release', 'Freebase_character/actor_map_ID', 'Freebase_character_ID', 'Freebase_actor_ID'] 
characters_raw_data = pd.read_csv(path_charac, sep='\t',names = colnames, header=None)

characters_data = characters_raw_data.copy()


for filename in tqdm(os.listdir(directory)):
    path = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(path):
        last_summary = path
    else:
        print("error in file reading")
        
        
    with open(path, 'r') as f:
        data = f.read()
        
        
        Bs_data = BeautifulSoup(data, "xml")
        
        #coreference treatment
        all_link = coreference_treatment(Bs_data)
        
        #dependence treatment
        dependence_by_summary = dependence_treatment(Bs_data)
        
        
        #sentences treatment
        all_sentence,sentence_matrix,test= sentence_treatment(Bs_data)
        
        
        #coreference from numerical links to words linked dataframe
        all_link_completed=link_words(all_link,sentence_matrix)
        
        
        #for sentences with no characters, use coreferences to assign a character
        all_sentence_completed=assign_characters_to_sentence_without_explicit_character_name(all_sentence,all_link_completed)
        
       
            
        
        all_summary_by_characters = group_by_characters(all_sentence_completed,all_summary_by_characters,dependence_by_summary,characters_data)
        
        
    summary_index+=1
    
    

        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
  arr_value = np.array(value)
 23%|████████████████▋                                                       | 9829/42306 [2:45:56<10:15:44,  1.14s/it]

In [None]:
#save to pickle
all_summary_by_characters.to_pickle("data\summary_final.pkl")