# Summary parsing using Flair & Classification of our characters

In this notebook, we are going to:
- cycle through all summaries in the `plot_summaries.txt` file and count the number of occurences of each character name inside
- use this data to classify the characters present inside of the `character.metadata.tsv` file into three categories:
    - **Primary:** the character name takes up over 10% of all mentioned characters
    - **Secondary:** the character name takes up less than 10% of all mentioned characters
    - **Missed:** the character name is not mentioned in the movie summary at all

### Library imports

In [2]:
import pandas as pd
import numpy as np
import ast
import missingno as msno
from geopy.geocoders import Nominatim
import geopandas as gpd
import re
import pycountry_convert as pc
from itertools import combinations
import matplotlib.pyplot as plt
import seaborn as sns

import dataframes as RAW

In [3]:
from flair.nn import Classifier
from flair.data import Sentence

# Load the model
tagger = Classifier.load('ner-fast')

2023-12-12 15:40:28,272 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


### Useful functions

The function `extract_character_names_flair` takes a string (which can be one of the summaries) and goes through all the words inside. If the word is considered a name, then it is appended to a list. The list of names is returned at the end of the function.

The function `count_appearances` takes a large string (it can be the summary) and a list of strings (it can be the list of characters found with the first function), and counts the number of times all strings in the list appear in the large text. The function returns a dictionary with the strings from the list and their occurence count inside of the large text.

These two functions will be used in the following way for all summaries:
- Use `extract_character_names_flair` to identify all character names inside
- Use `count_appearances` to count the number of appearances of all character names in the summary.

In [4]:
def extract_character_names_flair(summary):
    # Create a Flair Sentence
    sentence = Sentence(summary)

    # Run NER on the sentence
    tagger.predict(sentence)

    # Extract character names (NER tags labeled as PER, indicating a person)
    character_names = []

    for entity in sentence.get_spans('ner'):
        if entity.tag == 'PER':
            character_names.append(entity.text)

    return character_names

def count_appearances(larger_string, string_list):
    # Initialize an empty dictionary to store counts
    appearances_dict = {}

    # Iterate over each string in the list
    for search_string in string_list:
        # Count occurrences using the count() method (we convert to lowercase to avoid missing any occurence)
        count = larger_string.lower().count(search_string.lower())
        
        # Store the count in the dictionary
        appearances_dict[search_string] = count

    return appearances_dict

### Summary parsing

First, let us import the summary data:

In [5]:
summaries = RAW.summaries.copy()
summaries.head()

Unnamed: 0,Wiki ID,Summary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


Now, we can use the two functions mentionned above to cycle through all summaries and create the character appearance dictionaries:

In [6]:
# Grab a subset of the data (it takes about 8h for 10,000 summaries)
sub_summaries = summaries.iloc[:10, :].copy()

parsing_results = []

for index, row in sub_summaries.iterrows():
    # Print the index to keep track of where we are in the parsing
    print(index)

    # Extract the names from the summary
    names = set(extract_character_names_flair(row['Summary']))

    # Count the appearances of every name
    counts = count_appearances(row['Summary'], names)

    # Append the dictionary to the result list
    parsing_results.append(counts)

0
1
2
3
4
5
6
7
8
9


The list `parsing_results` will now contain all the dictionaries from the character counting. In order to have a better wiew of the distribution of the characters in each movie summary, we can rank the dictionaries in descending order (to get the most common names at the beginning):

In [7]:
parsing_results = [
    {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
    for d in parsing_results
]

sub_summaries['Characters'] = parsing_results

Here is the resulting dataframe:

In [9]:
sub_summaries

Unnamed: 0,Wiki ID,Summary,Characters
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha...","{'Shlykov': 1, 'Lyosha': 1}"
1,31186339,The nation of Panem consists of a wealthy Capi...,"{'Katniss': 24, 'Peeta': 16, 'Rue': 11, 'Cato'..."
2,20663735,Poovalli Induchoodan is sentenced for six yea...,"{'Induchoodan': 18, 'Menon': 12, 'Manapally': ..."
3,2231378,"The Lemon Drop Kid , a New York City swindler,...","{'Kid': 35, 'Charley': 18, 'Moran': 8, 'Nellie..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...,"{'Lindy': 7, 'Michael': 4, 'Azaria': 4, 'Chamb..."
5,5272176,The president is on his way to give a speech. ...,"{'Thomas': 17, 'Baldwin': 7, 'Kate': 5, 'Steve..."
6,1952976,"{{plot}} The film opens in 1974, as a young gi...","{'Dahlia': 21, 'Cecilia': 19, 'Natasha': 12, '..."
7,24225279,"The story begins with Hannah, a young Jewish t...","{'Hannah': 15, 'Dominic': 14, 'Miss Lombardo':..."
8,2462689,Infuriated at being told to write one final co...,"{'Doe': 12, 'John Doe': 10, 'Mitchell': 8, 'Wi..."
9,20532852,A line of people drool at the window of the s...,"{'Buzz': 5, 'Woody': 4}"


In [56]:
sub_summaries.to_csv('parsing_15000_19999.csv', index=False)

### Character classification

Now, the goal is to classify all characters from the character data into three roles: Primary, Secondary and Missed (explained earlier). First, let us import the data and add a `Role` column (filled with NaN values):

In [50]:
characters = RAW.character_data.copy()
characters['Role'] = np.nan
characters.head()

Unnamed: 0,Wiki ID,Freebase ID,Release date,Character name,Actor DOB,Actor gender,Actor height,Actor ethnicity,Actor name,Actor age at release,Map ID,Character ID,Actor ID,Role
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7,
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.78,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4,
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l,
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.75,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc,
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.65,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg,


Now, we will classify the characters:

- First version: characters who take up over 10% of all names are primary and the rest are secondary (many characters are classified)

In [None]:
'''

for index, row in sub_summaries.iterrows():
    print(index)

    # Wiki ID of the movie to consider
    wiki_id = row['Wiki ID']

    # Dictionary of the parsing results for this movie
    parsing_result = row['Characters']

    # All characters who belong to this movie
    sub_characters = characters[characters['Wiki ID'] == wiki_id]
    
    # If the movie features actors inside of the character dataframe then proceed
    if not(sub_characters.empty):
        for i, r in sub_characters.iterrows():
            # Take one of the characters
            character = r['Character name']

            # If the considered character has a valid name then proceed
            if not(pd.isna(character)):
                # Split the character in all of its words (name, surname, etc)
                split_character_name = character.split()

                count = 0
                total = 0

                for key, value in parsing_result.items():
                    # Add all values to the total
                    total += value

                    for item in split_character_name:
                        if item in key:
                            # If we find a match then add to the count and stop (to avoid counting twice)
                            count += value
                            break
                    
                if total != 0:
                    # Compute ratio
                    ratio = count / total
                else:
                    # Empty dictionary: the character is a miss
                    ratio = 0

                if ratio > 0.1:
                    # Primary character: appears 10% of the time or more
                    characters.loc[(characters['Character name'] == character) & (characters['Wiki ID'] == wiki_id), 'Role'] = 'Primary'

                elif ratio <= 0.1 and ratio > 0:
                    # Secondary character: appears less than 10%
                    characters.loc[(characters['Character name'] == character) & (characters['Wiki ID'] == wiki_id), 'Role'] = 'Secondary'

                else:
                    # None: The character was not mentioned in the summary
                    characters.loc[(characters['Character name'] == character) & (characters['Wiki ID'] == wiki_id), 'Role'] = 'Missed'

'''

- Second version: the most common character is primary and the second most is secondary (only 2 characters are classified and there are no 'Missed' category)

In [51]:
for index, row in sub_summaries.iterrows():
    print(index)

    # Wiki ID of the movie to consider
    wiki_id = row['Wiki ID']

    # Dictionary of the parsing results for this movie
    parsing_result = row['Characters']

    # All characters who belong to this movie
    sub_characters = characters[characters['Wiki ID'] == wiki_id]

    primary = None
    secondary = None

    most = -1
    second_most = -1
    
    # If the movie features actors inside of the character dataframe then proceed
    if not(sub_characters.empty):
        for i, r in sub_characters.iterrows():
            # Take one of the characters
            character = r['Character name']

            # If the considered character has a valid name then proceed
            if not(pd.isna(character)):
                # Split the character in all of its words (name, surname, etc)
                split_character_name = character.split()

                count = 0
                total = 0

                for key, value in parsing_result.items():
                    # Add all values to the total
                    total += value

                    for item in split_character_name:
                        if item in key:
                            # If we find a match then add to the count and stop (to avoid counting twice)
                            count += value
                            break
                    
                if total != 0:
                    # Compute ratio
                    ratio = count / total
                else:
                    # Empty dictionary: the character is a miss
                    ratio = 0

                # Found a new character that appears more often than the current first
                if ratio > most:
                    # Current first becomes second
                    second_most = most
                    secondary = primary

                    # New character gats first place
                    most = ratio
                    primary = character

                else:
                    # Found a new character that appears as often as the current first and there are still no second most
                    if ratio == most and secondary == None:
                        second_most = ratio
                        secondary = character

                    # Found a new character that appears less often than the current first and more often than the current second
                    if ratio < most and ratio > second_most:
                        second_most = ratio
                        secondary = character

    # Check that the primary and secondary characters are valid and assign the role if correct
    # - if we couldn't classify both a primary and a secondary character then the movie is not useful
    # - if the character name classified as primary matches with the most appearing name in the summary then it's good
    # - same for secondary

    keys = list(parsing_result.keys())
    index = 0

    assign_primary = False
    assign_secondary = False
    
    if len(keys) >= 1 and primary != None:
        if keys[index] in primary:
            assign_primary = True

        while keys[index] in primary and index < len(keys):
            index += 1

    if index < len(keys) and secondary != None:
        if keys[index] in secondary:
            assign_secondary = True

    if assign_primary and assign_secondary:
        characters.loc[(characters['Character name'] == primary) & (characters['Wiki ID'] == wiki_id), 'Role'] = 'Primary'
        characters.loc[(characters['Character name'] == secondary) & (characters['Wiki ID'] == wiki_id), 'Role'] = 'Secondary'

0
1
2
3
4
5
6
7
8
9


The characters who still have a NaN value inside of their `Role` column are characters who are not featured inside of their summaries, so they will not be useful. Therefore, we filter the characters who were assigned a role:

In [55]:
result = characters[characters['Role'].notna()]
result

Unnamed: 0,Wiki ID,Freebase ID,Release date,Character name,Actor DOB,Actor gender,Actor height,Actor ethnicity,Actor name,Actor age at release,Map ID,Character ID,Actor ID,Role
4633,20663735,/m/051zjwb,2000,M.K. Menon,1935-12-10,M,,/m/0dryh9k,Thilakan,64.0,/m/059t6pp,/m/0h73lnb,/m/02hkvw,Secondary
4639,20663735,/m/051zjwb,2000,Marancheri Induchoodan,1960-05-21,M,1.72,/m/0dryh9k,Mohanlal,39.0,/m/059t6p_,/m/0h8gtfl,/m/02fbpz,Primary
107615,595909,/m/02tqm5,1988-11-03,Michael Chamberlain,1947-09-14,M,1.822,/m/02jvpv,Sam Neill,41.0,/m/02tbjj2,/m/0h2qv0j,/m/01ckhj,Secondary
107616,595909,/m/02tqm5,1988-11-03,Lindy Chamberlain,1949-06-22,F,1.68,,Meryl Streep,39.0,/m/02tb1h6,/m/05z0x_h,/m/0h0wc,Primary
128473,2462689,/m/07ftxt,1941-05-03,Ann Mitchell,1907-07-16,F,1.65,,Barbara Stanwyck,33.0,/m/0k0kbm,/m/0h57bdz,/m/0bw6y,Secondary
128474,2462689,/m/07ftxt,1941-05-03,Long John Willoughby - 'John Doe',1901-05-07,M,1.905,,Gary Cooper,39.0,/m/0k0kbg,/m/0h2svgl,/m/0c2tf,Primary
363229,31186339,/m/0gkz15s,2012-03-12,Katniss Everdeen,1990-08-15,F,1.75,,Jennifer Lawrence,21.0,/m/0gw7kv0,/m/0c01vfc,/m/02x0dzw,Primary
363230,31186339,/m/0gkz15s,2012-03-12,Peeta Mellark,1992-10-12,M,1.7,,Josh Hutcherson,19.0,/m/0gw7kvp,/m/0c03gdc,/m/08wjf4,Secondary


Finally, we store the result inside of a CSV file:

In [43]:
result.to_csv('character_classification_15000_19999.csv', index=False)