# Summary parsing using Flair & Classification of our characters

In this notebook, we are going to:
- cycle through all summaries in the `plot_summaries.txt` file and count the number of occurences of each character name inside
- use this data to classify the characters present inside of the `character.metadata.tsv` file into three categories:
    - **Primary:** the character name takes up over 10% of all mentioned characters
    - **Secondary:** the character name takes up less than 10% of all mentioned characters
    - **Missed:** the character name is not mentioned in the movie summary at all

### Library imports

In [None]:
import pandas as pd
import numpy as np
import ast
import missingno as msno
from geopy.geocoders import Nominatim
import geopandas as gpd
import re
import pycountry_convert as pc
from itertools import combinations
import matplotlib.pyplot as plt
import seaborn as sns

import dataframes as RAW

In [None]:
from flair.nn import Classifier
from flair.data import Sentence

# Load the model
tagger = Classifier.load('ner-fast')

### Useful functions

The function `extract_character_names_flair` takes a string (which can be one of the summaries) and goes through all the words inside. If the word is considered a name, then it is appended to a list. The list of names is returned at the end of the function.

The function `count_appearances` takes a large string (it can be the summary) and a list of strings (it can be the list of characters found with the first function), and counts the number of times all strings in the list appear in the large text. The function returns a dictionary with the strings from the list and their occurence count inside of the large text.

These two functions will be used in the following way for all summaries:
- Use `extract_character_names_flair` to identify all character names inside
- Use `count_appearances` to count the number of appearances of all character names in the summary.

In [None]:
def extract_character_names_flair(summary):
    # Create a Flair Sentence
    sentence = Sentence(summary)

    # Run NER on the sentence
    tagger.predict(sentence)

    # Extract character names (NER tags labeled as PER, indicating a person)
    character_names = []

    for entity in sentence.get_spans('ner'):
        if entity.tag == 'PER':
            character_names.append(entity.text)

    return character_names

def count_appearances(larger_string, string_list):
    # Initialize an empty dictionary to store counts
    appearances_dict = {}

    # Iterate over each string in the list
    for search_string in string_list:
        # Count occurrences using the count() method (we convert to lowercase to avoid missing any occurence)
        count = larger_string.lower().count(search_string.lower())
        
        # Store the count in the dictionary
        appearances_dict[search_string] = count

    return appearances_dict

### Summary parsing

First, let us import the summary data:

In [None]:
summaries = RAW.summaries.copy()
summaries.head()

Now, we can use the two functions mentionned above to cycle through all summaries and create the character appearance dictionaries:

In [None]:
# Grab a subset of the data (it takes about 8h for 10,000 summaries)
sub_summaries = summaries.iloc[15000:20000, :].copy()

parsing_results = []

for index, row in sub_summaries.iterrows():
    # Print the index to keep track of where we are in the parsing
    print(index)

    # Extract the names from the summary
    names = set(extract_character_names_flair(row['Summary']))

    # Count the appearances of every name
    counts = count_appearances(row['Summary'], names)

    # Append the dictionary to the result list
    parsing_results.append(counts)

The list `parsing_results` will now contain all the dictionaries from the character counting. In order to have a better wiew of the distribution of the characters in each movie summary, we can rank the dictionaries in descending order (to get the most common names at the beginning):

In [None]:
parsing_results = [
    {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
    for d in parsing_results
]

sub_summaries['Characters'] = parsing_results

Here is the resulting dataframe:

In [None]:
sub_summaries

In [None]:
sub_summaries.to_csv('parsing_15000_19999.csv', index=False)

### Character classification

Now, the goal is to classify all characters from the character data into three roles: Primary, Secondary and Missed (explained earlier). First, let us import the data and add a `Role` column (filled with NaN values):

In [None]:
characters = RAW.character_data.copy()
characters['Role'] = np.nan
characters.head()

Now, we will classify the characters:

- First version: characters who take up over 10% of all names are primary and the rest are secondary (many characters are classified)

In [None]:
'''

for index, row in sub_summaries.iterrows():
    print(index)

    # Wiki ID of the movie to consider
    wiki_id = row['Wiki ID']

    # Dictionary of the parsing results for this movie
    parsing_result = row['Characters']

    # All characters who belong to this movie
    sub_characters = characters[characters['Wiki ID'] == wiki_id]
    
    # If the movie features actors inside of the character dataframe then proceed
    if not(sub_characters.empty):
        for i, r in sub_characters.iterrows():
            # Take one of the characters
            character = r['Character name']

            # If the considered character has a valid name then proceed
            if not(pd.isna(character)):
                # Split the character in all of its words (name, surname, etc)
                split_character_name = character.split()

                count = 0
                total = 0

                for key, value in parsing_result.items():
                    # Add all values to the total
                    total += value

                    for item in split_character_name:
                        if item in key:
                            # If we find a match then add to the count and stop (to avoid counting twice)
                            count += value
                            break
                    
                if total != 0:
                    # Compute ratio
                    ratio = count / total
                else:
                    # Empty dictionary: the character is a miss
                    ratio = 0

                if ratio > 0.1:
                    # Primary character: appears 10% of the time or more
                    characters.loc[(characters['Character name'] == character) & (characters['Wiki ID'] == wiki_id), 'Role'] = 'Primary'

                elif ratio <= 0.1 and ratio > 0:
                    # Secondary character: appears less than 10%
                    characters.loc[(characters['Character name'] == character) & (characters['Wiki ID'] == wiki_id), 'Role'] = 'Secondary'

                else:
                    # None: The character was not mentioned in the summary
                    characters.loc[(characters['Character name'] == character) & (characters['Wiki ID'] == wiki_id), 'Role'] = 'Missed'

'''

- Second version: the most common character is primary and the second most is secondary (only 2 characters are classified and there are no 'Missed' category)

In [None]:
for index, row in sub_summaries.iterrows():
    print(index)

    # Wiki ID of the movie to consider
    wiki_id = row['Wiki ID']

    # Dictionary of the parsing results for this movie
    parsing_result = row['Characters']
    parsing_result = ast.literal_eval(parsing_result)

    # All characters who belong to this movie
    sub_characters = characters[characters['Wiki ID'] == wiki_id]

    primary = None
    secondary = None

    most = -1
    second_most = -1
    
    # If the movie features actors inside of the character dataframe then proceed
    if not(sub_characters.empty):
        for i, r in sub_characters.iterrows():
            # Take one of the characters
            character = r['Character name']
            char_index = i

            # If the considered character has a valid name then proceed
            if not(pd.isna(character)):
                # Split the character in all of its words (name, surname, etc)
                split_character_name = character.split()

                count = 0
                total = 0

                for key, value in parsing_result.items():
                    # Add all values to the total
                    total += value

                    for item in split_character_name:
                        if item in key:
                            # If we find a match then add to the count and stop (to avoid counting twice)
                            count += value
                            break
                    
                if total != 0:
                    # Compute ratio
                    ratio = count / total
                else:
                    # Empty dictionary: the character is a miss
                    ratio = 0

                # Found a new character that appears more often than the current first
                if ratio > most:
                    # Current first becomes second
                    second_most = most
                    secondary = primary

                    # New character gets first place
                    most = ratio
                    primary = i

                else:
                    # Found a new character that appears as often as the current first and there are still no second most
                    if ratio == most and secondary == None:
                        second_most = ratio
                        secondary = i

                    # Found a new character that appears less often than the current first and more often than the current second
                    if ratio < most and ratio > second_most:
                        second_most = ratio
                        secondary = i

        # If we couldn't classify both a primary and a secondary character then the movie is not useful
        if primary != None and secondary != None:
            characters.at[primary, characters.columns[-1]] = 'Primary'
            characters.at[secondary, characters.columns[-1]] = 'Secondary'

The characters who still have a NaN value inside of their `Role` column are characters who are not featured inside of their summaries, so they will not be useful. Therefore, we filter the characters who were assigned a role:

In [None]:
result = characters[characters['Role'].notna()]
result

Finally, we store the result inside of a CSV file:

In [None]:
result.to_csv('character_classification_15000_19999.csv', index=False)