# Note:

The below approaches are performed on 15 stories for demonstration purposes. Please note that this is scalable to the entire dataset as well. Full analysis is omitted due to computational limitations.

# Approach 1
1. Create a list of the top ten proper nouns from each story.
2. To identify the people on the list, see if any of these terms are followed by a verb (40 percent).
(It's most likely a person)
3. Words that are not followed by a verb at least 40% of the time are removed.

In [None]:
#run this cell only for colab
from google.colab import drive
drive.mount('/content/drive')
!cp /content/drive/MyDrive/plots ./plots
!cp /content/drive/MyDrive/titles ./titles


In [None]:
!pip install TextBlob
!python -m textblob.download_corpora

In [1]:
# first 10 stories
stories = []
story = ""
with open('plots') as f:
    for line in f:
        if len(stories) == 16:
            break
        if '<EOS>' in line:
          stories.append(story)
          story = ""
        else:
          story = story + line
stories = [story.replace("\n", "") for story in stories]
stories = [story.replace("\\", "") for story in stories]

In [2]:
with open('titles') as titles_file:
  titles = titles_file.readlines()
titles = [title.replace("\n", "") for title in titles]

In [4]:
import string
from textblob import TextBlob
import nltk
# nltk.download('averaged_perceptron_tagger')
# nltk.download('brown')
# nltk.download('punkt')

#clean up
# clearingSymbols = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’‘'
# for c in clearingSymbols:
#     text = text.replace(c, '')

for index, story in enumerate(stories):
    blob = TextBlob(story)

    verbs = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    a = blob.tags  #pos tagging

    # retruns the top 10 propernouns 
    def prn(plotblob):
        d = plotblob.np_counts # noun counter 
        d = dict(d)
        top = sorted(d, key=d.get, reverse=True)
        return top[0:10]

    dictionary = blob.np_counts
    # checks if the noun followed by verb 50% of the time 
    for word in prn(blob):
        count = 0
        for i in range(0, len(a)-1):
            if a[i][0] == word or a[i][0] == word.title():
                if a[i+1][1] in verbs:
                    count = count + 1
        if count / dictionary[word] > 0.4:
            protaganist = word 
            print("The protagonist of the story " + titles[index] + " is :-" + protaganist) 
            break


The protagonist of the story Animal Farm is :-napoleon
The protagonist of the story A Clockwork Orange (novel) is :-alex
The protagonist of the story The Plague is :-tarrou
The protagonist of the story Actaeon is :-actaeon
The protagonist of the story A Fire Upon the Deep is :-lab
The protagonist of the story All Quiet on the Western Front is :-paul
The protagonist of the story Anyone Can Whistle is :-fay
The protagonist of the story A Funny Thing Happened on the Way to the Forum is :-senex
The protagonist of the story Army of Darkness is :-ash
The protagonist of the story The Birth of a Nation is :-cameron
The protagonist of the story Blade Runner is :-deckard
The protagonist of the story Blazing Saddles is :-bart
The protagonist of the story Blue Velvet (film) is :-dreams
The protagonist of the story Barry Lyndon is :-barry
The protagonist of the story Buffy the Vampire Slayer (film) is :-pike


# Limitation
This approach is not ideal because it appears to be failing in some edge cases. As an example, **"Apple purchased Mitsubishi stock, and Apple fans are overjoyed."** Apple is considered a noun here, which is correct, but in the context of the sentence, Apple is an organization. As a result, I have discarded this approach.

# Approach 2

This the prefered approach. It consists of: 
1. **NLTK NER classifer**, which contains a person classifier as well 
2. The protaganist is extracted by calculating the **frequency of the person occurences** in each story.
3. **Coreference Resolution** is performed for enriching the frequency information and also to identfy the gender of the protaganist 
4. Finally, in the case where coreference resolution fails in gender identification, a **naive bayes classifer** trained on gender name data is employed to classify the protaganist gender.

In [None]:
!pip install flair
!pip install torch==1.9.0
!pip install allennlp==2.1.0 allennlp-models==2.1.0
!pip install transformers==4.1.1


In [None]:
import nltk
from nltk import pos_tag, word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from tqdm import tqdm
import re
import string
from itertools import combinations
from collections import Counter
from flair.models import SequenceTagger
from flair.data import Sentence
from nltk.corpus import names
from nltk import NaiveBayesClassifier as NBC
from nltk import classify
import random
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('names')


In [7]:
#training gender classifier 

def gender_features(word):
    return {'last_letter': word[-1]}


maleNames = [(name, 'male') for name in names.words('male.txt')]
femaleNames = [(name, 'female') for name in names.words('female.txt')]
allNames = maleNames + femaleNames
random.shuffle(allNames)
featureData = [(gender_features(namelist), gender)for (namelist, gender) in allNames]
test_data = featureData[:800]
train_data = featureData[800:]
classifier = NBC.train(train_data)

In [None]:
# Use flair named entity recognition
tagger = SequenceTagger.load('ner')

In [None]:
#extracting all the persons 
all_story_names = []
for s in stories:
  names = []
  sent = nltk.tokenize.sent_tokenize(s)
  # Get all the names of entities tagged as people
  for line in tqdm(sent):
    sentence = Sentence(line)
    tagger.predict(sentence)
    for entity in sentence.to_dict(tag_type='ner')['entities']:
      if entity['labels'][0].value == 'PER':
        names.append(entity['text'])
  all_story_names.append(names)

In [None]:
all_story_names[0]

In [10]:
# Remove any punctuation within the names

protagonists = []

for story_names in all_story_names:
  names = []
  for name in story_names:
    names.append(name.translate(str.maketrans('', '', string.punctuation)))
  
  result = [item for items, c in Counter(story_names).most_common()
                                        for item in [items] * c]
  #extracting the protagonist by calculating the frequency
  protagonist =  Counter(names).most_common()[0][0]
  protagonists.append(protagonist)
  # print(Counter(names).most_common()) 



In [11]:
protagonists

['Napoleon',
 'Alex',
 'Rieux',
 'Artemis',
 'Pham',
 'Paul',
 'Fay',
 'Senex',
 'Ash',
 'Cameron',
 'Deckard',
 'Bart',
 'Jeffrey',
 'Rachael',
 'Barry',
 'Buffy']

### Gender identification using coreference resolution and Gender classifier

In [None]:
from allennlp.predictors.predictor import Predictor
model_url = 'https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2021.03.10.tar.gz'
predictor = Predictor.from_path(model_url)

In [13]:
for index, protagonist in enumerate(protagonists):
  pred = predictor.predict(
      document=stories[index]
  )
  clusters = pred['clusters']
  document = pred['document']

  n = 0
  doc = {}
  for obj in document:
      doc.update({n:  obj})
      n = n+1
# captures all the coreferenc clusters of a particular noun
  clus_all = []
  cluster = []
  clus_one = {}
  for i in range(0, len(clusters)):
      one_cl = clusters[i]
      for count in range(0, len(one_cl)):
          obj = one_cl[count]
          for num in range((obj[0]), (obj[1]+1)):
              for n in doc:
                  if num == n:
                      cluster.append(doc[n])
      clus_all.append(cluster)
      cluster = []
  # print(clus_all)  # And finally, this shows all coreferences


  gender = "na"


  for cluster in clus_all:
    if protagonist in cluster:
      cluster_lower = [word.lower() for word in cluster]
      if "she" in cluster_lower or "her" in cluster_lower or "hers" in cluster_lower:
        gender = "female"
      if "he" in cluster_lower or "him" in cluster_lower or "his" in cluster_lower:
        gender = "male"
  
  if gender == "na":
    gender = classifier.classify(gender_features(protagonist))
  
  print("gender of protagonist " + protagonist + " of the story " + titles[index] + " is " + gender)




To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


gender of protagonist Napoleon of the story Animal Farm is male
gender of protagonist Alex of the story A Clockwork Orange (novel) is male
gender of protagonist Rieux of the story The Plague is male
gender of protagonist Artemis of the story Actaeon is female
gender of protagonist Pham of the story A Fire Upon the Deep is male
gender of protagonist Paul of the story All Quiet on the Western Front is male
gender of protagonist Fay of the story Anyone Can Whistle is female
gender of protagonist Senex of the story A Funny Thing Happened on the Way to the Forum is male
gender of protagonist Ash of the story Army of Darkness is male
gender of protagonist Cameron of the story The Birth of a Nation is female
gender of protagonist Deckard of the story Blade Runner is male
gender of protagonist Bart of the story Blazing Saddles is male
gender of protagonist Jeffrey of the story Blue Velvet (film) is male
gender of protagonist Rachael of the story Blade Runner 2: The Edge of Human is male
gender

#Improvements

1. In some case antagonist might have higest frequency. A way to elminate this would be to calculate average sentiment of the sentences containing the top candidates for protagonists and selecting those with a net positive sentiment.
2. The case of multiple protagonists in a story could be figured out by identifying the distribution of the occurences of the characters. If the distribution seems to be even, then there is a possibility of multiple or no protagonists. 
3. Creation of an external ground truth can be achieved by utilising search engine APIs, by converting the title to a question (for eg. "Who is the protagonist of Buffy the Vampire Slayer?") and then extracting the answer from the search engine results. In the case that search engine querying is expensive, the open source movie database TMDB can be used to capture the protagonist as well as their gender. 
