# Heroes and Heroines, Villains and Villainesses


The basic approach to this problem is same as any typical NLP problem: make a test and training set, and make predictions. However, the catch here is that we are not allowed to use any Machine Learning Library. Thus, we need to use some core Python methods to solve this problem.

The following methodology has been used:

* Read in the `corpus.txt` file, and clean it and convert to all small letters.
* After this, we tokenize the corpus.
* Next, we read all the names whose gender has to be predicted.
* This step is one of the most important steps. We create two `sets`, namely `males` and `females`. Both the datasets contain a list of some of the most assoicated words with males and females respectively. 
* The accuracy of our model depends a lot on these two sets. The larger and more relevant they are, the more accurate our model is likely to be.
* For every name in the input data, we check for every line the corpus, and 20 words below and after it.
* If a word matching in male dataset is obtained, we increment the Male score, and if we find a word matching in female dataset, we increment the Female Score.
* The gender which has the highest score is printed as result for the name.


### Getting and Cleaning Data

In [1]:
import re
import sys
with open('corpus.txt') as f:
    corpus = f.read()

pattern = re.compile(r" |:|;|,|-|\n|'|\.|\"|\'|!")
corpus = pattern.split(corpus.lower())
corpus = [w.strip() for w in corpus]
corpus = [w for w in corpus if len(w) > 1]

FileNotFoundError: [Errno 2] No such file or directory: 'corpus.txt'

In these steps, we input the corpus and clean it.


### Read the input

In [None]:
N = int(input())

names = sys.stdin.readlines()


### Create the males and females dataset

In [None]:
males = set(['he', 'his', 'him', 'himself', 'father', 'brother', 'uncle', 'half-brother', 'halfbrother', 'son', 'boy','boys', 'dad', 'grandfather', 'king', 'nephew', 'actor', 'steward', 'barman', 'groom', 'chairman', 'man', 'gentleman', 'hero', 'host', 'husband', 'landlord', 'lord', 'monk', 'prince', 'waiter', 'widower', 'character', 'marquis', 'earl', 'italian', 'sir', 'cousin', 'englishman', 'attack', 'war', 'ranger', 'businessman', 'crowned','co-founder','fisherman','technology','engineering','slays','intemperate', 'washerman,', 'berating', 'wayward', 'kill','buried','ruin','settle','exile','verses','cocky','abusive','aggressive','ruthless','accident','charming','young','mr','corporation'])
females = set(['she','girl','girls', 'hers', 'her', 'herself', 'mother', 'sister', 'aunt', 'half-sister', 'halfsister', 'daughter', 'girl', 'mom', 'grandmother', 'queen', 'niece', 'actress', 'stewardess', 'barmaid', 'bride', 'chairwoman', 'lady', 'headmistress', 'heroine', 'hostess', 'wife', 'landlady', 'lady', 'nun', 'princess','waitress', 'widow', 'dear', 'little', 'businesswoman','impure','forest','abducted','marries','purity','listen','earth','furrow','goddess','woman','female','women','dedication', 'self-sacrifice','wifely', 'womanly', 'virtues','fertile','feminine','doctor','caring','cute','tender','young','cute','slave','beautiful','mrs'])

I would like to mention again that the accuracy of model can be improved by expanding the above datasets.

### Make the predictions and display the result

In [None]:
for o, name in enumerate(names):
    name = name.strip().lower()

    pos_list = [i for i, x in enumerate(corpus) if x == name]
    
    male = 0
    female = 0
    for i in pos_list:
        male += sum([corpus[i-25:i+25].count(w) for w in males])
        female += sum([corpus[i-25:i+25].count(w) for w in females])
    #print(male,female,end='\n')
    
    if male > female:
        print ('Male')
    
    else:
        print ('Female')

The above algoritm generates the output as explained above. This codes generates a best score of 32.62/50 on Hackerrank, which is unfortunately less than the passing score of 36.96. However, the accuracy of model can be improved by increasing the `males` and `females` sets.

![Result](Capture.PNG)

Note: The codes might not work on this notebook, as they have been optimized for Hackerrank Input/Output format. Please download the .py file to execute them.