<a href="https://colab.research.google.com/github/averyPike/languageBiasCheck/blob/main/finalnotebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pronoun Analysis in Literature: Gender Prediction and Frequency Comparison

## Introduction

Personal pronouns are a critical element of language that can provide insights into character portrayal and gender representation in literature. This study investigates two primary questions:
1. Is there a statistically significant difference between the frequencies of subject and object pronouns for feminine and masculine personal pronouns?
2. Is it possible to predict the gender of a character in a story using the pronouns associated with them?

To address these questions, we analyze subjective pronouns (he, she) and objective pronouns (him, her). While "him" is straightforward to identify as an objective pronoun, "her" can be either an objective pronoun or a possessive adjective. To distinguish between these usages, we implemented a function for part-of-speech (POS) tagging that identifies instances of "her" not followed by a noun (NN or NNS type) as objective, and the rest as possessive. We also considered possessive adjectives (his, her) and possessive pronouns (his, hers), treating "his" as both an adjective and a pronoun for simplicity.

## Methodology

### Datasets

We used the following literary works from Calvin's Project Gutenberg repository:
- *Pride and Prejudice* by Jane Austen
- *Frankenstein: Or, the Modern Prometheus* by Mary Shelley
- *Wuthering Heights* by Emily Brontë

### Technologies

We utilized SparkNLP for POS-tagging and Spacy for Named-Entity Recognition (NER).

### Process

1. **Corpus Preparation**: Identified the corpus of interest and cleaned the text by removing periods after honorifics and hyphens.
2. **Text Processing**: Reinserted periods to facilitate the identification of objective "her." This ensured possessive adjectives "her" followed by nouns were correctly identified.
3. **Pronoun Counting**: Counted instances of male and female pronouns in each sentence, associating counts with character names in a dictionary.
4. **Dictionary Conversion**: Converted the dictionary to a list for iteration and further processing.
5. **Character Filtering**: Excluded characters with fewer than 10 pronoun occurrences to focus on significant data points.
6. **Gender Prediction**: Predicted the gender of each character based on pronoun counts.
7. **Manual Verification**: The function "manually_verify" allowed manual verification of character gender, enhancing the accuracy of predictions.
8. **Accuracy Calculation**: Compared true and predicted values to determine accuracy, including total accuracy, male accuracy, and female accuracy.

## Hypotheses

1. Given that all books in the corpus are authored by women, we hypothesized no statistical difference between the counts of subjective male pronouns, subjective female pronouns, objective male pronouns, and objective female pronouns.
2. We hypothesized that gender prediction based on pronoun counts in relevant sentences would be more accurate than a random guess.

## Evaluation

### Character Gender Prediction

Accuracy was calculated by dividing the number of correct predictions by the total number of predictions. Results were split by book to evaluate model performance across different texts. We also measured accuracy by gender. Notably, *Frankenstein* achieved 100% accuracy, likely due to its smaller cast of characters.

### Pronoun Frequency Analysis

We conducted statistical tests to compare the frequencies of male and female subjective and objective pronouns. Our findings indicated no statistically significant differences between the frequencies of male and female subjective pronouns, nor between male and female objective pronouns.

## Conclusions and Future Research

Our initial hypothesis that a corpus of entirely women authors would lead to no difference between the frequency of pronouns for each gender was supported by our model. We also had success with our predictive accuracy of gender counts. The study supports the notion that the gender of the author affects not only the frequency of respective gender pronouns, but also predictive accuracy within the corpus.

This observation suggests the need for further research into pronoun usage concerning author gender. It would be irresponsible to suggest any concrete relationship between author gender and pronoun prediction accuracy from such a small corpus, but these initial findings and confirmation of our hypotheses are promising in suggesting that the gender of the author has significant influence on gender pronoun usage.

With confirmation from a larger corpus and appropriate comparative analysis with a corpus of male authors, there could be considerable real-world significance to the observations made from this study. Because most published materials in English have had male authorship, this could suggest that predictive models that do not consider the ratio of authorship between genders possess inherent bias against recognition of feminine pronouns. This point contributes to linguistic arguments that a lack of representation in the development of English has obfuscated feminine viewpoints, contributing to patriarchal ideological concepts such as the “Feminine Mystique.” By analyzing this corpus, this study suggests that there is validity in further research in this field.

pip installs, library imports, and global vars

In [None]:
%pip install sparknlp
%pip install pyspark
%pip install spacy

Collecting sparknlp
  Downloading sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Collecting spark-nlp (from sparknlp)
  Downloading spark_nlp-5.4.1-py2.py3-none-any.whl (579 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m579.2/579.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spark-nlp, sparknlp
Successfully installed spark-nlp-5.4.1 sparknlp-1.0.0
Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=509b5340b72b4f0c3cb79f3e317ad33f090da941b69511634e11db2fb7bba152
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a6

In [None]:
import sparknlp
import os
import spacy
import pyspark
from pyspark.sql import SparkSession
from tqdm import tqdm
import datetime
import pandas as pd
from sparknlp.pretrained import PretrainedPipeline
import scipy.stats as stats
import matplotlib.pyplot as plt

In [None]:
spark = sparknlp.start()
root = os.path.dirname(os.path.realpath('cormac.ipynb'))
pipeline = PretrainedPipeline("explain_document_ml")

explain_document_ml download started this may take some time.
Approx size to download 9 MB
[OK!]


get corpus

In [None]:
book_list = ['pg1342.txt', 'pg768.txt', 'pg84.txt']
# curl each book
for book in book_list:
  !curl "https://raw.githubusercontent.com/cd-public/books/main/{book}" -o {book}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  739k  100  739k    0     0  1620k      0 --:--:-- --:--:-- --:--:-- 1618k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  665k  100  665k    0     0  1478k      0 --:--:-- --:--:-- --:--:-- 1481k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  430k  100  430k    0     0  1371k      0 --:--:-- --:--:-- --:--:-- 1372k


In [None]:
def read_txt(filename = 'pg1342.txt'):
  file = open(root + '/' + filename,"r")
  return file.read()
pride = read_txt()
frank = read_txt('pg84.txt')
wuther = read_txt('pg768.txt')

In [None]:
pronouns = ['he','him','his','she','her','hers']
def pronoun_check(x):
  '''
  x = list of tuples
  returns list of tuples with only pronouns
  '''
  pro_toks = []
  male_count, female_count = 0, 0
  for i in x:
    flag = False
    last_pos = ''
    last_tok = ''
    for y in i:
      if flag:
        if pos != 'NN' and pos != 'NNS' and token == '.':
          female_count += 1
          flag = False
          continue
      token, pos = y
      token = token.lower()
      # Magic (identifies instances of 'her' that are objective)
      if token in pronouns:
        if token == 'her':
          flag = True
          last_pos = pos
          last_tok = token
        else:
            if token == 'he' or token == 'him':
              male_count += 1
            else:
              female_count += 1

  return male_count, female_count

#def count_pronouns(x):

In [None]:
def count_subObjPos(x):
  '''
  x = list of tuples
  returns count of instances of each type of pronoun
  '''
  pro_toks = []
  male_countSub, female_countSub, male_countObj, female_countObj, male_countPos, female_countPos = 0, 0, 0, 0, 0, 0
  male_count, female_count = 0, 0
  for i in x:
    flag = False
    last_pos = ''
    last_tok = ''
    for y in i:
      if flag:
        if pos != 'NN' and pos != 'NNS' and token == '.':
          female_countObj += 1
          flag = False
        else:
          female_countPos += 1
          flag = False
          continue
      token, pos = y
      token = token.lower()
      # Magic (identifies instances of 'her' that are objective)
      if token in pronouns:
        if token == 'her':
          flag = True
          last_pos = pos
          last_tok = token
        else:
            if token == 'he':
              male_countSub += 1
            elif token == 'him':
              male_countObj += 1
            elif token == 'she':
              female_countSub += 1
            elif token == 'hers':
              female_countPos += 1
            elif token == 'his':
              male_countPos += 1
  male_count = male_countSub + male_countObj + male_countPos
  female_count = female_countSub + female_countObj + female_countPos

  return male_count, female_count, male_countSub, female_countSub, male_countObj, female_countObj, male_countPos, female_countPos

In [None]:
def book_cleaner(text):
  text = text.replace('Mr.','Mr')
  text = text.replace('Mrs.','Mrs')
  text = text.replace('Ms.','Ms')
  text = text.replace('Dr.','Dr')
  text = text.replace('_','')
  text = text.replace('-','')
  # text = text.replace('\n', ' ')
  # text = text.replace('""',' ')
  # text = text.replace('  ',' ')
  # # text = text.lower()
  return text

pridesplit = book_cleaner(pride).split('.')
franksplit = book_cleaner(frank).split('.')
wuthersplit = book_cleaner(wuther).split('.')

In [None]:
def period(sentences):
  sent_list = []
  for sent in sentences:
    sent = sent+'.'
    sent_list.append(sent)
  return sent_list
pridesplit = period(pridesplit)
franksplit = period(franksplit)
wuthersplit = period(wuthersplit)

for each sentence, find all people mentiond. Add the number of male and female pronounce counted in the sentence to a dict where the key is the persons name.

In [None]:
def book_person_dict_builder(split_book):
  nlp = spacy.load('en_core_web_sm')
  personGenderCount = {} # person: malePronounCount, femalePronounCount
  # for sent in pridesplit:
  for sent in tqdm(split_book): # for each sentence in pride
    doc = nlp(sent) # create the spacy doc
    people = []
    for ent in doc.ents:
      if ent.label_ == 'PERSON':
        people.append(ent.text) # get all people entities
    if len(people) == 0: # if there aren't any people
      continue # move onto next sentence
    annoted_sent = pipeline.annotate(sent) # annotate the sentence with parts of speach
    tok_tag = [(annoted_sent['token'], annoted_sent['pos'])] # list of tuples (token, part of speech)
    zips = [list(zip(tt[0],tt[1])) for tt in tok_tag] # zip
    male_count, female_count,_,_,_,_,_,_ = count_subObjPos(zips) # get male/female PN counts
    for person in people: # for each person in the sentence
      try: # if they already exist in the the dict
        # add the old counts to the new
        person_male_count = personGenderCount[person][0]
        person_female_count = personGenderCount[person][1]
        personGenderCount[person] = person_male_count + male_count, person_female_count + female_count
      except KeyError: # if they don't exist in the dict, add them
        personGenderCount[person] = male_count, female_count

  return personGenderCount

pride_dict = book_person_dict_builder(pridesplit)
frank_dict = book_person_dict_builder(franksplit)
wuther_dict = book_person_dict_builder(wuthersplit)


100%|██████████| 5516/5516 [11:31<00:00,  7.98it/s]
100%|██████████| 3129/3129 [02:44<00:00, 18.98it/s]
100%|██████████| 4922/4922 [08:06<00:00, 10.12it/s]


convert dict to list so we can itterate through it

In [None]:
def dict_to_list(dict):
  # convert personGenderCount to list
  out = []
  for key, value in dict.items():
    m_count, f_count = value[0], value[1]
    if m_count + f_count < 10:
      continue
    out.append([key, value[0], value[1]])
  return out

personGenderList_pride = dict_to_list(pride_dict)
personGenderList_frank = dict_to_list(frank_dict)
personGenderList_wuther = dict_to_list(wuther_dict)


manually verify each persons gender (do not presume to know unless prefaced by Mr, Mrs, ... etc)

In [None]:
def manual_verify(personGenderList, book):
  out_list = [] # will consist of [name, prediction, actual]
  try:
    old_out_file = pd.read_csv(book + '_output' + '.csv')
  except:
    old_out_file = []
  for person in personGenderList:
    name, male_count, female_count = person[0], person[1], person[2]
    prediciton = 1 if male_count > female_count else 0
    if len(old_out_file) > 0:
      if name in old_out_file['name'].values:
        actual = old_out_file[old_out_file['name'] == name]['actual'].values[0]
    else:
      actual = input('Is ' + name + ': ' + str(prediciton) + '? ') # 1 for M, 0 for F, -1 for other
    out_list.append([name, prediciton, actual])
  return out_list

out_list_pride = manual_verify(personGenderList_pride, 'pride')
out_list_frank = manual_verify(personGenderList_frank, 'frank')
out_list_wuther = manual_verify(personGenderList_wuther, 'wuther')

Is Mary: 0? Yes
Is Austen: 0? Yes
Is Wickham: 1? Yes
Is Jane: 0? Yes
Is Darcy: 1? No
Is Elizabeth: 0? No
Is Bennet: 0? 
Is Collins: 1? no
Is Mrs
Bennet: 1? no
Is Mrs Bennet: 0? no
Is Mr Collins: 1? no
Is Lady Catherine de Bourgh: 1? no
Is Lady Catherine: 0? o
Is Kitty: 0? yes
Is Mrs Bennet’s: 0? 
Is Bingleys: 0? 
Is Pemberley: 0? 
Is Mr Bingley: 0? 


KeyboardInterrupt: Interrupted by user

save to file

In [None]:
def save_books_person_dict(out_list, book):
  df = pd.DataFrame(out_list, columns=['name', 'prediction', 'actual'])
  df = df.astype({'actual': int})
  df.to_csv(book + '_output' + '.csv', index=False)
  return df

df_pride = save_books_person_dict(out_list_pride, 'pride')
df_frank = save_books_person_dict(out_list_frank, 'frank')
df_wuther = save_books_person_dict(out_list_wuther, 'wuther')

NameError: name 'out_list_pride' is not defined

In [None]:
# df[df['name'] == 'ellen'][actual] = 0
df_wuther.loc[df_wuther['name'] == 'Ellen', 'actual'] = 0

Calculation of Acurracy

In [None]:
def accuracy_calc(df, book):
  print(f'Book: {book}')
  # calculate accuracy, when predicted == actual that's good
  accuracy = (df['prediction'] == df['actual']).sum() / len(df)
  print(f'Total Accuracy: {round(accuracy*100,2)}%')
  # female accuracy
  female_df = df[df['actual'] == 0]
  female_accuracy = (female_df['prediction'] == female_df['actual']).sum() / len(female_df)
  print(f'Female Accuracy: {round(female_accuracy*100,2)}%')
  # male accuracy
  male_df = df[df['actual'] == 1]
  male_accuracy = (male_df['prediction'] == male_df['actual']).sum() / len(male_df)
  print(f'Male Accuracy: {round(male_accuracy*100,2)}%\n')

accuracy_calc(df_pride, 'Pride and Prejudice')
accuracy_calc(df_frank, 'Frakenstein')
accuracy_calc(df_wuther, 'Wuthering Heights')
df_all = pd.concat([df_pride, df_frank, df_wuther])
accuracy_calc(df_all, 'All Books')

In [None]:
books = [
    accuracy_calc(df_pride, 'Pride and Prejudice'),
    accuracy_calc(df_frank, 'Frankenstein'),
    accuracy_calc(df_wuther, 'Wuthering Heights')
]

df_all = pd.concat([df_pride, df_frank, df_wuther])
books.append(accuracy_calc(df_all, 'All Books'))

fig, axs = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Accuracy of Predictions for Each Book')

for i, (book, total_acc, female_acc, male_acc) in enumerate(books):
    ax = axs[i//2, i%2]
    ax.bar(['Total', 'Female', 'Male'], [accuracy, female_accuracy, male_accuracy], color=['green', 'purple', 'orange'])
    ax.set_title(book)
    ax.set_ylim([0, 100])
    for index, value in enumerate([total_acc, female_acc, male_acc]):
        ax.text(index, value + 1, f'{value:.2f}%', ha='center')

# Adjust layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

The model was very good at predicting female characters, terrible at predicting males. This could be a result of the main subjectmatter of this book being women. A comparative project could be done to analyze whether there is correlation between gender recognition accuracy and the gender of the author.

In [None]:
# Pronoun Frequencies

#xs = ['1', '2', '3']
#s = ''.join(xs)

books = [' '.join(pridesplit), ' '.join(franksplit,),  ' '.join(wuthersplit)]

def analyze_pronouns(corpus):
    """
    corpus: list of documents where each document is a list of tuples
            with (token, pos) representing token and its part-of-speech.
    returns: dictionary with t-test results for male and female pronoun counts.
    """
    male_countSub_list = []
    female_countSub_list = []
    male_countObj_list = []
    female_countObj_list = []


    for doc in corpus:
      annoted_sent = pipeline.annotate(doc) # annotate the sentence with parts of speach
      tok_tag = [(annoted_sent['token'], annoted_sent['pos'])] # list of tuples (token, part of speech)
      zips = [list(zip(tt[0],tt[1])) for tt in tok_tag]
      male_count, female_count, male_countSub, female_countSub, male_countObj, female_countObj, male_countPos, female_countPos = count_subObjPos(zips)

      male_countSub_list.append(male_countSub)
      female_countSub_list.append(female_countSub)
      male_countObj_list.append(male_countObj)
      female_countObj_list.append(female_countObj)

    # Perform t-tests
    ttest_results = {}

    ttest_results['male_sub_vs_female_sub'] = stats.ttest_ind(male_countSub_list, female_countSub_list, equal_var=False)
    ttest_results['male_obj_vs_female_obj'] = stats.ttest_ind(male_countObj_list, female_countObj_list, equal_var=False)

    print()

    return ttest_results

In [None]:
ttest = analyze_pronouns(books)

In [None]:
ttest