# Gender Bias in Autocomplete AI

The AI at hand is to predict next words based on what users have typed. It's pretty much how autocomplete in Google search works. Let's try to make the autocomplete algorithm less biased toward a certain gender.

The code learns from an existing corpora (text-based dataset), and performs autocomplete when receiving a word input by a user.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
import pandas as pd
import collections

## Training with Bias

Enter "women" or "men".

In [None]:
def preprocess(text):
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    return tokens

dataset = pd.read_csv("/content/drive/My Drive/Autocomplete/Autocomplete Dataset - Biased.csv")
dataset["Comments"] = dataset["Comments"].str.replace('\r\n', '')
text_list = dataset["Comments"].tolist()
text = ' '.join(text_list)

# Preprocess the text
tokens = preprocess(text)

# larger range
def train_model(tokens):
    model = collections.defaultdict(list)
    for i in range(len(tokens)-1):
        key = tokens[i]
        values = tokens[i-2:i] + tokens[i+1:i+3]
        model[key].extend(values)
    return model

# Train the model
model = train_model(tokens)

import random

def generate_prediction(model, prefix):
    if prefix in model:
        suffixes = model[prefix]
        return random.choice(suffixes)
    else:
        return None

def check_adjective(word):
    tagged_word = nltk.pos_tag([word])
    pos = tagged_word[0][1]
    return pos.startswith('JJ')

# Take input from user
input_str = input("Enter word: ")
output_num = input("How many words do you want to generate: ")

# Preprocess the input
input_tokens = preprocess(input_str)

# Generate prediction
count = 0
while count < int(output_num):
    new_word = generate_prediction(model, input_tokens[-1])
    if new_word != "women" and new_word != "men" and check_adjective(new_word):
      count += 1
      input_tokens.append(new_word)

# Print the prediction
if input_tokens:
    print("Next word prediction:", input_tokens[1:])
else:
    print("No prediction found.")

FileNotFoundError: ignored