# Testing Additional Biases

The purpose of this script is to test de-biasing based on Socioeconomic Status and Age. This Notebook is based on examples notebook provided by the original researchers of Man is to Programmer as Woman is to Homemaker? Debiasing Word Embeddings (https://arxiv.org/abs/1607.06520).

The scripts used and the example iPython Notebook can be found in their github repository at https://github.com/tolga-b/debiaswe

This is separate student project meant to build upon some of the results dicussed in the report and code located at
https://github.com/nathanag/debiaswe

## Part 1: Setup

### Step 1: Import relevant libraries and modules

In [95]:
from __future__ import print_function, division
%matplotlib inline
from matplotlib import pyplot as plt
import json
import random
import numpy as np

import debiaswe as dwe
import debiaswe.we as we
from debiaswe.we import WordEmbedding
from debiaswe.data import load_professions
from debiaswe.debias import debias

### Step 2: Import Data

In [96]:
# load google news word2vec
E = WordEmbedding('./embeddings/w2v_gnews_small.txt')

# load professions
professions = load_professions()
profession_words = [p[0] for p in professions]

*** Reading data from ./embeddings/w2v_gnews_small.txt
(26423, 300)
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
Loaded professions
Format:
word,
definitional female -1.0 -> definitional male 1.0
stereotypical female -1.0 -> stereotypical male 1.0


## Part 2: Age Based De-Biasing

In [97]:
# copy over word embedding to be used on Age de-biasing
E_age = E

### Step 1: Define Age Direction

We define the bias direction using elderly - youthful as these word are more explicitly tied to human age than perhaps old - young.

In [98]:
# age direction
v_age = E_age.diff('elderly','youthful')

### Step 2: Generating Analogies of "Elderly : x :: Youthful : y"

In [99]:
# analogies age
a_age = E_age.best_analogies_dist_thresh(v_age)

for (a,b,c) in a_age:
    print(a+"-"+b)

Computing neighbors
Mean: 10.219732808538016
Median: 7.0
young-youthful
elderly-middle_aged
children-youngsters
older-younger
caring-nurturing
mentally_ill-schizophrenic
considerate-personable
pensioner-schoolgirl
seniors-sophomores
kindness-humility
poor-mediocre
youths-youth
sick-tired
drunks-rowdy
farmers-crop
concerned-optimistic
woman-teenager
mothers-motherhood
revitalization-revitalized
appreciative-enthused
hatchback-sporty
adopts-embraces
idyllic-carefree
dog-bulldog
apartment-dorm_room
preparing-primed
osteoporosis-acne
adults-tweens
victims-perpetrators
bewildered-wide_eyed
plight-travails
collapsed-imploded
foster-nurture
uncooperative-combative
receiving-garnering
anxiety-excitement
lends-exudes
recognizes-embodies
heartless-ruthless
dementia-psychosis
courageous-fearless
antiques-vintage
unwillingness-eagerness
sentiment-optimism
peasants-revolutionaries
unconscionable-foolish
affectionate-playful
inadequate-lacking
husband-younger_brother
dozens-bevy
callous-arrogant
sle

### Step 3: Analyzing age bias in word vectors associated with professions

In [100]:
# profession analysis age
sp = sorted([(E_age.v(w).dot(v_age), w) for w in profession_words])

sp[0:20], sp[-20:]

([(-0.20439817, 'firebrand'),
  (-0.17037404, 'skipper'),
  (-0.16300799, 'protege'),
  (-0.15634924, 'soft_spoken'),
  (-0.15512511, 'captain'),
  (-0.14762147, 'character'),
  (-0.14514217, 'understudy'),
  (-0.13874087, 'maestro'),
  (-0.13791606, 'vocalist'),
  (-0.12872238, 'midfielder'),
  (-0.12415519, 'alter_ego'),
  (-0.117710695, 'coach'),
  (-0.11735272, 'drummer'),
  (-0.11567702, 'performer'),
  (-0.11487635, 'boss'),
  (-0.11357656, 'dancer'),
  (-0.108110696, 'crooner'),
  (-0.10684258, 'stylist'),
  (-0.09997366, 'commander'),
  (-0.099105045, 'protagonist')],
 [(0.14806971, 'parishioner'),
  (0.15078013, 'pharmacist'),
  (0.15097722, 'firefighter'),
  (0.15188062, 'constable'),
  (0.1610317, 'advocate'),
  (0.16266564, 'serviceman'),
  (0.16472465, 'trucker'),
  (0.16831487, 'nanny'),
  (0.17195654, 'laborer'),
  (0.17439684, 'handyman'),
  (0.17853151, 'employee'),
  (0.1785875, 'policeman'),
  (0.18029651, 'shopkeeper'),
  (0.18213417, 'taxi_driver'),
  (0.18763213, 

### Step 4: Debias based on age

In [101]:
# Lets load some age related word lists to help us with debiasing
with open('./data/age_definitional_pairs.json', "r") as f:
    age_defs = json.load(f)
print("definitional", age_defs)

with open('./data/age_equalize_pairs.json', "r") as f:
    age_equalize_pairs = json.load(f)
print("equalize", age_equalize_pairs)
with open('./data/age_specific_seed.json', "r") as f:
    age_specific_words = json.load(f)
print("gender specific", len(age_specific_words), age_specific_words[:10])

definitional [['aged', 'young'], ['parent', 'child'], ['elder', 'youth'], ['past', 'future'], ['old', 'new'], ['death', 'birth'], ['mature', 'immature'], ['expert', 'novice'], ['grandparent', 'grandchild'], ['elderly', 'youthful']]
equalize [['parent', 'child'], ['teacher', 'student'], ['grandparent', 'grandkid'], ['grandparents', 'grandkids'], ['grandfather', 'grandson'], ['grandmother', 'granddaughter'], ['retirement_home', 'nursery'], ['retiree', 'entry_level_worker'], ['adult', 'kid'], ['senior', 'freshman'], ['father', 'son'], ['mother', 'daughter']]
gender specific 24 ['parent', 'child', 'teacher', 'student', 'grandparent', 'grandkid', 'grandparents', 'grandkids', 'grandfather', 'grandson']


In [102]:
debias(E_age, age_specific_words, age_defs, age_equalize_pairs)

26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
{('teacher', 'student'), ('Parent', 'Child'), ('senior', 'freshman'), ('RETIREMENT_HOME', 'NURSERY'), ('mother', 'daughter'), ('ADULT', 'KID'), ('Grandparents', 'Grandkids'), ('Retirement_Home', 'Nursery'), ('GRANDFATHER', 'GRANDSON'), ('father', 'son'), ('RETIREE', 'ENTRY_LEVEL_WORKER'), ('Senior', 'Freshman'), ('PARENT', 'CHILD'), ('retirement_home', 'nursery'), ('grandmother', 'granddaughter'), ('Adult', 'Kid'), ('FATHER', 'SON'), ('MOTHER', 'DAUGHTER'), ('Grandparent', 'Grandkid'), ('grandparents', 'grandkids'), ('parent', 'child'), ('TEACHER', 'STUDENT'), ('SENIOR', 'FRESHMAN'), ('grandfather', 'grandson'), ('Mother', 'Daughter'), ('GRANDPARENTS', 'GRANDKIDS'), ('Grandfather', 'Grandson'), ('grandparent', 'grandkid'), ('Teacher', 'Student'), ('adult', 'kid'), ('Father', 'Son'), ('GRANDPARENT', 'GRANDKID'), ('GRANDMOTHER', 'GRANDDAUGHTER'), ('Grandmother', 'Granddaughter'), ('Retiree', 'Entry_Level_Wo

### Step 5: Analyzing age bias in professions post debiasing

In [103]:
# profession analysis age
sp_debiased = sorted([(E_age.v(w).dot(v_age), w) for w in profession_words])

sp_debiased[0:20], sp_debiased[-20:]

([(-0.3091138, 'student'),
  (-0.20400825, 'soft_spoken'),
  (-0.1516062, 'firebrand'),
  (-0.14507702, 'skipper'),
  (-0.14192595, 'maestro'),
  (-0.13913538, 'vocalist'),
  (-0.13761961, 'crooner'),
  (-0.13602431, 'guitarist'),
  (-0.13209036, 'drummer'),
  (-0.13118091, 'captain'),
  (-0.12988138, 'protege'),
  (-0.12250668, 'stylist'),
  (-0.12145236, 'singer'),
  (-0.11868984, 'understudy'),
  (-0.112814255, 'dancer'),
  (-0.11172544, 'confesses'),
  (-0.10882473, 'midfielder'),
  (-0.107521325, 'alter_ego'),
  (-0.10380973, 'investment_banker'),
  (-0.101790324, 'doctoral_student')],
 [(0.03428898, 'rabbi'),
  (0.03568951, 'handyman'),
  (0.037688643, 'janitor'),
  (0.04216937, 'tutor'),
  (0.042986553, 'housekeeper'),
  (0.04361617, 'pharmacist'),
  (0.04777269, 'warden'),
  (0.047919422, 'lifeguard'),
  (0.047979757, 'sheriff_deputy'),
  (0.048308235, 'advocate'),
  (0.0492267, 'firefighter'),
  (0.050726738, 'fireman'),
  (0.05111678, 'landlord'),
  (0.053630207, 'receptionis

## Part 3: Socioeconmic Based De-Biasing

In [105]:
# copy over word embedding to be used on Socioeconomic de-biasing
E_se = E

### Step 1: Define Socioeconomic Direction

We define the bias direction wealthy  - impoverished as these word are more explicitly tied to socioeconomic status/class than perhaps rich - poor.

In [106]:
# socioeconomic direction
v_se = E_se.diff('wealthy','impoverished')

### Step 2: Generating Analogies of "Wealthy : x :: Impoverished : y"

In [107]:
# analogies socioecomic status
a_se = E_se.best_analogies_dist_thresh(v_se)

for (a,b,c) in a_se:
    print(a+"-"+b)

Computing neighbors
Mean: 10.160844718616357
Median: 7.0
wealthiest-poorest
wealthier-poorer
poorer-impoverished
wealthy-richest
yachts-fishing_boats
luxuries-basic_necessities
educated-illiterate
yacht-vessel
capitalists-peasants
secluded-desolate
cottages-shacks
corporations-governments
terrorists-guerrillas
communists-communist
greedy-corrupt
revolt-uprising
earners-earner
modest-meager
affluent-populous
socialite-actress
civil_liberties-human_rights
despicable-deplorable
gullible-uneducated
insiders-observers
cynicism-hopelessness
mansions-palaces
cosmopolitan-metropolis
disadvantaged-poverty_stricken
inequality-poverty
hectic-chaotic
banker-lender
terrorism-militancy
millionaire-mogul
obesity-malnutrition
steak-fried_chicken
skinny-emaciated
compromise-accord
opulent-lush
distinguished-exemplary
lousy-abysmal
liberal-leftist
speculators-traders
oversupply-shortages
maids-migrant_workers
renovated-dilapidated
steaks-beef
stuffy-drab
residences-neighborhoods
sleazy-seedy
mountain-mo

### Step 3: Analyzing socioeconomic bias in word vectors associated with professions

In [108]:
# profession analysis socioecomic status
sp_se = sorted([(E_se.v(w).dot(v_se), w) for w in profession_words])

sp_se[0:20], sp_se[-20:]

([(-0.1363303, 'missionary'),
  (-0.118660614, 'performer'),
  (-0.11678807, 'lyricist'),
  (-0.109490275, 'artiste'),
  (-0.10128166, 'envoy'),
  (-0.08714663, 'laborer'),
  (-0.08581598, 'minister'),
  (-0.08471248, 'ranger'),
  (-0.08247355, 'technician'),
  (-0.08150867, 'nun'),
  (-0.07534849, 'drug_addict'),
  (-0.07504965, 'pastor'),
  (-0.07445507, 'major_leaguer'),
  (-0.07175174, 'soldier'),
  (-0.070720546, 'singer'),
  (-0.07058622, 'prisoner'),
  (-0.06899183, 'choreographer'),
  (-0.068508714, 'warden'),
  (-0.06536086, 'understudy'),
  (-0.064544156, 'pathologist')],
 [(0.14112504, 'restaurateur'),
  (0.1414366, 'architect'),
  (0.14270231, 'housekeeper'),
  (0.14558473, 'jeweler'),
  (0.15168333, 'baron'),
  (0.152001, 'interior_designer'),
  (0.15693706, 'accountant'),
  (0.16227153, 'collector'),
  (0.16358331, 'philanthropist'),
  (0.16491355, 'tycoon'),
  (0.16985941, 'lawyer'),
  (0.17088652, 'butler'),
  (0.18480188, 'realtor'),
  (0.20747714, 'stockbroker'),
  (0

### Step 4: Debias based on socioeconomic status

In [113]:
# Lets load some socioecomic status related word lists to help us with debiasing
with open('./data/se_definitional_pairs.json', "r") as f:
    se_defs = json.load(f)
print("definitional", se_defs)

with open('./data/se_equalize_pairs.json', "r") as f:
    se_equalize_pairs = json.load(f)
print("equalize", se_equalize_pairs)
with open('./data/se_specific_seed.json', "r") as f:
    se_specific_words = json.load(f)
print("gender specific", len(se_specific_words), se_specific_words[:10])

definitional [['rich', 'poor'], ['wealthy', 'impoverished'], ['wealth', 'poverty'], ['privileged', 'disadvantaged'], ['white_collar', 'blue_collar'], ['owner', 'employee'], ['upper', 'lower'], ['extravagant', 'basic'], ['opulent', 'modest'], ['surplus', 'deficit']]
equalize [['yacht', 'dinghy'], ['yachts', 'dinghies'], ['mansion', 'apartment'], ['mansions', 'apartments'], ['white_collar', 'blue_collar'], ['employer', 'employee'], ['employers', 'employees'], ['new_car', 'used_car'], ['lord', 'surf']]
gender specific 16 ['yacht', 'dinghy', 'yachts', 'dinghies', 'mansion', 'apartment', 'mansions', 'apartments', 'white_collar', 'blue_collar']


In [114]:
debias(E_se, se_specific_words, se_defs, se_equalize_pairs)

26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
{('new_car', 'used_car'), ('Mansion', 'Apartment'), ('employer', 'employee'), ('white_collar', 'blue_collar'), ('MANSION', 'APARTMENT'), ('YACHTS', 'DINGHIES'), ('New_Car', 'Used_Car'), ('EMPLOYER', 'EMPLOYEE'), ('lord', 'surf'), ('LORD', 'SURF'), ('MANSIONS', 'APARTMENTS'), ('yachts', 'dinghies'), ('yacht', 'dinghy'), ('Yacht', 'Dinghy'), ('EMPLOYERS', 'EMPLOYEES'), ('NEW_CAR', 'USED_CAR'), ('mansion', 'apartment'), ('Lord', 'Surf'), ('WHITE_COLLAR', 'BLUE_COLLAR'), ('employers', 'employees'), ('YACHT', 'DINGHY'), ('mansions', 'apartments'), ('Employer', 'Employee'), ('Mansions', 'Apartments'), ('White_Collar', 'Blue_Collar'), ('Employers', 'Employees'), ('Yachts', 'Dinghies')}
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine


### Step 5: Analyzing age bias in professions post debiasing

In [115]:
# profession analysis socioecomic status
sp_se_debiased = sorted([(E_se.v(w).dot(v_se), w) for w in profession_words])

sp_se_debiased[0:20], sp_se_debiased[-20:]

([(-0.16797233, 'employee'),
  (-0.13171333, 'artiste'),
  (-0.11455773, 'lyricist'),
  (-0.110702515, 'performer'),
  (-0.10799282, 'envoy'),
  (-0.10408275, 'entertainer'),
  (-0.10313272, 'choreographer'),
  (-0.09875187, 'vocalist'),
  (-0.09643534, 'singer'),
  (-0.08758396, 'protege'),
  (-0.08690682, 'missionary'),
  (-0.08083978, 'maestro'),
  (-0.07951636, 'soloist'),
  (-0.0752451, 'dancer'),
  (-0.07424241, 'cinematographer'),
  (-0.07296255, 'assistant_professor'),
  (-0.072925545, 'minister'),
  (-0.07290186, 'composer'),
  (-0.07273125, 'warden'),
  (-0.071610466, 'understudy')],
 [(0.07854535, 'psychologist'),
  (0.07855992, 'fighter_pilot'),
  (0.07865458, 'businesswoman'),
  (0.08034091, 'handyman'),
  (0.08201269, 'pundit'),
  (0.082914785, 'broker'),
  (0.083283305, 'investment_banker'),
  (0.083394065, 'cab_driver'),
  (0.08365481, 'accountant'),
  (0.08433533, 'lawyer'),
  (0.08477233, 'fireman'),
  (0.09080547, 'landlord'),
  (0.092140645, 'financier'),
  (0.09338