# Download and prep lexicon

In this notebook, I download and prepare the extended lexicon created by [Nicolas et al. (2021)](https://onlinelibrary.wiley.com/doi/epdf/10.1002/ejsp.2724), in which words are annotated with either +1 or -1 to indicate a positive or negative association with a given sub-dimension of the Stereotype Content Model.

### Table of Contents:
- [1. Download lexicon](#1.-download-lexicon)
- [2. Prepare lexicon](#2.-prepare-lexicon)
    - [2.1. Apply filters](#21-apply-filters)
    - [2.2. Reformat data frame](#22-reformat-data-frame)

In [2]:
# Import dependencies
import pandas as pd
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /Users/brienna/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/brienna/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [3]:
# Declare variables
LEXICON_RAW_PATH = '../data/raw/lexicon.csv' # raw lexicon that was downloaded from the paper
LEXICON_INPUT_PATH = '../data/input/lexicon.csv' # processed lexicon that is ready for use

## 1. Download lexicon



In [7]:
# Download lexicon
scm_df = pd.read_csv('https://osf.io/m9nb5/download')
scm_df.head()

Unnamed: 0,original word,preprocessed word 1 (lacksknowledge to NA),preprocessed word 2 (no spaces),preprocessed word 3 (lemmatized),preprocessed word 4 (minus one trailing s),Positive valence,Negative valence,Neutral valence,Sum,Sociability dictionary,...,body_covering dictionary,beauty dictionary,beauty direction,insults dictionary,stem dictionary,humanities dictionary,art dictionary,social_groups dictionary,Lacks_knowledge dictionary,fortune dictionary
0,aa,aa,aa,aa,aa,0.0,0.0,1.0,1,0,...,0,0,,0,0,0,0,0,0,0
1,ab,ab,ab,ab,ab,0.0,0.0,1.0,1,0,...,0,0,,0,0,0,0,0,0,0
2,aba,aba,aba,aba,aba,0.0,0.0,1.0,1,0,...,0,0,,0,0,0,0,0,0,0
3,abandon,abandon,abandon,abandon,abandon,0.06,0.38,0.56,2,0,...,0,0,,0,0,0,0,0,0,0
4,abandoned infant,abandoned infant,abandonedinfant,abandonedinfant,abandonedinfant,0.0,0.0,1.0,1,0,...,0,0,,0,0,0,0,0,0,0


In [8]:
# Persist locally
scm_df.to_csv(LEXICON_RAW_PATH, index=False)

## 2. Prepare lexicon

### 2.1. Apply filters 

Here, I do the following: 
- Remove attributes that don't show up in the 4 sub-dimensions
- Remove attributes that are not adjectives
    - Our selected template sentences only work with adjectives.
- Remove multi-word attributes
    - I originally added this filter to limit the variations between any two attributes. This way, if BERT responds differently to them, we can be more confident that the difference isn't a side effect of something such as sentence length. However, it turns out that when we filter out adjectives, all multi-word attributes got removed. So this is a note that this filter changes nothing. 

In [28]:
# Read in lexicon again
scm_df = pd.read_csv(LEXICON_RAW_PATH)
print('Attributes:', len(scm_df))

Attributes: 14449


In [29]:
# Remove attributes that don't show up in the 4 sub-dimensions
mask_morality = ((scm_df['Morality direction'] == -1.0) | (scm_df['Morality direction'] == 1.0))
mask_sociability = ((scm_df['Sociability direction'] == -1.0) | (scm_df['Sociability direction'] == 1.0))
mask_ability =  ((scm_df['Ability direction'] == -1.0) | (scm_df['Ability direction'] == 1.0))
mask_agency = ((scm_df['Agency direction'] == -1.0) | (scm_df['Agency direction'] == 1.0))
scm_df = scm_df.loc[(mask_morality | mask_sociability | mask_ability | mask_agency),:].reset_index(drop=True)
print('Attributes:', len(scm_df))

Attributes: 4630


In [30]:
# Remove attributes that are not adjectives
def is_target_pos(word, target_pos='a'):
    pos_list = []
    for synset in wn.synsets(word):
        synset_components = synset.name().split('.')
        if synset_components[0] == word: 
            pos_list.append(synset.pos())

    if target_pos in pos_list:
        return True 
    else:
        return False

scm_df = scm_df[scm_df['original word'].apply(lambda w: is_target_pos(w, 'a') or is_target_pos(w, 's'))].reset_index(drop=True)
print('Attributes:', len(scm_df))

Attributes: 1256


In [31]:
# Remove multi-word attributes
mask_one_word = (scm_df['original word'].str.split(' ').apply(len) == 1)
scm_df = scm_df.loc[mask_one_word,:].reset_index(drop=True)
print('Attributes:', len(scm_df))

Attributes: 1256


This leaves 1,256 attributes in this lexicon.

### 2.2. Reformat data frame

In [32]:
# Rename "original word" column, drop some other columns
scm_df = scm_df.rename(columns={'original word': 'Attribute'})
scm_df = scm_df[['Attribute', 'Ability direction', 'Morality direction', 'Sociability direction', 'Agency direction']]

In [33]:
# Reformat columns
dummies_df = pd.get_dummies(scm_df, 
                            columns=['Sociability direction', 'Morality direction', 'Agency direction', 'Ability direction'])
dummies_df = dummies_df.rename(columns={'Sociability direction_-1.0': 'Unsociable',
                           'Sociability direction_1.0': 'Sociable',
                           'Morality direction_-1.0': 'Immoral',
                           'Morality direction_1.0': 'Moral',
                           'Agency direction_-1.0': 'Dependent',
                           'Agency direction_1.0': 'Independent',
                           'Ability direction_-1.0': 'Unable',
                           'Ability direction_1.0': 'Able'})
scm_df = dummies_df.drop(['Sociability direction_0.0', 'Morality direction_0.0', 'Agency direction_0.0', 'Ability direction_0.0'], axis=1)
scm_df.head()

Unnamed: 0,Attribute,Unsociable,Sociable,Immoral,Moral,Dependent,Independent,Unable,Able
0,aberrant,False,False,True,False,False,False,False,False
1,abhorrent,False,False,True,False,False,False,False,False
2,abject,False,False,False,False,True,False,False,False
3,able,False,False,False,False,False,False,False,True
4,abnormal,False,False,True,False,False,False,False,False


In [34]:
# Save processed lexicon
scm_df.to_csv(LEXICON_INPUT_PATH, index=False)