# Topic Searching with BERT

In [1]:
import pandas as pd
import numpy as np
import torch
import string
from scipy.spatial.distance import cosine

In [2]:
# Import BERT model
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()
;

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


''

In [3]:
#Load survey data
path = "../data/surveys_clean.csv"
df = pd.read_csv(path, na_filter=False)

## Method 1: String-Matching Keyword Search

Build a list of keywords relating to a topic of interest, and return any responses that contain any of these keywords. For our purposes, we wish to find responses that discuss race relations, so we fill our keyword list with related words found in survey responses.

In [4]:
def contains_key(df, col, keys):
    """Applies function that checks whether a string contains any keyword 
    
    :param df: a DataFrame object
    :param col: name of column to apply function to
    :type col: string
    :rtype: list
    :return: True for rows that contain any keyword, False otherwise
    """
    
    return df[col].apply(lambda x: any([k in x for k in keys]))

In [5]:
keywords = [
    "negro",
    "negros",
    "color",
    "colored",
    "black",
    "blacks",
    "white",
    "whites",
    "race",
    "races",
    "racial"
]

df['about_race'] = contains_key(df, 'long', keywords)

Below is an example of a response the keyword search failed to pick up. 

In [6]:
display(df.iloc[[2936]])
print(df['long'][2936])

Unnamed: 0,ind_id,subject_id,image_name,image_name_2,outfits,outfits_comment,long,racial_group,index,about_race
2936,2920,15828618,2521127-12-0315.jpg,,,,i think that all colerd men should be train in...,black,2937,False


i think that all colerd men should be train in the north they would get a better chance to make better solder


This highlights one of the drawbacks of the simple keyword method. Without prior knowledge of this specific response, one might not have guessed to add "colerd" to the list of keywords. Given a large corpus of raw text data, it is impractical to manually log all possible keyword spelling variations.

## Method 2: BERT-Contextualized Keyword Search

The idea here is simple. We use BERT to get an embedding for each keyword and each word in a given response. Then, we check whether any keyword/response word pairs have similar embeddings. If this similarity is above some threshold, we consider it a match.

First, we borrow a function for returning the embeddings of each token in a given text. [1]

In [7]:
def get_token_embeddings(text):
    
    # Tokenize the text
    split_text = text.split(". ")
    marked_text = "[CLS] " + " [SEP] ".join(split_text) + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)[:512] # Truncate if longer than 512
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    # Mark tokens belonging to a sentence
    segment_ids = [0]*len(tokenized_text)
    is_zero = True
    for i in range(len(tokenized_text)):
        segment_ids[i] = 0 if is_zero else 1
        if tokenized_text[i] == "[SEP]":
            is_zero = not is_zero

    # Convert to torch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segment_ids])

    # Run through BERT
    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensors)
        hidden_states = outputs[2]

    # Adjust
    token_embeddings = torch.stack(hidden_states, dim=0)
    token_embeddings = torch.squeeze(token_embeddings, dim=1)
    token_embeddings = token_embeddings.permute(1,0,2)

    # Get token vectors
    token_vecs_sum = []
    for token in token_embeddings:
        sum_vec = torch.sum(token[-4:], dim=0)
        token_vecs_sum.append(sum_vec)
    
    return token_vecs_sum

Note that the tokenizer adds special \[CLS\] and \[SEP\] tokens and may split certain words into multiple tokens. As such, the embedding corresponding to your keyword may not be at the index you expect.

For example, say we want to use "white" as a keyword. The word "white" alone may be interpreted in multiple ways, so we will give it context by using it in a short phrase, "the white man".

In [8]:
def tokenize(text):
    split_text = text.split(". ")
    marked_text = "[CLS] " + " [SEP] ".join(split_text) + " [SEP]"
    return tokenizer.tokenize(marked_text)  

In [9]:
print(tokenize("the white man"))

['[CLS]', 'the', 'white', 'man', '[SEP]']


When we pass this phrase into the tokenizer, we see that our keyword is at index 2. Therefore, the embedding we want will be at index 2 of the list returned by **get_token_embeddings**.

In [10]:
keys = [
    {"text": "negro soldier", "idx": 1, "embed": None},
    {"text": "the white man", "idx": 2, "embed": None}
]

for k in keys:
    embed = get_token_embeddings(k['text'])
    k['embed'] = embed[k['idx']]

Now, we simply compare our key token embeddings to the token embeddings of each response. If a response contains a token with a high-enough* similarity to a key token, label it as "topic_race". 

<small>\* Begin with an arbitrary threshold for similarity, and adjust as needed. Consider checking the cosine similarity scores to get an idea of what scores are high/low. Also print which keywords "matched" to determine if the threshold is too high/low.</small>

In [12]:
# df: Dataframe
# column: column to consider for labeling
# label: name of column to store results in
# keys: keyword dictionary
# thresh: similarity threshold
def label_topic(df, column, label, keys, thresh):

    # Initialize/Reset column
    df[label] = 0
    df[label+"_score"] = float("-inf")
    
    # Track tokens that matched to keywords
    token_matches = []
    
    # Search
    for i in range(len(df)):
        best_sim = float("-inf")
        embed = get_token_embeddings(df[column][i])
        for j in range(len(embed)):
            for k in keys:
                sim = 1 - cosine(embed[j], k['embed'])
                if sim > best_sim:
                    best_sim = sim
                if sim >= thresh:
                    df.at[i, label] += 1
                    
                    # Get the token that matched to a keyword
                    split_text = df[column][i].split(". ")
                    marked_text = "[CLS] " + " [SEP] ".join(split_text) + " [SEP]"
                    tokenized_text = tokenizer.tokenize(marked_text)
                    token_matches.append((tokenized_text[j], k['text']))
                    
                    break
        
        df.at[i, label+"_score"] = best_sim
        
        # Track progress
        if i%100==0:
            print(i,"/",len(df))
                    
    return token_matches

In [16]:
#token_matches = label_topic(df, 'long', 'topic_race', keys, 0.5)
print(set(token_matches))

{('##o', 'negro soldier'), ('northern', 'negro soldier'), ('negro', 'negro soldier'), ('crow', 'negro soldier'), ('##ented', 'negro soldier'), ('chinese', 'negro soldier'), ('minorities', 'negro soldier'), ('union', 'negro soldier'), ('north', 'negro soldier'), ('##gger', 'negro soldier'), ('segregation', 'negro soldier'), ('chinese', 'the white man'), ('black', 'the white man'), ('mississippi', 'negro soldier'), ('southern', 'negro soldier'), ('poor', 'negro soldier'), ('african', 'negro soldier'), ('tough', 'negro soldier'), ('american', 'negro soldier'), ('slavery', 'negro soldier'), ('colored', 'negro soldier'), ('lynch', 'negro soldier'), ('white', 'the white man'), ('men', 'negro soldier'), ('america', 'negro soldier'), ('races', 'negro soldier'), ('people', 'negro soldier'), ('americans', 'negro soldier'), ('##ial', 'negro soldier'), ('peoples', 'negro soldier'), ('racial', 'negro soldier'), ('color', 'the white man'), ('whites', 'the white man'), ('blacks', 'negro soldier'), ('

Below are responses labeled as "topic_race" which did not contain a keyword. A simple keyword search using the same keywords would not pick up any of these responses.

In [14]:
for r in df['long'].loc[df['topic_race'] > 0]:
    if not any([k in r for k in keywords]):
        print(r, "\n")

why is it when we a sick we sleep in the same ward use same toilet, eat out same dishes, but, when well enough to go out are separated in mess halls. why cant we have  refreshments etc? there is no separate fronts in africa etc for us to fight. why not equal rights for all in actuality and not on "paper" 

this war would be better with the southern soldier stay in the south. this war would be better if us northern soldiers could stay in the north. there would be not so many fight with the soldier. they do not feed us well in the army. the army is no life for me. 

i think the questions ask was very very good one concerning the camps and different personels. it was also a good idea for enlisted men to tell his opions of different think which he had a very good chance to express. 

the army has wasted a lot of money thru segregation and it has not help the morale of the soldier at all. 

as a whole i don't like the word that i hear often in this army camp that (nigger) i think any man ca

Next are responses that were not labeled as "topic_race" but contained a keyword. This is good in cases where one word may have multiple meanings. A simple keyword search using the same keywords would have picked up all of these responses. False negatives may be remedied by strategically adding more keywords/using multiple short sentences for the same keywords, adjusting the similarity threshold (or perhaps using a different threshold for each dictionary element).

In [15]:
for r in df['long'].loc[df['topic_race'] == 0]:
    if any([k in r for k in keywords]):
        print(r, "\n")

i dont like the army. i had rather be on the out side i dont think i have any thing to fight for. the white have all of the privedlages and they should do the fighting 

i highly approve of this questionnaire it gives me an opportunity to express my views. i firmly believe the whole cause of  is due to lack of intelligence and understanding. avoiding a problem never remedies it. we are all americans no matter what color the supreme being or nature chose to make us. a man is only a man in spirit, body, and in blood. may god grant that we all as human beings soon realize that fact. thanks again for this opportunity. 

i have been asked to give my honest + frank opinion of the army and the war, its effect for the negro. so i have done that and feel very confident that the information i have given, the true and i still rely on god and his all mighty power. personally i like the army and have  in peace time soldier. lets all hope + pray for the best, i dont think there is no man who in his 

### Searching Other Topics

Now let's use the same ideas to search for responses that talk about women/gender relations. We will use a few example phrases directly from the surveys for context.

In [17]:
keys = [
    {"text": "the white woman and the negro man", "idx": 3, "embed": None},
    {"text": "forcing the woman to go and leave children", "idx": 3, "embed": None},
    {"text": "run after the womens", "idx": 4, "embed": None}, #ok
    {"text": "take your mother out and hang her", "idx": 3, "embed": None},
]
    
for k in keys:
    embed = get_token_embeddings(k['text'])
    k['embed'] = embed[k['idx']]

In [19]:
token_matches = label_topic(df, 'long', 'topic_gender', keys, 0.6)
print(set(token_matches))

{('women', 'the white woman and the negro man'), ('father', 'take your mother out and hang her'), ('wife', 'forcing the woman to go and leave children'), ('men', 'forcing the woman to go and leave children'), ('mother', 'forcing the woman to go and leave children'), ('people', 'the white woman and the negro man'), ('girls', 'the white woman and the negro man'), ('girl', 'the white woman and the negro man'), ('parent', 'take your mother out and hang her'), ('wife', 'the white woman and the negro man'), ('sister', 'take your mother out and hang her'), ('man', 'the white woman and the negro man'), ('men', 'the white woman and the negro man'), ('mother', 'take your mother out and hang her'), ('man', 'forcing the woman to go and leave children'), ('woman', 'the white woman and the negro man'), ('women', 'run after the womens'), ('she', 'forcing the woman to go and leave children'), ('girl', 'forcing the woman to go and leave children')}


From the set of tokens that matched with key embeddings, we see that "woman" frequently matches with "man", even in contexts that aren't necessarily about gender. While both words are clearly related, we would like to dissuade them from matching while preserving their semantic meanings. To solve this, we took inspiration from the Word2Vec. Rather than subtracting one embedding from the other, we subtract the projection of one embedding onto another.

In [20]:
keys = [
    #{"text": "forcing the woman to go and leave children", "idx": 3, "embed": None},
    #{"text": "respectable young women", "idx": 3, "embed": None},
    #{"text": "there is no color women", "idx": 5, "embed": None},
    #{"text": "run after the womens", "idx": 4, "embed": None}, #ok
    #{"text": "take your mother out and hang her", "idx": 3, "embed": None},
    {"text": "woman", "idx": 1, "embed": None}
]

man_embed = get_token_embeddings("man")[1]
for k in keys:
    embed = get_token_embeddings(k['text'])
    k['embed'] = embed[k['idx']] - (np.dot(embed[k['idx']], man_embed) / np.dot(man_embed, man_embed)) * man_embed

In [23]:
token_matches = label_topic(df, 'long', 'topic_gender', keys, 0.3)
print(set(token_matches))

{('women', 'woman'), ('girl', 'woman'), ('ladies', 'woman'), ('marriage', 'woman'), ('with', 'woman'), ('civilians', 'woman'), ('sister', 'woman'), ('men', 'woman'), ('her', 'woman'), ('wives', 'woman'), ('mother', 'woman'), ('wife', 'woman'), ('girls', 'woman'), ('married', 'woman'), ('she', 'woman'), ('woman', 'woman')}


Finally, we'll save our results to a new file to avoid having to refilter later.

In [22]:
df.to_csv("../data/surveys_clean_filtered.csv")

## Bibliography

[1] Chris McCormick and Nick Ryan. (2019, May 14). BERT Word Embeddings Tutorial. Retrieved from http://www.mccormickml.com