The tweets in Founta et al. have been classified using the model from Blodgett et al. This assigns each tweet a vector of probabilities corresponding to four different racialized language models. To use this information in the STM I need to convert these probabilities into discrete values. 

Note that Blodgett et al. warn against using the other and asian categories, so for the topic model I should throw out those not majority white or African-American.

In [60]:
import pandas as pd

In [61]:
df = pd.read_csv('data/founta_race_annotated.csv')

In [62]:
df.shape

(99996, 6)

In [63]:
df.head()


Unnamed: 0,text,label,afam,hisp,asian,white
0,Beats by Dr. Dre urBeats Wired In-Ear Headphon...,spam,0.379062,0.222205,0.193619,0.205114
1,RT @Papapishu: Man it would fucking rule if we...,abusive,0.187467,0.187928,0.118104,0.506501
2,It is time to draw close to Him &#128591;&#127...,normal,0.182347,0.458019,0.077331,0.282302
3,if you notice me start to act different or dis...,normal,0.466657,0.331978,0.007351,0.194014
4,"Forget unfollowers, I believe in growing. 7 ne...",normal,0.106735,0.186906,0.089628,0.616732


Note some columns have values of -9. This indicates that no remaining words were present after tokenization using the Blodgett algorithm. These should be dropped from the analysis.

In [64]:
def getHighestProbAndThreshold(df, threshold=.6):
    race_cats = ['afam', 'hisp', 'asian', 'white']
    max_race = []
    max_vals = []
    threshold_met = []
    for _, r in df.iterrows():
        values = [float(r['afam']), float(r['hisp']), float(r['asian']), float(r['white'])]
        max_val = max(values)
        max_idx = values.index(max_val)
        max_race.append(race_cats[max_idx])
        max_vals.append(max_val)
        if max_val >= threshold:
            threshold_met.append(True)
        else:
            threshold_met.append(False)
    df['max_race'] = max_race
    df['max_vals'] = max_vals
    df['threshold_met'] = threshold_met
    return(df)

In [65]:
df = getHighestProbAndThreshold(df)

In [66]:
df.tail()

Unnamed: 0,text,label,afam,hisp,asian,white,max_race,max_vals,threshold_met
99991,RT @shangros: my fucking queen https://t.co/wa...,abusive,0.216189,0.305741,0.289655,0.188415,hisp,0.305741,False
99992,#Osteporosis treated with #PEMF - rebuild bone...,normal,0.218286,0.215283,0.183034,0.383396,white,0.383396,False
99993,@LGUSAMobile why does my phone screen keeps fl...,normal,0.28719,0.459262,0.00852,0.245029,hisp,0.459262,False
99994,#bigdata vs. #reality ... but equally applies ...,normal,0.09839,0.141443,0.219687,0.54048,white,0.54048,False
99995,"you can do whatever you choose, if you first g...",normal,0.082188,0.306929,0.003522,0.607361,white,0.607361,True


In [67]:
from collections import Counter
Counter(df.query('label != "spam" and afam != -9')['max_race'])

Counter({'white': 53765, 'hisp': 11261, 'afam': 12348, 'asian': 8334})

In [68]:
df.shape

(99996, 9)

In [69]:
df.query('threshold_met').shape

(22134, 9)

In [70]:
df.query('afam == -9').shape

(514, 9)

In [71]:
df.to_csv('data/founta_race_annotated_2.csv')