# User Flair Analysis

(no longer valid due to lack of data)

## Method

### Flair data
1. DONE - Q: Does r/asianamerican have flair templates? A: No, it is only customizable.
2. Decipher which ethnicity (Chinese, Japanese, Korean, Vietnamese, other) using flair text
- How should we deal with multi-ethnic flairs (Chinese-Thai, Korean/Black, etc)
- Looks like most of the comments with flair text are from a small subset of users with flairs who have commented numerous times
3. DONE - Flairs can contain up to 10 emojis, so can we use emojis to decipher ethnicity? A: Don't think we need emoji data, doesn't seem to be used very much.

#### Flair data EDA

In [1]:
import pandas as pd
import numpy as np

In [2]:
comments_df = pd.read_csv('../data/top_100_post_comments_user_flair.txt', header=None, names=['username', 'flair_text', 'body'])

In [6]:
print(comments_df.shape)
comments_df.head(10)

(3623, 3)


Unnamed: 0,username,flair_text,body
0,Tungsten_,,Thanks to everyone who engaged in insightful a...
1,ProudBlackMatt,Chinese-American,I would prefer using a process that takes into...
2,TomatoCanned,,"u/Tungsten_, Thanks for creating a section jus..."
3,bad-fengshui,,As with anything related to Asians in politics...
4,Pancake_muncher,,Yet colleges will allow alumni and doners in e...
5,suberry,,I just hated Affirmative Action as a distracti...
6,Puzzled-Painter3301,,My own feeling is that I was never in love wit...
7,e9967780,,Anti Asian racism whether against East Asians ...
8,,,Can we overturn legacy and athlete admissions ...
9,OkartoIceCream,,"I want to remind people that in California, on..."


1. How many comments have flair text? A: Of 3623 rows, 3085 do NOT have flairs, 538 do have flairs
- Seems like we could use more data... but I'm not sure if there is more to collect
2. How many comments are by Chinese/Chinese-Americans flaired users?

In [7]:
print(comments_df.isnull().sum())

username       833
flair_text    3085
body             0
dtype: int64


In [3]:
# use find() to search the array of flair texts -- make the flair texts lowercase first
# substrings to find:
# Chinese: 'china', 'chines', 'abc'
# Korean: 'korea', 'kor', 'abk', 'gyopo'
# Japanese: 'jap', 'abj'
# Filipino: 'filip', "philppi", 'pinoy', 'abf', 'abp'
# Indian: 'indian', 'abi'
# South Asian: 'desi', 'south asia'

# Series of flair_text
flair_text = comments_df['flair_text']

# get rid of nan
flair_text_nona = flair_text.fillna(0)
flair_text_clean = flair_text_nona.str.lower()

#### Chinese flairs

In [4]:
# empty matrix to hold indices of substring
chine_matrix = np.empty((flair_text_clean.shape[0],3))

# each column is for a different type of identifying substring
chine_matrix[:,0] = flair_text_clean.str.find('china')
chine_matrix[:,1] = flair_text_clean.str.find('chines')
chine_matrix[:,2] = flair_text_clean.str.find('abc')

In [5]:
print(chine_matrix)
# row of nan is comment w/o flair

[[nan nan nan]
 [-1.  0. -1.]
 [nan nan nan]
 ...
 [-1. -1. -1.]
 [nan nan nan]
 [-1. -1. -1.]]


In [6]:
# change nan to -1 (no substring found)
chine_matrix_clean = np.nan_to_num(chine_matrix, nan=-1)
print(chine_matrix_clean)

[[-1. -1. -1.]
 [-1.  0. -1.]
 [-1. -1. -1.]
 ...
 [-1. -1. -1.]
 [-1. -1. -1.]
 [-1. -1. -1.]]


In [8]:
# identify rows with one of the keywords (has value other than -1)
print(chine_matrix_clean != -1)
chine_rows = (chine_matrix_clean != -1).any(axis=1)

[[False False False]
 [False  True False]
 [False False False]
 ...
 [False False False]
 [False False False]
 [False False False]]


In [9]:
print(chine_rows.shape)
print(chine_rows.sum()) #97 comments of 3623 have Chinese flair


(3623,)
97


In [11]:
chi_comments_df = comments_df[chine_rows]
num_unique_users = len(pd.unique(chi_comments_df['username']))

print(f'Num of unique users with Chinese flair: {num_unique_users}')

Num of unique users with Chinese flair: 16


Chinese flair summary:
- 97 comments with Chinese flair
- 16 unique users with Chinese flair

#### Korean flairs

- Korean substrings: 'kor', 'abk', 'gyopo', 'hanguk'


In [12]:
# empty matrix to hold korean substring indices
kor_matrix = np.empty((flair_text_clean.shape[0],4))

kor_matrix[:,0] = flair_text_clean.str.find('kor')
kor_matrix[:,1] = flair_text_clean.str.find('abk')
kor_matrix[:,2] = flair_text_clean.str.find('gyopo')
kor_matrix[:,3] = flair_text_clean.str.find('hanguk')

In [13]:
# change nan to -1 (no flair to no substring found)
kor_matrix_clean = np.nan_to_num(kor_matrix, nan=-1)
print(kor_matrix_clean)

[[-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 ...
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]]


In [14]:
# identify rows with one of the keywords (has value other than -1)
print(kor_matrix_clean != -1)
kor_rows = (kor_matrix_clean != -1).any(axis=1)
print(kor_rows)

[[False False False False]
 [False False False False]
 [False False False False]
 ...
 [False False False False]
 [False False False False]
 [False False False False]]
[False False False ... False False False]


In [15]:
print(kor_rows.shape)
print(kor_rows.sum()) # 12 comments of 3623 have Korean flair

# get indexes of Korean flair comments
#kor_idx = np.where(kor_rows==1)[0]
kor_comments_df = comments_df[kor_rows]

(3623,)
12


In [17]:
kor_comments_df = comments_df[kor_rows]
num_unique_kor_users = len(pd.unique(kor_comments_df['username']))
print(f'Num of unique users with Korean flair: {num_unique_kor_users}')

Num of unique users with Korean flair: 3


Korean flairs summary:
- 12 comments with Korean flair
- 3 unique users with Korean flair

# Topic Modeling

In [15]:
import pandas as pd
import numpy as np

# cluster detection
import sklearn
import sklearn.feature_extraction.text
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.datasets
import sklearn.cluster
import sklearn.decomposition
import sklearn.metrics

import nltk # for co-locations
import scipy #For hierarchical clustering and some visuals
#import scipy.cluster.hierarchy
import gensim#For topic modeling
import requests #For downloading our datasets
import matplotlib.pyplot as plt #For graphics
import matplotlib.cm #Still for graphics
import seaborn as sns #Makes the graphics look nicer

### 1. Get data

In [29]:
comments_df = pd.read_csv('../data/comments_df.csv')

Note: normalized_tokens column is string type, not list

In [33]:
normalized_tokens = comments_df['normalized_tokens'][0]
print(type(normalized_tokens))

<class 'str'>


In [35]:
from ast import literal_eval

def converter(x):
    return literal_eval(x)

comments_df = pd.read_csv('../data/comments_df.csv', converters={'tokens_new':converter, 'normalized_tokens':converter})

In [37]:
normalized_tokens = comments_df['normalized_tokens'][0]
tokens_new = comments_df['tokens_new'][0]
print(type(normalized_tokens))
print(type(tokens_new))

<class 'list'>
<class 'list'>


Token columns are now lists.

In [38]:
comments_df.head(10)

Unnamed: 0.1,Unnamed: 0,username,flair_text,body,tokens_new,normalized_tokens,normalized_tokens_count,word_count
0,0,Tungsten_,,Thanks to everyone who engaged in insightful a...,"[Thanks, to, everyone, who, engaged, in, insig...","[thank, engage, insightful, respectful, discou...",9,20
1,1,ProudBlackMatt,Chinese-American,I would prefer using a process that takes into...,"[I, would, prefer, using, a, process, that, ta...","[prefer, process, take, account, poverty, inst...",52,103
2,2,TomatoCanned,,"u/Tungsten_, Thanks for creating a section jus...","[u/Tungsten_,, Thanks, for, creating, a, secti...","[u/tungsten_,, thank, create, section, discuss...",126,269
3,3,bad-fengshui,,As with anything related to Asians in politics...,"[As, with, anything, related, to, Asians, in, ...","[relate, asians, politic, m, see, lot, non, as...",25,59
4,4,Pancake_muncher,,Yet colleges will allow alumni and doners in e...,"[Yet, colleges, will, allow, alumni, and, done...","[college, allow, alumnus, doner, easily, consi...",19,40
5,5,suberry,,I just hated Affirmative Action as a distracti...,"[I, just, hated, Affirmative, Action, as, a, d...","[hate, affirmative, action, distraction, banda...",78,171
6,6,Puzzled-Painter3301,,My own feeling is that I was never in love wit...,"[My, own, feeling, is, that, I, was, never, in...","[feeling, love, affirmative, action, possible,...",102,231
7,7,e9967780,,Anti Asian racism whether against East Asians ...,"[Anti, Asian, racism, whether, against, East, ...","[anti, asian, racism, east, asians, south, asi...",21,46
8,8,,,Can we overturn legacy and athlete admissions ...,"[Can, we, overturn, legacy, and, athlete, admi...","[overturn, legacy, athlete, admission, point, ...",15,29
9,9,OkartoIceCream,,"I want to remind people that in California, on...","[I, want, to, remind, people, that, in, Califo...","[want, remind, people, california, progressive...",104,200


Text has already been tokenized, lemmatized, normalized.

### 1.1 Collocations

In [51]:
comments_df.tail(5)

Unnamed: 0.1,Unnamed: 0,username,flair_text,body,tokens_new,normalized_tokens,normalized_tokens_count,word_count,normalized_tokens_str
3618,3618,aduogetsatastegouda,,But that's irrelevant. The right not to be dis...,"[But, that, 's, irrelevant, The, right, not, t...","[irrelevant, right, discriminate, base, race, ...",38,84,irrelevant right discriminate base race subjec...
3619,3619,rentonwong,Support Asian-American Media!,"Despite my dislike of AA, at least 2/3rds of A...","[Despite, my, dislike, of, AA, at, least, 2/3r...","[despite, dislike, aa, 2/3rds, asian, american...",19,32,despite dislike aa 2/3rds asian americans base...
3620,3620,rentonwong,Support Asian-American Media!,> If 1/3 of a racial minority's members say th...,"[>, If, 1/3, of, a, racial, minority, 's, memb...","[>, racial, minority, member, want, discrimina...",27,61,> racial minority member want discriminate ove...
3621,3621,,,I'm just annoyed at how there's so much handwa...,"[I, 'm, just, annoyed, at, how, there, 's, so,...","[m, annoyed, handwaving, consequence, pro, aa,...",48,117,m annoyed handwaving consequence pro aa anti a...
3622,3622,rentonwong,Support Asian-American Media!,The current system as it stands preserves whil...,"[The, current, system, as, it, stands, preserv...","[current, system, stand, preserve, privilege, ...",49,102,current system stand preserve privilege play m...


In [59]:
bigrams = nltk.collocations.BigramCollocationFinder.from_words(comments_df['normalized_tokens'].sum())
print(f'There are {bigrams.N} bigrams in the finder.')

There are 130955 bigrams in the finder.


In [61]:
def bigramScoring(count, wordsTuple, total):
    return count
bigrams.nbest(bigramScoring, 50)

[('affirmative', 'action'),
 ('asian', 'americans'),
 ('asian', 'american'),
 ('white', 'people'),
 ('high', 'school'),
 ('college', 'admission'),
 ('race', 'base'),
 ('asian', 'student'),
 ('legacy', 'admission'),
 ('test', 'score'),
 ('ivy', 'league'),
 ('white', 'student'),
 ('high', 'education'),
 ('black', 'hispanic'),
 ('support', 'affirmative'),
 ('black', 'people'),
 ('student', 'body'),
 ('asian', 'applicant'),
 ('model', 'minority'),
 ('black', 'latino'),
 ('chinese', 'americans'),
 ('middle', 'class'),
 ('supreme', 'court'),
 ('black', 'student'),
 ('asian', 'kid'),
 ('sit', 'score'),
 ('feel', 'like'),
 ('african', 'american'),
 ('admission', 'officer'),
 ('east', 'asians'),
 ('m', 'sure'),
 ('admission', 'process'),
 ('asian', 'people'),
 ('minority', 'group'),
 ('holistic', 'admission'),
 ('white', 'supremacy'),
 ('african', 'americans'),
 ('base', 'affirmative'),
 ('personality', 'score'),
 ('american', 'student'),
 ('elite', 'school'),
 ('low', 'income'),
 ('united', 's

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
whBigrams.score_ngrams(bigram_measures.likelihood_ratio)[:40]
# other options include student_t, chi_sq, likelihood_ratio, pmi

### 2. CountVectorizer

In [4]:
count_vectorizer = sklearn.feature_extraction.text.CountVectorizer()
count_vector = count_vectorizer.fit_transform(comments_df['normalized_tokens'])

In [5]:
print(count_vector.shape)

(3623, 10425)


- 3623 rows, 10425 columns/unique tokens

### 3. TD-IDF Vectorizer

In [5]:
tdidf_transformer = sklearn.feature_extraction.text.TfidfTransformer()
tdidf_vector = tdidf_transformer.fit_transform(count_vector)

In [6]:
list(zip(count_vectorizer.vocabulary_.keys(), tdidf_vector.data))[:20]

[('thank', 0.2790647239686926),
 ('engage', 0.2624916204470818),
 ('insightful', 0.4308677651786394),
 ('respectful', 0.2588701492057808),
 ('discourse', 0.3724050122335431),
 ('news', 0.4155587375062185),
 ('thread', 0.3355191219247845),
 ('lock', 0.3528365045862098),
 ('comment', 0.2282958742756067),
 ('prefer', 0.08961425241998294),
 ('process', 0.16812507837038776),
 ('take', 0.10333828523481878),
 ('account', 0.09791761829024669),
 ('poverty', 0.12307005687760202),
 ('instead', 0.11641353787807514),
 ('generation', 0.11641353787807514),
 ('family', 0.10922874743933599),
 ('come', 0.14501625345093178),
 ('america', 0.13812266943966214),
 ('painfully', 0.11936149821893847)]

### 4. Prune Matrix of features

In [7]:
#initialize
prune_vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(min_df=3, max_features=1000, stop_words='english', norm='l2') # why norm=l2?
#train
pruned_vec = prune_vectorizer.fit_transform(comments_df['normalized_tokens'])

- min document freq=3 because low document frequency inflates td-idf
- An idea: visualize document freq of each word

In [8]:
pruned_vec

<3623x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 73451 stored elements in Compressed Sparse Row format>

Now, matrix is only 1000 terms/columns

In [14]:
# try to find term in matrix
termtofind = 'vector'
try:
    print(prune_vectorizer.vocabulary_[termtofind])
except KeyError:
    print(f'"{termtofind}" is missing')
    print('The available words are: {} ...'.format(list(prune_vectorizer.vocabulary_.keys())[:10]))

"vector" is missing
The available words are: ['thank', 'news', 'thread', 'comment', 'prefer', 'process', 'account', 'poverty', 'instead', 'generation'] ...
