# User Flair Analysis

## Method

### Flair data
1. DONE - Q: Does r/asianamerican have flair templates? A: No, it is only customizable.
2. Decipher which ethnicity (Chinese, Japanese, Korean, Vietnamese, other) using flair text
- How should we deal with multi-ethnic flairs (Chinese-Thai, Korean/Black, etc)
- Looks like most of the comments with flair text are from a small subset of users with flairs who have commented numerous times
3. DONE - Flairs can contain up to 10 emojis, so can we use emojis to decipher ethnicity? A: Don't think we need emoji data, doesn't seem to be used very much.

#### Flair data EDA

In [1]:
import pandas as pd
import numpy as np

In [2]:
comments_df = pd.read_csv('../data/top_100_post_comments_user_flair.txt', header=None, names=['username', 'flair_text', 'body'])

In [6]:
print(comments_df.shape)
comments_df.head(10)

(3623, 3)


Unnamed: 0,username,flair_text,body
0,Tungsten_,,Thanks to everyone who engaged in insightful a...
1,ProudBlackMatt,Chinese-American,I would prefer using a process that takes into...
2,TomatoCanned,,"u/Tungsten_, Thanks for creating a section jus..."
3,bad-fengshui,,As with anything related to Asians in politics...
4,Pancake_muncher,,Yet colleges will allow alumni and doners in e...
5,suberry,,I just hated Affirmative Action as a distracti...
6,Puzzled-Painter3301,,My own feeling is that I was never in love wit...
7,e9967780,,Anti Asian racism whether against East Asians ...
8,,,Can we overturn legacy and athlete admissions ...
9,OkartoIceCream,,"I want to remind people that in California, on..."


1. How many comments have flair text? A: Of 3623 rows, 3085 do NOT have flairs, 538 do have flairs
- Seems like we could use more data... but I'm not sure if there is more to collect
2. How many comments are by Chinese/Chinese-Americans flaired users?

In [7]:
print(comments_df.isnull().sum())

username       833
flair_text    3085
body             0
dtype: int64


In [3]:
# use find() to search the array of flair texts -- make the flair texts lowercase first
# substrings to find:
# Chinese: 'china', 'chines', 'abc'
# Korean: 'korea', 'kor', 'abk', 'gyopo'
# Japanese: 'jap', 'abj'
# Filipino: 'filip', "philppi", 'pinoy', 'abf', 'abp'
# Indian: 'indian', 'abi'
# South Asian: 'desi', 'south asia'

# Series of flair_text
flair_text = comments_df['flair_text']

# get rid of nan
flair_text_nona = flair_text.fillna(0)
flair_text_clean = flair_text_nona.str.lower()

#### Chinese flairs

In [4]:
# empty matrix to hold indices of substring
chine_matrix = np.empty((flair_text_clean.shape[0],3))

# each column is for a different type of identifying substring
chine_matrix[:,0] = flair_text_clean.str.find('china')
chine_matrix[:,1] = flair_text_clean.str.find('chines')
chine_matrix[:,2] = flair_text_clean.str.find('abc')

In [5]:
print(chine_matrix)
# row of nan is comment w/o flair

[[nan nan nan]
 [-1.  0. -1.]
 [nan nan nan]
 ...
 [-1. -1. -1.]
 [nan nan nan]
 [-1. -1. -1.]]


In [6]:
# change nan to -1 (no substring found)
chine_matrix_clean = np.nan_to_num(chine_matrix, nan=-1)
print(chine_matrix_clean)

[[-1. -1. -1.]
 [-1.  0. -1.]
 [-1. -1. -1.]
 ...
 [-1. -1. -1.]
 [-1. -1. -1.]
 [-1. -1. -1.]]


In [8]:
# identify rows with one of the keywords (has value other than -1)
print(chine_matrix_clean != -1)
chine_rows = (chine_matrix_clean != -1).any(axis=1)

[[False False False]
 [False  True False]
 [False False False]
 ...
 [False False False]
 [False False False]
 [False False False]]


In [9]:
print(chine_rows.shape)
print(chine_rows.sum()) #97 comments of 3623 have Chinese flair


(3623,)
97


In [11]:
chi_comments_df = comments_df[chine_rows]
num_unique_users = len(pd.unique(chi_comments_df['username']))

print(f'Num of unique users with Chinese flair: {num_unique_users}')

Num of unique users with Chinese flair: 16


Chinese flair summary:
- 97 comments with Chinese flair
- 16 unique users with Chinese flair

#### Korean flairs

- Korean substrings: 'kor', 'abk', 'gyopo', 'hanguk'


In [12]:
# empty matrix to hold korean substring indices
kor_matrix = np.empty((flair_text_clean.shape[0],4))

kor_matrix[:,0] = flair_text_clean.str.find('kor')
kor_matrix[:,1] = flair_text_clean.str.find('abk')
kor_matrix[:,2] = flair_text_clean.str.find('gyopo')
kor_matrix[:,3] = flair_text_clean.str.find('hanguk')

In [13]:
# change nan to -1 (no flair to no substring found)
kor_matrix_clean = np.nan_to_num(kor_matrix, nan=-1)
print(kor_matrix_clean)

[[-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 ...
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]]


In [14]:
# identify rows with one of the keywords (has value other than -1)
print(kor_matrix_clean != -1)
kor_rows = (kor_matrix_clean != -1).any(axis=1)
print(kor_rows)

[[False False False False]
 [False False False False]
 [False False False False]
 ...
 [False False False False]
 [False False False False]
 [False False False False]]
[False False False ... False False False]


In [15]:
print(kor_rows.shape)
print(kor_rows.sum()) # 12 comments of 3623 have Korean flair

# get indexes of Korean flair comments
#kor_idx = np.where(kor_rows==1)[0]
kor_comments_df = comments_df[kor_rows]

(3623,)
12


In [17]:
kor_comments_df = comments_df[kor_rows]
num_unique_kor_users = len(pd.unique(kor_comments_df['username']))
print(f'Num of unique users with Korean flair: {num_unique_kor_users}')

Num of unique users with Korean flair: 3


Korean flairs summary:
- 12 comments with Korean flair
- 3 unique users with Korean flair