## Motivation

Proper audience targeting is a critical component of a successful advertising campaign. This notebook aims to identify customer demographics that may be particularly receptive to Game Revenant's (GR) advertising. Analysis is performed on unstructured social media data, specifically from Twitter. Compared to other social media, Twitter's API allows greater access to user data, enabling more effective data mining. 

GR currently has too few followers on Twitter to allow for productive data mining. However, the Twitter accounts of rival companies can be analyzed as well. *Where Shadows Slumber* (WSS) is frequently compared to *Monument Valley*, a mobile puzzle game produced by USTWO Games, by game critics and customer reviewers. As of this writing, @ustwogames has over 126k followers. 

Customer interests were estimated by 1) examining the profile description of @ustogames followers, and 2) examining the most popular friends among @ustogames followers. Preprocessed descriptions were examined via two clustering approches, K-modes and 
DBSCAN. In Twitter's official terminology, a 'friend' is an account a user follows. The most popular friends were clustered via K-modes. The rationale behind the unorthodox application of K-modes will be explained later in the notebook. 

## Methods and Results

The Twitter profiles of @ustogames were mined via the python package Tweepy and written to a SQL database (refer to *pull_data_twitter.py* and *sqlite_fx.py* for script and function codes). 

In [2]:
# Autoreload to accomodate script updates without restarting notebook
%load_ext autoreload
%autoreload 2
# Move to main directory of the Customer-Segmentation project
%cd ..

C:\Users\Vincent S\game-revenant\Customer-Segmentation


In [3]:
import pandas as pd
import sqlalchemy as sa
from pathlib import Path

DB_NAME = 'customer-segmentation'
TAB_NAME = 'ustwo_followers'

# Pull interim data from SQLite DB
e = sa.create_engine('sqlite:///./data/interim/' + DB_NAME + '.sqlite')
query = 'SELECT * FROM ' + TAB_NAME
users = pd.read_sql_query(query, e)

users.head()

Unnamed: 0,index,id,id_string,name,screen_name,location,url,description,protected,verified,followers_count,friend_count,listed_count,favourites_count,statuses_count,created_at,default_profile,default_profile_image
0,0,1114109046792032256,1114109046792032256,Hiyoru,Hiyoru6,,,so para fotos de desenhos,0,0,1,21,0,139,5,Fri Apr 05 10:14:30 +0000 2019,1,0
1,1,1120790192128954371,1120790192128954371,Tom Baines 🏳️‍🌈 🇪🇺,TomBaines16,"North West, England",,Extreme sports calendar model. \nKeeping retro...,0,0,29,169,0,961,456,Tue Apr 23 20:42:59 +0000 2019,0,0
2,2,113652889,113652889,Marmalade Games,MarmaladeGames,"London, UK",http://t.co/sPUf5LShHE,,0,0,506,262,13,304,471,Fri Feb 12 15:07:07 +0000 2010,0,0
3,3,363116063,363116063,MIT SHAH,mitshah97,Ahmedabad,,#Unity3d #Game #Developer,0,0,39,279,1,22,13,Sat Aug 27 15:13:29 +0000 2011,1,0
4,4,972169508,972169508,Kyrie E.H.C.,KyrieEHC,"Madison, WI",https://t.co/ajE03P8x2g,Still believes in the warmth in interaction. S...,0,0,333,1103,18,1118,734,Mon Nov 26 15:27:59 +0000 2012,0,0


In [4]:
print(str(len(users)) + ' @ustwogames follower profiles were mined')

125925 @ustwogames follower profiles were mined


### Text wrangling of profile descriptions

Although a number of parameters were mined, only the profile description and identifying user id are of interest in the proceeding analysis. 

Prior to analysis, unstructured text data must be preprocessed. Punctuation, emojis, and excess white space were removed. All words were converted to lowercase and tokenized. Stopwords were removed. The NLTK stopword list was extended by several words that were found to be common in profile descriptions, but did not add any useful information about user interests. 

In [5]:
import nltk
import re
import numpy as np

def normalize_doc(doc, stop_words):
    # remove special characters and white space. This filters out non-Latin languages!!
    doc = re.sub(r'[^a-zA-Z\s]', ' ', doc, re.IGNORECASE|re.ASCII)
    # remove single characters
    doc = re.sub(r'\b[a-zA-Z]\b', '', doc, re.IGNORECASE|re.ASCII)    
    # remove whitespace at beginning and end of string
    doc = doc.strip()
    # convert all characters to lowercase
    doc = doc.lower()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # recreate doc from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

stop_words = nltk.corpus.stopwords.words('english')
stop_words.extend(['co', 'https', 'http', 'gmail', 'com', 'like', 'love'])

users['clean_desc'] = users['description'].map(lambda doc: normalize_doc(doc, stop_words) 
                                                if doc is not None
                                                else np.nan)

users[['description','clean_desc']].head()

Unnamed: 0,description,clean_desc
0,so para fotos de desenhos,para fotos de desenhos
1,Extreme sports calendar model. \nKeeping retro...,extreme sports calendar model keeping retro al...
2,,
3,#Unity3d #Game #Developer,unity game developer
4,Still believes in the warmth in interaction. S...,still believes warmth interaction studied game...


For now, only profile descriptions in English were analyzed. Language identification of short text is challenging, and the more common language identication libraries (e.g. langid) were inaccurately classifying certain profile descriptions. Facebook's FastText library offers an alternative that has shown relatively high accuracy with short text language classification (http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html). 

English profiles were lemmatized using the NLP libary spaCy, completing the text preprocessing stage of the analysis pipeline. Lemmatization can take several minutes, so the updated DataFrame was saved to a csv file. 

In [13]:
import os
import fasttext
import spacy

nlp = spacy.load('en') #download 'small' version of english model

def lemmatize_doc(doc):
    doc = nlp(doc)
    doc = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in doc])
    return doc

# import and apply FastText model for language identification
idlang_path = 'data/external/fasttext_training_data/lid.176.bin'
idlang_model = fasttext.FastText.load_model(idlang_path)

users['lang_id'] = users['clean_desc'].map(lambda doc: idlang_model.predict(doc) 
                                            if doc == doc 
                                            else np.nan)

# only English profiles are lemmatized; empty profiles are also ignored
users['clean_desc_en'] = users.apply(lambda row: lemmatize_doc(row['clean_desc']) 
                                        if  row['lang_id'] == row['lang_id']
                                        and row['lang_id'][0][0] == '__label__en'
                                        and row['clean_desc'] == row['clean_desc'] 
                                        else np.nan,
                                        axis=1)


ModuleNotFoundError: No module named 'fasttext'

In [30]:
# Push processed data into SQLite DB. Profiles with non-English or empty descriptions were dropped
e = sa.create_engine('sqlite:///./data/processed/' + DB_NAME + '-clean.sqlite')

cleaned_tab = TAB_NAME + '_clean'
users[['id', 'clean_desc_en']].to_sql(cleaned_tab, e)

users[['description','clean_desc','clean_desc_en']].head()

Unnamed: 0,description,clean_desc,clean_desc_en
0,so para fotos de desenhos,para fotos de desenhos,
1,Extreme sports calendar model. \nKeeping retro...,extreme sports calendar model keeping retro al...,extreme sport calendar model keep retro alive ...
2,,,
3,#Unity3d #Game #Developer,unity game developer,unity game developer
4,Still believes in the warmth in interaction. S...,still believes warmth interaction studied game...,still believe warmth interaction study game cu...


### Clustering profile descriptions

After preprocessing, the profile descriptions were clustered via two approaches, K-modes and DBSCAN. 

#### K-modes clustering

Given the short length of a typical profile, a word will typically appear no more than once within a given profile description. **It would thus be more approriate to vectorize profile descriptions as binary categorical vectors, as opposed to numeric vectors.** Profile descriptions were count vectorized as binary categorical data, where a value of *1* would indicate that a given word was present in a given profile description.  


K-means is likely the most popular approach to clustering. However, its use is limited to numerical data. The K-modes algorithim was developed as an analog to K-means for categorical data (Huang 1997, 1998). Like K-means, the K-modes algorithim requires that the number of clusters be predetermined.  

The kmodes Python library was used to perform K-modes clustering. No optimal number of clusters was assumed a priori. Consequently, K-modes clustering for a a range of initial cluster numbers (2 through 20) was executed. 

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
from kmodes.kmodes import KModes

DB_NAME = 'customer-segmentation-clean'
TAB_NAME = 'ustwo_followers_clean'
NUM_CLUSTERS_RNG = 16 #upper limit of cluster range to evaluate

# Pull processed data from SQLite DB
e = sa.create_engine('sqlite:///./data/processed/' + DB_NAME + '.sqlite')

#SQL query includes a WHERE statement that filters out NA values and empty strings
query = 'SELECT id, clean_desc_en FROM ' + TAB_NAME + ' WHERE clean_desc_en != \'\''
users = pd.read_sql_query(query, e)

cv = CountVectorizer(ngram_range=(1, 2), min_df=100, max_df=1.0,
                     stop_words=stop_words, binary=True)
cat_matrix = cv.fit_transform(users['clean_desc_en'])

# Connect to DB where cluster results are to be saved
DB_RESULTS_NAME = 'customer-segmentation-cluster'
e = sa.create_engine('sqlite:///./models/' + DB_RESULTS_NAME + '.sqlite')
TAB_RESULTS_PREFIX = 'ustwo_followers_k_'

for n_clusters in range(2, NUM_CLUSTERS_RNG):
    kmod = KModes(n_clusters=n_clusters, init='Huang', random_state=42, n_jobs=-1)
    y_pred = kmod.fit_predict(cat_matrix.toarray())
    users['cluster'] = kmod.labels_
    tab_name = TAB_RESULTS_PREFIX + str(n_clusters)
    users.to_sql(tab_name, e)

////maybe I don't have to list keyword functions? not really sure I need

The Kmodes algorithim will force the data to be organized into the predefined number of clusters. Clusters are not guaranteed to be generated in a meaningful way. 

In order to gauge the distinctiveness of the clusters, a vocabulary list of the most common unigrams, bigrams, and trigrams within the profile descriptions of a given cluster are generated (*gen_vocab_list*) and ordered by freqeuncy of occurance. The degree of similiarity in the top vocabulary between two clusters is a measure of the of the distance between those clusters.

In [33]:
from collections import Counter

# function generates a vocab list of unigrams, bigrams, and trigrams found in the corpus of profile descriptions
def gen_vocab_list(corpus):
    unigrams = [words for doc in corpus for words in doc.split()]
    ngrams = [bigram for doc in corpus for bigram in nltk.ngrams(doc.split(), 2)]
    bigrams = [token[0] + ' ' + token[1] for token in ngrams]
    ngrams = [bigram for doc in corpus for bigram in nltk.ngrams(doc.split(), 3)]
    trigrams = [token[0] + ' ' + token[1] + ' ' + token[2] for token in ngrams]
    vocab = unigrams + bigrams + trigrams
    return vocab

# function to order keywords based on their frequency in the corpus
# the min_freq argument sets a minimum frequency for a vocabulary word to be considered a keyword
def get_freq_keywords(corpus, min_freq):
    unigrams = [words for doc in corpus for words in doc.split()]
    key_counter = Counter(unigrams).most_common()
    keywords = [key[0] for key in key_counter if key[1] >= min_freq]
    return keywords

# function to detect keywords present in a document (a profile description)
def detect_keywords(doc, keywords):
    valid_words = [word for word in doc.split() if word in keywords]
    valid_words = ' '.join(valid_words)
    if not valid_words:
        valid_words = np.nan
    return valid_words

vocab = gen_vocab_list(users['clean_desc_en'].dropna())
keywords = get_freq_keywords(users['clean_desc_en'].dropna(), 500)

users['keywords'] = users['clean_desc_en'].map(lambda doc: detect_keywords(doc, keywords) 
                                                if doc is not None 
                                                else np.nan)

Cluster 2

In [77]:
def print_cluster_results(df, col_clust, col_list, clust_rng, n_keywords=10, n_desc=5):
    '''Evaluate clustering results by printing the most popular vocab words within a cluster corpus, and the occurance of those words in the cluster corpus 
    
    param df (DataFrame): contains columns with profile features and cluster assignment
    param col_clust (str): name of column containing cluster assignment
    param col_list (str list): name of column containing corpus to be analyzed
    param clust_rng (tuple): bounds of the range of clusters to be printed. Assumes integer increments of 1. 
    param n_keywords (int): number of top vocab words to be analyzed for occurance in the corpus
    param n_desc (int): number of randomly sampled twitter profiles whose features listed in col_list are to be printed
    '''
    n_total = len(df) 
    
    for cluster in range(clust_rng[0], clust_rng[1]):
        corpus = df.loc[df[col_clust] == cluster]
        n_corpus = len(corpus)
        per_label = round(n_corpus/n_total*100) # percentage of documents assigned to the cluster label 
        vocab = gen_vocab_list(corpus[col_list[0]])
        print('Cluster # ' + str(cluster))
        print(str(n_corpus) + ' (' + str(per_label) + '%) users with valid description in this cluster')
        print(Counter(vocab).most_common(n_keywords))
        print(corpus[col_list].sample(n=n_desc))
        print('\n')

Rundown of ALL results

In [78]:
# NUM_CLUSTER = 2

DB_NAME = 'customer-segmentation-cluster'
e = sa.create_engine('sqlite:///./models/' + DB_NAME + '.sqlite')

for n_clusters in range(2, NUM_CLUSTERS_RNG):

    TAB_NAME = 'ustwo_followers_k_' + str(n_clusters)
    query = 'SELECT id, clean_desc_en, cluster FROM ' + TAB_NAME + ' WHERE clean_desc_en != \'\''
    users = pd.read_sql_query(query, e)

    print('CLUSTER K = ' + str(n_clusters))
    print('\n')
    print_cluster_results(df=users, col_clust='cluster', col_list=['clean_desc_en'], clust_rng=(0,n_clusters))

CLUSTER K = 2


Cluster # 0
47166 (98%) users with valid description in this cluster
[('game', 10080), ('designer', 3933), ('make', 2676), ('developer', 2441), ('design', 2342), ('gamer', 2325), ('artist', 2257), ('video', 2101), ('play', 2036), ('follow', 1895)]
                                                                                               clean_desc_en
14509  home world gallifrey time lock kind waste time sharing awesomeness also work programmer dovetail game
16151  champlain college class game artist movie lover micro painter                                        
5814   yinzer interior designer                                                                             
10799  music sport                                                                                          
47628  web app designer developer postcapitalist veg founder code boss totallyco                            


Cluster # 1
1058 (2%) users with valid description in this cluster
[('creative',

44354  project manager intpd founder psychoactive et gamecraftedu former president beckerigda sip igdascholar             


CLUSTER K = 5


Cluster # 0
3365 (7%) users with valid description in this cluster
[('designer', 3469), ('graphic', 556), ('game', 489), ('graphic designer', 470), ('ux', 464), ('design', 356), ('product', 336), ('illustrator', 333), ('artist', 301), ('ui', 278)]
                                                                                                         clean_desc_en
32028  let introduce name maria turk ux designer year experience check website                                        
46283  freelance ui ux designer senior designer verotruesocial travel addict                                          
5566   level designer secondhandgame spare time work solitarythegame episodic sci fi game indiegamedev unity          
30525  indie game developer graphic designer music art lover funny dood                                               
44580  paddyduke

30424  graphic designer day video game fisherman night                                                                          


Cluster # 4
485 (1%) users with valid description in this cluster
[('hi', 487), ('game', 111), ('name', 84), ('follow', 67), ('hi name', 63), ('play', 56), ('youtube', 55), ('video', 51), ('make', 45), ('i', 43)]
                                                                                                                  clean_desc_en
4192   hi idk man guess                                                                                                        
9669   yetee game love beef jerky eat mythical beast working artist make hi quality affordable clothe art avatar sarahgraleyart
32084  hi wayside gaming video game clan clan relaxed version clan allow game pace                                             
29118  hi gamerrepublic upload awesome video go show channel subscribe thank                                                   
4085   hi i be j

109 (0%) users with valid description in this cluster
[('artist', 112), ('writer', 111), ('artist writer', 39), ('game', 26), ('writer artist', 20), ('designer', 17), ('gamer', 16), ('lover', 11), ('musician', 9), ('make', 8)]
                                                                                                                      clean_desc_en
8552   writer artist avid reader former bookstore slave                                                                            
15795  livestreamer law student consultant writer artist mother five word know hell anymore                                        
5306   commission open then teammanticore writer comic artist creator oh hell donna deddrie maker weird plushie half manticore team
28459  artist try find entertain musician artist writer                                                                            
8642   artist art manager player writer fantasy lover thing dragon                                               

13260  young movie director blogger vg lover create short movie honor                                                                 


CLUSTER K = 9


Cluster # 0
35017 (73%) users with valid description in this cluster
[('gamer', 1811), ('design', 1597), ('follow', 1501), ('artist', 1360), ('youtube', 1226), ('life', 1221), ('gaming', 1185), ('developer', 1108), ('art', 1070), ('music', 1058)]
                                                                                    clean_desc_en
24942  rexwal                                                                                    
34011  hello name agitatedsky gamer minecraft                                                    
27454  karaoke balloon                                                                           
39601  creative director peter jackson wingnut ar formerly head weta digital mostly random musing
20306  look atwood soundcloud hear exclusive music atwood                                        


Cluster # 

[('developer', 2465), ('game', 1374), ('game developer', 790), ('designer', 290), ('indie', 289), ('web', 276), ('software', 252), ('io', 217), ('work', 205), ('indie game', 203)]
                                                                                                           clean_desc_en
17227  professional gamer game developer wait mention also game developer oh oh okay well yeah                          
29699  game developer programmer qa game designer team emoveo khsdl                                                     
27305  entrepreneur retail shoe commerce web developer marketing shoe store sweden                                      
19271  make great game come true senior unity developer pvp studio former programm ubisoftmobile craiova driverspeedboat
36267  computer engineer game developer founder cubesoftbilisim                                                         


Cluster # 1
32476 (67%) users with valid description in this cluster
[('designer', 1972), ('

9728   unity gameplay programmer ai pathfinde procedual content generation shader programming learning machine learning tweet stuff relate gamming


Cluster # 9
1248 (3%) users with valid description in this cluster
[('artist', 1300), ('designer', 152), ('illustrator', 133), ('art', 123), ('concept', 80), ('concept artist', 74), ('animator', 73), ('writer', 65), ('freelance', 63), ('artist illustrator', 62)]
                                                                                                                clean_desc_en
7566   artist computer science major derby chick queer kid                                                                   
22403  artist st designer philosopher realitysculptor strive full potential entity manifest positive thought structure always
8634   kuroi ame kuroiamedream ruin spirit spiritruine bokeh deathbybokeh video artist ceo purelifetape                      
27044  imagine life recording artist kanye west episode hbo curb enthusiasm         

[('team', 812), ('game', 170), ('competitive', 79), ('look', 73), ('player', 66), ('gaming', 60), ('new', 55), ('call', 55), ('make', 52), ('us', 52)]
                                                                              clean_desc_en
41883  go team follow back team everyone cherish follower unless dick join pack            
8369   team nac                                                                            
47145  maker game enjoyer cinema baker bread team hide variable also instagram sixteenbit  
47998  freelance interaction designer animator part time potter proud member gojauntly team
34488  pro gamer part kaos clan team overwatch retire truck driver                         


Cluster # 8
874 (2%) users with valid description in this cluster
[('go', 923), ('youtube', 139), ('game', 137), ('follow', 121), ('channel', 105), ('check', 86), ('play', 78), ('subscribe', 73), ('go check', 72), ('video', 69)]
                                                                  

9104   work natural language processing information retrieval large biological dbs love drink tea post random pic blog


Cluster # 6
515 (1%) users with valid description in this cluster
[('try', 522), ('game', 125), ('make', 114), ('try make', 76), ('get', 72), ('play', 45), ('try get', 45), ('life', 43), ('thing', 42), ('video', 42)]
                                                              clean_desc_en
26126  insane crew recruiter voice play cod try mlg                        
25733  mad foolish guy try figure right wrong thing                        
10567  guy try make decent game                                            
21432  gamer play cod battlefield trickshotter try hard play snd domination
25983  youtuber ya ever try eat rainbow                                    


Cluster # 7
7294 (15%) users with valid description in this cluster
[('game', 8603), ('video', 1282), ('video game', 1151), ('developer', 970), ('play', 923), ('designer', 912), ('make', 871), ('game deve

43289  product builder passionate tech entrepreneurship startup travel art design                                


Cluster # 3
508 (1%) users with valid description in this cluster
[('instagram', 511), ('follow', 151), ('follow instagram', 75), ('youtube', 72), ('snapchat', 56), ('facebook', 34), ('designer', 31), ('artist', 30), ('gamer', 29), ('illustrator', 24)]
                                                                                 clean_desc_en
8410   kurucu yay netmeni bigumigu founder editor chief bigumigu instagram nl mfazelz         
23455  snapchat marquelkt instagram thompsonismyname                                          
38454  follow instagram i be nvp thank much                                                   
45431  follow us instagram artsyunicorn                                                       
2773   illustrator artist love fantasy sci fi avid climber love dance instagram miss doubtfire


Cluster # 4
292 (1%) users with valid description in this c

34755  coo thebotplatform founder touchpaperorg founder beard member raspberry pi foundation former oil cfo


CLUSTER K = 14


Cluster # 0
276 (1%) users with valid description in this cluster
[('support', 283), ('game', 74), ('follow', 27), ('please', 20), ('work', 19), ('us', 19), ('gamer', 18), ('help', 18), ('developer', 16), ('thank', 16)]
                                                                                                                   clean_desc_en
2759   work support uk early stage game development community                                                                   
25140  thank inquire oblivious gaming call duty network support xbox xbox one                                                   
27589  ascend nation gaming entertainment organization bring viewer good content possible show support subscribe follow interact
18401  try go pro gaming would support let grow together                                                                        
46718  c

[('man', 368), ('game', 53), ('one', 30), ('family', 28), ('family man', 25), ('developer', 19), ('life', 18), ('fan', 18), ('one man', 17), ('gamer', 16)]
                                                                                                            clean_desc_en
11229  take shower shine shoe get time lose young man must live                                                          
233    man brazilian ligado                                                                                              
26854  one distinguish characteristic louis represent every day man well get well right pill                             
4658   angry little artist man slash ya tire                                                                             
20957  reflex future play reflex clan trickshotte pubstompe clan bill day reflex giver man reflex nebula awsome great day


Cluster # 10
475 (1%) users with valid description in this cluster
[('us', 522), ('follow', 128), ('game', 107

20202  lil ole country boy hunt fish girl                                                            


Cluster # 5
1287 (3%) users with valid description in this cluster
[('thing', 1357), ('game', 313), ('make', 243), ('designer', 139), ('design', 119), ('make thing', 114), ('lover', 85), ('work', 82), ('artist', 80), ('art', 74)]
                                                                                              clean_desc_en
43636  name delme design thing make awesome                                                                
16410  play game thing oemsb eak psn hardcorefolife                                                        
29443  thing sometimes                                                                                     
44647  freelance editor graphic designer stylist cook addict workout experimenter collector beautiful thing
42938  thought thing non sequitur                                                                          


Cluster # 6
7843

Cluster # 14
173 (0%) users with valid description in this cluster
[('would', 182), ('i', 46), ('i would', 42), ('follow', 17), ('youtube', 15), ('psn', 12), ('xbox', 12), ('join', 12), ('get', 11), ('make', 11)]
                                                                                   clean_desc_en
17218  hi matt would share meme consider hit discord gigabyte                                   
37823  switch sw nintendo i would tmrikac                                                       
386    another version vibe suicide would push button ya bowin let cuttin rage machine          
2688   artist also food blogger steam i would                                                   
37108  make gaming video stuff would see stuff check youtube feel free subscribe drop like video




#### DBSCAN clustering

DBSCAN (density-based spatial clustering of applications with noise) is a density-based clustering algorithim. Unlike K-modes, which will force each data point to be assigned to a cluster, DBSCAN will mark data in low-density regions as outliers. 

Descriptions were vectorized by normalized term frequency. 

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN

DB_NAME = 'customer-segmentation-clean'
TAB_NAME = 'ustwo_followers_clean'

# Pull processed data from SQLite DB
e = sa.create_engine('sqlite:///./data/processed/' + DB_NAME + '.sqlite')

#SQL query includes a WHERE statement that filters out NA values and empty strings
query = 'SELECT id, clean_desc_en FROM ' + TAB_NAME + ' WHERE clean_desc_en != \'\''
users = pd.read_sql_query(query, e)
N_DESC = 10
N_KEYWORDS = 10

tf = TfidfVectorizer(use_idf=False)
tf_matrix = tf.fit_transform(users['clean_desc_en']) #tf is normalized as opposed to cv

dbscan = DBSCAN(eps=1.2, min_samples=1000)

y_pred = dbscan.fit_predict(tf_matrix)
users['cluster'] = dbscan.labels_

# Push processed data into SQLite DB. Profiles with non-English or empty descriptions were dropped
DB_RESULTS_NAME = 'customer-segmentation-cluster'
TAB_RESULTS_NAME = 'ustwo_followers_dbscan'
e = sa.create_engine('sqlite:///./models/' + DB_RESULTS_NAME + '.sqlite')
users.to_sql(TAB_RESULTS_NAME, e)

print_cluster_results(df=users, col_clust='cluster', col_list=['clean_desc_en'], clust_rng=(0,n_clusters))

ValueError: Table 'ustwo_followers_dbscan' already exists.

In [92]:
# Push processed data into SQLite DB. Profiles with non-English or empty descriptions were dropped
DB_RESULTS_NAME = 'customer-segmentation-cluster'
TAB_RESULTS_NAME = 'ustwo_followers_dbscan'
e = sa.create_engine('sqlite:///./models/' + DB_RESULTS_NAME + '.sqlite')
users.to_sql(TAB_RESULTS_NAME, e, if_exists='replace')

n_clusters = len(np.unique(dbscan.labels_))
clust_rng = (-1, n_clusters-1) # offset by -1 since DBSCAN's first cluster is #-1
print_cluster_results(df=users, col_clust='cluster', col_list=['clean_desc_en'], clust_rng=clust_rng)

Cluster # -1
14653 (30%) users with valid description in this cluster
[('world', 391), ('founder', 381), ('work', 380), ('tweet', 337), ('get', 328), ('new', 327), ('fan', 316), ('tech', 315), ('team', 303), ('good', 299)]
                                                                                                                           clean_desc_en
21616  serve joseph kalamaraki real fm radio show playlist since early                                                                  
25161  fire xxevilsnipexx                                                                                                               
45334  digital life home automation geek software dev pro eu fervent european xfgysmh remain supporter eu citizen stopbrexit peoplesvote
36638  cheap skin                                                                                                                       
24631  use work ea try help madden community software coin cash generator player dm detail  

In [None]:
users = pd.read_csv('preprocess_ustwo_users.csv')

len(users['clean_desc_en'].dropna())

In [91]:
len(np.unique(dbscan.labels_))

2

In [29]:
DB_NAME = 'customer-segmentation'
TAB_NAME = 'ustwo_followers'

In [None]:
import pandas as pd
import nltk
import re
import numpy as np
import os
import fasttext
import spacy




In [93]:
# %% K-modes clustering, show top keywords in each cluster

NUM_CLUSTER = 3
N_KEYWORDS = 50
N_DESC = 0

PATH = os.path.normpath('c:\\Users\\Vincent\\Game-Revenant\\Shadows\\ustwo_sampled_friend_fol_cluster')
FILE_PREFIX = 'kmode_idlimit-75_'
FULL_PATH = os.path.join(PATH, FILE_PREFIX + str(NUM_CLUSTER) + '.csv')
results = pd.read_csv(FULL_PATH).dropna()
n_total = len(results)

for cluster in range(0, NUM_CLUSTER):
    corpus = results.loc[results['cluster'] == cluster]
    n_corpus = len(corpus)
    vocab = gen_vocab_list(corpus['friend_name'])
    print('Cluster: ' + str(cluster))
    print(str(n_corpus) + ' (' + str(round(n_corpus/n_total*100)) + '%) users with valid description in this cluster')
    print(Counter(vocab).most_common(N_KEYWORDS))
#    print(corpus[['description','keywords']].sample(n=N_DESC))
    print('\n')

Cluster: 0
4500 (41%) users with valid description in this cluster
[('BarackObama', 1708), ('elonmusk', 1590), ('NASA', 1299), ('PlayStation', 1276), ('BillGates', 1170), ('RockstarGames', 1132), ('Twitter', 1112), ('steam_games', 1073), ('YouTube', 1072), ('Ubisoft', 998), ('NintendoAmerica', 979), ('IGN', 953), ('Xbox', 946), ('HIDEO_KOJIMA_EN', 889), ('AppStore', 886), ('Polygon', 860), ('GooglePlay', 813), ('Twitch', 791), ('Kotaku', 788), ('TheEllenShow', 769), ('gamasutra', 758), ('tha_rami', 749), ('jimmyfallon', 735), ('TimOfLegend', 725), ('notch', 718), ('GameSpot', 671), ('SupergiantGames', 654), ('ID_AA_Carmack', 633), ('telltalegames', 601), ('bethesda', 501), ('2K', 491), ('femfreq', 427), ('PocketGamer', 408), ('engadgetgaming', 397), ('toucharcade', 385), ('helvetica', 377), ('GamesRadar', 368), ('fullbright', 367), ('bfod', 349), ('leighalexander', 345), ('majornelson', 344), ('levine', 342), ('CallofDuty', 341), ('popcap', 327), ('br', 318), ('brandonnn', 315), ('chri

In [None]:
FULL_PATH = os.path.normpath('c:\\Users\\Vincent\\Game-Revenant\\Customer-Segmentation\\ustwo_sampled_friend_fol_cluster')
FILE_PREFIX = 'kmode_idlimit-' + str(ID_LIMIT) + '_'

cv = CountVectorizer(min_df=1, max_df=1.0, binary=True)
cat_matrix = cv.fit_transform(friends_feat['friend_name'])

NUM_CLUSTERS_RNG = 20

for n_clusters in range(2, NUM_CLUSTERS_RNG):
    kmod = KModes(n_clusters=n_clusters, init='Huang', random_state=42, n_jobs=-1)
    y_pred = kmod.fit_predict(cat_matrix.toarray()) #or cv_matrix or cos_sim_feat
#    results = pd.DataFrame({'id':desc_valid['cluster':kmod.labels_, 'description':desc_valid['clean_desc_en']})
    friends_feat['cluster'] = kmod.labels_
    filename = FILE_PREFIX + str(n_clusters) + '.csv'
    save_loc = os.path.join(FULL_PATH, filename)
    friends_feat.to_csv(save_loc, index=False)
    print('processed cluster #' + str(n_clusters))

In [None]:
# %% USTWO game cluster friends K-modes clustering

def id_to_name(doc, profiles):
    tokens = doc.split()
#    names = [profiles[profiles['id']==int(id_token)]['name'].item() + '(' +
#             profiles[profiles['id']==int(id_token)]['screen_name'].item() + 
#             ')' for id_token in tokens]
    names = [profiles[profiles['id']==int(id_token)]['screen_name'].item() for id_token in tokens]    
    translated_doc = ' '.join(names)
    return translated_doc


# Generate list of user profiles of interest in clustering (ordered by popularity)
ID_LIMIT = 75
top_friends = pd.read_csv('ustwo_sampled_fol_friends.csv')
top_friends = top_friends.sort_values(by='popularity_count', ascending=False)
top_friends = top_friends.iloc[1:ID_LIMIT] #start at index=1 to skip USTWO account (everyone follows)

# Check follower's friend ids against the collection above 
friends = pd.read_csv('ustwo_sampled_friends_ids.csv').drop_duplicates()
friends['is_top_friend'] = friends['friend_id'].isin(top_friends['id'])

# Convert ids to strings. 'friend_name' is a string listing all a user's top friends
friends_select = friends[friends['is_top_friend']==True][['user_id', 'friend_id']]
friends_select['friend_id'] = friends_select['friend_id'].astype(str)
friends_feat = friends_select.groupby(by='user_id')['friend_id'].apply(' '.join).reset_index()
friends_feat['friend_name'] = friends_feat['friend_id'].map(lambda doc: id_to_name(doc, top_friends))

#=============================================

FULL_PATH = os.path.normpath('c:\\Users\\Vincent\\Game-Revenant\\Shadows\\ustwo_sampled_friend_fol_cluster')
FILE_PREFIX = 'kmode_idlimit-' + str(ID_LIMIT) + '_'

cv = CountVectorizer(min_df=1, max_df=1.0, binary=True)
cat_matrix = cv.fit_transform(friends_feat['friend_name'])

NUM_CLUSTERS_RNG = 20

for n_clusters in range(2, NUM_CLUSTERS_RNG):
    kmod = KModes(n_clusters=n_clusters, init='Huang', random_state=42, n_jobs=-1)
    y_pred = kmod.fit_predict(cat_matrix.toarray()) #or cv_matrix or cos_sim_feat
#    results = pd.DataFrame({'id':desc_valid['cluster':kmod.labels_, 'description':desc_valid['clean_desc_en']})
    friends_feat['cluster'] = kmod.labels_
    filename = FILE_PREFIX + str(n_clusters) + '.csv'
    save_loc = os.path.join(FULL_PATH, filename)
    friends_feat.to_csv(save_loc, index=False)
    print('processed cluster #' + str(n_clusters))
          
          
# %% K-modes clustering, show top keywords in each cluster

NUM_CLUSTER = 3
N_KEYWORDS = 50
N_DESC = 0

PATH = os.path.normpath('c:\\Users\\Vincent\\Game-Revenant\\Shadows\\ustwo_sampled_friend_fol_cluster')
FILE_PREFIX = 'kmode_idlimit-75_'
FULL_PATH = os.path.join(PATH, FILE_PREFIX + str(NUM_CLUSTER) + '.csv')
results = pd.read_csv(FULL_PATH).dropna()
n_total = len(results)

for cluster in range(0, NUM_CLUSTER):
    corpus = results.loc[results['cluster'] == cluster]
    n_corpus = len(corpus)
    vocab = gen_vocab_list(corpus['friend_name'])
    print('Cluster: ' + str(cluster))
    print(str(n_corpus) + ' (' + str(round(n_corpus/n_total*100)) + '%) users with valid description in this cluster')
    print(Counter(vocab).most_common(N_KEYWORDS))
#    print(corpus[['description','keywords']].sample(n=N_DESC))
    print('\n')
    
   
# # %% 
# results = results.rename(columns={'cluster':'cluster_friends', 'user_id':'id'}) 

# df = pd.merge(sampled_users, results, on='id')

# CLUSTER_FRIEND = 2
# vocab = gen_vocab_list(df[df['cluster_friends']==CLUSTER_FRIEND]['clean_desc_en'].dropna())
# print(Counter(vocab).most_common(20))
