## Motivation

Proper audience targeting is a critical component of a successful advertising campaign. This notebook aims to identify customer demographics that may be particularly receptive to Game Revenant's (GR) advertising. Analysis is performed on unstructured social media data, specifically from Twitter. Compared to other social media, Twitter's API allows greater access to user data, enabling more effective data mining. 

GR currently has too few followers on Twitter to allow for productive data mining. However, the Twitter accounts of rival companies can be analyzed as well. *Where Shadows Slumber* (WSS) is frequently compared to *Monument Valley*, a mobile puzzle game produced by USTWO Games, by game critics and customer reviewers. As of this writing, @ustwogames has over 126k followers. 

Customer interests were estimated by 1) examining the profile description of @ustogames followers, and 2) examining the most popular friends among @ustogames followers. Preprocessed descriptions were examined via two clustering approches, K-modes and 
DBSCAN. In Twitter's official terminology, a 'friend' is an account a user follows. The most popular friends were clustered via K-modes. The rationale behind the unorthodox application of K-modes will be explained later in the notebook. 

## Methods and Results

The Twitter profiles of @ustogames were mined via the python package Tweepy and written to a SQL database (refer to *pull_data_twitter.py* and *sqlite_fx.py* for script and function codes). 

In [2]:
# Autoreload to accomodate script updates without restarting notebook
%load_ext autoreload
%autoreload 2
# Move to main directory of the Customer-Segmentation project
%cd ..

C:\Users\Vincent S\game-revenant\Customer-Segmentation


In [3]:
import pandas as pd
import sqlalchemy as sa
from pathlib import Path

DB_NAME = 'customer-segmentation'
TAB_NAME = 'ustwo_followers'

# Pull interim data from SQLite DB
e = sa.create_engine('sqlite:///./data/interim/' + DB_NAME + '.sqlite')
query = 'SELECT * FROM ' + TAB_NAME
users = pd.read_sql_query(query, e)

users.head()

Unnamed: 0,index,id,id_string,name,screen_name,location,url,description,protected,verified,followers_count,friend_count,listed_count,favourites_count,statuses_count,created_at,default_profile,default_profile_image
0,0,1114109046792032256,1114109046792032256,Hiyoru,Hiyoru6,,,so para fotos de desenhos,0,0,1,21,0,139,5,Fri Apr 05 10:14:30 +0000 2019,1,0
1,1,1120790192128954371,1120790192128954371,Tom Baines 🏳️‍🌈 🇪🇺,TomBaines16,"North West, England",,Extreme sports calendar model. \nKeeping retro...,0,0,29,169,0,961,456,Tue Apr 23 20:42:59 +0000 2019,0,0
2,2,113652889,113652889,Marmalade Games,MarmaladeGames,"London, UK",http://t.co/sPUf5LShHE,,0,0,506,262,13,304,471,Fri Feb 12 15:07:07 +0000 2010,0,0
3,3,363116063,363116063,MIT SHAH,mitshah97,Ahmedabad,,#Unity3d #Game #Developer,0,0,39,279,1,22,13,Sat Aug 27 15:13:29 +0000 2011,1,0
4,4,972169508,972169508,Kyrie E.H.C.,KyrieEHC,"Madison, WI",https://t.co/ajE03P8x2g,Still believes in the warmth in interaction. S...,0,0,333,1103,18,1118,734,Mon Nov 26 15:27:59 +0000 2012,0,0


In [4]:
print(str(len(users)) + ' @ustwogames follower profiles were mined')

125925 @ustwogames follower profiles were mined


### Text wrangling of profile descriptions

Although a number of parameters were mined, only the profile description and identifying user id are of interest in the proceeding analysis. 

Prior to analysis, unstructured text data must be preprocessed. Punctuation, emojis, and excess white space were removed. All words were converted to lowercase and tokenized. Stopwords were removed. The NLTK stopword list was extended by several words that were found to be common in profile descriptions, but did not add any useful information about user interests. 

In [5]:
import nltk
import re
import numpy as np

def normalize_doc(doc, stop_words):
    # remove special characters and white space. This filters out non-Latin languages!!
    doc = re.sub(r'[^a-zA-Z\s]', ' ', doc, re.IGNORECASE|re.ASCII)
    # remove single characters
    doc = re.sub(r'\b[a-zA-Z]\b', '', doc, re.IGNORECASE|re.ASCII)    
    # remove whitespace at beginning and end of string
    doc = doc.strip()
    # convert all characters to lowercase
    doc = doc.lower()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # recreate doc from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

stop_words = nltk.corpus.stopwords.words('english')
stop_words.extend(['co', 'https', 'http', 'gmail', 'com', 'like', 'love'])

users['clean_desc'] = users['description'].map(lambda doc: normalize_doc(doc, stop_words) 
                                                if doc is not None
                                                else np.nan)

users[['description','clean_desc']].head()

Unnamed: 0,description,clean_desc
0,so para fotos de desenhos,para fotos de desenhos
1,Extreme sports calendar model. \nKeeping retro...,extreme sports calendar model keeping retro al...
2,,
3,#Unity3d #Game #Developer,unity game developer
4,Still believes in the warmth in interaction. S...,still believes warmth interaction studied game...


For now, only profile descriptions in English were analyzed. Language identification of short text is challenging, and the more common language identication libraries (e.g. langid) were inaccurately classifying certain profile descriptions. Facebook's FastText library offers an alternative that has shown relatively high accuracy with short text language classification (http://alexott.blogspot.com/2017/10/evaluating-fasttexts-models-for.html). 

English profiles were lemmatized using the NLP libary spaCy, completing the text preprocessing stage of the analysis pipeline. Lemmatization can take several minutes, so the updated DataFrame was saved to a csv file. 

In [13]:
import os
import fasttext
import spacy

nlp = spacy.load('en') #download 'small' version of english model

def lemmatize_doc(doc):
    doc = nlp(doc)
    doc = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in doc])
    return doc

# import and apply FastText model for language identification
idlang_path = 'data/external/fasttext_training_data/lid.176.bin'
idlang_model = fasttext.FastText.load_model(idlang_path)

users['lang_id'] = users['clean_desc'].map(lambda doc: idlang_model.predict(doc) 
                                            if doc == doc 
                                            else np.nan)

# only English profiles are lemmatized; empty profiles are also ignored
users['clean_desc_en'] = users.apply(lambda row: lemmatize_doc(row['clean_desc']) 
                                        if  row['lang_id'] == row['lang_id']
                                        and row['lang_id'][0][0] == '__label__en'
                                        and row['clean_desc'] == row['clean_desc'] 
                                        else np.nan,
                                        axis=1)


ModuleNotFoundError: No module named 'fasttext'

In [19]:
# Push processed data into SQLite DB. Profiles with non-English or empty descriptions were dropped
e = sa.create_engine('sqlite:///./data/processed/' + DB_NAME + '_clean.sqlite')

cleaned_tab = TAB_NAME + '_clean'
users[['id', 'clean_desc_en']].to_sql(cleaned_tab, e)

users[['description','clean_desc','clean_desc_en']].head()

Unnamed: 0,description,clean_desc,clean_desc_en
0,so para fotos de desenhos,para fotos de desenhos,
1,Extreme sports calendar model. \nKeeping retro...,extreme sports calendar model keeping retro al...,extreme sport calendar model keep retro alive ...
2,,,
3,#Unity3d #Game #Developer,unity game developer,unity game developer
4,Still believes in the warmth in interaction. S...,still believes warmth interaction studied game...,still believe warmth interaction study game cu...


### Clustering profile descriptions

After preprocessing, the profile descriptions were clustered via two approaches, K-modes and DBSCAN. 

#### K-modes clustering

Given the short length of a typical profile, a word will typically appear no more than once within a given profile description. **It would thus be more approriate to vectorize profile descriptions as binary categorical vectors, as opposed to numeric vectors.** Profile descriptions were count vectorized as binary categorical data, where a value of *1* would indicate that a given word was present in a given profile description.  


K-means is likely the most popular approach to clustering. However, its use is limited to numerical data. The K-modes algorithim was developed as an analog to K-means for categorical data (Huang 1997, 1998). Like K-means, the K-modes algorithim requires that the number of clusters be predetermined.  

The kmodes Python library was used to perform K-modes clustering. No optimal number of clusters was assumed a priori. Consequently, K-modes clustering for a a range of initial cluster numbers (2 through 20) was executed. 

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from kmodes.kmodes import KModes

DB_NAME = 'customer-segmentation'
TAB_NAME = 'ustwo_followers'

# Pull processed data from SQLite DB
e = sa.create_engine('sqlite:///./data/processed/' + DB_NAME + '-clean.sqlite')

query = 'SELECT id, clean_desc_en FROM ' + TAB_NAME
users = pd.read_sql_query(query, e)
# desc_valid = users[['id', 'clean_desc_en']].dropna()
desc_valid.dropna()

# FULL_PATH = os.path.normpath('c:\\Users\\Vincent\\Game-Revenant\\Shadows\\ustwo_desc_cluste') # save location
FULL_PATH = Path('.models/usto_fol_desc_kmodes')
MIN_DF = 100 # minimum document frequency. 75, 200, and 400 were also tried
FILE_PREFIX = 'kmode_mindf-' + str(MIN_DF) + '_'

cv = CountVectorizer(ngram_range=(1, 2), min_df=MIN_DF, max_df=1.0,
                     stop_words=stop_words, binary=True)
cat_matrix = cv.fit_transform(desc_valid['clean_desc_en'])

NUM_CLUSTERS_RNG = 11

for n_clusters in range(2, NUM_CLUSTERS_RNG):
    results = desc_valid
    kmod = KModes(n_clusters=n_clusters, init='Huang', random_state=42, n_jobs=-1)
    y_pred = kmod.fit_predict(cat_matrix.toarray())
    results['cluster'] = kmod.labels_
    # clustering results saved to a csv for easier comparison between number of clusters
    filename = FILE_PREFIX + str(n_clusters) + '.csv'
    save_loc = os.path.join(FULL_PATH, filename)
    results.to_csv(save_loc, index=False)
    print('processed cluster #' + str(n_clusters))

ModuleNotFoundError: No module named 'kmodes'

////maybe I don't have to list keyword functions? not really sure I need

The Kmodes algorithim will force the data to be organized into the predefined number of clusters. Clusters are not guaranteed to be generated in a meaningful way. 

In order to gauge the distinctiveness of the clusters, a vocabulary list of the most common unigrams, bigrams, and trigrams within the profile descriptions of a given cluster are generated (*gen_vocab_list*) and ordered by freqeuncy of occurance. The degree of similiarity in the top vocabulary between two clusters is a measure of the of the distance between those clusters.

In [17]:
from collections import Counter

# function generates a vocab list of unigrams, bigrams, and trigrams found in the corpus of profile descriptions
def gen_vocab_list(corpus):
    unigrams = [words for doc in corpus for words in doc.split()]
    ngrams = [bigram for doc in corpus for bigram in nltk.ngrams(doc.split(), 2)]
    bigrams = [token[0] + ' ' + token[1] for token in ngrams]
    ngrams = [bigram for doc in corpus for bigram in nltk.ngrams(doc.split(), 3)]
    trigrams = [token[0] + ' ' + token[1] + ' ' + token[2] for token in ngrams]
    vocab = unigrams + bigrams + trigrams
    return vocab

# function to order keywords based on their frequency in the corpus
# the min_freq argument sets a minimum frequency for a vocabulary word to be considered a keyword
def get_freq_keywords(corpus, min_freq):
    unigrams = [words for doc in corpus for words in doc.split()]
    key_counter = Counter(unigrams).most_common()
    keywords = [key[0] for key in key_counter if key[1] >= min_freq]
    return keywords

# function to detect keywords present in a document (a profile description)
def detect_keywords(doc, keywords):
    valid_words = [word for word in doc.split() if word in keywords]
    valid_words = ' '.join(valid_words)
    if not valid_words:
        valid_words = np.nan
    return valid_words

vocab = gen_vocab_list(users['clean_desc_en'].dropna())
keywords = get_freq_keywords(users['clean_desc_en'].dropna(), 500)

users['keywords'] = users['clean_desc_en'].map(lambda doc: detect_keywords(doc, keywords) 
                                                if doc is not None 
                                                else np.nan)

Cluster 2

In [None]:
NUM_CLUSTER = 9
N_KEYWORDS = 10
N_DESC = 10

PATH = os.path.normpath('c:\\Users\\Vincent\\Game-Revenant\\Shadows\\ustwo_desc_cluster')
FILE_PREFIX = 'kmode_mindf-100_'
FULL_PATH = os.path.join(PATH, FILE_PREFIX + str(NUM_CLUSTER) + '.csv')
desc_valid = pd.read_csv(FULL_PATH).dropna()
n_total = len(desc_valid)

#results.rename(columns={'user_id': 'id'})
desc_cluster = desc_valid.drop(['clean_desc_en'], axis=1)
df = pd.merge(users, desc_cluster, on='id', how='right')

#df = pd.merge(users, desc_valid, on='id', how='right') # users v results originally
for cluster in range(0, NUM_CLUSTER):
    corpus = df.loc[desc_valid['cluster'] == cluster][['description','clean_desc_en','keywords']]
    n_corpus = len(corpus)
    vocab = gen_vocab_list(corpus['clean_desc_en'])
    print('Cluster: ' + str(cluster))
    print(str(n_corpus) + ' (' + str(round(n_corpus/n_total*100)) + '%) users with valid description in this cluster')
    print(Counter(vocab).most_common(N_KEYWORDS))
    print(corpus[['description','keywords']].sample(n=N_DESC))
    print('\n')

#### DBSCAN clustering

DBSCAN (density-based spatial clustering of applications with noise) is a density-based clustering algorithim. Unlike K-modes, which will force each data point to be assigned to a cluster, DBSCAN will mark data in low-density regions as outliers. 

Descriptions were vectorized by normalized term frequency. 

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN

DB_NAME = 'customer-segmentation-clean'
TAB_NAME = 'ustwo_followers_clean'

# Pull processed data from SQLite DB
e = sa.create_engine('sqlite:///./data/processed/' + DB_NAME + '.sqlite')

#SQL query includes a WHERE statement that filters out NA values and empty strings
query = 'SELECT id, clean_desc_en FROM ' + TAB_NAME + ' WHERE clean_desc_en != \'\''
users = pd.read_sql_query(query, e)
N_DESC = 10
N_KEYWORDS = 10

tf = TfidfVectorizer(use_idf=False)
tf_matrix = tf.fit_transform(users['clean_desc_en']) #tf is normalized as opposed to cv

dbscan = DBSCAN(eps=1.2, min_samples=1000)

y_pred = dbscan.fit_predict(tf_matrix)
# results = users[users['clean_desc_en'].notnull()].copy()
# results = users.copy()
users['cluster'] = dbscan.labels_

n_total = len(results)
pd.set_option('display.max_colwidth', -1)
for cluster in np.unique(dbscan.labels_):
    df = users.loc[users['cluster'] == cluster]
    n_corpus = len(df)
    vocab = gen_vocab_list(df['clean_desc_en'])
    print('Cluster: ' + str(cluster))
    print(str(n_corpus) + ' (' + str(round(n_corpus/n_total*100)) + '%) users with valid description in this cluster')
    print(Counter(vocab).most_common(N_KEYWORDS))
    print(df[['clean_desc_en']].sample(n=N_DESC))
    print('\n')
    
results.to_csv('dbscan_game_cluster.csv', index=False)

Cluster: -1
14653 (30%) users with valid description in this cluster
[('world', 391), ('founder', 381), ('work', 380), ('tweet', 337), ('get', 328), ('new', 327), ('fan', 316), ('tech', 315), ('team', 303), ('good', 299)]
                                                                                   clean_desc_en
46921  vivo base de wifi mutant proud                                                           
42782  engineer poet optimist hu wbs old boy                                                    
5403   sir hunt signal lva currently work sci fi horror light keep us safe buy wishlist cvxwfdns
28083  dantdm squid nugget                                                                      
47036  hail fly baby elephant uranus                                                            
26358  designzz ag member ag part owner                                                         
8778   angry pescatarian asu                                                                    
23

In [None]:
# Push processed data into SQLite DB. Profiles with non-English or empty descriptions were dropped
DB_RESULTS_NAME = 'customer_segmentation_cluster'
TAB_RESULTS_NAME = 'ustwo_followers_'
e = sa.create_engine('sqlite:///./models/processed/' + DB_NAME + '.sqlite')

cleaned_tab = TAB_NAME + '_clean'
users[['id', 'clean_desc_en']].to_sql(cleaned_tab, e)

In [None]:
users = pd.read_csv('preprocess_ustwo_users.csv')

len(users['clean_desc_en'].dropna())

In [None]:
import pandas as pd
import nltk
import re
import numpy as np
import os
import fasttext
import spacy


