# Clustering Analysis Notebook

#### This notebook contains a demonstration of the tools necessary for conducting clustering on Twitter data. 

In [1]:
import process as proc
import clustering as cluster
#TODO: move this to a TRT Python file
import org_research as org 

[nltk_data] Downloading package wordnet to /Users/g/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction import text 

## PREPROCESSING

### Variables for Analysis

In [49]:
# Set the path to the parent directory containing all Tweets of interest
DIRECTORY = './data/test/*'
# Set to True to isolate english language tweets
ENGLISH = True

### Load Tweet and Generate Dataframe

In [50]:
tweet_objects = proc.loadTweetObjects(DIRECTORY)
df = proc.convertTweetsToDataframe(tweet_objects, ENGLISH)

Initial size: 183735
Dropping duplicates...
Final size: 121727


### Extract Potential Cashtags

In [51]:
ctdf = proc.extractPossibleCashtags(df)

Total potential Cashtags: 1098


### Removing Noisy Tweets

In [109]:
'''
*** Noisy words can be identified to use to filter such tweets.
*** Enter these words below in the noisy_terms list.
'''
noisy_terms = ['foxnews','foxandfriends','abc','nbc','cbs','cnn',
              'nowthisnews','foxbusiness','reuters','youtube',
              'saveshadowhunters','pickupshadowhunters','motoshowtime',
              'wsj','nytimes','nyt']
cldf = proc.removeNoisyTerms(df, noisy_terms)

  noisy_terms[i].lower())


Removed 58415 noisy terms.


### Remove Retweets

In [110]:
cldf = proc.removeRetweets(cldf)
print("Cleaned Tweets: " + str(cldf.shape[0]))

Removed 905 duplicates.
Cleaned Tweets: 61719


## CLUSTERING

In [111]:
additional_stop_words = ['1','2','3','4','5','6','7','8','9','wa','ha',\
                         'trump','realdonaldtrump','rt','amp','doe','le',\
                        'change','greeting','regard','en','el','la','te',
                        'por','se','ya','10']
stop_words = text.ENGLISH_STOP_WORDS.union(org.STOP_WORDS)
stop_words = stop_words.union(additional_stop_words)
temp_stop_words = []
stop_words = stop_words.union(temp_stop_words)

In [116]:
'''
*** Clustering requires the use of a number of parameters for tuning.
*** These are included below and should be set based on your project.
'''
n_FEATURES = 300
n_TOPICS = 12
n_TOP_WORDS = 10
n_TOP_TWEETS = 15
NGRAM = 1

In [117]:
tfidf, tfidf_feature_names = cluster.tfidf(cldf, n_FEATURES, \
                                           NGRAM, stop_words)

In [118]:
km, kmeans_embedding = cluster.KMeans(tfidf, n_TOPICS)

Initialization complete
Iteration  0, inertia 46063.003
Iteration  1, inertia 44669.655
Iteration  2, inertia 44399.077
Iteration  3, inertia 44327.823
Iteration  4, inertia 44301.342
Iteration  5, inertia 44295.699
Iteration  6, inertia 44295.573
Iteration  7, inertia 44295.536
Iteration  8, inertia 44295.513
Iteration  9, inertia 44295.496
Iteration 10, inertia 44295.486
Converged at iteration 10: center shift 0.000000e+00 within tolerance 2.727324e-07


In [119]:
cluster.printClusterResults(cldf, km, tfidf, tfidf_feature_names,\
                   n_TOP_WORDS, n_TOP_TWEETS, n_TOPICS)

Topic 0:
send
 message
 dm
 http
 email
 account
 assist
 address
 happy
 information

1) @VodafoneIN Simply copy pasting same message instead of solving the issue....
2) @MarisaW79253706 Sorry for they delay. We’ve received a lot more messages than usual the last little while so it’s taking a bit more time for us to respond. I have replied to your DM though. ~Talia
3) @MainGinger13 We're sorry you're experiencing this, Allison. This isn't something we'd expect with our Non-slip white strips, so when you have time, please send us a DM. We'd be happy to help you with this further.
4) @U2Brad This is not the kind of feedback we want to receive from our loyal customers. This promotion also is for existing customers adding a line. Please send us a DM for more information. -EC https://t.co/rMApsV8PQY
5) @Ingemeg78 Hello Gerardo, thank you for contacting us. In this case we need you to fill all the data requested, since your travelling to the USA and there's some information which is mandato