## TweetGroups

Twitter is an integral part of marketing and can’t be ignored.  Twitter interactions can not only be a good metric for tracking a marketing campaign’s performance, but it can also be the cause of product or brands success and failure.

In recent years we have all seen examples of bad tweets that have ruined reputations and tarnished brands, so making sure that your company's tweets are throughly planned is essential.  The process of creating and maintaining an effective presence on Twitter is a complex one, but TweetGroups can help you get started:

#### TweetGroups helps to answer two fundamental marketing questions:
1.  What are our market segements (groups of customers)?
2.  How do we engage these customers?

### Why is Dimensionality Reduction Important?

The text written on social media can be random, arbitrary, and have a wide variety of tokens (including words/phrases/emojis).  Without a way to reduce these high-dimensional token matricies, you can run in to performance issues and clustering may be difficult.  This is particularly import in our case, where we are using clustering algorthims that do not need a n_topics paraments (ie. Affinity Progpogation, DBSCAN, Mean Shift, etc.)



### Text Pre-processing with spaCy

For text processing step we chose to use the Natural Language Processing library spaCy due to a few advantages over other libraries:

1.  Performance: spaCy is written in Cython and contains a wide array of NLP functions that can be parallelized and execute quickly

2.  Flexability: spaCy has functions that are particularly useful for social media data, including the ability to tokenize emojis, parts of speech tagging, and named entity recognition

3.  Usability: spaCy code is clear, concise, well-documented and actively supported

Let's show a quick performance comparison:

In [20]:
import time
import spacy
import timeit
import textacy
%timeit

In [17]:
# Loading a pickled list of tweets
import pandas as pd
tweet_list = pd.read_pickle('tweet_list_gopro.pkl')
# this list contains about 14000 tweets

The slowest run took 4.34 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 6.3 ms per loop


Textacy is tokenization package built on top of spacy

In [23]:
%timeit corpus = textacy.TextCorpus.from_texts('en',tweet_list)
corpus

1 loop, best of 3: 12.1 s per loop


TextCorpus(2510 docs; 232695 tokens)

In [24]:
import textblob

In [30]:
def tokenize_textblob(docs):
    tweetblob = []
    for doc in docs:
        tweetblob.append(textblob.TextBlob(doc))
    return tweetblob

In [31]:
%timeit tokenize_textblob(tweet_list)

10 loops, best of 3: 112 ms per loop


In [32]:
# Here I can walkthrough the process using sklearn tools

# from sklearn.feature_extraction.text import CountVectorizer
# count_vectorizer = CountVectorizer()