# "Jeep - The Middle" User Clustering
## Background Information
In the midst of one of the most politically divided climates, Jeep was one of the few brands who attempted to wade in during the Super Bowl LV. As the cost of Super Bowl Ads continue to remain at all time highs, there is a lot
of risk inherent in addressing political and social issues. The premise to Jeep's ad, ["The Middle"](https://driving.ca/auto-news/news/middling-the-internets-best-reactions-to-jeeps-super-bowl-lv-ad), was that there is a place in the middle for everyone to meet -- Red or Blue; Democrat or Republican; America can be reunited again. In Jeep's own words: "to the ReUnited States of America". 

### Response to the Ads
Response to the Ad, has been varied and overall appears to have generally failed to accomplish Jeep's intended goal of reuniting the country. Social Media backlash to the ad points to the politically divided audience that appeared to have no intent of "meeting in the middle". The ad campaign was further complicated by the critical backlash against The Middle's celebrity sponsor, Bruce Springsteen. News broke that Springsteen had been arrested two months prior to the filming of "The Middle" for a [DUI] (https://adage.com/article/cmo-strategy/jeep-pulls-bruce-springsteen-super-bowl-ad-after-news-his-dwi-arrest/2313336) (Driving Under the Influence) violation. Considering Jeep's industry of choice and attempt to wade into the political atmosphere, the smart thing for them to do was pull the ad, and they did so.

## Analytics
### The Data
In the dataset gathered by the University of Utah for their annual [*Game Day Analytics Challenge*](https://eccles.utah.edu/programs/undergraduate/game-day-ad-analytics/), 1.2 million tweets touching on 64 different ad campaigns were collected for analysis. The analysis performed in this notebook stems from that dataset found under `data/2021-all-Ads-tweets.csv`. For easier performance and analysis, the data has been split and stored as a `.feather` under `data/clean_data`. If you have more questions about the preprocessing process for the data, feel free to inspect the `./cleaning.ipynb` jupyter notebook.

### Analysis Objective
Considering the political nature of Jeep's ad, the purpose of this analysis is to determine the groups or (clusters) of users that engaged with Jeep because of their ad. The assumption that the political nature of Jeep's ad would draw politically minded twitter users to engage is the base premise for analyzing what types of users engaged. This is done by building a vector matrix of users twitter bios and clustering them based on their vector similarities. If you have more questions about the analysis and data preparation, feel free to check out the project [GitHub](https://github.com/drewipson/game_day_analytics)

### Project Dependencies
For a complete install of project dependecies, I recommend using the `requirements.txt` file located at the root of this project by running the command `python3 -m pip install -r requirements.txt`. The dependencies used in this notebook are imported below. We'll use a combination of the seaborn, matplotlib.pyplot, and bokeh libraries for our visualizations.

In [1]:
import pandas as pd, json, logging, numpy as np, matplotlib.pyplot as plt, seaborn as sns, itertools as it, hdbscan, pickle, os
from classes.user_preprocessor import UserPreProcessor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from MulticoreTSNE import MulticoreTSNE as TSNE
from bokeh.plotting import figure, ColumnDataSource, show, output_notebook; output_notebook()
from bokeh.models import HoverTool

### Styling Our Visualizations & Import Our User Data PreProcessor
The UserPreProcessor class is located in the `classes.user_preprocessor.py` file. It contains data cleaner, stopword generator, and tweet tokenizer methods for our TfidfVectorizer below. The more improved this pre_processor class is, the better the model output later.

In [2]:
plt.style.use("bmh")
UPP = UserPreProcessor()

### Building Our Data Structures
#### Handling Dataset Size
The original data set `.csv` file size was over 6GB; to improve processing time a `.feather` data file is read in chunks. Using our `track_users` function, we will extract user data from the dataframe rows and append them to a `jeep_users` dictionary. Since we have 23 different files to go through, we will use the `os.listdir` method to read in each of the files, filter where the Ad Name is the 'Jeep - The Middle' and load the results into a temporary dataframe. We will then loop through our dataframe and pass the rows to our `track_users` function to gather the data.

In [4]:
jeep_users = {}
def track_users(obj: dict):
    """ Builds a dictionary of users that tweeted. User id as key in dictionary."""
    if obj['id'] in jeep_users:
        jeep_users[obj['id']]['count'] += 1
    else:
        jeep_users[obj['id']] = {
            "screen_name": obj['screen_name'],
            "created_at": obj['created_at'],
            "description": obj['description'],
            "follwers_count": obj['followers_count'],
            "location": obj['location'],
            "lang": obj['lang'],
            "verified": obj['verified'],
            "id_str": obj['id_str'],
            "count": 0
        }

In [5]:
# read in feather files for processing
file_path = "../gda_data/processed/feathers/"
for file in os.listdir(file_path):
    df = pd.read_feather(file_path + file)
    temp_df = df[df['Ad Name'] == 'Jeep - middle']
    for index, row in temp_df['user'].items():
        track_users(row)


### Save `jeep_users` Dictionary to JSON File
Because of the largeness of the data set, let's store our data into a more accessible file format (JSON), so we can retrieve it later much faster if needed. The code below will load the data back into memory:
```
with open('data/descriptions/jeep_users.json', 'r') as fh:
    jeep_users = json.loads(fh.read())
```
Since this is our first run, we'll write it out to file. 

In [6]:
with open('../gda_data/interim/descriptions/jeep_users.json', 'w') as fh:
    fh.write(json.dumps(jeep_users))

## Vectorized Twitter Bios
### TF-IDF
Since we want to group our twitter users into different clusters, we need to establish the data in a format that allows us to analyze the words contained within their bios. One of the most popular ways to do that is with the TfidfVectorizer. The TfidfVectorizer from sklearn is a great two to extract features from text and build a matrix of TF-IDF features. TF-IDF helps us count the frequency of words (features) that appear in the text and then multiplys them by the inverse frequency of the times that word appears in the document body or dataset. We'll get a better idea of the types of features extracted to our matrix below.

### Building the vectorizer 
The variable `bio_matrix` will hold the matrix of our feature data. Let's establish the TfidfVectorizer object as vectorizer and pass along some parameters. From our UPP pre-processing object, we can generate a list of stopwords to pass into the vectorizer that way we can build a feature matrix of strong, unique words. A good rule of thumb is a ration of 1:100 features to observations. We can calculate the max number of features the vectorizer should build by passing that calculation in the `max_features` argument. The preprocessor method, tokenizer, and stop_words value are also derived from the UPP class. For more details, inspect under `classes.user_preprocessor.py`.

### Building the Data Objects
Now that we have our vectorizer object, let's run the data through. Since we're only interested in clustering the user bios, we'll separate those out into a list called `bios`. We'll add the corresponding usernames to a separate list object called `users`. We can then run the `bios` list object through the vectorizer and transform it into our `bio_matrix` variable. When we're ready to link the clustered users back to their bios we'll join in the `users` list object later.

In [7]:
user_count = len(jeep_users)

In [12]:
stopwords = UPP.generate_stopwords()
vectorizer = TfidfVectorizer(preprocessor = UPP.replace_www, tokenizer = UPP.tweet_tokenizer, stop_words = stopwords, max_features = user_count//100)

In [8]:
users = []
bios = []
for key, value in jeep_users.items():
    users.append(jeep_users[key]['screen_name'])
    if value['description'] is None:
        bio = ''
    else:
        bio = value['description']
    bios.append(bio)

In [13]:
%%time
bio_matrix = vectorizer.fit_transform(bios)
bio_matrix



CPU times: user 12.5 s, sys: 0 ns, total: 12.5 s
Wall time: 12.5 s


<26144x261 sparse matrix of type '<class 'numpy.float64'>'
	with 133592 stored elements in Compressed Sparse Row format>

In [10]:
type(bio_matrix)

scipy.sparse.csr.csr_matrix

### View Bio Make Up
We'll print out the first 20 bios in our `bios` list object to get an idea of the data and how people write their bios. This gives a better idea of the data make up before we try to group them.

In [11]:
for i, bio in enumerate(bios[:20]):
    print(i, ': ', bio.replace('\n', ' '))

0 :  Enamorat dels Boxers, tot i que mai porto calçotets. Peó de la República Catalana. Vull tornar als llocs on no he estat mai. 354. Memorial 1714
1 :  Farm kid frm Jewell County—Public school advocate—Transparency, Accountability, Checks & Balances in Gov't—Let's work TOGETHER to solve problems #ksleg #ksed
2 :  
3 :  A Ravenclaw with a BFA. She’s crafty.
4 :  rule-breaker, CEO @LigaInsider, gamechanger-podcast | ⚽🎾🏀🚴🏐🏊‍♂️⛳🎱🏈🇫🇷🇵🇹☀️🏄‍♂️🎧
5 :  Program Director KLBJ-FM and Host of the national radio show LA Lloyd Rock 30 Countdown featuring the top-30 rock songs with Rock's biggest artists as co-hosts!
6 :  Dallas Cowboys Big D Germany NFL College Football Arkansas Razorbacks, Bvb,NBA und allem anderen außer Curling und Snooker
7 :  Don’t ask, don’t know.
8 :  Fußball - Podcasts - Rants ⚽️ @drei90  ⚽️ @WettBroetchen  ⚾️ @JB_Podcast
9 :  because laughing is better than crying.
10 :  We amplify #HR voices. Provided with ❤️  by http://partwell.io. Automate exit interviews and rehiring wit

## KMeans Clustering of User Bios
### Testing Cluster Count
To get a better idea of how many clusters we should specify in our KMeans object, we'll test a variety of clusters to determine how well our model performs. We'll score each cluster count using sklearns [silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html) (see [silhouette method](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_silhouette_method)) and [intertia](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) calculations to determine which cluster count best fits our data. We can plot the data and use [the elbow method](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_elbow_method) to see what count works best.

In [12]:
%%time
ks = [2, 50, 200, 500]
sil_scores = []
inertias = []

for k in ks:
    logging.warning(f'fitting model for {k}')
    model = KMeans(n_clusters=k, n_jobs=-1, random_state = 42)
    model.fit(bio_matrix)
    labels = model.labels_
    sil_scores.append(silhouette_score(bio_matrix, labels))
    inertias.append(model.inertia_)

# plot the quality metrics for inspection
fig, ax = plt.subplots(2, 1, sharex=True)

plt.subplot(211)
plt.plot(ks, inertias, 'o--')
plt.ylabel('inertia')
plt.title('kmeans parameter search')

plt.subplot(212)
plt.plot(ks, sil_scores, 'o--')
plt.ylabel('silhouette score')
plt.xlabel('k')



## Run our KMeans Model @ 200 Clusters
Based off the elbow method mentioned above, 200 cluster counts seems to be the best for our data. We'll go ahead and specify our KMeans object with additional parameters for our model and fit our `bio_matrix` to the model.

In [21]:
kn_model = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=200, n_init=10, n_jobs=-1, precompute_distances='auto',
    random_state=42, tol=0.0001, verbose=0)
kn_model.fit(bio_matrix)



KMeans(n_clusters=200, n_jobs=-1, precompute_distances='auto', random_state=42)

## View Strongest Features:
The below function will display the cluster groups and list the strongest features from our fitted data. We can run our vectorizer object through this function to see the strongest featuers in each cluster. This is a great text-based way to view the clusters

In [None]:
def view_strongest_features(model, vectorizer, topk=10):
    model_name = model.__class__.__name__
    features = vectorizer.get_feature_names()
    if model_name == 'KMeans':
        relevant_labels = list(set(model.labels_))
        centroids = model.cluster_centers_.argsort()[:,::-1]
        for label in relevant_labels:
            print(f'Cluster {label}:', end=' ')
            for ind in centroids[label, :topk]:
                print(f'{features[ind]}', end=' ')
            print()

Error: Session cannot generate requests

In [None]:
view_strongest_features(kn_model, vectorizer, topk=10)

Error: Session cannot generate requests

## Reducing Dimensionality for Plotting
Viewing our clusters strongest features is difficult in a text based format and we can't see how individual bios fit into the cluster. We can plot our bios in an more easily viewed format but first we have to reduce the dimensionality of our `bio_matrix` variable. If you remember from when we first built the matrix, the dimensions were 26144x261 (or approx. 100 to 1). We can get that down to 2 dimensions for plotting by using the TSNE library. sklearn has a tsne object available for reducing dimensionality, but it can be quite slow. We can use the [Multicore TSNE](https://github.com/DmitryUlyanov/Multicore-TSNE#benchmark) project available on GitHub which has been shown to be much faster. Once we are done we'll also be able to join our username data for viewing along with bios and cluster number. For demonstration purposes, if you clone this repo the below function will load a cached 2 dimension bio_matrix for faster processing and store it in a variable called `bio_matrix_2d`. If you're running the jupyter notebook yourself, you can run the function and it will reduce the `bio_matrix` instead.

In [16]:
def maybe_fit_tsne():
    file = "../models/full_bio_matrix_2d.npy"
    try:
        bio_matrix_2d = np.load(file)
        logging.warning("loading cached TSNE file")
    except FileNotFoundError:
        logging.warning("Fitting TSNE")
        tsne = TSNE(n_components=2,
                    n_jobs=-1,
                    random_state=42)
        bio_matrix_2d = tsne.fit_transform(bio_matrix.todense())

        np.save(file, bio_matrix_2d)
    return bio_matrix_2d

In [17]:
%%time
bio_matrix_2d = maybe_fit_tsne()



CPU times: user 5min 28s, sys: 501 ms, total: 5min 28s
Wall time: 1min 55s


In [None]:
type(bio_matrix)

## Visualizing Our Data
Now that we have our data in a more manageable format, let's plot the data. Bokeh is a great interactive library that allows us to plot the data and hover over it to see specific data points. We can pass our bios, usernames, and cluster number to the chart and interact with it in a more efficient way. We'll first transform our `bio_matrix_2d` into a pandas dataframe so we can plot it. We'll pass in the `users` list object along with the `bios` and the cluster coordinates. Using seaborns we'll generate a color scheme for the cluster to be stored in a column called colors. Once we have our data frame we can plot the clusters in Bokeh and explore.

In [18]:
def build_plottable_dataframe(users: list, bios: list, coord: object, labels: list):
    num_labels = len(set(labels))
    colors = sns.color_palette('hls', num_labels).as_hex()
    color_lookup = {v:k for k,v in zip(colors, set(labels))}
    df = pd.DataFrame({
        'user_name': users,
        'text': bios,
        'x_val': coord[:,0],
        'y_val': coord[:,1],
        'cluster': labels
    })
    df['color'] = list(map(lambda x: color_lookup[x], labels))
    return df

In [19]:
def plot_cluster(df, title='t-SNE plot'):
    # add our DataFrame as a ColumnDataSource for Bokeh
    plot_data = ColumnDataSource(df)
    # configure the chart
    tsne_plot = figure(title=title, plot_width=800, plot_height=700, tools=('pan, box_zoom, reset'))
    # add a hover tool to display words on roll-over
    tsne_plot.add_tools(
        HoverTool(tooltips = """<div style="width: 400px;"><strong>Cluster: @cluster</strong> | <u>User Name: @user_name</u> | <i>Bio: @text</i></div>""")
    )
    # draw the words as circles on the plot
    tsne_plot.circle('x_val', 'y_val',
                     source=plot_data,
                     color='color',
                     line_alpha=0.2,
                     fill_alpha=0.1,
                     size=7,
                     hover_line_color='black')
    # configure visual elements of the plot
    tsne_plot.title.text_font_size = '12pt'
    tsne_plot.xaxis.visible = True
    tsne_plot.yaxis.visible = True
    tsne_plot.grid.grid_line_color = None
    tsne_plot.outline_line_color = None
    return tsne_plot

In [22]:
df = build_plottable_dataframe(users, bios, bio_matrix_2d, kn_model.labels_)

In [23]:
show(plot_cluster(df, 'Projection of KMeans Clustered Super Bowl Users'))

## Cluster Insights
### Below is a table of already indentified insights about the users. We can see that politically predominant groups exist having responding to the Jeep ad, in addition to predominantly assumed demographics of Jeep owners.

 Coordinates | Color           |Observation                                        |
-------------|-----------------|---------------------------------------------------|
~ 30, -8     |Green            | Patriotic Users & High Use of American Flag Emoji |
~ 19, 24     | Green           | Users who identify as Fathers and Husbands        |
~ 34, 6      | Yellowish Green | LGBTQ Identifiers, Pride Flag Emoji               |
~ -15, -30   | Yellowish Orange| Christian & Jesus                                 |
~ -30, 14    | Pink            | News, Journalism                                  |

# Acknowledgements
This project has been a great opportunity for me to learn analytics that I haven't had the chance to do before and to learn a lot about big data analysis using machine learning algorithms. Thank you to the [University of Utah's Game Day Analytics Challenge](https://eccles.utah.edu/programs/undergraduate/game-day-ad-analytics/) for providing the data set and their hard work in collecting and agregating the data. The [Twitter Dev GitHub](https://github.com/twitterdev/do_more_with_twitter_data) was an awesome resource in the clustering user analysis and Josh Montague's jupyter notebook was a big help.
