<h1> Capstone 3: Processing and Modeling </h1><a id='Capstone_3_Processing_and_Modeling'></a>

## Table of Contents<a id='Table_of_Contents'></a>
* [1 Imports](#Imports)
    * [1.1 Import Libraries](#Import_Libraries)
    * [1.2 Import Data](#Import_Data)
    
* [2 Task](#Task)
* [3 Develop Bag-of-Words Model for Sentiment Analysis](#Develop_Bag-of-Words_Model_for_Sentiment_Analysis) 
    * [3.1 Train-Test Split](#Train-Test_Split)
    * [3.2 Encode Text as Vectors](#Encode_Text_as_Vectors)
    * [3.3 Develop a Multi-Layer Perceptron](#Develop_a_Multi-Layer_Perceptron)
* [4 Predict Tweet Sentiment](#Predict_Tweet_Sentiment)
* [5 Topic Modeling](#Topic_Modeling)

## Imports <a id="Imports"></a>

### Import Libraries <a id="Import_Libraries"></a>

In [94]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import pandas as pd
import numpy as np
import pickle
from collections import Counter
import gensim
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

from keras.preprocessing.text import Tokenizer
from keras import Sequential
from keras.layers import Dense

  and should_run_async(code)


### Import Data <a id="Import_Data"></a>

In [2]:
with open("../data/processed/twcs_dict.pkl", "rb") as pkl_file:
    twcs_dict = pickle.load(pkl_file)

In [3]:
with open("../data/processed/pos_tweets.pkl", "rb") as pkl_file:
    pos_tweets = pickle.load(pkl_file)

In [4]:
with open("../data/processed/neg_tweets.pkl", "rb") as pkl_file:
    neg_tweets = pickle.load(pkl_file)

## Task <a id="Task"></a>

In this step I will:

1. develop a neural bag-of-words model for sentiment analysis
    * our training data is the **neg_tweets** and **pos_tweets** lists, which are tweets that have been labelled as positive or negative. Our testing data is the dataframes with customer tweets before and after a customer service interaction in **twcs_dict**. 
    * [click here to see the previous notebook in which I process and clean the raw data](https://github.com/bmensah/springboard/blob/main/capstone3/notebooks/Cap3_Wrangling_and_EDA.ipynb)


2. apply Latent Dirichlet Allocation (LDA) Topic Modeling on the customer tweets to find out whether there are certain trends that appear in customer issues.


3. explore differentiating factors between companies that are effective at changing customer sentiment and those that are not

helpful links, delete this later


pyLDAvis: https://github.com/bmabey/pyLDAvis

LDA: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

## Develop Bag-of-Words Model for Sentiment Analysis <a id="Develop_Bag-of-Words_Model_for_Sentiment_Analysis"></a>

The labelled tweets were fully processed in the previous notebook. Here, I will split the data into train and validation sets and vectorize. 

#### Positive Tweets

<font color="red">In order to run stuff locally, I'm going to try cutting down the sizes of the arrays from 800,000 each to 40,000 each.</font>

In [6]:
print("previous length:",len(pos_tweets))
pos_tweets = pos_tweets[:len(pos_tweets)//20]
print("new length:",len(pos_tweets))

previous length 800000
new length 40000


In [7]:
type(pos_tweets)

list

In [8]:
print(pos_tweets[:3])

[['love', 'guys', 'best'], ['im', 'meeting', 'one', 'besties', 'tonight', 'cant', 'wait', 'girl', 'talk'], ['thanks', 'twitter', 'add', 'got', 'meet', 'hin', 'show', 'dc', 'area', 'sweetheart']]


#### Negative Tweets

In [9]:
print("previous length:",len(neg_tweets))
neg_tweets = neg_tweets[:len(neg_tweets)//20]
print("new length:",len(neg_tweets))

previous length: 800000
new length: 40000


In [10]:
type(neg_tweets)

list

In [11]:
print(neg_tweets[:3])

[['awww', 'thats', 'bummer', 'you', 'shoulda', 'got', 'david', 'carr', 'third', 'day'], ['upset', 'cant', 'update', 'facebook', 'texting', 'might', 'cry', 'result', 'school', 'today', 'also', 'blah'], ['many', 'times', 'ball', 'managed', 'save', 'the', 'rest', 'go', 'bounds']]


### Train-Test Split<a id="Train-Test_Split"></a>

In [12]:
# create labels: 1 is positive, 0 is negative
pos_labels = np.zeros(len(pos_tweets))+1
neg_labels = np.zeros(len(neg_tweets))

In [32]:
# train-test split
split = int(len(pos_tweets)*.75)

X_train = pos_tweets[:split]
X_train.extend(neg_tweets[:split])
X_test = pos_tweets[split:]
X_test.extend(neg_tweets[split:])
y_train = np.append(pos_labels[:split], neg_labels[:split])
y_test = np.append(pos_labels[split:], neg_labels[split:])

In [39]:
lt = ["X_train", "X_test", "y_train", "y_test"]
for arr in lt:
    print(arr+" length: {:,}".format(len(eval(arr))))

X_train length: 60,000
X_test length: 20,000
y_train length: 60,000
y_test length: 20,000


I will now develop a vocabulary based only on the train set to simulate a real scenario in which the test set would not be available. 

In [14]:
vocab = Counter()
for tweet in X_train:
    vocab.update(tweet)

In [15]:
print("most common words:")
vocab.most_common()[:10]

most common words:


[('im', 6250),
 ('good', 3388),
 ('day', 3100),
 ('get', 2988),
 ('like', 2867),
 ('go', 2822),
 ('work', 2624),
 ('going', 2581),
 ('today', 2546),
 ('dont', 2448)]

In [16]:
print("least common words:")
vocab.most_common()[-10:]

least common words:


[('altough', 1),
 ('puedo', 1),
 ('oily', 1),
 ('bane', 1),
 ('hace', 1),
 ('speeches', 1),
 ('fuckkkk', 1),
 ('havta', 1),
 ('dwn', 1),
 ('nooooooooooooooo', 1)]

In [17]:
print("vocab size: {:,} words".format(len(vocab)))

vocab size: 23,028 words


In [18]:
# im is the most common but unlikely to help prediction
del vocab["im"]

# remove words that occur less than 5 times in the vocabulary 
trimmed_vocab = Counter([k for k,v in vocab.items() if v > 5])

# see new size
print("trimmed vocab size: {:,} words".format(len(trimmed_vocab)))

trimmed vocab size: 6,387 words


In [19]:
# remove words not in vocab and join words into one string per document
for i in range(len(X_train)):
    X_train[i] = [w for w in X_train[i] if w in trimmed_vocab]

In [20]:
print(X_train[:5])

[['love', 'guys', 'best'], ['meeting', 'one', 'besties', 'tonight', 'cant', 'wait', 'girl', 'talk'], ['thanks', 'twitter', 'add', 'got', 'meet', 'show', 'dc', 'area', 'sweetheart'], ['being', 'sick', 'really', 'cheap', 'hurts', 'much', 'eat', 'real', 'food', 'plus', 'friends', 'make', 'soup'], ['effect', 'everyone']]


### Encode Text as Vectors<a id="Encode_Text_as_Vectors"></a>

In [21]:
# transform the vocabulary into a list of strings
tokens = [w for w in trimmed_vocab]

# encode vocab
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tokens)

In [22]:
# encode train and test data
X_train_en = tokenizer.texts_to_matrix(X_train, mode="freq")
X_test_en = tokenizer.texts_to_matrix(X_test, mode="freq")

In [23]:
print("train data shape:", X_train_en.shape)
print("test data shape:", X_test_en.shape)

train data shape: (60000, 6388)
test data shape: (20000, 6388)


### Develop a Multi-Layer Perceptron<a id="Develop_a_Multi-Layer_Perceptron"></a>

In [24]:
# create an input layer the same size as the vocabulary
n_words = X_train_en.shape[1]

In [25]:
# create model
model = Sequential()
# define hidden layer
model.add(Dense(50, input_shape=(n_words,), activation="relu"))
# define output layer
model.add(Dense(1, activation="sigmoid"))
# compile
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 50)                319450    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
Total params: 319,501
Trainable params: 319,501
Non-trainable params: 0
_________________________________________________________________


In [26]:
model.fit(X_train_en, y_train, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f84d5d1c2e0>

In [40]:
# evaluate
loss, acc = model.evaluate(X_test_en, y_test, verbose=1)



This text for this model was encoded using frequency within each document. Now, I will try the other three modes of encoding: frequency over all documents (tfidf), binary/onehot encoding, and count. 

In [42]:
modes = ["tfidf","binary","count"]
for mode in modes:
    # encode according to mode
    X_train_en = tokenizer.texts_to_matrix(X_train, mode=mode)
    X_test_en = tokenizer.texts_to_matrix(X_test, mode=mode)
    # define model
    model = Sequential()
    # define hidden layer
    model.add(Dense(50, input_shape=(n_words,), activation="relu"))
    # define output layer
    model.add(Dense(1, activation="sigmoid"))
    # compile
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
    # fit model
    model.fit(X_train_en, y_train, epochs=10, verbose=0)
    # evaluate and print results
    loss, acc = model.evaluate(X_test_en, y_test, verbose=0)
    print("Mode:", mode)
    print("Loss:", loss)
    print("Accuracy:", acc)
    print("-----------")

Mode: tfidf
Loss: 1.5128040313720703
Accuracy: 0.7335000038146973
-----------
Mode: binary
Loss: 0.9243593811988831
Accuracy: 0.7432500123977661
-----------
Mode: count
Loss: 0.9496979713439941
Accuracy: 0.7421500086784363
-----------


With an accuracy of 0.76, encoding documents with frequency, the first option we used, seems to be the best. 
<font color="red">Because of the stochastic nature of neural networks, I should run several trials for each method; about 10. These scores are pretty close, so it is not certain freq will be the best on every run through. After doing that, I should choose the best encoding method, and train a model in a cloud environment (AWS, GCP) on all of the data. </font>

In [45]:
X_train_en = tokenizer.texts_to_matrix(X_train, mode="freq")
X_test_en = tokenizer.texts_to_matrix(X_test, mode="freq")
best_model = Sequential()
# define hidden layer
best_model.add(Dense(50, input_shape=(n_words,), activation="relu"))
# define output layer
best_model.add(Dense(1, activation="sigmoid"))
# compile
best_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
best_model.fit(X_train_en, y_train, epochs=10, verbose=0)

<keras.callbacks.History at 0x7f83d1344370>

## Predict Tweet Sentiment<a id="Predict_Tweet_Sentiment"></a>

<font color="red">note: for this part, I only really need the transformed text from the previous part. so I can make this easier on myself by only sending that part through pickle. 
    
I also need to pass on the airline tweets via pickle. I may do some sentiment analysis on that as well. Get a score for each company, see if that correlates with their customers' sentiment scores. This could be done in a correlation heatmap with other features as well (number of tweets, length of tweets, response time, etc).</font>

In [53]:
def get_sentiment(text):
    # process text
    for i in range(len(text)):
        text[i] = " ".join([w for w in text[i] if w in trimmed_vocab])
    encoded = tokenizer.texts_to_matrix(text, mode="freq")
    yhat = best_model.predict(encoded, verbose=0)
    yhat = [round(x[0]) for x in yhat]
    return yhat

In [64]:
for airline,dfs in twcs_dict.items():
    pre_tweets = dfs["pre"].loc[:,"transformed_text"].to_list()
    pre_sentiment = get_sentiment(pre_tweets)
    post_tweets = dfs["post"].loc[:,"transformed_text"].to_list()
    post_sentiment = get_sentiment(post_tweets)
    
    print("Airline:", airline)
    print("percent of positive sentiment tweets before customer service interaction:")
    print((sum(pre_sentiment)/len(pre_sentiment))*100)
    print("percent of positive sentiment tweets after customer service interaction:")
    print((sum(post_sentiment)/len(post_sentiment))*100)
    print("number of original issues:", len(pre_sentiment))
    print("number of customers who follow up after their issue has been addressed:", len(post_sentiment))
    print("percent followup:", (len(post_sentiment)/len(pre_sentiment))*100)
    print("------------------")
    print()

Airline: Delta
percent of positive sentiment tweets before customer service interaction:
39.62673611111111
percent of positive sentiment tweets after customer service interaction:
49.705535924617195
number of original issues: 11520
number of customers who follow up after their issue has been addressed: 5094
percent followup: 44.21875
------------------

Airline: AmericanAir
percent of positive sentiment tweets before customer service interaction:
35.72460688482788
percent of positive sentiment tweets after customer service interaction:
49.60658962380133
number of original issues: 11765
number of customers who follow up after their issue has been addressed: 8134
percent followup: 69.13727156821079
------------------

Airline: British_Airways
percent of positive sentiment tweets before customer service interaction:
40.967269174401565
percent of positive sentiment tweets after customer service interaction:
56.05313092979127
number of original issues: 10235
number of customers who follow u

Interesting results, but it could be slightly misleading. Not every customer has a "post" tweet; meaning not every customer follows up after their customer service issue has been solved. 

But maybe this is not important... since I am just looking at general trends. The question is: does customer sentiment improve after a custoemr service interaction on Twitter? I don't need to look at each conversation, I can just look at the pre and post tweets in general. 

It looks like for every airline, yes. 

The caveat is the followup rate. less than half of the customers who sent in tweets to Delta describing an issue followed up with customer service. We can't necessarily count all of these people as having had a negative experience. They could have not followed up for many reasons; got distracted, resolved their issue some other way, etc. We cannot reasonably assume that everyone who did not follow up neglected to because they were angry. 

The cool thing is that there is a lot of different stuff I could visualize here. Could really format the web app like a dashboard. 

## Topic Modeling<a id="Topic_Modeling"></a>

I want to take a look at the most common topics in the customer tweets before and after a customer service interaction. 

In [70]:
# get processed pre-tweets and post-tweets into separate lists
pre_corpus = []
post_corpus = []
for airline,dfs in twcs_dict.items():
    pre_corpus.extend(dfs["pre"].transformed_text.to_list())
    post_corpus.extend(dfs["post"].transformed_text.to_list())

In [99]:
def model_topics(corpus):
    dic = gensim.corpora.Dictionary(corpus)
    bow = [dic.doc2bow(doc) for doc in corpus]
    lda_model = gensim.models.LdaMulticore(bow,
                                          num_topics=4,
                                          id2word=dic,
                                          passes=10,
                                          workers=2)
    pyLDAvis.enable_notebook()
    vis = gensimvis.prepare(lda_model, bow, dic)
    return vis

In [103]:
model_topics(twcs_dict['AlaskaAir']['pre'].transformed_text.to_list())

In [72]:
pre_dict = gensim.corpora.Dictionary(pre_corpus)
pre_bow = [pre_dict.doc2bow(doc) for doc in pre_corpus]

post_dict = gensim.corpora.Dictionary(post_corpus)
post_bow = [post_dict.doc2bow(doc) for doc in post_corpus]

In [73]:
pre_lda_model = gensim.models.LdaMulticore(pre_bow, 
                                   num_topics = 4, 
                                   id2word = pre_dict,                                    
                                   passes = 10,
                                   workers = 2)

[(0,
  '0.021*"please" + 0.020*"check" + 0.017*"bag" + 0.016*"get" + 0.015*"help" + 0.015*"baggage" + 0.014*"boarding" + 0.014*"luggage" + 0.013*"still" + 0.011*"call"'),
 (1,
  '0.035*"flight" + 0.032*"service" + 0.021*"customer" + 0.017*"gate" + 0.015*"great" + 0.013*"thank" + 0.013*"thanks" + 0.013*"crew" + 0.010*"today" + 0.009*"airline"'),
 (2,
  '0.041*"flight" + 0.024*"booking" + 0.023*"hi" + 0.021*"help" + 0.016*"please" + 0.014*"change" + 0.012*"ticket" + 0.011*"need" + 0.010*"book" + 0.010*"can"'),
 (3,
  '0.065*"flight" + 0.019*"delayed" + 0.015*"get" + 0.014*"time" + 0.012*"flights" + 0.011*"hours" + 0.010*"going" + 0.010*"plane" + 0.009*"seat" + 0.009*"delay"')]

In [96]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(pre_lda_model, pre_bow, pre_dict)
vis

In [97]:
post_lda_model = gensim.models.LdaMulticore(post_bow, 
                                   num_topics = 4, 
                                   id2word = post_dict,                                    
                                   passes = 10,
                                   workers = 2)
post_lda_model.show_topics()

[(0,
  '0.047*"flight" + 0.015*"get" + 0.013*"plane" + 0.012*"us" + 0.010*"hours" + 0.010*"we" + 0.009*"the" + 0.009*"time" + 0.009*"gate" + 0.009*"bag"'),
 (1,
  '0.077*"thanks" + 0.023*"service" + 0.022*"customer" + 0.021*"you" + 0.015*"guys" + 0.014*"great" + 0.014*"good" + 0.013*"ok" + 0.011*"like" + 0.010*"first"'),
 (2,
  '0.033*"sent" + 0.018*"just" + 0.016*"ba" + 0.013*"sure" + 0.013*"flight" + 0.012*"dont" + 0.012*"im" + 0.012*"never" + 0.011*"hope" + 0.011*"dm"'),
 (3,
  '0.080*"thank" + 0.033*"please" + 0.027*"dm" + 0.025*"still" + 0.019*"no" + 0.019*"done" + 0.016*"yes" + 0.016*"help" + 0.015*"email" + 0.014*"number"')]

In [98]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(post_lda_model, post_bow, post_dict)
vis