In [1]:
from IPython.display import HTML, display

In [3]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
Code is hidden for easier reading.
 <b><font size="3">To toggle on/off code, 
 click</font></b> <a href="javascript:code_toggle()">here</a>.''')

## An NLP (LDA & Word2Vec) trend analysis (topic modelling) exercise on Tweets sent by customer to different airlines

The data contains 146,000 tweets by customers to 6 different airlines' support tweet hashtag on various topics. The data is already categorized by various criterion e.g. 'positive', 'negative', 'neural' sentiment, or by issue type, 'luggage', 'reservation', 'late'. etc.

However, I will only be using the 'text' field containing the customer's text, perform unsupervised training and identify the trends/issues/topics.

#### Potential business uses:
- A weekly report to airline's executive mgmt indicating the trends/issues in complaint data (from twitter channel)
- A dialog system (chatbot)
- Redirect, classify and summarize complaint/inquiry text from airline's support pages

### Objective: To quickly train an LDA model on the 'text' field and identify trends/topics/sentiment.

#### Data: The data file 'Tweets.csv' can be downloaded from Kaggle [here]( https://www.kaggle.com/crowdflower/twitter-airline-sentiment#Tweets.csv)

#### Load and display a few rows

In [4]:
import numpy as np
import pandas as pd
import os
import warnings

warnings.filterwarnings("ignore")

dirname='D:\ML-Data\\airline-tweet-sentiment' # local directory name where file is downloaded
filename='Tweets.csv'
maindf=pd.read_csv(os.path.join(dirname,filename))

print('shape:',maindf.shape)
maindf.head(2)

shape: (14640, 15)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)


#### Check for nulls, especially interested in the 'text' column

In [5]:
maindf.isnull().sum()

tweet_id                            0
airline_sentiment                   0
airline_sentiment_confidence        0
negativereason                   5462
negativereason_confidence        4118
airline                             0
airline_sentiment_gold          14600
name                                0
negativereason_gold             14608
retweet_count                       0
text                                0
tweet_coord                     13621
tweet_created                       0
tweet_location                   4733
user_timezone                    4820
dtype: int64

#### Three columns have almost 90% nulls, drop them. Also drop 'name' of person column.

In [6]:
maindf.drop(labels=['airline_sentiment_gold','negativereason_gold','tweet_coord', 'name'], axis='columns',inplace=True)
print('Columns dropped')

Columns dropped


#### Check some categorical columns and their counts

In [7]:
cat_cols=[col for col in maindf.select_dtypes('object').columns if col not in ['user_timezone','tweet_created','text']]
print('categorical cols:',cat_cols)
for col in cat_cols:
    print('***********',col)
    print(maindf[col].value_counts())

categorical cols: ['airline_sentiment', 'negativereason', 'airline', 'tweet_location']
*********** airline_sentiment
negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64
*********** negativereason
Customer Service Issue         2910
Late Flight                    1665
Can't Tell                     1190
Cancelled Flight                847
Lost Luggage                    724
Bad Flight                      580
Flight Booking Problems         529
Flight Attendant Complaints     481
longlines                       178
Damaged Luggage                  74
Name: negativereason, dtype: int64
*********** airline
United            3822
US Airways        2913
American          2759
Southwest         2420
Delta             2222
Virgin America     504
Name: airline, dtype: int64
*********** tweet_location
Boston, MA                        157
New York, NY                      156
Washington, DC                    150
New York                          127
USA    

#### Check a few 'text' column entries for all the different airlines

In [8]:
airlines=maindf.airline.unique()
for al in airlines:
    display(maindf.loc[maindf['airline']==al,['airline','text']].head(2))

Unnamed: 0,airline,text
0,Virgin America,@VirginAmerica What @dhepburn said.
1,Virgin America,@VirginAmerica plus you've added commercials t...


Unnamed: 0,airline,text
504,United,@united thanks
505,United,@united Thanks for taking care of that MR!! Ha...


Unnamed: 0,airline,text
4326,Southwest,@SouthwestAir still waiting. Just hit one hour.
4327,Southwest,@SouthwestAir although I'm not happy you Cance...


Unnamed: 0,airline,text
6746,Delta,@JetBlue Yesterday on my way from EWR to FLL j...
6747,Delta,@JetBlue I hope so because I fly very often an...


Unnamed: 0,airline,text
8966,US Airways,@USAirways is there a better time to call? My...
8967,US Airways,@USAirways and when will one of these agents b...


Unnamed: 0,airline,text
11879,American,@AmericanAir why would I even consider continu...
11880,American,@AmericanAir we've already made other arrangem...


#### All 'Delta' entries are wrong. They are actually tweets to @JetBlue. Change Delta to Jetblue

In [10]:
maindf.airline.replace(to_replace='Delta', value='JetBlue', inplace=True)
display(maindf.loc[maindf['airline']=='JetBlue',['airline','text']].head(3))

Unnamed: 0,airline,text
6746,JetBlue,@JetBlue Yesterday on my way from EWR to FLL j...
6747,JetBlue,@JetBlue I hope so because I fly very often an...
6748,JetBlue,"@JetBlue flight 1041 to Savannah, GA"


## Some quick trend analysis and topic modelling

Extract all text in the 'text' column and make a lit; display a few

In [11]:
orig_documents=list(maindf.text)

documents=orig_documents.copy()
documents[:2]    

['@VirginAmerica What @dhepburn said.',
 "@VirginAmerica plus you've added commercials to the experience... tacky."]

#### First some text preprocessing

- Remove special characters
- Lower case text
- Remove numeric data
- tokenize the phrases
- remove less than 3 characters long
- Lemmatize words

#### display the cleaned tokenized ... few rows from top and bottom

In [12]:
# Cleanup and tokenize the documents

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import re

STOPWORDS=stopwords.words('english')
def clean_docs(documents):
    
    # Split the documents into tokens.
    tokenizer = RegexpTokenizer(r'\w+')
    for idx in range(len(documents)):
        documents[idx] = documents[idx].lower()             # Convert to lowercase.
        documents[idx]=re.sub(r'http\S+', '',documents[idx]) # remove URL
        documents[idx] = tokenizer.tokenize(documents[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    documents = [[token for token in doc if not token.isnumeric()] for doc in documents]

    # Remove and words upto 2 character long
    documents = [[token for token in doc if len(token) > 2] for doc in documents]

    # Remove stopwords
    documents=[[token for token in doc if token not in STOPWORDS] for doc in documents]

    # Lemmatize the documents (stemming is crude and produces unreadable words)
    lemmatizer = WordNetLemmatizer()
    documents = [[lemmatizer.lemmatize(token) for token in doc] for doc in documents]
    
    return documents

In [13]:
documents=clean_docs(documents)
print(documents[:2])
print(documents[-2:])

[['virginamerica', 'dhepburn', 'said'], ['virginamerica', 'plus', 'added', 'commercial', 'experience', 'tacky']]
[['americanair', 'money', 'change', 'flight', 'answer', 'phone', 'suggestion', 'make', 'commitment'], ['americanair', 'ppl', 'need', 'know', 'many', 'seat', 'next', 'flight', 'plz', 'put', 'standby', 'people', 'next', 'flight']]


#### Remove airline names from text, e.g. 'americanair', 'virginamerica'

In [14]:
airlines=maindf.airline.unique()
altags=[]
for al in airlines:
    x=maindf.loc[maindf['airline']==al,['text']].text.tolist()[0].partition(' ')[0]
    x=x.lower()
    x=x[1:]         #remove '@'
    altags.append(x)
print('tags to remove:')
print(altags)

STOPWORDS.extend(altags)
STOPWORDS.extend(['http'])
documents=[[token for token in doc if token not in STOPWORDS] for doc in documents]
print('tags removed:')
print(documents[:2])
print(documents[-2:])

tags to remove:
['virginamerica', 'united', 'southwestair', 'jetblue', 'usairways', 'americanair']
tags removed:
[['dhepburn', 'said'], ['plus', 'added', 'commercial', 'experience', 'tacky']]
[['money', 'change', 'flight', 'answer', 'phone', 'suggestion', 'make', 'commitment'], ['ppl', 'need', 'know', 'many', 'seat', 'next', 'flight', 'plz', 'put', 'standby', 'people', 'next', 'flight']]


## Construct a document-term matrix

- Calculate bigrams in the documents and add them in the document.
- construct a document-term matrix using Gensim's Dictionary() function
  - will assign a unique integer id to each unique token and assign word counts
- converted Dictionary into a bag-of-words
  -  will create a list of vectors equal to the number of documents. Eac document vector is a series of tuples (term ID, term frequency)
- Remove rare & common words

In [15]:
from gensim.corpora import Dictionary
from gensim.models import Phrases

def make_docmatrix(documents,ngram=2):
    
    if ngram>2:
        print('Error: ngram should be 1 or 2')
        return
    if ngram==2:
        # Adding bigrams documents, the ones that appear 20 times or more.
        bigram = Phrases(documents, min_count=20)
        for idx in range(len(documents)):
            for token in bigram[documents[idx]]:
                if '_' in token:
                    # Token is a bigram, add to document.
                    documents[idx].append(token)
    
    # For dictionary representation of the documents.
    dictionary = Dictionary(documents)

    # Filter out words that occur less than 20 documents, or more than 50% of the documents.
    dictionary.filter_extremes(no_below=20, no_above=0.5)

    # Vectorize representation of words & n-grams.
    corpus = [dictionary.doc2bow(doc) for doc in documents]

    print('Number of unique tokens: %d' % len(dictionary))
    print('Number of documents: %d' % len(corpus))
    print(dictionary)
    return documents,dictionary,corpus

In [16]:
documents,dictionary,corpus=make_docmatrix(documents,ngram=2)
print('Done. corpus & dictionary constructed.')

Number of unique tokens: 991
Number of documents: 14640
Dictionary(991 unique tokens: ['said', 'added', 'experience', 'plus', 'another']...)
Done. corpus & dictionary constructed.


## Train an LDA model

The text data is preprocessed and ready for LDA model training. LSA is statistical, LDA is probablistic and usually shows better results.

In [17]:
from gensim.models import LdaMulticore

import logging

logfn='ldatrain.log'
print('Writing training logs to file:',logfn)
logging.basicConfig(filename=logfn,level=logging.DEBUG)
logger = logging.getLogger()

#logger.setLevel(logging.DEBUG)
#logging.debug("test")

# Set training parameters.
num_topics = 10
chunksize = 500
passes = 100
iterations =400
eval_every =5 

print('num_topics:', num_topics)
print('chunksize: ', chunksize)
print('passes:    ', passes)
print('iterations:', iterations)
print('eval_every:', eval_every)

print('Training started......')
model = LdaMulticore(corpus=corpus,
                 id2word=dictionary,
                 chunksize=chunksize,
                 #alpha='auto', 
                 eta='auto',
                 iterations=iterations, 
                 num_topics=num_topics,
                 passes=passes, 
                 eval_every=eval_every,
                 workers=11,
                 random_state=123)

logger.setLevel(logging.CRITICAL) # set logging back to critical to reduce messages
print('Training done.')

Writing training logs to file: ldatrain.log
num_topics: 10
chunksize:  500
passes:     100
iterations: 400
eval_every: 5
Training started......
Training done.


#### Lets see the top topics(trends) identified by the LDA model

In [18]:
# Extract the top 10 topics/trends based on topic coherence
topcohr=model.top_topics(corpus=corpus, dictionary=dictionary, topn=10)

# make a dictionary of trends with most significant words
topcohrDict={}
for idx, elem in enumerate(topcohr):
    topic_words=[]
    for y in elem[0]:
        topic_words.append(y[1])
    topcohrDict[idx]=topic_words

# display the trends    
for idx,elem in enumerate(topcohrDict.values()):
    print('Comment trend {:2d}: {}'.format(idx+1,'-'.join(elem)))

Comment trend  1: hold-hour-phone-help-get-trying-time-wait-call-minute
Comment trend  2: flight-cancelled-cancelled_flightled-flightled-tomorrow-get-rebooked-got-dfw-today
Comment trend  3: flight-delayed-gate-time-plane-delay-hour-crew-waiting-connection
Comment trend  4: flight-seat-late-get-home-change-late_flight-flighted-cancelled_flighted-cancelled
Comment trend  5: call-problem-back-number-email-yes-get-sent-booking-say
Comment trend  6: bag-plane-luggage-hour-lost-baggage-still-jfk-weather-amp
Comment trend  7: great-thanks-time-right-good-look-trip-airline-hope-flying
Comment trend  8: thanks-thank-help-agent-please-guy-much-need-love-class
Comment trend  9: would-like-fly-staff-better-want-never-anything-people-make
Comment trend 10: service-customer-customer_service-ever-worst-airline-experience-attendant-fleek-fleet


#### Create topic query functions for new topics

In [19]:
### for creating a df with topic numbers, words, probablities
topics_matrix=model.show_topics(formatted=False, num_words=20)
topdf=pd.DataFrame()
topidx=[]
topwords=[]
wordprob=[]
for idx,topic in enumerate(topics_matrix):
    #print('idx=',idx)
    for tupl in topic[1]:
        #print(tupl[1])
        topidx.append(idx)
        topwords.append(tupl[0])
        wordprob.append(tupl[1])
topdf['topidx']=topidx
topdf['topwords']=topwords
topdf['wordprob']=wordprob

def topic_print(topic):
    topids=[]
    tprobs=[]
    qrydf=pd.DataFrame()
    for tup in topic:
        topids.append(tup[0])
        tprobs.append(tup[1])
    qrydf['topids']=topids
    qrydf['tprobs']=tprobs
    qrydf.sort_values(by='tprobs',ascending=False,inplace=True)
    t1=qrydf.topids.iloc[0]
    t2=qrydf.topids.iloc[1]
    
    print('+++++++++++++++++Trends/topics found++++++++++++')
    for idx,val in enumerate([t1,t2]):
        elem=list(topdf['topwords'].loc[topdf.topidx==val])
        print('++Comment trend {:1d}: {}'.format(idx+1,'-'.join(elem)))
    #print('\n')

def find_topics(query):
    print('+++++++++++++++Original comment text++++++++++:')
    print(query,'\n')

    query1=clean_docs(query)
    tempcorpus = [dictionary.doc2bow(doc) for doc in query1]
    query_topics=model[tempcorpus]
    for topic in query_topics:
        topic_print(topic)

#### Find trends/topics in new complaints (predictions)

In [20]:
query=['sitting in the plane at the gate for an hour']
find_topics(query)

+++++++++++++++Original comment text++++++++++:
['sitting in the plane at the gate for an hour'] 

+++++++++++++++++Trends/topics found++++++++++++
++Comment trend 1: flight-delayed-gate-time-plane-delay-hour-crew-waiting-connection-boarding-min-miss-still-update-passenger-going-stuck-pilot-minute
++Comment trend 2: bag-plane-luggage-hour-lost-baggage-still-jfk-weather-amp-get-left-one-checked-check-sitting-night-waiting-claim-lax


In [21]:
query=['cant find my luggage']
find_topics(query)

+++++++++++++++Original comment text++++++++++:
['cant find my luggage'] 

+++++++++++++++++Trends/topics found++++++++++++
++Comment trend 1: bag-plane-luggage-hour-lost-baggage-still-jfk-weather-amp-get-left-one-checked-check-sitting-night-waiting-claim-lax
++Comment trend 2: hold-hour-phone-help-get-trying-time-wait-call-minute-online-reservation-book-agent-flight-tried-website-change-line-amp


In [22]:
query=['took an hour to book a flight']
find_topics(query)

+++++++++++++++Original comment text++++++++++:
['took an hour to book a flight'] 

+++++++++++++++++Trends/topics found++++++++++++
++Comment trend 1: hold-hour-phone-help-get-trying-time-wait-call-minute-online-reservation-book-agent-flight-tried-website-change-line-amp
++Comment trend 2: bag-plane-luggage-hour-lost-baggage-still-jfk-weather-amp-get-left-one-checked-check-sitting-night-waiting-claim-lax


#### Lets try an orginal text

In [23]:
query=[orig_documents[55]]
find_topics(query)

+++++++++++++++Original comment text++++++++++:
["@VirginAmerica hi! i'm so excited about your $99 LGA-&gt;DAL deal- but i've been trying 2 book since last week &amp; the page never loads. thx!"] 

+++++++++++++++++Trends/topics found++++++++++++
++Comment trend 1: hold-hour-phone-help-get-trying-time-wait-call-minute-online-reservation-book-agent-flight-tried-website-change-line-amp
++Comment trend 2: bag-plane-luggage-hour-lost-baggage-still-jfk-weather-amp-get-left-one-checked-check-sitting-night-waiting-claim-lax


## Quick play with Word2Vec for topic modelling

Word2Vec is a two layer neural net with pre-trained several hundered dimensional word embeddings. The embedding are context rich.

Will quickly train a Word2vec on the corpora of complaint documents, with 100 dimensional vector per word and use skip gram (instead of CBOW). Skip gram provides context words from a single word.

In [24]:
import gensim
w2vmodel = gensim.models.word2vec.Word2Vec(sentences=documents, 
                                           size=100,  # 100 dimensional vector for each token
                                           sg=1,      # use skip-gram instead of CBOW
                                           window=5,  # Maximum distance between the current and predicted word within a sentence
                                           workers=12 # CPU threads
                                          )
print('Model trained...print a few words:')
wordlist = list(w2vmodel.wv.index2word)
print('vocabulary size=',len(wordlist))
print(wordlist[:20])

Model trained...print a few words:
vocabulary size= 2678
['flight', 'get', 'hour', 'thanks', 'cancelled', 'service', 'time', 'customer', 'help', 'bag', 'plane', 'amp', 'hold', 'need', 'thank', 'one', 'still', 'call', 'please', 'customer_service']


#### try word2vec model with the same queries

In [25]:
# generate topics using word2vec trained model
def w2v_pred(query):
    print('++Original Text:',query)
    print('')
    clean_query=clean_docs(query)[0]

    print('++Word2Vec produced context words:')
    display(w2vmodel.wv.most_similar(positive=clean_query,topn=10))   

In [26]:
query=['sitting in the plane at the gate for an hour']
w2v_pred(query)

++Original Text: ['sitting in the plane at the gate for an hour']

++Word2Vec produced context words:


[('tarmac', 0.9787759184837341),
 ('runway', 0.957625687122345),
 ('sitting_plane', 0.9564936757087708),
 ('half', 0.9446218013763428),
 ('waiting', 0.9443740844726562),
 ('left', 0.9437344074249268),
 ('landed', 0.9436891078948975),
 ('sat', 0.9414244890213013),
 ('boarded', 0.9312993288040161),
 ('ground', 0.9284708499908447)]

In [27]:
query=['cant find my luggage']
w2v_pred(query)

++Original Text: ['cant find my luggage']

++Word2Vec produced context words:


[('promised', 0.9742435812950134),
 ('went', 0.9700623750686646),
 ('delivered', 0.9678585529327393),
 ('drop', 0.9677227139472961),
 ('international', 0.9675685167312622),
 ('dropped', 0.9672185182571411),
 ('midnight', 0.9671964645385742),
 ('land', 0.9660167098045349),
 ('advance', 0.9650110602378845),
 ('one', 0.9643750786781311)]

In [28]:
query=[orig_documents[55]]
w2v_pred(query)

++Original Text: ["@VirginAmerica hi! i'm so excited about your $99 LGA-&gt;DAL deal- but i've been trying 2 book since last week &amp; the page never loads. thx!"]

++Word2Vec produced context words:


[('3rd', 0.9923524856567383),
 ('wed', 0.9887734055519104),
 ('track', 0.9883015155792236),
 ('pushing', 0.9882409572601318),
 ('birmingham', 0.9878718852996826),
 ('button', 0.987572968006134),
 ('client', 0.9874402284622192),
 ('jacksonville', 0.9873881340026855),
 ('operating', 0.9873400330543518),
 ('wth', 0.986469030380249)]

#### Word2Vec conclusion: Not too shabby for such quick and dirty try. By spending a little more time, hypertuning and trying different parameters it has good potential to find topics.

### Remarks

Interesting results considering an hours work. 

Many things could be done to improve model:
1. We can get (or scrape) the airline's response to these queries/comments a predictive model can be generated which can respond (a chatbot)
2. Hyperparameter tuning
3. Try more iterations, and different batch sizes
4. Trying trigram and 4-grams
5. In this quick example I just used 'text' data, however, all the non-text data can be used to better the model using time, day of week, orignation, destination and other fields
6. Improve, tune, train the word2vec model

#### Business uses:
- A weekly report to airline's executive mgmt indicating the trends/issues in complaint data (from twitter channel)
- A dialog system (chatbot)
- Redirect, classify and summarize complaint/inquiry text from airline's support pages 