In [1]:
import pandas as pd
import glob, os    
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
files = glob.glob(os.path.join('', "conversation_data/*.txt"))

Got 302 conversations from Intercom, used those for the PoC

In [3]:
len(files)

302

Had to do some cleaning, getting rid of unwanted characters and spacing 

In [4]:
def remove_multiple_spacing(string):
    return re.sub(' +', ' ', string.strip())

In [5]:
def remove_newLine_character(string):
    return string.replace('\n', '')

In [6]:
conversations = list()

for file in files:
    with open(file) as f:
      conversations.append(remove_multiple_spacing(remove_newLine_character(f.read())))

Here's what one conversation looks like at this point.
The number at the start is the conversation ID.
For now, got rid of all the context of who says what in the dialog, working with just the content of messages in the conversation.

In [7]:
conversations[0]

"12750452841 Hi, I have created a USSD service code in the Sandbox and I am not able to access it externally is it possible to do this? Africa's Talking typically replies in under 5m. In the meantime, these articles might help: Why am I receiving ‘Dear customer, the network is experiencing technical problems and your request was not processed. Please try again later’ response from AT API? Grace Why am I getting the error 'Supplied Authentication is Invalid'? Liz Kathure Why am I receiving 'Connection MMI code' response from AT API? Grace More in the Help Center Good afternoon When testing the service on sandbox the code will be delivered to the simulator and not the phone. You need to launch the simulator. Thanks - I keep getting this error message on the simulator A moment please for tech team. Anthony Maina Kindly look into this and revert. Hi What's your username? timothy.muchai@m-kopa.com Hi Timothy,\xa0 Sorry for the delayed response that error you are getting is due the fact that

I tokenize the conversations, which is just turning them into a list of words as opposed to a string

In [8]:
tokenize = lambda doc: doc.lower().split(" ")

In [9]:
tokenized_conversations = [tokenize(d) for d in conversations]
tokenized_conversations[0]

['12750452841',
 'hi,',
 'i',
 'have',
 'created',
 'a',
 'ussd',
 'service',
 'code',
 'in',
 'the',
 'sandbox',
 'and',
 'i',
 'am',
 'not',
 'able',
 'to',
 'access',
 'it',
 'externally',
 'is',
 'it',
 'possible',
 'to',
 'do',
 'this?',
 "africa's",
 'talking',
 'typically',
 'replies',
 'in',
 'under',
 '5m.',
 'in',
 'the',
 'meantime,',
 'these',
 'articles',
 'might',
 'help:',
 'why',
 'am',
 'i',
 'receiving',
 '‘dear',
 'customer,',
 'the',
 'network',
 'is',
 'experiencing',
 'technical',
 'problems',
 'and',
 'your',
 'request',
 'was',
 'not',
 'processed.',
 'please',
 'try',
 'again',
 'later’',
 'response',
 'from',
 'at',
 'api?',
 'grace',
 'why',
 'am',
 'i',
 'getting',
 'the',
 'error',
 "'supplied",
 'authentication',
 'is',
 "invalid'?",
 'liz',
 'kathure',
 'why',
 'am',
 'i',
 'receiving',
 "'connection",
 'mmi',
 "code'",
 'response',
 'from',
 'at',
 'api?',
 'grace',
 'more',
 'in',
 'the',
 'help',
 'center',
 'good',
 'afternoon',
 'when',
 'testing',
 

Then comes the TF-IDF.
TF = Term Frequency. This is the number of times a word appears in a document(each conversation is a document) normalized by the number of words in the document.
IDF = Inverse Document Frequency. This is the number of documents that contain a term normalized by the total number of documents.

Multiplying the 2 values gives us the TF-IDF of each term.
This process also turns the conversations into vectors of the same length. With a mathematical representation of the conversations, we can now start using them in mathematical ways...read ML.

In [10]:
tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize, stop_words='english')

tfidf_representation = tfidf.fit_transform(conversations)

Each conversation is now a vector of length of 5016. 5016 is the number of unique words in the whole corpus of documents. Each vector will contain zeros where the word in question is not in the document, and have a TF-IDF score if the word is in the document. 

In [11]:
len(tfidf_representation.toarray()[0].tolist())

5016

In [17]:
len(tfidf_representation.toarray()[1].tolist())

5016

In [18]:
len(tfidf_representation.toarray()[2].tolist())

5016

In [19]:
tfidf_representation.toarray()[1].tolist()

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

I ordered the terms in descending order of TF-IDF scores. Remember, this score tells us what terms are deemed important.

In [22]:
def display_scores(vectorizer, tfidf_result):
    # http://stackoverflow.com/questions/16078015/
    scores = zip(vectorizer.get_feature_names(),
                 np.asarray(tfidf_result.sum(axis=0)).ravel())
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    
    for item in sorted_scores:
        print("{0:50} Score: {1}".format(item[0], item[1]))

In [23]:
display_scores(tfidf, tfidf_representation)

good                                               Score: 11.033274821008682
talking                                            Score: 10.243935691162273
africa's                                           Score: 10.201436401483356
way                                                Score: 9.60753187334947
reach                                              Score: 9.603043820493477
you:                                               Score: 9.489083138559138
kathure                                            Score: 9.167721806064488
liz                                                Score: 9.167721806064488
hello                                              Score: 8.816842543125015
hi                                                 Score: 8.646615989940456
assist                                             Score: 8.427008284347888
account                                            Score: 8.296337867848871
you?                                               Score: 7.653115606614245
help      

# Conclusion

The output isn't useful at this point but in the least tells that TF-IDF is a good start. I was hoping to see the various products emerge among the top scored, instead we have:
good
talking
africa's
way
reach

First consider that the corpus used has only 302 conversations. In total we have about 20,000, so we should get more meaningful results doing this on the whole dataset.
Second, these particular terms all happen to be among the autoresponses the bot gives if there's no one online to handle a client request. So moving forward I'm thinking of putting those aside for starters.
Third, I still need to work on cleaning the data a whole lot more. I see "hi", "hello", "hey", all variants which are really just the same thing. Along with contraction like "I'm" and "I've"...those also need to be taken care of.

# Moving Forward

Did some reading on the what we are trying to achieve which is essentially 'Topic Modelling'.

I'd like to try clustering the documents before getting the TF-IDF scores. I want to see if the different products will emerge in those clusters given the vocabulary used around them and then compare that method with getting TF-IDF scores from the entire corpus.

Once we are satisfied with the results of the TF-IDF, then we can train a classification model that can do real-time tagging of the conversations as they are happening and see how that goes.