This script is a first pass at classifying which tickets can be automated. Using a hand-labeled set of "automate/manual" data, we want to determine if there is any correlation between automate and the frequency of key words. The first idea is to try to choose a set of key words, but it is difficult to know these a-priori. So, a count of all words or topic analysis will likely be a better approach. 


In [None]:
#load spacy to use for text preprocessing. It takes a lot of memory, and needs it's own cell. 
import spacy
import re
nlp = spacy.load('en')


In [32]:
def cleantext(text):
    """
    input: text, str. The raw text from an individual ticket. 
    output: token_lemma: stemmed tokens 
    The raw text is processed to remove junk, stopwords, stem each word, and tokenize the string.
    """
    #remove the drop down menu text, user name, and order
    dd1 = re.search("(Customer Contact Reason Drop-down1:.*\n)", text)
    if dd1:
        text = text[dd1.end():]
    dd2 = re.search("(Customer Contact Reason Drop-down2:.*\n)", text)
    if dd2:
        text = text[dd2.end():]
    issue = re.search("Please give us a quick idea of what's going on:", text)
    if issue:
        text = text[issue.end():]
    user = re.search("User Name.*\n", text)
    if user:
        text = text[:user.start()]
    order = re.search("Order:.*\n", text)
    if order:
        text = text[:order.start()]
    parsed_text = nlp(text)
    #stem text, remove stopwords and punctuation
    token_lemma = [token.lemma_ for token in parsed_text if not token.is_stop and not token.is_punct and not token.like_num and not token.is_space]
    #returns a list of words 
    return(token_lemma)

def wordcounter(ticket_text, word):
    """
    input:  ticket_text: list of strings. A tokenized version of the ticket text
            word, str. A single word to be counted
    output: the word count
    """
    return(ticket_text.count(word))
        
    

In [11]:
#load data
import pandas as pd
import os
data_file_path = os.path.join('..', 'SatisfAI-data', 'randomsample500.csv')
out_dir = os.path.join('..', 'SatisfAI-data', 'Dec_data')
randomsample500 = pd.read_csv(data_file_path)
randomsample500.head()


Unnamed: 0,tickets_id,tickets_assignee_id,tickets_created_at,tickets_custom_field_20180577,tickets_custom_field_21024293,tickets_custom_field_22526998,tickets_custom_field_24112003,tickets_custom_field_27311858,tickets_custom_field_505895,tickets_description,...,metric_sets_reply_time_in_minutes_calendar,metric_sets_requester_updated_at,metric_sets_requester_wait_time_in_minutes_business,metric_sets_requester_wait_time_in_minutes_calendar,metric_sets_solved_at,metric_sets_status_updated_at,metric_sets_updated_at,ticket_channel,lob,automate
0,52814760,14143076948,24:01.0,622331045.0,refund_issued,reason4_2g13,ctlamkl@yahoo.com,LG-TNVK-M1FC-JX4T-R2FC,gg-xbox-360-camouflage-wireless-controller,\nCustomer Contact Reason Drop-down1: Item del...,...,557,24:01.0,401,557,40:48.0,09:00.0,40:48.0,Contact Us,Goods,A
1,52412687,3958975757,02:06.0,608726425.0,refund_not_discussed,reason4_2f1,rescuerenee@gmail.com,LG-YWSX-72ZV-7H74-KPKH,wow-starbucks-6,\nCustomer Contact Reason Drop-down1: Managing...,...,4,25:05.0,0,8,04:09.0,07:29.0,04:10.0,Contact Us,Local,M
2,51805897,1180628008,24:40.0,255136503.0,refund_issued,reason4_2f3,d_castrosr@yahoo.com,241897357-0-1,mymms-com-9-boston,\nCustomer Contact Reason Drop-down1: Problem ...,...,101,02:33.0,1329,3324,27:38.0,09:57.0,27:39.0,Contact Us,Local,A
3,52297843,6071705957,13:59.0,619034955.0,refund_not_discussed,reason4_2g19,vkuhn@nycap.rr.com,GG-19N2-7XKL-32XN-N759,gg-tulipmist-ultrasonic-essential-oil-diffuser,\nCustomer Contact Reason Drop-down1: Item del...,...,525,13:58.0,406,525,58:42.0,06:38.0,58:43.0,Contact Us,Goods,M
4,52424023,14676246228,03:36.0,628595395.0,refund_issued,reason4_2f10,snewtonrobbins@gmail.com,LG-7VRW-36BM-WC29-BSTM,shari-s-berries-1103-orange-county,\nCustomer Contact Reason Drop-down1: Somethin...,...,477,12:05.0,0,477,00:26.0,03:46.0,00:26.0,Contact Us,Local,


In [12]:
#nab the columns we are interested in for text processing
subset = randomsample500[["tickets_id", "tickets_description", "automate"]]
subset.head()

Unnamed: 0,tickets_id,tickets_description,automate
0,52814760,\nCustomer Contact Reason Drop-down1: Item del...,A
1,52412687,\nCustomer Contact Reason Drop-down1: Managing...,M
2,51805897,\nCustomer Contact Reason Drop-down1: Problem ...,A
3,52297843,\nCustomer Contact Reason Drop-down1: Item del...,M
4,52424023,\nCustomer Contact Reason Drop-down1: Somethin...,


In [18]:
#test on one ticket
text =randomsample500["tickets_description"][0]
tokens = cleantext(text)
counts = wordcounter(tokens, ["cancel", "order"])


{'cancel': 1}
{'cancel': 1, 'order': 2}


In [21]:
counts = wordcounter(tokens, ["cancel", "order", "rabbit"])
print(counts)

cancel
{'cancel': 1}
order
{'cancel': 1, 'order': 2}
rabbit
{'cancel': 1, 'order': 2, 'rabbit': 0}
end loop  {'cancel': 1, 'order': 2, 'rabbit': 0}
{'cancel': 1, 'order': 2, 'rabbit': 0}


In [30]:
#tokenize the text of all tickets
subset[["tokens"]] = subset['tickets_description'].apply(cleantext)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [31]:
subset.head()

Unnamed: 0,tickets_id,tickets_description,automate,tokens
0,52814760,\nCustomer Contact Reason Drop-down1: Item del...,A,"[order, activity, regard, shipping, like, canc..."
1,52412687,\nCustomer Contact Reason Drop-down1: Managing...,M,"[think, hack, order, item, not, recognize, ema..."
2,51805897,\nCustomer Contact Reason Drop-down1: Problem ...,A,"[be, have, issue, redeem, groupon]"
3,52297843,\nCustomer Contact Reason Drop-down1: Item del...,M,"[wonder, track, item, say, ship, click, happen]"
4,52424023,\nCustomer Contact Reason Drop-down1: Somethin...,,"[cancel, groupon, shipping, charge, price, item]"


In [33]:
#get the word count for user curated word list
wordlist = ["cancel", "order", "rabbit"]
for w in wordlist:
    subset[w] = subset["tokens"].apply(lambda tok: tok.count(w))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [34]:
subset.head()

Unnamed: 0,tickets_id,tickets_description,automate,tokens,cancel,order,rabbit
0,52814760,\nCustomer Contact Reason Drop-down1: Item del...,A,"[order, activity, regard, shipping, like, canc...",1,2,0
1,52412687,\nCustomer Contact Reason Drop-down1: Managing...,M,"[think, hack, order, item, not, recognize, ema...",0,2,0
2,51805897,\nCustomer Contact Reason Drop-down1: Problem ...,A,"[be, have, issue, redeem, groupon]",0,0,0
3,52297843,\nCustomer Contact Reason Drop-down1: Item del...,M,"[wonder, track, item, say, ship, click, happen]",0,0,0
4,52424023,\nCustomer Contact Reason Drop-down1: Somethin...,,"[cancel, groupon, shipping, charge, price, item]",1,0,0


In [35]:
#find the correlation between the word counts and automate
corr_cols = ["automate"] + wordlist
print(corr_cols)
print(subset[corr_cols].corr())

['automate', 'cancel', 'order', 'rabbit']
         cancel    order  rabbit
cancel  1.00000  0.26646     NaN
order   0.26646  1.00000     NaN
rabbit      NaN      NaN     NaN
