## Problem Statement

Given a chat script that is a group of chat transcript between customer and agent, classify customer question to different classifcation groups. These groups later needs to be tied to an intent for createing chatbots


### Phase 1: Data Cleanup and pre processing
__Step 1:__ Before loading the file add a line *Raw chat* as the first line in the txt file. This is done to ensure that first line is picked up as column name when we use Pandas, other wise it will create a column name as *"Activity Id: 1286172"*. 

In [None]:
import pandas as pd

chat_df = pd.read_fwf(filepath_or_buffer="eGain Transcript sample.txt", delimiter="\n" )
#Fix column Name 
chat_df =chat_df.rename(index=str,columns={'Raw chat':'raw_chat'})


In [18]:
chat_df.head()

Unnamed: 0,raw_chat
0,Activity ID: 143595
1,2017-06-16 18:36:06 Stuart Parker: Refinancing...
2,2017-06-16 18:36:14 system: You are now chatti...
3,2017-06-16 18:36:17 David L.: ^0aWelcome to We...
4,2017-06-16 18:36:28 David L.: The first step w...


__Step 2__: Extract Time stamp and chat into two seprate columns

In [19]:
#The time stamp format ex - "2017-06-16 18:36:06" has length of 20 chars. 
# To extract we need to avoid lines that start with Activity ID: xxx as they too are of length 20.
def getTimeStamp(row):
    first_20col= row['raw_chat']
    if(first_20col.count("Activity") == 0 ):
        return first_20col[0:19]


In [20]:
#Cut time stamp and create a new column
chat_df['timestamp'] = chat_df.apply(getTimeStamp, axis=1)

In [21]:
chat_df.head(30)

Unnamed: 0,raw_chat,timestamp
0,Activity ID: 143595,
1,2017-06-16 18:36:06 Stuart Parker: Refinancing...,2017-06-16 18:36:06
2,2017-06-16 18:36:14 system: You are now chatti...,2017-06-16 18:36:14
3,2017-06-16 18:36:17 David L.: ^0aWelcome to We...,2017-06-16 18:36:17
4,2017-06-16 18:36:28 David L.: The first step w...,2017-06-16 18:36:28
5,2017-06-16 18:36:44 Stuart Parker:,2017-06-16 18:36:44
6,2017-06-16 18:36:44 Stuart Parker: I purchased...,2017-06-16 18:36:44
7,2017-06-16 18:37:02 Stuart Parker: and I had i...,2017-06-16 18:37:02
8,2017-06-16 18:37:29 Stuart Parker: but I was u...,2017-06-16 18:37:29
9,2017-06-16 18:37:29 Stuart Parker:,2017-06-16 18:37:29


>To Extract just the chat without time stamp, we need take text that does not start with timestamp or Activity Id

In [22]:
#If the row does not start with Activity ID:xxxx then slice the Timesatmp and return text after 20 chars 
def getChatText(row):
    raw_chat= row['raw_chat']
    if(raw_chat.count("Activity") == 0 ):
        return raw_chat[20:]
    else:
        return raw_chat

In [23]:
chat_df['chat'] = chat_df.apply(getChatText, axis=1)

In [24]:
chat_df.head()

Unnamed: 0,raw_chat,timestamp,chat
0,Activity ID: 143595,,Activity ID: 143595
1,2017-06-16 18:36:06 Stuart Parker: Refinancing...,2017-06-16 18:36:06,Stuart Parker: Refinancing a recent auto loan
2,2017-06-16 18:36:14 system: You are now chatti...,2017-06-16 18:36:14,system: You are now chatting with David L.
3,2017-06-16 18:36:17 David L.: ^0aWelcome to We...,2017-06-16 18:36:17,David L.: ^0aWelcome to Wells Fargo. How can I...
4,2017-06-16 18:36:28 David L.: The first step w...,2017-06-16 18:36:28,David L.: The first step would be to fill out ...


__Step 3:__ Drop the column **raw_chat** and create a column **activity_id** and populate with the chat activity Id. Now every column in **chat** does not have keyword Activity ID. So on on first occurence of __Activity ID__, extract the id and save it in global variable **activity_id**. For every line after that keep populating the id from global varibale. When the next **Activity ID: xxxx** is found, it means that new chat session has started, so update the value of global variable **activity_id**. 

In [25]:
#Global Variable
activity_id=0

def getActivityId(row):
    global activity_id
    chat = row['chat']
    if(chat.count("Activity") == 0):
        return activity_id
    else:
        activity_id=chat[13:] # Will get the value after text "Activity ID: "

In [26]:
#Drop the first column
chat_df = chat_df.drop(columns="raw_chat")

#Create a column activity_id and fill in column 
chat_df['activity'] = chat_df.apply(getActivityId,axis=1)

In [27]:
chat_df.head(30)

Unnamed: 0,timestamp,chat,activity
0,,Activity ID: 143595,
1,2017-06-16 18:36:06,Stuart Parker: Refinancing a recent auto loan,143595.0
2,2017-06-16 18:36:14,system: You are now chatting with David L.,143595.0
3,2017-06-16 18:36:17,David L.: ^0aWelcome to Wells Fargo. How can I...,143595.0
4,2017-06-16 18:36:28,David L.: The first step would be to fill out ...,143595.0
5,2017-06-16 18:36:44,Stuart Parker:,143595.0
6,2017-06-16 18:36:44,Stuart Parker: I purchased a car recently (***...,143595.0
7,2017-06-16 18:37:02,Stuart Parker: and I had intended to have my d...,143595.0
8,2017-06-16 18:37:29,Stuart Parker: but I was unaware that I needed...,143595.0
9,2017-06-16 18:37:29,Stuart Parker:,143595.0


**Step 4:** Drop all the rows where **chat** starts with **Activity Id: xxx** OR column timestamp is **None** or column **activity_id** is **None**. These rows are not required since we have already captured the activity ID information in its own column.

In [28]:
#Remove Rows with Activity Id
chat_df= chat_df.replace(to_replace='None', value=np.nan).dropna()

In [29]:
chat_df.head()

Unnamed: 0,timestamp,chat,activity
1,2017-06-16 18:36:06,Stuart Parker: Refinancing a recent auto loan,143595
2,2017-06-16 18:36:14,system: You are now chatting with David L.,143595
3,2017-06-16 18:36:17,David L.: ^0aWelcome to Wells Fargo. How can I...,143595
4,2017-06-16 18:36:28,David L.: The first step would be to fill out ...,143595
5,2017-06-16 18:36:44,Stuart Parker:,143595


__Step 5:__ Drop all the unnecessary lines such as 
- "Welcome to Wells fargo. How can I help you today?"
- "system: Chat has been initiated by customer."
- "David L.: David L. has ended the chat"

In [32]:
chat_df=chat_df[~chat_df['chat'].str.contains('Welcome to Wells Fargo.')]
chat_df=chat_df[~chat_df['chat'].str.contains('Welcome to Wells Fargo.')]
chat_df=chat_df[~chat_df['chat'].str.contains('has ended the chat')]
chat_df=chat_df[~chat_df['chat'].str.contains('Chat has been initiated by customer')]
chat_df=chat_df[~chat_df['chat'].str.contains('You are now chatting with')]

In [33]:
chat_df.describe()

Unnamed: 0,timestamp,chat,activity
count,29868,29868,29868
unique,28427,22438,4555
top,2017-07-25 17:12:58,David L.: Thank you for your interest in Wells...,1103696
freq,4,1278,52
