# FAQ Chatbot from Twitter Data
The objective of this assignment is the following:
1. To fetch the data from Google Drive by mounting storage (Completed)
3. Performing Exploratory Data Analysis (EDA) to reveal patterns and trends that are relevant to the business problem identified.
4. Cleaning up the data to ensure data quality is enhanced for the Machine Learning Model to give best results.
5. Training/Evaluation of Model on the data.

Prior to the above activity, we wish to ensure a live link that can respond to API calls so that this task doesn't cause a bottleneck.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#NLTK for natural language processing
import nltk
#for maths
import numpy as np
#for string manipulation
import string
#for importing and managing our dataset
import pandas as pd
#for pre-processing our dataset
import re

nltk.download('stopwords')

import plotly.express as px
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**First, when developing the chat bot, I imported the dataset the chatbot should work from**.

**import the csv file form  google drive **

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Chatbot_dataset/twcs.csv')

In [None]:
df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


**Performing Exploratory Data Analysis (EDA)**

When inbound is true , it is usally  response and when it if fase it it a question 

In [None]:
inbound_count = df.inbound.value_counts()
px.pie(names = inbound_count.index, values = inbound_count.values,title = 'Inbound = True vs False',width = 500,height = 300)

Here we are tryig to anzlize numbere of tweets pre authore , and diplay top 50

In [None]:
cap = 50
brand_count = df.author_id.value_counts().head(cap)
px.bar(brand_count,title = 'tweets per author - top {}'.format(cap),width = 800,height = 400)

In [None]:
counter = Counter()
for line in df['text']:
    for word in line.split():
      counter[word]+=1
px.bar(pd.Series(dict(counter.most_common(50))),title = 'Most common 20 words')


## Code for data cleansing

Stop Works removal :is simply removing the words that occur commonly across all the documents , 
Change all works to lower case
wer remove pancuations we do lematized 


In [None]:
stopwords = nltk.corpus.stopwords.words('english')
nlp = spacy.load('en', disable=['parser', 'ner'])

def to_lower(x):
    return str(x).lower()
def remove_stopwords(x):
    return ' '.join(i for i in x.split() if i not in stopwords)
def remove_punctuation(x):
    punctuations = string.punctuation
    return x.translate(str.maketrans('','', punctuations))
def lemmatized(x):
    text = nlp(x)
    return ' '.join(token.lemma_ for token in text)

# Master function with sub-function calls()
def clean_text(x):
    return lemmatized(remove_punctuation(remove_stopwords(to_lower(x))))

In [None]:
df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [None]:
df_q = df[df['inbound']==True].head(200)
df_q['cleaned'] = df_q['text'].apply(lambda x: clean_text(x))

In order to show the question and response in a different columen, we write a code to merege the t able with itself 

In [None]:
merged = pd.merge(df[df['inbound']],df,left_on = 'tweet_id', right_on = 'in_response_to_tweet_id',how = 'left').dropna(subset = ['tweet_id_y'])
merged.head()

Unnamed: 0,tweet_id_x,author_id_x,inbound_x,created_at_x,text_x,response_tweet_id_x,in_response_to_tweet_id_x,tweet_id_y,author_id_y,inbound_y,created_at_y,text_y,response_tweet_id_y,in_response_to_tweet_id_y
1,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1,4.0,1.0,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
2,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4,6.0,4.0,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
3,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,6.0,sprintcare,False,Tue Oct 31 21:46:24 +0000 2017,@115712 Can you please send us a private messa...,57.0,8.0
4,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,9.0,sprintcare,False,Tue Oct 31 21:46:14 +0000 2017,@115712 I would love the chance to review the ...,,8.0
5,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,10.0,sprintcare,False,Tue Oct 31 21:45:59 +0000 2017,@115712 Hello! We never like our customers to ...,,8.0


In [None]:
merged.shape,df.shape

((1450335, 14), (2811774, 7))

Since it is a big dataset we wants to select only data for AppleSupport

In [None]:
data = merged[merged['author_id_y']=='AppleSupport']

In [None]:
df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [None]:
data['cleaned_txt'] = data['text_x'].apply(clean_text)

In [None]:
data['cleaned_txt_author'] = data['cleaned_txt'].str.replace('applesupport','')
data['cleaned_txt_author']

252               new update i️ make sure download yesterday
253                                      httpstconv0yucs0 lb
255                        try reset setting   restart phone
256                             look like httpstcoxcqu2l4xub
257                                  i️ iphone 7 plus yes i️
                                 ...                        
1684095    anyone issue osx highsierra slack zoom multipl...
1684159    hey    able duplicate file page search really ...
1684160    yo  weird glitch w capital " i️ " attempt make...
1684196    fuck  phone keep hang call show " call failure...
1684226    anyone iphone issue phone freeze randomly 7 pu...
Name: cleaned_txt_author, Length: 106646, dtype: object

In [None]:
data[['cleaned_txt_author','text_y']]

Unnamed: 0,cleaned_txt_author,text_y
252,new update i️ make sure download yesterday,@115854 Lets take a closer look into this issu...
253,httpstconv0yucs0 lb,@115854 We're here for you. Which version of t...
255,try reset setting restart phone,@115855 Let's go to DM for the next steps. DM ...
256,look like httpstcoxcqu2l4xub,@115855 Any steps tried since it started last ...
257,i️ iphone 7 plus yes i️,@115855 That's great it has iOS 11.1 as we can...
...,...,...
1684095,anyone issue osx highsierra slack zoom multipl...,@823737 We're happy to help out with your conc...
1684159,hey able duplicate file page search really ...,@689907 We're certainly glad to get you pointe...
1684160,"yo weird glitch w capital "" i️ "" attempt make...",@823765 We'd love to help! Which device are yo...
1684196,"fuck phone keep hang call show "" call failure...",@823779 We'd like to help. Send us a DM and we...


In [None]:
counter = Counter()
for line in data['cleaned_txt']:
    for word in line.split():
      counter[word]+=1
px.bar(pd.Series(dict(counter.most_common(50))),title = 'Most common 20 words')


In [None]:
data[['text_x','text_y']].head()

Unnamed: 0,text_x,text_y
252,@AppleSupport The newest update. I️ made sure ...,@115854 Lets take a closer look into this issu...
253,@AppleSupport https://t.co/NV0yucs0lB,@115854 We're here for you. Which version of t...
255,@AppleSupport Tried resetting my settings .. r...,@115855 Let's go to DM for the next steps. DM ...
256,@AppleSupport This is what it looks like https...,@115855 Any steps tried since it started last ...
257,@AppleSupport I️ have an iPhone 7 Plus and yes...,@115855 That's great it has iOS 11.1 as we can...


In [None]:
json_input = {
    "intents":[
    {
     "tag":"iTunes",
    "patterns": ["Music doesn't seem to work","Can't play any songs","Apple Music is stuck"],
    "responses":["Try contacting our iTunes Store team here for more help: https://t.co/SDIe7UiyJN",
                "Sorry to hear that. Please DM us with your apple ID and we will look into this"]
    },
    {
     "tag":"hardware",
    "patterns": ["button not working","home button","Slow after update"],
    "responses":["Oh, this seems like a hardware problem. Please reach out to us over DM",
                "Uh-oh this doesn't seem like something we can solve over chat. Can you visit an apple service center"]
    },
    {
     "tag":"software",
    "patterns": ["app not working","camera bug","slow phone"],
    "responses":["Please check if restarting helps the issue?",
                "Sorry to hear that. Let's start with a quick restart test?"]
    },
    {
    "tag": "opentoday",
     "patterns": ["Are you open today?", "When do you open today?", "What are your hours today?"],
     "responses": ["We're open every day from 9am-9pm", "Our hours are 9am-9pm every day"]
    },
    {"tag": "greeting",
     "patterns": ["Hi", "How are you", "Is anyone there?", "Hello", "Good day","Hey"],
     "responses": ["Hello, thanks for visiting", "Good to see you again", "Hi there, how can I help?"],
     "context_set": ""
    },
    {"tag": "goodbye",
     "patterns": ["Bye", "See you later", "Goodbye"],
     "responses": ["See you later, thanks for visiting", "Have a nice day", "Bye! Come back again soon."]
    },
    {"tag": "thanks",
     "patterns": ["Thanks", "Thank you", "That's helpful"],
     "responses": ["Happy to help!", "Any time!", "My pleasure"]
    },
    {
     "tag": "AppleWatch",
     "patterns": ["My apple watch is not staying up long", "Battery life of Apple Watch is too less", "Apple Watch drains battery life"],
     "responses": ["Happy to help!", "Any time!", "My pleasure"]
    }
    ]
}

Implementing TensorFLow to take a multi-dimensional arraw and construct a flowchart of operations from inputs to output

In [None]:
!pip install tflearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tflearn
  Downloading tflearn-0.5.0.tar.gz (107 kB)
[K     |████████████████████████████████| 107 kB 7.6 MB/s 
Building wheels for collected packages: tflearn
  Building wheel for tflearn (setup.py) ... [?25l[?25hdone
  Created wheel for tflearn: filename=tflearn-0.5.0-py3-none-any.whl size=127299 sha256=8a3c1b5f73eead76d3e56ea206e6e5891d51bf386d6a1fc66b0ff9451e086c8d
  Stored in directory: /root/.cache/pip/wheels/5f/14/2e/1d8e28cc47a5a931a2fb82438c9e37ef9246cc6a3774520271
Successfully built tflearn
Installing collected packages: tflearn
Successfully installed tflearn-0.5.0


In [None]:
import nltk
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()

import numpy
import tflearn
import tensorflow
import random
nltk.download('punkt')

json_input

Instructions for updating:
non-resource variables are not supported in the long term
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


{'intents': [{'patterns': ["Music doesn't seem to work",
    "Can't play any songs",
    'Apple Music is stuck'],
   'responses': ['Try contacting our iTunes Store team here for more help: https://t.co/SDIe7UiyJN',
    'Sorry to hear that. Please DM us with your apple ID and we will look into this'],
   'tag': 'iTunes'},
  {'patterns': ['button not working', 'home button', 'Slow after update'],
   'responses': ['Oh, this seems like a hardware problem. Please reach out to us over DM',
    "Uh-oh this doesn't seem like something we can solve over chat. Can you visit an apple service center"],
   'tag': 'hardware'},
  {'patterns': ['app not working', 'camera bug', 'slow phone'],
   'responses': ['Please check if restarting helps the issue?',
    "Sorry to hear that. Let's start with a quick restart test?"],
   'tag': 'software'},
  {'patterns': ['Are you open today?',
    'When do you open today?',
    'What are your hours today?'],
   'responses': ["We're open every day from 9am-9pm",
  

Populating a json with categories with intents and patterns and sorting words by tags into categories to assist in seeking patterns like in a database

In [None]:
words = []
labels = []
docs_x = []
docs_y = []

for intent in json_input['intents']:
    for pattern in intent['patterns']:
        wrds = nltk.word_tokenize(pattern)
        words.extend(wrds)
        docs_x.append(wrds)
        docs_y.append(intent["tag"])
        
    if intent['tag'] not in labels:
        labels.append(intent['tag'])

In [None]:

words = [stemmer.stem(w.lower()) for w in words if w != "?"]
words = sorted(list(set(words)))

labels = sorted(labels)

training = []
output = []

out_empty = [0 for _ in range(len(labels))]

for x, doc in enumerate(docs_x):
    bag = []

    wrds = [stemmer.stem(w.lower()) for w in doc]

    for w in words:
        if w in wrds:
            bag.append(1)
        else:
            bag.append(0)

    output_row = out_empty[:]
    output_row[labels.index(docs_y[x])] = 1

    training.append(bag)
    output.append(output_row)


training = numpy.array(training)
output = numpy.array(output)

## Neural Network 

Building the neural network to train the chatbot 

In [None]:
# tensorflow.reset_default_graph()
tensorflow.compat.v1.reset_default_graph()

net = tflearn.input_data(shape=[None, len(training[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(output[0]), activation="softmax")
net = tflearn.regression(net)

model = tflearn.DNN(net)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Training of chatbot

In [None]:
model.fit(training, output, n_epoch=1000, batch_size=8, show_metric=True)

Training Step: 3999  | total loss: [1m[32m0.03458[0m[0m | time: 0.024s
| Adam | epoch: 1000 | loss: 0.03458 - acc: 1.0000 -- iter: 24/27
Training Step: 4000  | total loss: [1m[32m0.03563[0m[0m | time: 0.033s
| Adam | epoch: 1000 | loss: 0.03563 - acc: 1.0000 -- iter: 27/27
--


Saveing the model for Chatbot from training

In [None]:
model.save("model_new")

INFO:tensorflow:/content/model_new is not in all_model_checkpoint_paths. Manually adding it.


Generating library of words for model to use and recognize and try to predict next response based off of.  Initiates prediction sequence

In [None]:
def bag_of_words(s, words):
    bag = [0 for _ in range(len(words))]

    s_words = nltk.word_tokenize(s)
    s_words = [stemmer.stem(word.lower()) for word in s_words]

    for se in s_words:
        for i, w in enumerate(words):
            if w == se:
                bag[i] = 1
            
    return numpy.array(bag)

def chat():
    print("Start talking with the bot (type quit to stop)!")
    while True:
        inp = input("You: ")
        if inp.lower() == "quit":
            break

        results = model.predict([bag_of_words(inp, words)])
        results_index = numpy.argmax(results)
        tag = labels[results_index]

        for tg in json_input["intents"]:
            if tg['tag'] == tag:
                responses = tg['responses']

        print(random.choice(responses))


In [None]:
model.predict([bag_of_words('Hey', words)])[0]

array([1.9161955e-05, 7.4997260e-03, 9.5861787e-01, 1.9166844e-04,
       5.2523818e-03, 1.3167132e-02, 1.5032289e-02, 2.1965800e-04],
      dtype=float32)

Initiation of chat box for testing of functionality.

In [None]:
chat()

Start talking with the bot (type quit to stop)!
You: hi
Good to see you again
You: i need help
Any time!
