# DOMAIN: Digital content management
## CONTEXT: 
Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

### DATA DESCRIPTION: 
Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:

• 8240 "10s" blogs (ages 13-17),

• 8086 "20s" blogs(ages 23-27) and

• 2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions.
Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label url
link. Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus

## PROJECT OBJECTIVE: 
The need is to build a NLP classifier which can use input text parameters to determine the label/s of of the blog.
## Steps and tasks: [ Total Score: 20 points]
1. Import and analyse the data set.
2. Perform data pre-processing on the data:

*   Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase.
*   Target/label merger and transformation
*   Train and test split
*   Vectorisation, etc.

3. Design, train, tune and test the best text classifier.
4. Display and explain detail the classification report
5. Print the true vs predicted labels for any 5 entries from the dataset.

* Hint: The aim here Is to import the text, process it such a way that it can be taken as an inout to the ML/NN classifiers. Be analytical and experimental here in trying new approaches to design the best model

In [27]:
import warnings
warnings.filterwarnings('ignore')

In [28]:
!find / -iname 'libdevice'
!find / -iname 'libnvvm.so'


/usr/local/cuda-10.1/nvvm/libdevice
/usr/local/cuda-10.0/nvvm/libdevice
/usr/local/cuda-11.0/nvvm/libdevice
/usr/local/cuda-10.1/nvvm/lib64/libnvvm.so
/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so
/usr/local/cuda-11.0/nvvm/lib64/libnvvm.so


In [29]:
import os
os.environ['NUMBAPRO_LIBDEVICE'] = "/usr/local/cuda-11.0/nvvm/libdevice"
os.environ['NUMBAPRO_NVVM'] = "/usr/local/cuda-11.0/nvvm/lib64/libnvvm.so"

In [32]:
from numba import vectorize 
from numba import jit, cuda 

In [2]:
import tensorflow as tf
# Enable GPU in the Colab settings for running the code faster
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [5]:
import textblob as TextBlob
import spacy
import wordcloud as WordCloud
from nltk.stem.snowball import SnowballStemmer

# 1. Import and analyse the data set.


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
project_path = '/content/drive/MyDrive/Colab/NLP/Project1/'

In [69]:
df = pd.read_csv(project_path + 'blogtext.csv', index_col=False)
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [70]:
df.shape

(681284, 7)

In [71]:
df.dtypes

id         int64
gender    object
age        int64
topic     object
sign      object
date      object
text      object
dtype: object

In [72]:
# Check for missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      681284 non-null  int64 
 1   gender  681284 non-null  object
 2   age     681284 non-null  int64 
 3   topic   681284 non-null  object
 4   sign    681284 non-null  object
 5   date    681284 non-null  object
 6   text    681284 non-null  object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


In [73]:
# Taking a smaller sample data for initial analysis
df = df.sample(50000, random_state=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 25639 to 266424
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      50000 non-null  int64 
 1   gender  50000 non-null  object
 2   age     50000 non-null  int64 
 3   topic   50000 non-null  object
 4   sign    50000 non-null  object
 5   date    50000 non-null  object
 6   text    50000 non-null  object
dtypes: int64(2), object(5)
memory usage: 3.1+ MB


# 2. Perform data pre-processing on the data:


In [74]:
df = df.drop(['id', 'date'], axis=1)
df.head()

Unnamed: 0,gender,age,topic,sign,text
25639,male,33,indUnk,Pisces,Let's say you have friends that hav...
216060,male,15,Technology,Aries,Was officially the COOLEST FUCKING ...
633204,male,17,Student,Gemini,"Apparently, a few people consider..."
582291,male,27,indUnk,Aries,His nose is too big for his face. Eyes...
366878,female,27,indUnk,Gemini,urlLink urlLink 16-feb-04


In [75]:
#df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,gender,age,topic,sign,text
25639,male,33,indUnk,Pisces,Let's say you have friends that hav...
216060,male,15,Technology,Aries,Was officially the COOLEST FUCKING ...
633204,male,17,Student,Gemini,"Apparently, a few people consider..."
582291,male,27,indUnk,Aries,His nose is too big for his face. Eyes...
366878,female,27,indUnk,Gemini,urlLink urlLink 16-feb-04


## Data preprocessing:
Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase.

In [76]:
import re
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [77]:
#@jit(target='cuda')
def process_data(text):
  # Remove unwanted characters, only keeping alphabetic words
  alpha_text = re.sub(r'[^A-Za-z]+',' ', text )
  # Convert to lower text
  lower_text = alpha_text.lower()
  # Remove Stop words 
  stop_words = set(stopwords.words('english'))
  text_wo_sw = ' '.join([words for words in lower_text.split() if words not in stop_words])
  # Lemmetization
  lemma = WordNetLemmatizer()
  processed_text = [lemma.lemmatize(word) for word in text_wo_sw]
  processed_text = "".join(processed_text)
  return processed_text

In [78]:
# Check how the process_data function words on a smaller text
t = "Info has been found (+/- 100 pages,. urlLink urlLink 16-feb-04"
t1 = process_data(t)
print(t1)

info found pages urllink urllink feb


In [80]:
# Once above step is successful, run apply the function on the full dataset
df['processed_text'] = df.text.apply(process_data)
df.head()

Unnamed: 0,gender,age,topic,sign,text,processed_text
25639,male,33,indUnk,Pisces,Let's say you have friends that hav...,let say friends stood past return always good ...
216060,male,15,Technology,Aries,Was officially the COOLEST FUCKING ...,officially coolest fucking day ever blake brot...
633204,male,17,Student,Gemini,"Apparently, a few people consider...",apparently people considered cory gang massive...
582291,male,27,indUnk,Aries,His nose is too big for his face. Eyes...,nose big face eyes soft little browns intimate...
366878,female,27,indUnk,Gemini,urlLink urlLink 16-feb-04,urllink urllink feb


## Target/label merger and transformation

As there are multiple claseses gender, age, topic and sign, we will merge them into a single label 

In [81]:
all_labels = df[['gender', 'age', 'topic', 'sign']]
all_labels.head()

Unnamed: 0,gender,age,topic,sign
25639,male,33,indUnk,Pisces
216060,male,15,Technology,Aries
633204,male,17,Student,Gemini
582291,male,27,indUnk,Aries
366878,female,27,indUnk,Gemini


In [82]:
all_labels.dtypes

gender    object
age        int64
topic     object
sign      object
dtype: object

In [83]:
# Age is int64, so lets convert it into string
all_labels['age'] = all_labels['age'].astype('str')
all_labels.dtypes

gender    object
age       object
topic     object
sign      object
dtype: object

In [84]:
all_labels.shape

(50000, 4)

In [85]:
m = []
for i in range(all_labels.shape[0]):
  g = []
  for j in range(all_labels.shape[1]):
    g.append(all_labels.iloc[i][j])
  m.append(g)


In [86]:
df['labels'] = m

In [87]:
df.head()

Unnamed: 0,gender,age,topic,sign,text,processed_text,labels
25639,male,33,indUnk,Pisces,Let's say you have friends that hav...,let say friends stood past return always good ...,"[male, 33, indUnk, Pisces]"
216060,male,15,Technology,Aries,Was officially the COOLEST FUCKING ...,officially coolest fucking day ever blake brot...,"[male, 15, Technology, Aries]"
633204,male,17,Student,Gemini,"Apparently, a few people consider...",apparently people considered cory gang massive...,"[male, 17, Student, Gemini]"
582291,male,27,indUnk,Aries,His nose is too big for his face. Eyes...,nose big face eyes soft little browns intimate...,"[male, 27, indUnk, Aries]"
366878,female,27,indUnk,Gemini,urlLink urlLink 16-feb-04,urllink urllink feb,"[female, 27, indUnk, Gemini]"


In [88]:
# Create a new dataframe with processed_text and the labels only
new_df = df[['processed_text', 'labels']]
new_df.sample(5)

Unnamed: 0,processed_text,labels
395438,got home picking kids checked phone messages o...,"[female, 33, indUnk, Scorpio]"
253710,professionally totally urllink stressed trying...,"[female, 25, indUnk, Sagittarius]"
614190,yesh got spot blogspot pun intended hm wait co...,"[male, 16, Engineering, Libra]"
426766,hot pants got hot pants,"[female, 24, Arts, Libra]"
420790,urllink protect personal info,"[female, 24, Student, Libra]"


In [89]:
# Check for most frequent bloggers
new_df.labels.astype('str').value_counts()

['male', '34', 'indUnk', 'Aries']           335
['female', '16', 'Student', 'Libra']        272
['female', '16', 'Student', 'Aries']        265
['female', '23', 'indUnk', 'Aries']         262
['female', '26', 'indUnk', 'Cancer']        250
                                           ... 
['female', '33', 'Banking', 'Cancer']         1
['female', '26', 'Accounting', 'Gemini']      1
['female', '26', 'Fashion', 'Pisces']         1
['male', '39', 'Internet', 'Aquarius']        1
['male', '35', 'indUnk', 'Leo']               1
Name: labels, Length: 3740, dtype: int64

In [90]:
# Check for null values
new_df.isna().sum()

processed_text    0
labels            0
dtype: int64

## Train and test split
First separate the data in X and y and then do a train test split

In [91]:
X = new_df['processed_text']
y = new_df['labels']

In [92]:
print(X.shape, y.shape)

(50000,) (50000,)


In [93]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)

In [94]:
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(35000,) (35000,) (15000,) (15000,)


#### Vectorization

In [95]:
from sklearn.feature_extraction.text import CountVectorizer

In [96]:
vectorizer = CountVectorizer(ngram_range=(1,2), binary=True)

In [97]:
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)
X_train_vect

<35000x2230871 sparse matrix of type '<class 'numpy.int64'>'
	with 6038756 stored elements in Compressed Sparse Row format>

In [98]:
vectorizer.get_feature_names()[:10]

['aa',
 'aa aa',
 'aa aaa',
 'aa aaaa',
 'aa aaaaa',
 'aa aaaaaa',
 'aa aaaaaaa',
 'aa aaaaaaaa',
 'aa aaaaaaaaa',
 'aa aaaaaaaaaaaa']

In [99]:
print(X_train_vect[:3], vectorizer.get_feature_names()[:3])

  (0, 302028)	1
  (0, 134140)	1
  (0, 1793634)	1
  (0, 1740066)	1
  (0, 1979318)	1
  (0, 1356137)	1
  (0, 2171836)	1
  (0, 2227169)	1
  (0, 1653204)	1
  (0, 1150089)	1
  (0, 888211)	1
  (0, 1125477)	1
  (0, 1971384)	1
  (0, 1749840)	1
  (0, 1915272)	1
  (0, 57182)	1
  (0, 1325152)	1
  (0, 669806)	1
  (0, 1785274)	1
  (0, 1141122)	1
  (0, 1271649)	1
  (0, 449003)	1
  (0, 1726989)	1
  (0, 1981710)	1
  (0, 1284698)	1
  :	:
  (1, 54944)	1
  (1, 1832435)	1
  (1, 63090)	1
  (1, 744667)	1
  (2, 1971384)	1
  (2, 2061775)	1
  (2, 2193773)	1
  (2, 2114587)	1
  (2, 1948708)	1
  (2, 1221823)	1
  (2, 733824)	1
  (2, 1192394)	1
  (2, 1203022)	1
  (2, 435811)	1
  (2, 768721)	1
  (2, 2115245)	1
  (2, 2068555)	1
  (2, 1952831)	1
  (2, 2195564)	1
  (2, 1221855)	1
  (2, 735475)	1
  (2, 1973953)	1
  (2, 1193495)	1
  (2, 1203209)	1
  (2, 436032)	1 ['aa', 'aa aa', 'aa aaa']


In [100]:
# use the dataframe that we created earlier to create the dictionary of words and their counts
all_labels.head()

Unnamed: 0,gender,age,topic,sign
25639,male,33,indUnk,Pisces
216060,male,15,Technology,Aries
633204,male,17,Student,Gemini
582291,male,27,indUnk,Aries
366878,female,27,indUnk,Gemini


In [101]:
keys = []
values = []
for i in range(all_labels.shape[1]):
  col = all_labels.iloc[:, i].value_counts()
  for j in range(col.shape[0]):
    keys.append(col.index[j])
    values.append(col.iloc[j])

In [102]:
# Check a sample key value pair
i = 32
print(keys[i], values[i])

Education 2151


In [103]:
dictionary = dict(zip(keys, values))

### Convert the labels using MultiLabelBinarizer

In [104]:
from sklearn.preprocessing import MultiLabelBinarizer

In [105]:
mlb = MultiLabelBinarizer(classes=sorted(dictionary.keys()))
y_train_mlb = mlb.fit_transform(y_train)
y_test_mlb = mlb.transform(y_test)


In [106]:
y_train_mlb[0]

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [107]:
y_test_mlb[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0])

In [108]:
# To verify the binary label transormation lets check 1 value
y_train.iloc[1]


['male', '26', 'Law', 'Capricorn']

In [109]:
# And its inverse transformed value
mlb.inverse_transform(y_train_mlb)[1]

('26', 'Capricorn', 'Law', 'male')

# 3. Design, train, tune and test the best text classifier

In [110]:
print(X_train_vect.shape, y_train_mlb.shape, X_test_vect.shape, y_test_mlb.shape)

(35000, 2230871) (35000, 80) (15000, 2230871) (15000, 80)


In [114]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

logit = LogisticRegression(solver='lbfgs')
ovr = OneVsRestClassifier(logit)

with tf.device('/device:GPU:0'):
  ovr.fit(X_train_vect, y_train_mlb)
y_pred_logit = ovr.predict(X_test_vect)
y_pred_logit

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1]])

# 4. Display and explain detail the classification report

In [115]:
from sklearn.metrics import classification_report
print(classification_report(y_test_mlb, y_pred_logit ))

              precision    recall  f1-score   support

           0       0.60      0.02      0.04       285
           1       0.48      0.05      0.10       613
           2       0.34      0.04      0.07       937
           3       0.47      0.08      0.14      1584
           4       0.46      0.09      0.14      1796
           5       0.27      0.02      0.04      1579
           6       0.37      0.03      0.06      1743
           7       0.30      0.02      0.05      1494
           8       0.34      0.01      0.02      1242
           9       0.35      0.02      0.04       997
          10       0.17      0.00      0.00       419
          11       0.58      0.03      0.06       486
          12       0.50      0.02      0.03       366
          13       0.25      0.00      0.01       292
          14       0.94      0.07      0.14       201
          15       1.00      0.04      0.07       159
          16       0.00      0.00      0.00        97
          17       0.89    

# 5. Print the true vs predicted labels for any 5 entries from the dataset

In [116]:
#sample_pred = y_pred[:5]
sample_pred = y_pred_logit[:5]
actual_label = y_test_mlb[:5]

In [117]:
actual_label = mlb.inverse_transform(actual_label)

In [118]:
actual_label

[('33', 'Aries', 'female', 'indUnk'),
 ('27', 'Libra', 'indUnk', 'male'),
 ('35', 'Libra', 'Technology', 'male'),
 ('17', 'Cancer', 'indUnk', 'male'),
 ('17', 'Gemini', 'indUnk', 'male')]

In [119]:
sample_pred = mlb.inverse_transform(sample_pred)
sample_pred

[('male',), ('indUnk', 'male'), ('male',), ('16', 'male'), ('indUnk', 'male')]

# DOMAIN: Customer support
## CONTEXT: 
Great Learning has a an academic support department which receives numerous support requests every day throughout the year. Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to heavy workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a proper resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can
interact with the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request
to an actual human support executive if the request is complex or not in it’s database.
## DATA DESCRIPTION: 
A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics skills.
# PROJECT OBJECTIVE: 
Design a python based interactive semi - rule based chatbot which can do the following:
1. Start chat session with greetings and ask what the user is looking for.
2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus.
3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it.

Please use the sample chatbot demo video for reference.

## EVALUATION: 
GL evaluator will use linguistics to twist and turn sentences to ask questions on the topics described in DATA DESCRIPTION and check if the bot is giving relevant replies.

### Hint: 
There are a lot of techniques using which one can clean and prepare the data which can be used to train a ML/DL classifier. Hence, it might require you to experiment, research, self learn and implement the above classifier. There might be many iterations between hand building the corpus and designing the best fit text classifier. As the quality and quantity of corpus increases the model’s performance i.e. ability to answer right questions also increases.

#### Reference: https://www.mygreatlearning.com/blog/basics-of-building-an-artificial-intelligence-chatbot/

In [179]:
# Importing Corpus GL Bot
import json

with open(project_path + 'GL Bot.json') as file:
  corpus = json.load(file)

print(corpus)

{'intents': [{'tag': 'Intro', 'patterns': ['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'], 'responses': ['Hello! how can i help you ?'], 'context_set': ''}, {'tag': 'Exit', 'patterns': ['thank you', 'thanks', 'cya', 'see you', 'later', 'see you later', 'goodbye', 'i am leaving', 'have a Good day', 'you helped me', 'thanks a lot', 'thanks a ton', 'you are the best', 'great help', 'too good', 'you are a good learning buddy'], 'responses': ['I hope I was able to assist you, Good Bye'], 'context_set': ''}, {'tag': 'Olympus', 'patterns': ['olympus', 'explain me how olympus works', 'I am not able to understand olympus', 'olympus window not working', 'no access to olympus', 'unable to see link in olympus', 'no link visible on olympus', 'whom to contact for olympus', 'lot of p

In [180]:
import nltk
nltk.download('punkt')

# Extract data
# Tokens
W = []

# tags
L = []

#Tokenized words
X = []

#Tags
y = []

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [181]:
for intent in corpus['intents']:
  for pattern in intent['patterns']:
    w_tmp = nltk.word_tokenize(pattern)
    W.extend(w_tmp)
    X.append(w_tmp)
    y.append(intent['tag'])
  
  if (intent['tag'] not in L):
    L.append(intent['tag'])

In [182]:
print(W)

['hi', 'how', 'are', 'you', 'is', 'anyone', 'there', 'hello', 'whats', 'up', 'hey', 'yo', 'listen', 'please', 'help', 'me', 'i', 'am', 'learner', 'from', 'i', 'belong', 'to', 'aiml', 'batch', 'aifl', 'batch', 'i', 'am', 'from', 'my', 'pm', 'is', 'blended', 'online', 'i', 'am', 'from', 'hey', 'ya', 'talking', 'to', 'you', 'for', 'first', 'time', 'thank', 'you', 'thanks', 'cya', 'see', 'you', 'later', 'see', 'you', 'later', 'goodbye', 'i', 'am', 'leaving', 'have', 'a', 'Good', 'day', 'you', 'helped', 'me', 'thanks', 'a', 'lot', 'thanks', 'a', 'ton', 'you', 'are', 'the', 'best', 'great', 'help', 'too', 'good', 'you', 'are', 'a', 'good', 'learning', 'buddy', 'olympus', 'explain', 'me', 'how', 'olympus', 'works', 'I', 'am', 'not', 'able', 'to', 'understand', 'olympus', 'olympus', 'window', 'not', 'working', 'no', 'access', 'to', 'olympus', 'unable', 'to', 'see', 'link', 'in', 'olympus', 'no', 'link', 'visible', 'on', 'olympus', 'whom', 'to', 'contact', 'for', 'olympus', 'lot', 'of', 'proble

In [183]:
print(y)

['Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Intro', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Exit', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'Olympus', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'SL', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'NN', 'Bot', 'Bot', 'Bot', 'Bot', 'Bot', 'Bot', 'Bot', 'Bot', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Profane', 'Ticket', 'Ticket', 'Ticket', 'Ticket', 'Ticket', 'Ticket', 'Ticket', 'Tick

In [184]:
# Stemming
stemmer = SnowballStemmer('english')
W = [stemmer.stem(w.lower()) for w in W if w != '?']
W = sorted(list(set(W)))
L = sorted(L)

In [185]:
L

['Bot', 'Exit', 'Intro', 'NN', 'Olympus', 'Profane', 'SL', 'Ticket']

In [186]:
X_train = []
y_train = []

empty = [0 for _ in range(len(L))]
empty

[0, 0, 0, 0, 0, 0, 0, 0]

In [187]:
for i, doc in enumerate(X):
  bag = []
  w_tmp = [stemmer.stem(w.lower()) for w in doc]

  for w in W:
    if (w in w_tmp):
      bag.append(1)
    else:
      bag.append(0)
  out_row = empty[:]
  out_row[L.index(y[i])] = 1

  X_train.append(bag)
  y_train.append(out_row)


In [188]:
print(len(X_train), len(y_train))

128 128


In [189]:
len(y_train[1])

8

In [190]:
j = 2
print(X_train[j], '\n', y_train[j])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 
 [0, 0, 1, 0, 0, 0, 0, 0]


In [228]:
# Define a neural Network to train the model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential()

In [229]:
model.add(Dense(64,  input_dim=(len(X_train[0])), activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(len(y_train[0]), activation='softmax'))

In [230]:
optimizer = tf.keras.optimizers.Adam(lr=0.001, amsgrad=True )
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=32, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f74e3dd0a10>

In [221]:
for i in range(len(y_train)):
  if (len(y_train[i])> 8):
    print(len(y_train[i]))

In [222]:
for i in range(len(X_train)):
  if (len(X_train[i])> 154):
    print(len(X_train[i]))

In [223]:
def bow(input_text):
  bag = []
  input_tmp = nltk.word_tokenize(input_text)
  w_tmp = [stemmer.stem(w.lower()) for w in input_tmp]

  for w in W:
    if (w in w_tmp):
      bag.append(1)
    else:
      bag.append(0)
  out_row = empty[:]
  out_row[L.index(y[i])] = 1
  return bag


In [224]:
test = bow("Hi, how are you doing?")
print(test)
print(len(test))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
154


In [225]:
# Text chat function
import random
def chat():
  print("Chat with Chatbot(type: stop to quit)")
  print('If you are not satisfied by the response: (type *)')
  while True:
    inp = input("\n\nYou: ")
    if (inp.lower() == "*"):
      print("BOT: Please rephrase your question and try again")
    if (inp.lower() == 'stop'):
      break
    
    #inp = [inp]
    print(inp)
    token = bow(inp)
    print(token)
    print(len(token))
    results = model.predict([token])
    results_index = np.argmax(results)
    tag = L[results_index]

    for tg in corpus['intents']:
      if (tg['tag'] == tag):
        responses = tg['responses']

    #print(random.choice(responses ))
    print(responses)



In [226]:
chat()

Chat with Chatbot(type: stop to quit)
If you are not satisfied by the response: (type *)


You: hello
hello
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
154
['Hello! how can i help you ?']


You: olympus
olympus
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [227]:
model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_45 (Dense)             (None, 64)                9920      
_________________________________________________________________
dropout_30 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_46 (Dense)             (None, 64)                4160      
_________________________________________________________________
dropout_31 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_47 (Dense)             (None, 8)                 520       
Total params: 14,600
Trainable params: 14,600
Non-trainable params: 0
_________________________________________________________________
