<a href="https://colab.research.google.com/github/aanchal0431/chatbot/blob/main/SEP_728_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chatbot

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Dropout


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split

In [2]:
print(tf.__version__)

2.6.0


Git Commands to clone repository, pull and push data

In [3]:
!git clone https://github.com/aanchal0431/chatbot.git
# !git pull
%cd chatbot/
#!git config --global user.name "aanchal0431"
#!git config --global user.email "aanchal0431@gmail.com"
%ls


Cloning into 'chatbot'...
remote: Enumerating objects: 644, done.[K
remote: Counting objects: 100% (644/644), done.[K
remote: Compressing objects: 100% (627/627), done.[K
remote: Total 644 (delta 87), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (644/644), 6.44 MiB | 7.29 MiB/s, done.
Resolving deltas: 100% (87/87), done.
/content/chatbot
[0m[01;34mData[0m/  README.md  SEP_728_Chatbot.ipynb


### Data Preprocessing

*   Load datasets
*   Append question and answer datasets
*   Remove duplicate questions
*   Split into train and test
*   Drop irrelvant columns






In [4]:
cur_path = 'Data/Question_Answer_Dataset_v1.2/'
data_s8 = pd.read_csv(cur_path + 'S08/question_answer_pairs.txt', delimiter="\t")
data_s9 = pd.read_csv(cur_path + 'S09/question_answer_pairs.txt', delimiter="\t")
#data_s10 = pd.read_csv(cur_path + 'S10/question_answer_pairs.txt', delimiter="\t") #IssueL Couldn't bring in s10
print("Shape s8:", data_s8.shape)
print("Shape s9:", data_s9.shape)
#print("Shape s10:", data_s10.shape)
data_s8.head()

Shape s8: (1715, 6)
Shape s9: (825, 6)


Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile
0,Abraham_Lincoln,Was Abraham Lincoln the sixteenth President of...,yes,easy,easy,data/set3/a4
1,Abraham_Lincoln,Was Abraham Lincoln the sixteenth President of...,Yes.,easy,easy,data/set3/a4
2,Abraham_Lincoln,Did Lincoln sign the National Banking Act of 1...,yes,easy,medium,data/set3/a4
3,Abraham_Lincoln,Did Lincoln sign the National Banking Act of 1...,Yes.,easy,easy,data/set3/a4
4,Abraham_Lincoln,Did his mother die of pneumonia?,no,easy,medium,data/set3/a4


In [5]:
#append all questions into one data set
#data_all = data_s8.append(data_s9.append(data_s10)) #Issue: append s10 as well
data_all = data_s8.append(data_s9)
print("Shape:", data_all.shape)
data_all.head()


Shape: (2540, 6)


Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile
0,Abraham_Lincoln,Was Abraham Lincoln the sixteenth President of...,yes,easy,easy,data/set3/a4
1,Abraham_Lincoln,Was Abraham Lincoln the sixteenth President of...,Yes.,easy,easy,data/set3/a4
2,Abraham_Lincoln,Did Lincoln sign the National Banking Act of 1...,yes,easy,medium,data/set3/a4
3,Abraham_Lincoln,Did Lincoln sign the National Banking Act of 1...,Yes.,easy,easy,data/set3/a4
4,Abraham_Lincoln,Did his mother die of pneumonia?,no,easy,medium,data/set3/a4


In [6]:
#remove duplicate questions
data_all = data_all.drop_duplicates(subset=['Question'])
print("Shape:", data_all.shape)
data_all.head()

Shape: (1628, 6)


Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile
0,Abraham_Lincoln,Was Abraham Lincoln the sixteenth President of...,yes,easy,easy,data/set3/a4
2,Abraham_Lincoln,Did Lincoln sign the National Banking Act of 1...,yes,easy,medium,data/set3/a4
4,Abraham_Lincoln,Did his mother die of pneumonia?,no,easy,medium,data/set3/a4
6,Abraham_Lincoln,How many long was Lincoln's formal education?,18 months,medium,easy,data/set3/a4
8,Abraham_Lincoln,When did Lincoln begin his political career?,1832,medium,easy,data/set3/a4


In [7]:
# divide into train and test
X_train, X_test, y_train, y_test = train_test_split(data_s8['Question'], data_s8['Answer'],
          shuffle=True, test_size=0.1, random_state=5)

#Format for simpleT5
train = pd.DataFrame({'source_text': X_train, 'target_text': y_train})    
test = pd.DataFrame({'source_text': X_test, 'target_text': y_test}) 
train.head()

Unnamed: 0,source_text,target_text
1214,Are otters playful animals?,yes
123,Did the scientific community not reserve great...,yes
1084,How many municipalities are within Oberland?,6.
917,What information did he record in his diary?,He wrote descriptions of events and impression...
823,"What does ""Era of Good Feelings"" refers to?","Monroe allowed his political base to decay, wh..."


### Train a Simple Model
A pretrained t5 model is used to test the question/answer process. No tokenization or context is required for this model.

In [10]:
#pip install --upgrade simplet5

Collecting simplet5
  Downloading simplet5-0.1.3.tar.gz (7.2 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.4 MB/s 
Collecting transformers==4.10.0
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 64.2 MB/s 
[?25hCollecting pytorch-lightning==1.4.5
  Downloading pytorch_lightning-1.4.5-py3-none-any.whl (919 kB)
[K     |████████████████████████████████| 919 kB 66.3 MB/s 
Collecting fsspec[http]!=2021.06.0,>=2021.05.0
  Downloading fsspec-2021.10.1-py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 74.5 MB/s 
Collecting future>=0.17.1
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 70.5 MB/s 
Collecting PyYAML>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux201

In [11]:
# import
from simplet5 import SimpleT5

# instantiate
model = SimpleT5()

# load (supports t5, mt5, byT5 models)
model.from_pretrained("t5","t5-base")

# train
model.train(train_df=train.applymap(str), # pandas dataframe with 2 columns: source_text & target_text
            eval_df=test.applymap(str), # pandas dataframe with 2 columns: source_text & target_text
            source_max_token_len = 512, #Issue: not sure of max len
            target_max_token_len = 128, #Issue: not sure of max len
            batch_size = 8,
            max_epochs = 2,
            use_gpu = False,
            outputdir = '/model/simpleT5',
            early_stopping_patience_epochs = 0,
            )

Global seed set to 42


Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Global seed set to 42
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [20]:
#Issue:Unable to load and predict with model
# load trained T5 model
model.load_model("t5",'lightning_logs', use_gpu=False)


# for each test data perform prediction
model.predict("is this a question")

file lightning_logs/config.json not found


OSError: ignored