#Bringing the Chatbot to 2020: Introducing the State of the Art (SOTA) in Machine Learning and Applying it to Chatbots

In our previous notebook, we looked at how neural networks can be used in the context of speech recognition and chatbots. We studied how we can model text using recurrent neural networks (RNNs) and long short term memory networks (LSTMs) to predict, for example, what words will come next in a conversation.

Would it surprise you to know that we can actually do much better than RNNs and LSTMs today? Today's cutting edge technology relies on more advanced neural networks that build on top of what RNNs and LSTMs discovered, ushering in a new era of natural language processing!

In this notebook, we'll dive into the state of the art in NLP and look at how we can use that to help create our mental health chatbot. Like before, we'll train our neural network, then create an interface that'll allow us to see how well our new chatbot is doing. 

###Outline:

This'll be the outline for today's notebook

1. Introduction to Transformers
2. Let's create our own Transformer!
2. Exploring and initializing our chat dataset
4. Training our Transformer on our data and using pre-trained models
5. Can we do better? An introduction to transfer learning
6. What does this mean? Where do we go from here?



###**Important: Before starting, set your runtype type to GPU!**

In [None]:
#@title Run this cell to install our libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import spacy
import tensorflow as tf
import tensorflow_datasets as tfds
tf.random.set_seed(1)
import os

# set pandas viewing options
pd.set_option("display.max_columns", None)
pd.set_option('display.width', None)
pd.set_option("max_colwidth", None)
#pd.reset_option("max_colwidth")

# the source of our data is: https://github.com/nbertagnolli/counsel-chat

# load pretrained weights:
# import gdown 
# gdown.download('https://drive.google.com/uc?export=download&id=1rR0HAOKgs0yGAyZwqeJkX1U3W8234BgR','chatbot_transformer_v4.h5',True);
!wget 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Mental%20Health%20Chatbots/chatbot_transformer_v4.h5'

--2022-03-12 03:03:29--  https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Mental%20Health%20Chatbots/chatbot_transformer_v4.h5
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.152.128, 173.194.195.128, 173.194.198.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.152.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37056308 (35M) [application/octet-stream]
Saving to: ‘chatbot_transformer_v4.h5’


2022-03-12 03:03:29 (64.0 MB/s) - ‘chatbot_transformer_v4.h5’ saved [37056308/37056308]



##Introduction to Transformers

###The Rise of the Transformers

In 2017, researchers from Google introduced a ground-breaking new approach to the field of natural language processing (NLP).

> **Why not teach a neural network to figure out what words it should pay attention to?**

They developed a model to do just that, which we call **transformers**. Since then, transformers have revolutionized NLP, with transformer-based architectures such as [**GPT-2**](https://en.wikipedia.org/wiki/OpenAI#GPT-2) and [**GPT-3**](https://en.wikipedia.org/wiki/GPT-3) producing ground-breaking results. In this section, we'll explore what a transformer is and how it enables neural networks to figure out what words it should pay attention to.

**Optional**: Read [their original paper on transformers](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf). While the details may be complex, skimming papers for key information is a good habit to develop!

**Optional**: Watch [this YouTube video](https://www.youtube.com/watch?v=4Bdc55j80l8) for a detailed overview on transformers.

**Optional**: For a bit of more light-hearted fun, try out [AI Dungeon](https://play.aidungeon.io/main/home), an impressive text role-playing game created using GPT-3.

###What does it mean for a machine to pay attention?

To illustrate what it means for a machine to pay attention, let's look at the following example:

> *The animal didn't cross the street because it was too tired.*

**Question**: In this sentence, what is **it** referring to?

Now, look at this example:

> *The animal didn't cross the street because it was too wide.*

**Question**: In *this* sentence, what is **it** referring to?

Now, let's think about how we might program a computer to accomplish this task.

**Question**: Could we hard-code rules to accomplish this task? If so, what would such rules look like? What might be some problems or difficulties with this approach (if any)?

**Question**: Could we use an RNN to accomplish this task? If so, how should we set up this RNN? What might be some problems or difficulties with this approach (if any)?


Transformers were revolutionary because they enable computers to consider the *entire* conversation at once when guessing what semi-ambiguous words such as **it** refer to. We'll explore step-by-step how exactly transformers accomplish this task by implementing a transformer-based architecture for our chatbot and exploring its properties!

##Let's create our own Transformer!

**Note**: We invite you to ask lots of questions as you complete this project! This is the brand-new, cutting-edge in AI and ML, so it's stuff that's relatively new even to instructors!



###Attention

Remember that what made Transformers unique was their ability to pay attention!.

What does this mean? 

Let's look at a demonstration: 

![link](https://raw.githubusercontent.com/jessevig/bertviz/master/images/head_thumbnail_right.gif)

If you look at this demo, the Transformer scans through each word in the left column and figures out what words in the sentence are most relevant to that word. The darker the color of the blue box on the right, the more the program thinks that word is relevant to the grey word. For example, when the grey box is on "rabbit", you should see that the word "hopped" on the right is highlighted in dark blue. Here, as the Transformer scans through each word on the left, it looks at the other words in the sentence and sees if, based on the other words and where they are in the sentence, if it can figure out what word should come next.

This is how the Transformer learns what words it should be paying attention to, and we'll create this in the code chunks below!



###Multi-Headed Attention
Transformers take the "Attention" process further by using something called **"Multi-Headed Attention"**

What does this mean?

Intuitively, it means that the model is taking several different looks at the input data. It's similar to how, for example, if you look at a painting from a different angle or if you look at a building during a different time of day, you might get a different view. 

Imagine if this dog were paying attention to you:

![link](http://vignette2.wikia.nocookie.net/naruto/images/5/5a/Multi_headed_dog2.PNG/revision/latest?cb=20150209111333)

Each head of the dog is paying attention to you, but each head gets a slightly different view of you. 

In the same way, Transformers use several "heads" to pay attention to the data in different ways, learning slightly different patterns with each head. 

Looks complicated? Don't worry, we'll simplify it for our you! 

In [None]:
#@title Run to get our function for calculating "attention"
def scaled_dot_product_attention(query, key, value, mask):
  """Calculate the attention weights. """
  matmul_qk = tf.matmul(query, key, transpose_b=True)

  # scale matmul_qk
  depth = tf.cast(tf.shape(key)[-1], tf.float32)
  logits = matmul_qk / tf.math.sqrt(depth)

  # add the mask to zero out padding tokens
  if mask is not None:
    logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k)
  attention_weights = tf.nn.softmax(logits, axis=-1)

  output = tf.matmul(attention_weights, value)

  return output

In [None]:
#@title Run this to create the multi-head attention mechanism
class MultiHeadAttention(tf.keras.layers.Layer):

  def __init__(self, d_model, num_heads, name="multi_head_attention"):
    super(MultiHeadAttention, self).__init__(name=name)
    self.num_heads = num_heads
    self.d_model = d_model

    assert d_model % self.num_heads == 0

    self.depth = d_model // self.num_heads

    self.query_dense = tf.keras.layers.Dense(units=d_model)
    self.key_dense = tf.keras.layers.Dense(units=d_model)
    self.value_dense = tf.keras.layers.Dense(units=d_model)

    self.dense = tf.keras.layers.Dense(units=d_model)

  def split_heads(self, inputs, batch_size):
    inputs = tf.reshape(
        inputs, shape=(batch_size, -1, self.num_heads, self.depth))
    return tf.transpose(inputs, perm=[0, 2, 1, 3])

  def call(self, inputs):
    query, key, value, mask = inputs['query'], inputs['key'], inputs[
        'value'], inputs['mask']
    batch_size = tf.shape(query)[0]

    # linear layers
    query = self.query_dense(query)
    key = self.key_dense(key)
    value = self.value_dense(value)

    # split heads
    query = self.split_heads(query, batch_size)
    key = self.split_heads(key, batch_size)
    value = self.split_heads(value, batch_size)

    # scaled dot-product attention
    scaled_attention = scaled_dot_product_attention(query, key, value, mask)

    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])

    # concatenation of heads
    concat_attention = tf.reshape(scaled_attention,
                                  (batch_size, -1, self.d_model))

    # final linear layer
    outputs = self.dense(concat_attention)

    return outputs

####Masking

Remember how we used padding to add `<pad>` to fill in sentences that were too short? We don't want our model to think that `<pad>` is an actual word (they're only there as placeholders), so we'll use something called a **mask**, which will cover up the `"<pad>"` instances in our sequence so that our model ignores them. 

In [None]:
#@title Run this chunk to create the masks
def create_padding_mask(x):
  mask = tf.cast(tf.math.equal(x, 0), tf.float32)
  # (batch_size, 1, 1, sequence length)
  return mask[:, tf.newaxis, tf.newaxis, :]

#print(create_padding_mask(tf.constant([[1, 2, 0, 3, 0], [0, 0, 0, 4, 5]])))

def create_look_ahead_mask(x):
  seq_len = tf.shape(x)[1]
  look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
  padding_mask = create_padding_mask(x)
  return tf.maximum(look_ahead_mask, padding_mask)

#print(create_look_ahead_mask(tf.constant([[1, 2, 0, 4, 5]])))



In [None]:
#@title Run this chunk as well, for more processing

# note: this gives positional information (i.e., where the words are in a sentence
# and includes it in the model) since our model isn't recurrent anymore

class PositionalEncoding(tf.keras.layers.Layer):

  def __init__(self, position, d_model):
    super(PositionalEncoding, self).__init__()
    self.pos_encoding = self.positional_encoding(position, d_model)

  def get_angles(self, position, i, d_model):
    angles = 1 / tf.pow(10000, (2 * (i // 2)) / tf.cast(d_model, tf.float32))
    return position * angles

  def positional_encoding(self, position, d_model):
    angle_rads = self.get_angles(
        position=tf.range(position, dtype=tf.float32)[:, tf.newaxis],
        i=tf.range(d_model, dtype=tf.float32)[tf.newaxis, :],
        d_model=d_model)
    # apply sin to even index in the array
    sines = tf.math.sin(angle_rads[:, 0::2])
    # apply cos to odd index in the array
    cosines = tf.math.cos(angle_rads[:, 1::2])

    pos_encoding = tf.concat([sines, cosines], axis=-1)
    pos_encoding = pos_encoding[tf.newaxis, ...]
    return tf.cast(pos_encoding, tf.float32)

  def call(self, inputs):
    return inputs + self.pos_encoding[:, :tf.shape(inputs)[1], :]

# example of this class: 

#sample_pos_encoding = PositionalEncoding(50, 512)

#plt.pcolormesh(sample_pos_encoding.pos_encoding.numpy()[0], cmap='RdBu')
#plt.xlabel('Depth')
#plt.xlim((0, 512))
#plt.ylabel('Position')
#plt.colorbar()
#plt.show()

###Encoders

Remember encoders from last notebook? They were used to process the inputs and then fed those processed inputs into the decoder. We feed in our patient questions into the encoder, the encoder figures out "the gist" of the patient's question, and then feeds that into the decoder for it to respond appropriately (in this case, predict the appropriate therapist response). 

For our Transformer model, we take all the steps above (Multi-Headed Attention and masking) and combine them into a series of layers, which will make up our encoder. Our input into this encoder is going to be our patient questions, and the output of our encoder (and, subsequently, the input to our decoder) will be a matrix (think of it like an Excel table) that has each word, weighed by "how important" it is in the context of the entire sentence (which gives us a measure of how much the model should "pay attention" to that word). 

Run the code chunks below to create our encoders!

In [None]:
#@title Let's create our encoder!
# individual encoder layer
def encoder_layer(units, d_model, num_heads, dropout, name="encoder_layer"):
  inputs = tf.keras.Input(shape=(None, d_model), name="inputs")
  padding_mask = tf.keras.Input(shape=(1, 1, None), name="padding_mask")

  attention = MultiHeadAttention(
      d_model, num_heads, name="attention")({
          'query': inputs,
          'key': inputs,
          'value': inputs,
          'mask': padding_mask
      })
  attention = tf.keras.layers.Dropout(rate=dropout)(attention)
  attention = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(inputs + attention)

  outputs = tf.keras.layers.Dense(units=units, activation='relu')(attention)
  outputs = tf.keras.layers.Dense(units=d_model)(outputs)
  outputs = tf.keras.layers.Dropout(rate=dropout)(outputs)
  outputs = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(attention + outputs)

  return tf.keras.Model(
      inputs=[inputs, padding_mask], outputs=outputs, name=name)

# complete encoder architecture
def encoder(vocab_size,
            num_layers,
            units,
            d_model,
            num_heads,
            dropout,
            name="encoder"):
  inputs = tf.keras.Input(shape=(None,), name="inputs")
  padding_mask = tf.keras.Input(shape=(1, 1, None), name="padding_mask")

  embeddings = tf.keras.layers.Embedding(vocab_size, d_model)(inputs)
  embeddings *= tf.math.sqrt(tf.cast(d_model, tf.float32))
  embeddings = PositionalEncoding(vocab_size, d_model)(embeddings)

  outputs = tf.keras.layers.Dropout(rate=dropout)(embeddings)

  for i in range(num_layers):
    outputs = encoder_layer(
        units=units,
        d_model=d_model,
        num_heads=num_heads,
        dropout=dropout,
        name="encoder_layer_{}".format(i),
    )([outputs, padding_mask])

  return tf.keras.Model(
      inputs=[inputs, padding_mask], outputs=outputs, name=name)


###Decoders

Now that we've created our encoder, we need to create the decoder as well. The purpose of the decoder is to take what the encoder spits out and "decode" it into something that's useful (in our case, a therapist's response). For Transformers, the Decoder will look very similar to our Encoder since it also uses multi-headed attention.

The end goal of the Decoder is to figure out how the patient's question relates to what the therapist's response was. It takes the output of the Encoder step, which essentially tells it how much "attention" should be placed on each word in the patient's question, and from there learns what a therapist would normally say in that situation. 

Run the code chunk below to create our decoders!

In [None]:
#@title Let's create our decoder!
def decoder_layer(units, d_model, num_heads, dropout, name="decoder_layer"):
  inputs = tf.keras.Input(shape=(None, d_model), name="inputs")
  enc_outputs = tf.keras.Input(shape=(None, d_model), name="encoder_outputs")
  look_ahead_mask = tf.keras.Input(
      shape=(1, None, None), name="look_ahead_mask")
  padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')

  attention1 = MultiHeadAttention(
      d_model, num_heads, name="attention_1")(inputs={
          'query': inputs,
          'key': inputs,
          'value': inputs,
          'mask': look_ahead_mask
      })
  attention1 = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(attention1 + inputs)

  attention2 = MultiHeadAttention(
      d_model, num_heads, name="attention_2")(inputs={
          'query': attention1,
          'key': enc_outputs,
          'value': enc_outputs,
          'mask': padding_mask
      })
  attention2 = tf.keras.layers.Dropout(rate=dropout)(attention2)
  attention2 = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(attention2 + attention1)

  outputs = tf.keras.layers.Dense(units=units, activation='relu')(attention2)
  outputs = tf.keras.layers.Dense(units=d_model)(outputs)
  outputs = tf.keras.layers.Dropout(rate=dropout)(outputs)
  outputs = tf.keras.layers.LayerNormalization(
      epsilon=1e-6)(outputs + attention2)

  return tf.keras.Model(
      inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask],
      outputs=outputs,
      name=name)

# decoder itself
def decoder(vocab_size,
            num_layers,
            units,
            d_model,
            num_heads,
            dropout,
            name='decoder'):
  inputs = tf.keras.Input(shape=(None,), name='inputs')
  enc_outputs = tf.keras.Input(shape=(None, d_model), name='encoder_outputs')
  look_ahead_mask = tf.keras.Input(
      shape=(1, None, None), name='look_ahead_mask')
  padding_mask = tf.keras.Input(shape=(1, 1, None), name='padding_mask')
  
  embeddings = tf.keras.layers.Embedding(vocab_size, d_model)(inputs)
  embeddings *= tf.math.sqrt(tf.cast(d_model, tf.float32))
  embeddings = PositionalEncoding(vocab_size, d_model)(embeddings)

  outputs = tf.keras.layers.Dropout(rate=dropout)(embeddings)

  for i in range(num_layers):
    outputs = decoder_layer(
        units=units,
        d_model=d_model,
        num_heads=num_heads,
        dropout=dropout,
        name='decoder_layer_{}'.format(i),
    )(inputs=[outputs, enc_outputs, look_ahead_mask, padding_mask])

  return tf.keras.Model(
      inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask],
      outputs=outputs,
      name=name)

###Putting it all together!

Now that we have our encoder and decoder, let's combine them into one model, our Transformer model!

Run the code chunk below to create our Transformer model!

In [None]:
#@title Let's create our Transformer!
def transformer(vocab_size,
                num_layers,
                units,
                d_model,
                num_heads,
                dropout,
                name="transformer"):
  inputs = tf.keras.Input(shape=(None,), name="inputs")
  dec_inputs = tf.keras.Input(shape=(None,), name="dec_inputs")

  enc_padding_mask = tf.keras.layers.Lambda(
      create_padding_mask, output_shape=(1, 1, None),
      name='enc_padding_mask')(inputs)
  # mask the future tokens for decoder inputs at the 1st attention block
  look_ahead_mask = tf.keras.layers.Lambda(
      create_look_ahead_mask,
      output_shape=(1, None, None),
      name='look_ahead_mask')(dec_inputs)
  # mask the encoder outputs for the 2nd attention block
  dec_padding_mask = tf.keras.layers.Lambda(
      create_padding_mask, output_shape=(1, 1, None),
      name='dec_padding_mask')(inputs)

  enc_outputs = encoder(
      vocab_size=vocab_size,
      num_layers=num_layers,
      units=units,
      d_model=d_model,
      num_heads=num_heads,
      dropout=dropout,
  )(inputs=[inputs, enc_padding_mask])

  dec_outputs = decoder(
      vocab_size=vocab_size,
      num_layers=num_layers,
      units=units,
      d_model=d_model,
      num_heads=num_heads,
      dropout=dropout,
  )(inputs=[dec_inputs, enc_outputs, look_ahead_mask, dec_padding_mask])

  outputs = tf.keras.layers.Dense(units=vocab_size, name="outputs")(dec_outputs)

  return tf.keras.Model(inputs=[inputs, dec_inputs], outputs=outputs, name=name)

Awesome! Now that we've finished creating our Transformer model, let's use it to create a start of the art chatbot!

## Preparing our dataset

In [None]:
#@title Run this cell to download our dataset

# load in our data with this code chunk: 
chat_data = pd.read_csv("https://raw.githubusercontent.com/nbertagnolli/counsel-chat/master/data/20200325_counsel_chat.csv")

We will use the same dataset as last time. As a recap, our dataset contains questions and answers taken from conversations between patients and licensed mental health professionals.

**Disclaimer**: Once again, this dataset was made freely available and all data was provided consensually, and in anonymized form. Remember, when working with sensitive data such as medical data, you should *always* get permission first!

### Exploring our data

As usual, let's begin by exploring our data. Our data is in a pandas dataframe named `chat_data`.

**Exercise**: Print out the first 5 rows in the dataset.

In [None]:
## YOUR CODE HERE
chat_data.head()

Unnamed: 0.1,Unnamed: 0,questionID,questionTitle,questionText,questionLink,topic,therapistInfo,therapistURL,answerText,upvotes,views,split
0,0,0,Can I change my feeling of being worthless to everyone?,"I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n How can I change my feeling of being worthless to everyone?",https://counselchat.com/questions/can-i-change-my-feeling-of-being-worthless-to-everyone,depression,"Sherry Katz, LCSWCouples and Family Therapist, LCSW",https://counselchat.com/therapists/sherry-katz-lcsw,"If everyone thinks you're worthless, then maybe you need to find new people to hang out with.Seriously, the social context in which a person lives is a big influence in self-esteem.Otherwise, you can go round and round trying to understand why you're not worthless, then go back to the same crowd and be knocked down again.There are many inspirational messages you can find in social media. Maybe read some of the ones which state that no person is worthless, and that everyone has a good purpose to their life.Also, since our culture is so saturated with the belief that if someone doesn't feel good about themselves that this is somehow terrible.Bad feelings are part of living. They are the motivation to remove ourselves from situations and relationships which do us more harm than good.Bad feelings do feel terrible. Your feeling of worthlessness may be good in the sense of motivating you to find out that you are much better than your feelings today.",1,2899,train
1,1,0,Can I change my feeling of being worthless to everyone?,"I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n How can I change my feeling of being worthless to everyone?",https://counselchat.com/questions/can-i-change-my-feeling-of-being-worthless-to-everyone,depression,"Robin Landwehr, DBH, LPCC, NCCMental Health in a Primary Care Setting",https://counselchat.com/therapists/robin-landwehr-dbh-lpcc-ncc,"Hello, and thank you for your question and seeking advice on this. Feelings of worthlessness is unfortunately common. In fact, most people, if not all, have felt this to some degree at some point in their life. You are not alone. Changing our feelings is like changing our thoughts - it's hard to do. Our minds are so amazing that the minute you change your thought another one can be right there to take it's place. Without your permission, another thought can just pop in there. The new thought may feel worse than the last one! My guess is that you have tried several things to improve this on your own even before reaching out on here. People often try thinking positive thoughts, debating with their thoughts, or simply telling themselves that they need to ""snap out of it"" - which is also a thought that carries some self-criticism. Some people try a different approach, and there are counselors out there that can help you with this. The idea is that instead of trying to change the thoughts, you change how you respond to them. You learn skills that allow you to manage difficult thoughts and feelings differently so they don't have the same impact on you that they do right now. For some people, they actually DO begin to experience less hurtful thoughts once they learn how to manage the ones they have differently. Acceptance and Commitment Therapy may be a good choice for you. There is information online and even self-help books that you can use to teach you the skills that I mentioned. Because they are skills, they require practice, but many people have found great relief and an enriched life by learning them. As for suicidal thoughts, I am very glad to read that this has not happened to you. Still, you should watch out for this because it can be a sign of a worsening depression. If you begin to think about this, it is important to reach out to a support system right away. The National Suicide Prevention Lifeline is 1-800-273-8255. The text line is #741741. I hope some other colleagues will provide you more suggestions. Be well...Robin Landwehr, DBH, LPCC",1,3514,train
2,2,0,Can I change my feeling of being worthless to everyone?,"I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n How can I change my feeling of being worthless to everyone?",https://counselchat.com/questions/can-i-change-my-feeling-of-being-worthless-to-everyone,depression,Lee KingI use an integrative approach to treatment and have an online therapy practice.,https://counselchat.com/therapists/lee-king,First thing I'd suggest is getting the sleep you need or it will impact how you think and feel. I'd look at finding what is going well in your life and what you can be grateful for. I believe everyone has talents and wants to find their purpose in life. I think you can figure it out with some help.,0,5,train
3,3,0,Can I change my feeling of being worthless to everyone?,"I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n How can I change my feeling of being worthless to everyone?",https://counselchat.com/questions/can-i-change-my-feeling-of-being-worthless-to-everyone,depression,"Shauntai Davis-YearginPersonalized, private online counseling for individuals and couples",https://counselchat.com/therapists/shauntai-davis-yeargin,"Therapy is essential for those that are feeling depressed and worthless. When I work with those that are experiencing concerns related to feeling of depression and issues with self esteem. I generally work with my client to help build coping skills to reduce level of depression and to assist with strengthening self esteem, by guiding my client with CBT practices. CBT helps with gaining a better awareness of how your thought process influences your belief system, and how your beliefs impact your actions and the outcome of your behaviors. This process isn’t easy but it helps teach an individual that we don’t always have control over what happens in our lives but we can control how we interpret, feel, and behave. CBT is good for individuals dealing with depression, anxiety, toxic relationships, stress, self esteem, codependency, etc.",0,31,train
4,4,0,Can I change my feeling of being worthless to everyone?,"I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n How can I change my feeling of being worthless to everyone?",https://counselchat.com/questions/can-i-change-my-feeling-of-being-worthless-to-everyone,depression,Jordan WhiteLicensed Social Worker at Oak Roots Dynamic,https://counselchat.com/therapists/jordan-white,I first want to let you know that you are not alone in your feelings and there is always someone there to help. You can always change your feelings and change your way of thinking by being open to trying to change. You can always make yourself available to learning new things or volunteering so that you can make a purpose for yourself.,0,620,train


**Question**: What does each row in our dataset represent?

Now, let's set our `X` and `y` variables to be, respectively, the input and output for our chatbot.

**Question**: What should we set `X` to be? What should we set `y` to be?

In [None]:
X = chat_data["questionTitle"] ## YOUR CODE HERE
y = chat_data["answerText"] ## YOUR CODE HERE

### Cleaning our data

Before we can begin to use our data, we must preprocess it to clean up any data which we don't want to see. Run the following cell to perform some initial cleaning of our data.



In [None]:
def preprocess_text(phrase): 
  phrase = re.sub(r"\xa0", "", phrase) # removes "\xa0"
  phrase = re.sub(r"\n", "", phrase) # removes "\n"
  phrase = re.sub("[.]{1,}", ".", phrase) # removes duplicate "."s
  phrase = re.sub("[ ]{1,}", " ", phrase) # removes duplicate spaces
  return phrase

# run cleaning function
X = X.apply(preprocess_text)
y = y.apply(preprocess_text)

### Splitting up our questions and answers

There's a little more preprocessing we need to do however! We want to keep our phrases relatively short; however, some of the questions and answers in our dataset are several sentences long.

To solve this problem, we'll split up each question and each answer into their constituent sentences. We'll then pair the first sentence of the question with the first sentence of the answer, the second sentence of the question with the second sentence of the answer, and so on until we can't form any more pairs.

For example, suppose we have the following question-answer pair:
> **Q**: "I am not feeling well today. I feel sad."

> **A**: "Tell me more about how you feel. What have you been up to today?"

First, we would split up the question into its constituent sentences, resulting in `["I am not feeling well today.", "I feel sad."]`. Similarly, we would split up the answer into its constituent sentences, resulting in `["Tell me more about how you feel.", "What have you been up to today?"]`. Finally, we would pair each sentence of the question with its corresponding sentence of the answer, ultimately resulting in two separate question-answer pairs:
> **Q**: "I am not feeling well today."

> **A**: "Tell me more about how you feel."

and

> **Q**: "I feel sad."

> **A**: "What have you been up to today?"

In [None]:
# run this code chunk, to store all of our question/answer pairs
question_answer_pairs = []

In [None]:
# loop through each combination of question + answer
for (question, answer) in zip(X, y):

  # clean up text inputs

  # example: 
  # question = "I am not feeling well today. I feel sad."
  # answer = "Tell me more about how you feel. What have you been up to today?"

  question = preprocess_text(question) 
  answer = preprocess_text(answer)

  # split by .
  # example
  # question_arr = ["I am not feeling well today", "I feel sad"]
  # answer_arr = ["Tell me more about how you feel", "What have you been up to?"]
  question_arr = question.split(".")
  answer_arr = answer.split(".")

  # get the maximum length, which will be the shorter of the two
  max_sentences = min(len(question_arr), len(answer_arr))

  # for each combination of question + answer, pair them up
  for i in range(max_sentences):

    # set up Q/A pair
    q_a_pair = []

    # append question, answer to pair (e.g,. first sentence of question + first sentence of answer, etc.)
    q_a_pair.append(question_arr[i])
    q_a_pair.append(answer_arr[i])

    # append to question_answer_pairs
    question_answer_pairs.append(q_a_pair)

### Tokenizing and padding our data

The next preprocessing steps that we need to implement are going to be tokenization and padding (which we reviewed in our last notebook). If you recall, tokenization is the process of turning a sentence into an array of the individual words (aka tokens), while padding is a way to add "filler" to make short sentences the same length as long sentences. Now that we've seen how tokenization and padding work, let's actually implement that on our dataset.

In [None]:
#@title Run this cell to tokenize our data

# Build tokenizer using tfds for both questions and answers
tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    [arr[0] + arr[1] for arr in question_answer_pairs], target_vocab_size=2**13)

# Define start and end token to indicate the start and end of a sentence
START_TOKEN, END_TOKEN = [tokenizer.vocab_size], [tokenizer.vocab_size + 1]

# Vocabulary size plus start and end token
VOCAB_SIZE = tokenizer.vocab_size + 2



In [None]:
#@title Run this cell to define our tokenization and padding functions!

# maximum sentence length
MAX_LENGTH = 100 # chosen arbitrarily

# tokenize, filter, pad sentences
def tokenize_and_filter(inputs, outputs):
  """
    Tokenize, filter, and pad our inputs and outputs
  """

  # store results
  tokenized_inputs, tokenized_outputs = [], []

  # loop through inputs, outputs
  for (sentence1, sentence2) in zip(inputs, outputs):

    # tokenize sentence
    sentence1 = START_TOKEN + tokenizer.encode(sentence1) + END_TOKEN
    sentence2 = START_TOKEN + tokenizer.encode(sentence2) + END_TOKEN
    
    # check tokenized sentence max length
    if len(sentence1) <= MAX_LENGTH and len(sentence2) <= MAX_LENGTH:
      tokenized_inputs.append(sentence1)
      tokenized_outputs.append(sentence2)

  # pad tokenized sentences
  tokenized_inputs = tf.keras.preprocessing.sequence.pad_sequences(
      tokenized_inputs, maxlen = MAX_LENGTH, padding = "post")
  tokenized_outputs = tf.keras.preprocessing.sequence.pad_sequences(
      tokenized_outputs, maxlen = MAX_LENGTH, padding = "post")
    
  return tokenized_inputs, tokenized_outputs

In [None]:
#@title Run this cell to run tokenization and padding on our dataset
# get questions, answers
questions, answers = tokenize_and_filter([arr[0] for arr in question_answer_pairs], 
                                         [arr[1] for arr in question_answer_pairs])

print('Vocab size: {}'.format(VOCAB_SIZE))
print('Number of samples: {}'.format(len(questions)))

BATCH_SIZE = 64
BUFFER_SIZE = 20000

# decoder inputs use the previous target as input
# remove START_TOKEN from targets
dataset = tf.data.Dataset.from_tensor_slices((
    {
        'inputs': questions,
        'dec_inputs': answers[:, :-1]
    },
    {
        'outputs': answers[:, 1:]
    },
))

dataset = dataset.cache()
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

print(dataset)

## Training our Transformer

Now that we've done all the hard work of setting up and building our model, let's actually train it!

###Initializing our model

First, we must initialize our Transformer object.

**Exercise**: Use our Transformer model above to create a Transformer object with these six parameters: 
1. vocab_size = `VOCAB_SIZE` (we've defined this variable earlier, so you can use the variable name as is)
2. num_layers = 2
3. units = 512
4. d_model = 256
5. num_heads = 8
6. dropout = 0.1


In [None]:
tf.keras.backend.clear_session()

model = None ## YOUR CODE HERE

In [None]:
#@title Run this cell to initiallize our loss function and learning rate

def loss_function(y_true, y_pred):
  y_true = tf.reshape(y_true, shape=(-1, MAX_LENGTH - 1))
  
  loss = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True, reduction='none')(y_true, y_pred)

  mask = tf.cast(tf.not_equal(y_true, 0), tf.float32)
  loss = tf.multiply(loss, mask)

  return tf.reduce_mean(loss)


class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):

  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps**-1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

# example of how we can adjust learning rate
#sample_learning_rate = CustomSchedule(d_model=128)

#plt.plot(sample_learning_rate(tf.range(200000, dtype=tf.float32)))
#plt.ylabel("Learning Rate")
#plt.xlabel("Train Step")

In [None]:
#@title Run this cell to compile our model
learning_rate = CustomSchedule(D_MODEL)

optimizer = tf.keras.optimizers.Adam(
    learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

def accuracy(y_true, y_pred):
  # ensure labels have shape (batch_size, MAX_LENGTH - 1)
  y_true = tf.reshape(y_true, shape=(-1, MAX_LENGTH - 1))
  return tf.keras.metrics.sparse_categorical_accuracy(y_true, y_pred)

model.compile(optimizer=optimizer, loss=loss_function, metrics=[accuracy])

###Fitting our data

**Exercise**: We will now train the model we built on Colab's free GPU! You should train your model for 20 epochs.

**Note: BUT, once it starts working, stop the code chunk!** You'll know that it works when you see a progress bar starting with `Epoch 1/20`.

Why are we ending early? Our model isn't even close to done training; however, actually training our model would take *waaaaay* too long for our class. For reference, if we were to train the GPT-3 transformer-based model on the most powerful GPU in the world, then [it would still take *355 years* to train](https://lambdalabs.com/blog/demystifying-gpt-3/#1)! Quite a long class period!

In [None]:
## YOUR CODE HERE

So... do we just end here? Well, we just so happen to have a version of this model that was trained on this same data. So we'll evaluate that instead!

In [None]:
#@title Run this cell to import our pretrained model
#### TODO: Include path to .h5 file 

model.load_weights("chatbot_transformer_v4.h5")

###Evaluating our model

Now that we have a working model, let's start playing around with it and see how it does!

In [None]:
#@title Run this code chunk to get the functions that we'll need for this section!
def evaluate(sentence):
  sentence = preprocess_text(sentence)

  sentence = tf.expand_dims(
      START_TOKEN + tokenizer.encode(sentence) + END_TOKEN, axis=0)

  output = tf.expand_dims(START_TOKEN, 0)

  for i in range(MAX_LENGTH):
    predictions = model(inputs=[sentence, output], training=False)

    # select the last word from the seq_len dimension
    predictions = predictions[:, -1:, :]
    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

    # return the result if the predicted_id is equal to the end token
    if tf.equal(predicted_id, END_TOKEN[0]):
      break

    # concatenated the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0)


def predict(sentence):
  prediction = evaluate(sentence)

  predicted_sentence = tokenizer.decode(
      [i for i in prediction if i < tokenizer.vocab_size])

  print('Input: {}'.format(sentence))
  print('Output: {}'.format(predicted_sentence))

  return predicted_sentence

###Testing the Transformer's performance

Let's tell our program something and see what it says

In [None]:
#@title Type in a phrase and let's see what our mental health chatbot thinks!

sentence = "What will make me happy?" #@param {type:"string"}
print("--------------------")
#output = predict(sentence)
predict(sentence)
print("--------------------")

###Discussion

1. How did our model do? Do you think it does better on certain inputs than others?
2. Try typing in one sentence, then replacing one word in that sentence, and look at how the output changes. For example, try running

  `I'm feeling very happy today`

  and then

  `I'm feeling very sad today`

  See how your model does when you change just one word. Does this surprise you?

3. Does this chatbot sound more conversational than our previous chatbots? Is it "good enough" or do you think there's room for improvement?

###Transfer Learning

One flaw in our design above is that we are training our model on a relatively small set of conversations. Because of that, our program has only been exposed to just a small sample of the questions it could ever receive. 

One way to improve our chatbot is to use a machine learning program that's already been built by someone else, using millions and millions of conversations, and teaching that program to be sensitive to mental health questions. 

This is called **transfer learning**, which is the process of using a pre-trained model to solve a new question. In this case, we'll use a powerful machine learning program called BERT (Bidirectional Encoder Representations from Transformers) which leverages the same Transformer architecture that we learned about before, but was trained on much more data. 

**TODO: Do transfer learning with BERT**

###Woebot Revisited

Now, let's revisit *Woebot*, the mental health chatbot who served as an inspiration for our project!

In [None]:
#@title Run this to watch a YouTube video describing how Woebot works!
from IPython.display import YouTubeVideo
YouTubeVideo('Ym5-e4E6Saw')

As you can hopefully see, our chatbot is well on our way to looking more and more like Woebot! With some more work, you'll be able to make an app like Woebot as well, using many of the concepts and principles that we've talked about in this project!

##Where do we go from here?

Congratulations! You've gotten a little bit of experience now in creating a mental health chatbot. You've learned about how we can leverage the power of NLP to create a program that can talk with someone and provide them with a constant companion. Where do we go from here?

Here are some possible suggestions to talk about:

1. More data: any AI/ML algorithm does better when it's given more training data. 
  - Where could we get more data? 
  - How much more data do we need? 

2. Can we modify our program to make it better? If so, how so?
  - Could we perhaps combine our machine learning based method with some rules?
  - Do we want our program to, say, be able to send the user to mental health resources or be able to contact live mental health professionals?

3. What do you think are some of the ethical implications of these chatbots?
  - What happens if our chatbot says something discouraging and just causes someone to feel worse?
  - Should our program be trying to give advice to users or is it more of a companion that just provides an ear to talk to?
  - Chatbots in the news: [A Bot Went on Twitter, Talked with People About Things Like Suicide](https://www.technologyreview.com/2020/10/08/1009845/a-gpt-3-bot-posted-comments-on-reddit-for-a-week-and-no-one-noticed/)

4. What do we do with the program that we just created? 
  - Turn it into an app! Look into deploying the pretrained model that you used today. For example, you can use Flask/Django to create a web app where you give people a text box where they can type in questions and talk to our new chatbot. 
  - Use this same technology, but for something else! We created a chatbot in the context of mental health, but as you've seen, there's chatbots for many types of applications, so get creative! If you like sports, maybe create a chatbot that'll give you the stats of your favorite player. If you want a study buddy, maybe create a chatbot that can read the same material that you're studying and quiz you on the material. If you just like hearing yourself talk, take all your texts, WeChat messages, WhatsApp conversations, and everything else in between, feed that into a chatbot, and train it to talk just like yourself! As you can see, the possibilities are limitless!





##Conclusion

During this deep dive, you've gained an introduction into how we can create chatbots using AI and machine learning. You first began by learning about rule-based approaches, where we have to write every single phrase that a chatbot could hear and then design an appropriate response for each. We saw how that was inefficient, so we discussed how machine learning, specifically deep learning and neural networks, can be used to create a chatbot with a richer understanding of language. Finally, in this notebook, we experimented with some of the current state-of-the-art NLP algorithms, specifically Transformers, and we learned about how they can be used to create a more conversational chatbot. You've learned so much over the course of this project and have reviewed some very advanced topics, which will hopefully fuel your growth as you start your career in AI and machine learning! Happy coding!