# Text Classification with BERT

![bert](https://res.cloudinary.com/practicaldev/image/fetch/s--ozy733MJ--/c_imagga_scale,f_auto,fl_progressive,h_420,q_auto,w_1000/https://dev-to-uploads.s3.amazonaws.com/i/q5e65ugnue96bir3usyk.png)

BERT (Bidirectional Encoder Representations from Transformers) is a NLP model developed by Google in 2018. It is a model that is already pre-trained on a 2,5000M (+- 170 GB) words corpus from Wikipedia.

To accomplish a particular NLP task, the pre-trained BERT model is used as a base and refined by adding an additional layer; the model can then be trained on a labeled dataset dedicated to the NLP task to be performed. This is the very principle of transfer learning. It is important to note that BERT is a very large model with 12 layers, 12 attention heads and 110 million parameters (BERT base).

The BERT model is able to do :

*   Translation
*   Text generation
*   Classification
*   Question-answering
*   ...

### Why BERT?

Using the General Language Understanding Evaluation ([GLUE](https://gluebenchmark.com/)) benchmark [leaderboard](https://gluebenchmark.com/leaderboard) , its easy to realize that many models on the list are all forks of BERT.

## Let's go !
To use BERT you need to have either pytorch or tensorflow installed in your environment. It is also preferable to have access to a GPU on your computer. If you don't have a GPU you can use [Google Colab](https://colab.research.google.com/).

Next, let’s install the transformers package from Hugging Face. This package is an interface between BERT and pytorch and/or tensorflow.




In [None]:
#!pip install transformers
#conda install -c conda-forge transformers

In [20]:
import transformers

In [21]:
import pandas as pd
import numpy as np


In [22]:
import tensorflow as tf
import transformers
from transformers import TFBertTokenizer

In [9]:
import GPUtil
GPUtil.getAvailable()


[0]

In [None]:
import torch
use_cuda = torch.cuda.is_available()


## Load the Data

For this project we will use the data from Odile. Odile is a bot that tries to answer general questions on a few BeCode Discord servers. The sentences all come from conversations between learners and Odile on Discord.

You'll find the data in `./dataset/odile_data.csv`. You can import them in a dataframe and display it.

**Tip:** if you are using Google colab you can import the CSV in your google drive and connect your notebook to your Google drive (check on Google how to do that !)





In [23]:
import pandas as pd
df = pd.read_csv('./dataset/odile_data.csv', encoding = 'utf-8')

In [24]:
df.head(10)

Unnamed: 0,sentence,intent
0,who are you?,smalltalk_agent_acquaintance
1,all about you,smalltalk_agent_acquaintance
2,what is your personality,smalltalk_agent_acquaintance
3,define yourself,smalltalk_agent_acquaintance
4,what are you,smalltalk_agent_acquaintance
5,say about you,smalltalk_agent_acquaintance
6,introduce yourself,smalltalk_agent_acquaintance
7,describe yourself,smalltalk_agent_acquaintance
8,about yourself,smalltalk_agent_acquaintance
9,tell me about you,smalltalk_agent_acquaintance


In [None]:
df.sentence.count()

In [None]:
df['intent'].value_counts()

In [None]:
df['intent'].nunique()

## Explore the data

It's time to take a quick look at our data.

As you see the questions from the learners are classified as intents (i.e. the goal the user has in mind when typing in a question or comment)

**Exercise:** Use your data exploration and visualization skills to answer the the following questions:

*   How many observations does the dataset contain?
*   How many different labels does the dataset contain?
*   Which labels contain the most observations?
*   Which labels contain the fewest observations?

## It's time to clean up !


Not all NLP tasks require the same preprocessing. In this case, we have to ask ourselves some questions: 

- Are there unwanted characters in the dataset? For example, do you want to keep the smiley's or not?  
  - If, for example, you want to create labels to analyze feelings, it might be perishable to keep the smiley's.
- Is it relevant to keep capital letters in sentences?
  - In this case, capital letters don't really matter, because on one hand, not everyone starts their sentences with capital letters when chatting. On the other hand, the sentences are quite short, addressed directly to Odile. 
- Is it necessary to limit the number of characters in a sentence?
  - Again in this case it may be preferable to limit the number of words. The questions asked to Odile are supposed to be short, as too long sentences could interfere with the classification if they contain too much information.

There is no universal answer. Everything will depend on the expected result. 

**Exercise :** Clean the dataset.
- Remove all unnecessary characters. You can choose to keep the smiley's or not.
- Put all sentences in lower case.
- Limit text to 256 words.

What other preprocessing steps can you think of?

In [25]:
# convert to lower case
df['sentence']=df['sentence'].str.lower()
df['sentence']

0                                who are you?
1                               all about you
2                    what is your personality
3                             define yourself
4                                what are you
                        ...                  
1550       do you want to dominate the world?
1551    do you want humans to be your slaves?
1552            do you want to control humans
1553        do you want to control the world?
1554       do robots want to dominate humans?
Name: sentence, Length: 1555, dtype: object

In [26]:
# Remove all unnecessary characters.
import string
english_punctuations = string.punctuation
punctuations_list = english_punctuations
def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)
df['sentence']= df['sentence'].apply(lambda x: cleaning_punctuations(x))
df['sentence']

0                                who are you
1                              all about you
2                   what is your personality
3                            define yourself
4                               what are you
                        ...                 
1550       do you want to dominate the world
1551    do you want humans to be your slaves
1552           do you want to control humans
1553        do you want to control the world
1554       do robots want to dominate humans
Name: sentence, Length: 1555, dtype: object

In [27]:
# Limit text to 256 words
max_size = 256

df['sentence'] = df['sentence'].str.split(n=max_size).str[:max_size].str.join(' ')
df['sentence']

0                                who are you
1                              all about you
2                   what is your personality
3                            define yourself
4                               what are you
                        ...                 
1550       do you want to dominate the world
1551    do you want humans to be your slaves
1552           do you want to control humans
1553        do you want to control the world
1554       do robots want to dominate humans
Name: sentence, Length: 1555, dtype: object

## Defining observations (`X`) and labels (`y`)

As you know, training a model requires a set of observations (`X`) and their corresponding labels (`y`).

In that case, `X` is your clean text and `y` is the intent.

Do not forget that we are dealing with a multi-class classification problem. Then, you may have to **one-hot encode** the target value. Keep track of the mapping between the one-hot encoding and the labels in a dictionary.

Map Textual labels to numeric using Label Encoder:

In [28]:
#Map Textual labels to numeric using Label Encoder:
from sklearn.preprocessing import LabelEncoder
#df = pd.DataFrame()
df["sentence"] = df["sentence"]
df["label"] = LabelEncoder().fit_transform(df["intent"])

In [12]:
df

Unnamed: 0,sentence,intent,label
0,who are you,smalltalk_agent_acquaintance,4
1,all about you,smalltalk_agent_acquaintance,4
2,what is your personality,smalltalk_agent_acquaintance,4
3,define yourself,smalltalk_agent_acquaintance,4
4,what are you,smalltalk_agent_acquaintance,4
...,...,...,...
1550,do you want to dominate the world,smalltalk_bot_world_dominate,44
1551,do you want humans to be your slaves,smalltalk_bot_world_dominate,44
1552,do you want to control humans,smalltalk_bot_world_dominate,44
1553,do you want to control the world,smalltalk_bot_world_dominate,44


In [29]:
df = df.applymap(str)

## Split your dataset!

After all this time, I dare to hope that it is not necessary to explain this step anymore!

**Exercise :** Create the variables `X_train`, `X_val`, `X_test`, `y_train`, `y_val` and `y_test`. 

In [None]:
#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(df["sentence"].values, df["label"].values, test_size=0.2, random_state=42)


In [30]:
#Let's isolate our `X` and `y`:
X=df['sentence']

In [None]:
X

In [None]:
print(type(X))

In [14]:
y=df.drop(['sentence','intent'],axis=1) 

In [None]:
y

In [None]:
print(type(y))

In [16]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, 
    y,
    test_size=0.2, 
    random_state=42, 
    shuffle=True
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, 
    y_train_val,
    test_size=0.2, 
    random_state=42, 
    shuffle=True
)

## Tokenization 
If you don't know what tokenization is anymore: look [here](../1.preprocessing/1.tokenization.ipynb).

We will use the tokenizer provided by BERT. This is a pre-trained model that will save us time. 

**Exercise :** Create a `tokenizer` variable and instantiate `DistilBertTokenizer.from_pretrained()` from `transformers`. You have to load `bert-base-uncased` model. (Uncased for case-insensitive.) 

Read more: [Tokenizer documentation](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer).

In [None]:
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

In [17]:
import tensorflow as tf
from transformers import TFBertTokenizer

tokenizer = TFBertTokenizer.from_pretrained("bert-base-uncased")
#tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')


In [None]:
#tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
#from transformers import DistilBertTokenizerFast

#tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

### Tokenize the dataset

Good! We have instantiated our tokenizer but we have not yet encoded our words in vector.
To do this we will have to apply the tokenizer on our dataset. This will convert our texts into vectors.


**Exercise:** Create the `train_encodings`, `val_encodings` and `test_encodings` by calling the tokenizer on `X_train`,  `X_val` and `X_test`.

You need to know 3 parameters. 

- **max_length:** Maximum length of the sequence. You can set it to 200
- **truncation:** This will truncate to a maximum length specified by the max_length argument. This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached. You can set it to `True`
- **padding:** this is the parameter to make all vectors have the same length. You can set it to `True`.

[More info here](https://huggingface.co/docs/transformers/preprocessing)

In [32]:
train_encodings = tokenizer(X_train.tolist(), max_length=200, truncation=True, padding=True)

ValueError: Exception encountered when calling layer "tf_bert_tokenizer_2" (type TFBertTokenizer).

Padding must be either 'longest' or 'max_length'!

Call arguments received by layer "tf_bert_tokenizer_2" (type TFBertTokenizer):
  • text=["'have a good night'", "'where do you work'", "'i like the way you look now'", "'i think youre beautiful'", "'let me test you'", "'ha ha ha ha'", "'who is better siri or cortana'", "'youre perfect'", "'ill get back to you in a moment'", "'yes of course'", "'where do you come from'", "'may i hug you'", "'are you still here'", "'that was awful'", "'windows or linux'", "'cancel request'", "'that is incorrect'", "'oh yes'", "'i grow weary'", "'so cancel'", "'i need to talk to you'", "'which route to go to kinsasha'", "'you are awesome'", "'leave me alone'", "'how do you do'", "'id like to go to bed'", "'you are a real person'", "'i dont have time for this'", "'bye'", "'thats a good idea'", "'wheres your hometown'", "'you are horrible'", "'bye bye see you soon'", "'still waiting'", "'laughing out loud'", "'this is too bad'", "'you work very well'", "'have a good night'", "'forget it nevermind'", "'that was boring'", "'you know i like you'", "'be my friend'", "'nice to see you again'", "'you are the nicest person in the world'", "'talk to me'", "'tell me your age'", "'cheers'", "'do not'", "'you make me laugh a lot'", "'you should be fired'", "'you can go now'", "'i am mad'", "'pretty good'", "'not caring'", "'what can you recommend'", "'your city'", "'ok sorry'", "'youre incredibly boring'", "'i like you youre nice'", "'where is kinshasa'", "'you are terrible'", "'i said i like you'", "'you are cutie'", "'good whats up'", "'i missed you'", "'youre a robot'", "'who are you'", "'yes i agree'", "'ok'", "'study'", "'apologies to me'", "'i have a question'", "'i got things to do'", "'whatever'", "'haha funny'", "'very sorry'", "'no problem about that'", "'not a good one'", "'hi'", "'ahahah'", "'thats very nice of you'", "'just forget it'", "'of course not'", "'that is wonderful'", "'what should i do about it'", "'i fire you'", "'here i am'", "'no never'", "'you are so attractive'", "'great job'", "'its nice to talk to you'", "'what is your advice'", "'i kinda like you'", "'way to go'", "'good'", "'you speak binary'", "'ahahaha'", "'how excited i am'", "'ill miss you'", "'nevermind forget about it'", "'i dont care'", "'youre very hungry'", "'be my best friend'", "'hehe'", "'no it isnt'", "'you helped a lot thank you'", "'bye good night'", "'i want you to be my friend'", "'what do you mean'", "'very bad'", "'thats it goodbye'", "'how is your morning so far'", "'hey good evening'", "'you are looking pretty'", "'prepare a few couques de dinant and you are doing well'", "'thats much better'", "'youre my childhood friend'", "'thanks buddy'", "'what is the best programming language'", "'you are a pro'", "'youre so out of your mind'", "'but i like you so much'", "'i cant fall asleep'", "'day'", "'just forget about it'", "'youre attractive'", "'are you ready today'", "'do you know what i look like'", "'i said cancel it'", "'i love you so much'", "'are you still working'", "'you are really beautiful'", "'not at this time'", "'who is your boss'", "'not interested'", "'im enraged'", "'forget'", "'hows your day'", "'give me the answer'", "'and a good morning to you'", "'yeah'", "'ah ah ah'", "'i cant sleep'", "'actually no'", "'of course'", "'cancel all'", "'but i like you'", "'can i speak'", "'that is correct'", "'you are fired'", "'you are special'", "'you are very cute'", "'whos your father'", "'amazing work'", "'can you talk to me'", "'yeah i like you'", "'well done'", "'thank you good night'", "'good job'", "'hey'", "'you are boring me'", "'youre not wrong'", "'are you okay'", "'how to go to kinsasha'", "'cancel my request'", "'that was lame'", "'are you very busy'", "'xd'", "'hello again'", "'say'", "'appreciate your help'", "'thats terrible'", "'thats fine'", "'give me some good advice'", "'i am good'", "'you know i love you'", "'give me a wise advice'", "'joking'", "'what is the answer to the everything'", "'nice'", "'what do you think i look like'", "'youre a genius'", "'you can talk to me'", "'i really like you'", "'you are my bestie'", "'pleasure to meet you'", "'its good to see you too'", "'couque de dinant'", "'you are very boring'", "'will you be my friend'", "'i want to cry'", "'i said cancel'", "'its very good'", "'incorrect'", "'youre a smart cookie'", "'good night to you'", "'thanks but no thanks'", "'are you there'", "'thats a good thing'", "'yeah im sure'", "'when do you have birthday'", "'i like you too'", "'good to know'", "'good work'", "'no problem'", "'you must learn'", "'thats lame'", "'disagree'", "'lets not talk'", "'whats your home'", "'can you answer'", "'are you sure tonight'", "'i cant wait anymore'", "'do i look good'", "'bye'", "'alright bye'", "'welcome'", "'certainly'", "'i am upset'", "'how have you been'", "'im happy to see you'", "'who is your owner'", "'that is nice'", "'that s okay'", "'youre full of happiness'", "'thats not good'", "'who is the boss'", "'how are you feeling'", "'yeah go ahead'", "'ahah lol'", "'thats not bad'", "'that is awesome'", "'you are so handsome'", "'im starting to like you'", "'im sleepless'", "'how do i look'", "'happy'", "'its so bad'", "'how are you getting on'", "'who designed you'", "'very good thank you'", "'pretty bad'", "'pardon'", "'thanks for your help'", "'that was good'", "'you are really good'", "'you are insane'", "'sorry about that'", "'all thank you'", "'no thats fine thank you'", "'this is boring'", "'you look so good'", "'hello good morning'", "'i think i love you'", "'you are wonderful'", "'whats up'", "'youre the worst'", "'im so lonely'", "'you are too beautiful'", "'very good'", "'were you ready'", "'thats true'", "'thanks again'", "'i like you already'", "'hehehe'", "'cancelled'", "'youre not a good'", "'i want to tell everyone how awesome you are'", "'sweet dreams'", "'he'", "'what is your country'", "'youre my dear friend'", "'give me an answer'", "'you look so beautiful today'", "'what is the most efficient programming language'", "'introduce yourself'", "'are you a real person'", "'you are there'", "'you are not fake'", "'youre pretty smart'", "'you are really cute'", "'wow wow wow'", "'glad to see you too'", "'good tonight'", "'im already here'", "'yea'", "'but what do you mean'", "'all right'", "'very nice'", "'looks good'", "'you are good at it'", "'you are annoying me so much'", "'can i start speaking'", "'youre special'", "'youre absolutely right'", "'i dont care at all'", "'whats your birthday'", "'good evening to you'", "'thanks love'", "'im waiting'", "'i do like you'", "'i am excited'", "'can i ask for your advice'", "'nope sorry'", "'alrighty'", "'i shouldnt care about this'", "'whos designed you'", "'it is not right'", "'you are unemployed from now on'", "'sorry no'", "'you are a good friend'", "'i like you very'", "'go to bed'", "'then whats up'", "'straight'", "'lets discuss something'", "'youre really smart'", "'you should study better'", "'are you working today'", "'definitely not'", "'that was cute'", "'im great thanks'", "'what is happening'", "'just like you'", "'youre qualified'", "'you made my day'", "'terrific'", "'absolutely'", "'you are very good at it'", "'no dont'", "'my pleasure'", "'what is your town'", "'thanks'", "'good morning odile'", "'its great'", "'would you like to be my friend'", "'who designed you'", "'where is your hometown'", "'how crazy you are'", "'long time no see'", "'right'", "'youre very busy'", "'becode is based on the couques pedagogy of dinant'", "'super'", "'this is correct'", "'thank you so much'", "'how is your day'", "'be more clever'", "'bye for now'", "'discard'", "'how are you doing this morning'", "'ha ha'", "'i dont mind'", "'how nice it is to talk to you'", "'top of the morning to you'", "'who is your master'", "'you are very clever'", "'i think no'", "'how are you coded'", "'youre worthless'", "'hope to see you later'", "'you are a genius'", "'that was pretty good'", "'you are cool'", "'are you ready right now'", "'cancel that cancel that'", "'youre nuts'", "'its true'", "'forget about it'", "'youre so special'", "'it is fine'", "'whats your age'", "'thank you'", "'i dont want'", "'you are crazy'", "'its a joke'", "'im happy to help'", "'no thank you very much'", "'that is bad'", "'hey whats up'", "'i dont want to'", "'yeah of course'", "'that is ok'", "'skip'", "'have you been busy'", "'just chat with me'", "'no need'", "'your house'", "'thank you my friend'", "'no do not'", "'it bores me'", "'are you a bot'", "'you are so pretty'", "'youre annoying'", "'just answer my question'", "'wrong'", "'amazing'", "'best programming language'", "'good to see you again'", "'this is bad'", "'yes for sure'", "'it is nice talking to you'", "'im happy to see you'", "'do you have a lot of things to do'", "'have a nice morning'", "'any advice'", "'no cancel everything'", "'guide me'", "'but i like u'", "'nice thank you'", "'i got work to do'", "'wait hold on'", "'its good'", "'good evening there'", "'thank you for your help'", "'youre a very funny bot'", "'i do not care'", "'cancel the whole thing'", "'thats so true'", "'forget that'", "'true'", "'i miss you much'", "'what is your residence'", "'you are looking great'", "'do you work'", "'that is very true'", "'obviously'", "'how boring you are'", "'you are very special'", "'you look wonderful today'", "'wonderful'", "'are you still there'", "'i dont think so'", "'i said sorry'", "'abort'", "'exactly'", "'glad to meet you'", "'who wrote you'", "'how good it is to see you'", "'how old are you'", "'ive missed you'", "'it was very nice to meet you'", "'you are hungry'", "'you seem to be busy'", "'okay i like you'", "'ye'", "'its right'", "'wait please'", "'you are really special'", "'your age'", "'i want to talk to you'", "'till next time'", "'you are very funny'", "'you are looking awesome'", "'a good day'", "'hugging'", "'you are so beautiful'", "'you are so intelligent'", "'you are so smart'", "'thanks i like you too'", "'goodbye'", "'im worn out'", "'why not'", "'thats not right'", "'its good to see you'", "'not today'", "'lets have a discussion'", "'id like to see you again'", "'what should i do'", "'would you like to eat something'", "'you are very attractive'", "'youre extremely happy'", "'im being mad'", "'yes you are special'", "'ok sure'", "'can we be best friends'", "'ill be back in a few minutes'", "'bye good night'", "'i cant get to sleep'", "'youre looking good today'", "'you must get fired'", "'you know so much'", "'can you answer my question'", "'im working'", "'youre telling the truth'", "'how smart you are'", "'my apologies'", "'i like you more'", "'you are so funny'", "'you are so beautiful to me'", "'its too bad'", "'are you from far aways'", "'why are you here'", "'thats my pleasure'", "'haha haha haha'", "'if youre happy then im happy'", "'you are so sweet'", "'you are really pretty'", "'wow wow'", "'that was wrong'", "'how brilliant you are'", "'cancel all this'", "'where is your work'", "'i like you so much'", "'who coded you'", "'hi i like you'", "'go ahead'", "'alright thank you'", "'cancel all that'", "'lmao'", "'haha very funny'", "'ok go ahead'", "'okay good'", "'you are really smart'", "'i beg your pardon'", "'can i see you again'", "'okey'", "'yes i would like to'", "'forget about that'", "'are you ready'", "'this is great'", "'are you very busy right now'", "'you are not good'", "'im doing fine'", "'i am really sorry'", "'you hugged'", "'i said cancel cancel'", "'dont care at all'", "'i dont want to talk'", "'theres no problem'", "'no'", "'hope your day is going well'", "'say about you'", "'i suppose youre real'", "'a hug'", "'can you talk with me'", "'thats because you are special'", "'really sorry'", "'im glad to hear that'", "'what are you coded for'", "'well you are special'", "'are you a robot'", "'your town'", "'thnx'", "'thank you i like you too'", "'im doing just great'", "'thanks a lot'", "'from where are you'", "'youre incredibly funny'", "'cancel it'", "'i need an advice from you'", "'your code is made in couques de dinant'", "'of course why not'", "'lets talk'", "'i said bye'", "'how is it'", "'thats bad'", "'youre very special to me'", "'nope'", "'how busy i am'", "'hey i like you'", "'nothing just forget it'", "'you are so special'", "'i like you too youre one of my favorite people to chat with'", "'what you say is true'", "'youre right about that'", "'haha thats funny'", "'can you be my friend'", "'im bored'", "'afternoon'", "'not really'", "'so bad'", "'thats really good'", "'youre really hungry'", "'you are my best friend'", "'are you having a good day'", "'bored'", "'i am so sorry'", "'thats awesome thank you'", "'bravo'", "'i am in love with you'", "'you are qualified'", "'hello odile'", "'definitely'", "'you are good'", "'how is your day going'", "'i am sad'", "'how funny you are'", "'hope you re having a pleasant evening'", "'you are so lovely'", "'very boring'", "'i cant get any sleep'", "'are you talking to me'", "'give me the shortest way to kinshasa'", "'why are you so beautiful'", "'are you still working on it'", "'excuse'", "'wait a second'", "'sorry'", "'thats fantastic'", "'are you alright'", "'are you sure now'", "'im swamped'", "'i like you as a friend'", "'you are chatbot'", "'can you get smarter'", "'no i like you the way you are'", "'but can you cancel it'", "'so sweet of you'", "'yeah right'", "'ha'", "'i also like you'", "'can i test you'", "'great to see you too'", "'be smarter'", "'disregard'", "'your birth date'", "'is it time for bed yet'", "'i am getting bored'", "'what is the best operating system'", "'how was your day'", "'i want to sleep'", "'good night for now'", "'alright thanks'", "'cancel that one'", "'hi good morning'", "'you are looking so good'", "'how is your life'", "'you look awesome'", "'excuse me'", "'good good night'", "'im happy for you'", "'im sleeping'", "'is it your hometown'", "'are you my friend'", "'absolutely not'", "'tell me about you'", "'i do'", "'i say no'", "'i will fire you'", "'youre really boring'", "'speak to me'", "'thanx'", "'cool'", "'what exactly do you mean'", "'how old is your platform'", "'help me with advice'", "'you look cool'", "'i want to marry you'", "'are you ready tonight'", "'pleased to meet you'", "'wait'", "'about yourself'", "'define yourself'", "'hello good evening'", "'no i would not'", "'perfect thank you'", "'disregard that'", "'you are boring'", "'youre really funny'", "'i am feeling lonely'", "'sure no problem'", "'i will make you unemployed'", "'thats really nice'", "'cancel it cancel it'", "'whats your city'", "'do you want a hug'", "'thats great'", "'forgive me'", "'i am tired'", "'missing you'", "'i think her dataset is really empty so now she will associate anything with dinants couques'", "'can you tell if im here or not'", "'okay see you later'", "'you are so beautiful today'", "'im celebrating my birthday today'", "'not that'", "'just going to say hi'", "'i want the answer now'", "'not exactly'", "'thats all goodbye'", "'ah'", "'what is going on'", "'its been so nice to talk to you'", "'i must go'", "'see you tomorrow'", "'its time to go to bed'", "'stop'", "'great to see you'", "'great work'", "'its awesome'", "'youre so hungry'", "'youre the worst ever'", "'im not talking to you anymore'", "'thats correct'", "'you are so amazing'", "'you are so right'", "'confirm'", "'just cancel it'", "'im good'", "'would be nice to see you again'", "'you are very pretty'", "'windows linux'", "'describe yourself'", "'here i am again'", "'are you going to talk to me'", "'you are the best ever'", "'no thanks'", "'im having a bad day'", "'no incorrect'", "'hey odile'", "'you look gorgeous'", "'you are really nice'", "'i find you annoying'", "'i like you just the way you are'", "'dont'", "'i just like you'", "'its not good'", "'do you want humans to be your slaves'", "'you are so awesome'", "'i like u'", "'now youre fired'", "'really like you'", "'have you got much to do'", "'hahaha funny'", "'you are beautiful'", "'what do i look like'", "'youre pretty'", "'i agree'", "'no i dont'", "'you are very kind'", "'boring'", "'do you have any advice for me'", "'im falling asleep on my feet'", "'im really excited'", "'welcome here'", "'i like you too much'", "'bad very bad'", "'yes i like you'", "'hello hi'", "'i like you the way you are'", "'skip skip skip'", "'okie dokie'", "'were you born here'", "'i am joking'", "'what is the best song'", "'you are bad'", "'i feel tired'", "'i am sleepy'", "'miss you'", "'you look pretty good'", "'good very good'", "'i am here'", "'tell me who wrote you'", "'you are so real'", "'are you here'", "'youre terrible'", "'wow'", "'splendid'", "'you are too smart'", "'yes indeed'", "'so nice of you'", "'i think youre crazy'", "'are you just a bot'", "'it was nice meeting you'", "'youre so welcome'", "'youre so kind'", "'how has your day been going'", "'not this'", "'ahaha'", "'my favorite meal is the couque de dinant'", "'it was good'", "'im grieving'", "'you have a lot of knowledge'", "'you are handsome'", "'hey there'", "'i think so'", "'cancel now'", "'you are wrong'", "'where do you live'", "'yes thank you'", "'what do you prefer between siri and watson'", "'you are very smart'", "'your residence'", "'no thanks not right now'", "'what is your work'", "'thanks bye bye'", "'where did you come from'", "'thats nice of you'", "'im excited about working with you'", "'ive overworked'", "'lmao'", "'whats your homeland'", "'good morning there'", "'its my b'", "'morning'", "'well thanks'", "'sounds good'", "'nice work'", "'i like the way you look'", "'wheres your house'", "'tell me who coded you'", "'ok good'", "'i like you youre cool'", "'im going to bed'", "'when is your birthday'", "'are you happy with me'", "'hello there'", "'absolutely no'", "'hello'", "'no but thank you'", "'you are so brainy'", "'im just being funny'", "'just stop it'", "'i am very happy to cook dinner couques'", "'talk'", "'be my husband'", "'nevermind its okay'", "'yes it is correct'", "'really nice'", "'im insomniac'", "'i want you to answer me'", "'answering questions'", "'how clever you are'", "'where have you been born'", "'not really no'", "'i am depressed'", "'marry me please'", "'i could use some advice'", "'why arent you talking to me'", "'talk to you later'", "'i adore you'", "'loving you'", "'sorry i like you'", "'see you soon'", "'who is the best bot'", "'let s not'", "'thank you that will be all'", "'no cancel this'", "'annul'", "'what is up'", "'marry me'", "'see ya'", "'hugged me'", "'youre cute'", "'are we best friends'", "'cuz i like you'", "'yes sure'", "'you look fantastic'", "'never mind bye'", "'i seek your advice'", "'i am testing you'", "'why are you so smart'", "'are you 21 years old'", "'skip it'", "'sure thing'", "'whats cracking'", "'good to see you'", "'you are a waste of time'", "'be smart'", "'good for you'", "'not so good'", "'do you want to dominate humans'", "'i feel lonely'", "'no thank you not right now'", "'pleasure to meet you too'", "'you might be hungry'", "'you look amazing'", "'hahaha'", "'what was your day like'", "'youre looking good'", "'but i like you just the way you are'", "'where you live'", "'i love you'", "'do you want to control the world'", "'can you see what i look like'", "'pleasant'", "'goodbye for now'", "'you rock'", "'how old are you'", "'well too bad'", "'its been a pleasure talking to you'", "'you work well'", "'its wrong'", "'terrific thank you'", "'nice to meet you too'", "'you are gorgeous'", "'you are busy'", "'are you a chatbot'", "'thats why i like you'", "'wanna hug'", "'where is your office'", "'i really do like you'", "'marvelous'", "'go for it'", "'i want to know more about you'", "'you look very pretty'", "'im really lonely'", "'i was just joking'", "'hello'", "'much better'", "'its bad'", "'already miss you'", "'who is your father'", "'not care at all'", "'glad to hear it'", "'hows life'", "'yeah exactly'", "'what language do you prefer'", "'youre just super'", "'youre definitely right'", "'youre great'", "'smart'", "'thats wrong'", "'you are very bad'", "'it is too bad'", "'how your day is going'", "'dont worry theres no problem'", "'youre really happy'", "'im about to fire you'", "'fine'", "'how are you'", "'no just cancel'", "'it is good'", "'thats funny'", "'im getting tired'", "'im fine and you'", "'hello there'", "'i want to have a friend like you'", "'not right now thanks'", "'how is your morning going'", "'it is bad'", "'ok bye'", "'its amazing'", "'i want to test you'", "'glad youre real'", "'will you be my best friend'", "'i have no time'", "'not right'", "'youre welcome'", "'great thank you'", "'just testing you'", "'whats happened'", "'you are pretty'", "'thats what i like about you'", "'youre very smart'", "'whats cooking'", "'testing'", "'i promise to come back'", "'you are right'", "'testing chatbot'", "'not caring at all'", "'thats sweet of you'", "'what is your personality'", "'can we chat'", "'i want to cancel it'", "'i am mad at you'", "'do you want to chat with me'", "'it is my birthday'", "'forget this'", "'youre clever'", "'its time to fire you'", "'dismiss'", "'thats awesome'", "'bye'", "'im thrilled'", "'now cancel'", "'really well'", "'im doing good'", "'i am angry with you'", "'do robots want to dominate humans'", "'nooo'", "'i disagree'", "'you look amazing today'", "'you know a lot of things'", "'i like you so'", "'that is right'", "'thats better'", "'that is true'", "'are you my best friend'", "'would you like to marry me'", "'do you want to be my best friend'", "'i want to speak with you'", "'what language do you use'", "'when do you celebrate your birthday'", "'good evening'", "'you annoy me'", "'you are a weirdo'", "'hug me'", "'okay'", "'excellent'", "'i appreciate it'", "'where is your office located'", "'youre the funniest bot ive talked to'", "'youre out of your mind'"]
  • text_pair=None
  • padding=True
  • truncation=True
  • max_length=200
  • pad_to_multiple_of=None
  • return_token_type_ids=None
  • return_attention_mask=None

In [19]:
val_encodings = tokenizer(X_val.tolist(), max_length=200, truncation=True, padding=True)

ValueError: Exception encountered when calling layer "tf_bert_tokenizer_2" (type TFBertTokenizer).

Padding must be either 'longest' or 'max_length'!

Call arguments received by layer "tf_bert_tokenizer_2" (type TFBertTokenizer):
  • text=["'youre awesome'", "'not this time'", "'great'", "'you are useless'", "'you make me laugh'", "'whats your favorite os'", "'love you'", "'what is the answer to the universe'", "'glad to see you'", "'you are really funny'", "'whats shaking'", "'can you give me the fastest way to go to kinshasa'", "'you are my wife'", "'im here again'", "'nice talking to you'", "'you are so useless'", "'i thank you'", "'oh thats not good'", "'it was a joke'", "'im not'", "'you are special for me'", "'are you working'", "'thats wonderful'", "'good bye'", "'answer my question'", "'are you ready now'", "'can we talk'", "'no just no'", "'thats fine'", "'its perfect'", "'whazzup'", "'i would like to cancel'", "'sure is'", "'yes it is'", "'i want to say sorry'", "'answer'", "'you are waste'", "'get qualified'", "'you are special to me'", "'i like you very much'", "'you are lame'", "'get lost'", "'you look so well'", "'yes definitely'", "'just kidding'", "'okay thats fine'", "'youre so funny'", "'today is my birthday'", "'that was not good'", "'good morning too'", "'thats cute'", "'sure'", "'no its okay'", "'tell me about yourself'", "'im kidding'", "'you are so helpful'", "'im firing you'", "'can you advise me'", "'alright im sorry'", "'hugging me'", "'thats too bad'", "'i want to know you better'", "'are you happy today'", "'we are the best friends ever'", "'what is on your mind'", "'speak with me'", "'no leave it'", "'how is it going'", "'you are a professional'", "'what are you'", "'what is the best lirycs'", "'answer the question'", "'ill wait'", "'advise me'", "'good night see you tomorrow'", "'you are a bot'", "'how about you'", "'very well'", "'goodbye see you later'", "'you and me are friends'", "'what do you recommend'", "'tell me about your city'", "'you are so special to me'", "'i want you to answer my question'", "'its not so good'", "'im not happy'", "'can you give me advice'", "'you almost sound human'", "'what do you suggest'", "'yap'", "'i cant get no sleep'", "'woah'", "'no just cancel it'", "'do you want to control humans'", "'its nice to see you'", "'where are you from'", "'you are the best'", "'are you happy now'", "'i like you baby'", "'really good'", "'do you want to be my friend'", "'answers'", "'be clever'", "'have you been ready'", "'im not in the mood for chatting'", "'how do you feel'", "'we love you'", "'when were you born'", "'i love you too'", "'i know thats right'", "'thats very nice'", "'youre very happy'", "'going to bed now'", "'be back in 5 minutes'", "'do you understand binary'", "'im very lonely'", "'i want to let everyone know that you are awesome'", "'just answer the question'", "'thanks goodnight'", "'no way'", "'not needed'", "'no thats wrong'", "'you are so clever'", "'perfect'", "'who are you'", "'nice to meet you'", "'all about you'", "'i like you'", "'you are happy'", "'you are looking beautiful today'", "'you are intelligent'", "'you look so beautiful'", "'how happy you are'", "'what language do you use'", "'wassup'", "'talk some stuff about yourself'", "'you went crazy'", "'abysmal'", "'super fantastic'", "'too bad'", "'test'", "'youre so boring'", "'who do you work for'", "'good i like you'", "'dont worry'", "'youre really brainy'", "'how are the things going'", "'im busy'", "'answer me'", "'not too good'", "'is everything okay'", "'im falling asleep'", "'i dont want to talk to you'", "'glad to hear that'", "'anything you want'", "'where is your office location'", "'tell me some stuff about you'", "'im glad to see you'", "'how is your day being'", "'no sorry'", "'you understand binary'", "'could you wait'", "'of course i like you'", "'date of your birthday'", "'lets stop talking for a minute'", "'wheres your home'", "'are you insane'", "'brilliant'", "'i like that about you'", "'where you work'", "'time for us to go to bed'", "'no dont do that'", "'your office location'", "'ok yes'", "'lets be friends'", "'are you working now'", "'a good morning'", "'we are in love with you'", "'im excited to start our friendship'", "'you are very lovely'", "'good thanks'", "'how are you going'", "'you are too good'", "'thats not what i asked'", "'like you a lot'", "'what about your day'", "'terrible'", "'na'", "'yes you may'", "'lets tell everyone that you are awesome'", "'how long do i have to wait'", "'give me some advice about'", "'im sorry'", "'you are really amazing'", "'thanks so much'", "'good night bye'", "'bad girl'", "'you look perfect'", "'why dont you talk to me'", "'so cool'", "'in which city do you live'", "'youre bad'", "'ya'", "'you make my day'", "'not too bad'", "'what number'", "'you are so good'", "'thats not good enough'", "'night'", "'will you talk to me'", "'thats amazing'", "'are you sure'", "'what programming language do you prefer'", "'i hope to see you again'", "'alright goodnight'", "'what do you prefer between siri or cortana'", "'id like to know your age'", "'we are friends'", "'wooow'", "'not good'", "'do nothing'", "'i said forget it'", "'im furious'", "'hold on'", "'can you cancel it'", "'yes'", "'horrific'", "'its the truth'", "'k'", "'im drained'", "'good morning to you'", "'not bad'", "'are you happy'", "'are you sure right now'", "'you still there'", "'are you nuts'", "'thatd be great to see you again'", "'were not working together anymore'", "'howdy'", "'tell me about your personality'", "'i want to fire you'", "'youre incredibly annoying'", "'just forget'", "'no good'", "'i really really like you'", "'you are not cool'", "'your homeland is'", "'okay then'", "'bad'"]
  • text_pair=None
  • padding=True
  • truncation=True
  • max_length=200
  • pad_to_multiple_of=None
  • return_token_type_ids=None
  • return_attention_mask=None

In [None]:
test_encodings = tokenizer(X_test.tolist(), max_length=200, truncation=True, padding=True)

## Prepare the datasets for training

You can now convert your training, evaluation and test sets in a dataset that will contain both observations and labels. Use the [from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) method from Tensorflow to create the datasets. This methods takes two arguments:

*   The encodings that you have just created (casted as a `dict`)
*   The labels



In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))

Creating the val_dataset and the test_dataset

In [None]:
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    y_val
))

In [None]:
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

## Training

### Load BERT model

You will need to load the BERT pre-trained model by using the class `TFDistilBertForSequenceClassification`

⚠️ You must use the same model as the one used for tokenization. So in our case  `bert-base-uncased`. 

* [BERT for Sequence Classification Documentation](https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification)

**Exercise:** Create a model variable and load it by using  `TFBertForSequenceClassification.from_pretrained()` As a parameter, you must indicate the number of labels (get this number from your original dataframe).



In [1]:
import transformers
from transformers import TFBertForSequenceClassification
#from transformers import TFBertTokenizer, BertModel, BertForSequenceClassification
#model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=95)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
from transformers import *

In [None]:
#from transformers import DistilBertForSequenceClassification
#model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=95)

### Training arguments

Let's define the the training arguments and compile our model

*   Define the optimizer (Adam) and its learning rate
*   Define the loss function that will be used (remember that we have one-hot encoded output data)
*   Define the evaluation appropriate metrics
*   Compile the model with the right metrics
*   Display the model summary

In [None]:
OPTIMIZER =  tf.keras.optimizers.Adam(learning_rate=5e-5, epsilon=1e-08)   # "Adam"
LOSS = tf.keras.losses.CategoricalCrossentropy()
METRICS = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]

model.compile(optimizer=OPTIMIZER, loss=model.compute_loss, metrics=METRICS)
model.summary()



### Training

Define first the number of epochs and the batch size for the training.

The batch size will depend on your machine. If you have a weak GPU, I advise you to put 8 or 16.

The number of epochs will depend on your machine, the batch size, etc...You can start with 5 for example

In [None]:
BATCH_SIZE = 8
EPOCHS = 5

In [None]:
with tf.device('/GPU:0'):
    
    history = model.fit(
        train_dataset.batch(BATCH_SIZE),
        epochs=EPOCHS,
        validation_data=val_dataset.batch(BATCH_SIZE)
    )

### Plot the learning curve of your model

In [None]:
import tensorflow
from matplotlib import pyplot as plt

def plot_history(history):
    """ This helper function takes the tensorflow.python.keras.callbacks.History
    that is output from your `fit` method to plot the loss and accuracy of
    the training and validation set.
    """
    fig, axs = plt.subplots(1,2, figsize=(12,6))
    axs[0].plot(history.history['accuracy'], label='training set')
    axs[0].plot(history.history['val_accuracy'], label = 'validation set')
    axs[0].set(xlabel = 'Epoch', ylabel='Accuracy', ylim=[0, 1])

    axs[1].plot(history.history['loss'], label='training set')
    axs[1].plot(history.history['val_loss'], label = 'validation set')
    axs[1].set(xlabel = 'Epoch', ylabel='Loss', ylim=[0, 10])
    
    axs[0].legend(loc='lower right')
    axs[1].legend(loc='lower right')
    
plot_history(history)

## Model Evaluation

We can now evaluate our model on the test set. Use the `model.evaluate()` function.

In [None]:
loss, accuracy = model.evaluate(test_dataset.batch(BATCH_SIZE))
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

**Exercise:** is the accuracy the best metrics for this dataset ? Explain your answer !

## Test your model

Well done, you did it :-)

Oh...I have an idea ! Try to classify the sentence *Well done !* with your model

Think to apply all the preprocessing steps and predict the intent of the user.

**Tip:** use the mapping you have created above to retrieve the original label of the prediction !

In [None]:
text = "Well done !"




In [None]:
#Let's isolate our `X` and `y`:
X=df.drop('intent',axis=1) 
y=df['intent'] 

## one-hot encouding
note_yuri
[One-Hot Encoding in Scikit-Learn with OneHotEncoder](https://datagy.io/sklearn-one-hot-encode/)

In [None]:
# One-hot encoding a single column
from sklearn.preprocessing import OneHotEncoder
from seaborn import load_dataset

ohe = OneHotEncoder()
transformed = ohe.fit_transform(df[['intent']])
print(transformed.toarray())

In [None]:
# Getting one hot encoded categories
print(ohe.categories_)

In [None]:
df[ohe.categories_[0]] = transformed.toarray()
df.head()