<a href="https://colab.research.google.com/github/adamzki99/nlp-zlatan/blob/main/nlp_zlatan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Connect to Google Drive

This notebook is designed to be used together with Google Colab. We start by connecting the notebook to our personal Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
Be careful to check the you have the same filepath for the dataset in your drive

In [None]:
%cd /content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia

/content/drive/MyDrive/nlp-datasets/wizard_of_wikipedia


## Reading the dataset

The dataset used is very nested, hard to navigate and just difficult to wrap ones head around. So it is recommended to see this [resource](https://parl.ai/projects/wizard_of_wikipedia/) to get a better undersatnding.

In [None]:
import json

with open('data.json', 'r') as file:
    json_data = file.read()
    data = json.loads(json_data)

print('Datatype:', type(data))

Datatype: <class 'list'>


Use the following keys to double check with the [resource](https://parl.ai/projects/wizard_of_wikipedia/) that you have loaded in the right dataset.

In [None]:
data[0]['dialog'][0].keys()

## Data exploration

How big is the dataset, and how does it look?

In [None]:
len(data)

In [None]:
data[:5]

In [None]:
data[0].keys()

In [None]:
data[0]['chosen_topic_passage']

### Dictionary keys of Wizard

In [None]:
data[0]['dialog'][0].keys()

### Dictionary keys of Apprentice

In [None]:
data[0]['dialog'][1].keys()

### Dialog example

In [None]:
for i in range(10):
    print(i, ":", data[0]['dialog'][i]['text'])

### Exploring uniqe types

Exploring how many uniqe "chosen_topic"s, "persona"s and "wizard_eval"s there are in the dataset.

In [None]:
topics = []
personas = []
wizardEvals = []

for entry in data:

  topics.append(entry['chosen_topic'])
  personas.append(entry['persona'])
  wizardEvals.append(entry['wizard_eval'])

# Making the list containing only uniqe items
topics = list(set(topics))
personas = list(set(personas))
wizardEvals = list(set(wizardEvals))

print("topic:", len(topics), "persona:", len(personas), "wizard_eval:", len(wizardEvals))

#### "Wizard evals"

Why are there more than 5 different "wizard_eval"s? The paper only mentions a rating from 1-5. What are the other 2?

In [None]:
for entry in wizardEvals:
  print(wizardEvals[entry] )


What's up with -1 and 0? In paper only ratings from 1 to 5 are mentioned.

How often does each rating occur in "wizard_eval"s? Visualize all the different instances in a histogram

In [None]:
import matplotlib.pyplot as plt
import numpy as np

wEval = []

for entry in data:
    wEval.append(entry['wizard_eval'])

plt.hist(wEval, bins=2*len(set(wEval))) #the number of bins can probably be improved to look nicer
plt.yscale('log')
plt.show()

#### Topics

In [None]:
# What is a topic?

topics[:10]

#### Personas

In [None]:
# What is a persona?

personas[:10]

## all-MiniLM-L6-v2

This implementation is based on the all-MiniLM-L6-v2 model which is available from [Huggingface](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

The all-MiniLM-L6-v2 is a sentence-transformers model. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. This is a later model compared to the one showed in one of the tutorials, but is used more or less in the same way.

We have selected to pick a BERT model as we wanted to explore the posibility of creating a "vector database". The use-case is as follows:

From a natural user input, we want to retrive the correct Wikipedia passage. So that the input from the user is as small as possible. 

The reduction of data input comes from the exlusion of topics etc..  

### Splitting the data

We split up the data into a 80/20 split. We use 80% of the original dataset to perform fine-tuing of the all-MiniLM-L6-v2, the rest is then used for validating the reslut.

We aim to have the model being able to search the vector-space with new input and still being able to find the correct Wiki-passage.

It is important to note that this is a best case scenario as the input is generated from text that is present in the Wiki-passage.

In [None]:
import pandas as pd

data_extract_train = {
    "chosen_topic": [],
    "speaker_passage": [],
    "checked_sentence": [],
    "chosen_topic_passage": []
}

data_extract_test = {
    "chosen_topic": [],
    "speaker_passage": [],
    "checked_sentence": [],
    "chosen_topic_passage": []
}

for i, conversation in enumerate(data):

  for j, dialog in enumerate(conversation['dialog']):    

    if "Wizard" in dialog['speaker']:

      checked_sentence = list(dialog['checked_sentence'].values())

      if "no_passages_used" not in checked_sentence:

        if j % 4 == 0:

          data_extract_test['chosen_topic'].append(conversation['chosen_topic'])
          data_extract_test['speaker_passage'].append(dialog['text'])
          data_extract_test['checked_sentence'].append(checked_sentence)
          data_extract_test['chosen_topic_passage'].append(conversation['chosen_topic_passage'])

        else:
      
          data_extract_train['chosen_topic'].append(conversation['chosen_topic'])
          data_extract_train['speaker_passage'].append(dialog['text'])
          data_extract_train['checked_sentence'].append(checked_sentence)
          data_extract_train['chosen_topic_passage'].append(conversation['chosen_topic_passage'])

extract_train_df = pd.DataFrame(data_extract_train)

extract_test_df = pd.DataFrame(data_extract_test)

extract_test_df

### Reducing the size

The size of the dataset is too big for the amount of available VRAM on the GPU. Therefor we need to reduce the size of the extracted dataset.

We also found that using more data will just result in overfitting the model.

In [None]:
testing_size = len(extract_test_df.index)*0.04
testing_size = int(testing_size)

training_size = len(extract_train_df.index)*0.04
training_size = int(training_size)

print("Ratio %:", testing_size/training_size * 100)

In [None]:
extract_test_df = extract_test_df.sample(testing_size)
extract_test_df

In [None]:
extract_test_df = extract_test_df.sample(testing_size)
extract_test_df

### Using the model


In [None]:
%pip install transformers

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')


In order to use the model we need to construct sentence pairs. These pairs consists of a "user-input" and a sentence from the Wiki-passage

As stated earlier, the "user-input" is a genereated human-like input. The input from is genereated from the same sentence as the Wiki-passage which it is matched with. 

We note that this isn't the best case senario, as it can be interperted as the dataset is traning it self and creates a circle dependence. But we see it as being a "optimal" scenario instead.

In [None]:
def data_division(dataframe, sample_size:int):

  selected_sentences = []
  selected_conversation_topics = []

  for c, row in dataframe.sample(sample_size).iterrows():
    
    selected_conversation_topics.append(row['chosen_topic'])

    for resp in row['checked_sentence']:
      pair = (row['speaker_passage'], resp)
      selected_sentences.append(pair)

  return selected_sentences, selected_conversation_topics

In [None]:
selected_sentences_training, conversation_topics_traning = data_division(extract_train_df, len(extract_train_df.index))

In [None]:
selected_sentences_training[0]

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(device)

# Move model to GPU
model.to(device)

### Using mean-pooling

Becasue of the variable-length of the input we need to transform the input into a fixed-length representation so we can pass it to our model for traning.

The process involves taking the average of all the token embeddings in the sequence. More or less, this is achieved by summing up the embeddings and dividing the sum by the total number of tokens in the sequence.

Note that mean-pooling does not consider the positional information or the relative importance of individual tokens within the sequence. In order to combat this we make use of a attetion mask inorder to highlight some importance in the embedding.

The **mean_pooling** function performs mean pooling on token embeddings while considering an attention mask for correct averaging. It takes *model_output* and *attention_mask* as inputs.

The function first extracts the token embeddings from *model_output*. It then expands the attention mask to match the dimensions of the token embeddings. The expanded mask is used to mask out the embeddings that should be ignored.

Next, the masked token embeddings are summed along the second dimension (axis 1) to obtain the sum of the embeddings for each token. The attention mask is also summed along the second dimension and clamped to avoid division by zero.

Finally, the masked token embeddings are divided by the clamped attention mask sum to compute the mean pooling. The resulting mean-pooled embeddings are returned.

In [None]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

The embedding is performed in the same way as desrcibes in the documentation for the all-MiniLM-L6-v2 model. One step that has been left out is the normalizaiton of the embedding. 

The normalization whould provide a list of benefits, such as: improved training and stability, reducing dimensionality, alignment of embedding spaces.

The reason that we whould like to perform normalization is to have similarity of meaurements when evaluating the performance of the model.

In [None]:
def perform_embedding(documents:list, device, model):

  encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors='pt')

  encoded_documents.to(device)
  with torch.no_grad():
      model_output_documents = model(**encoded_documents)

  # Perform pooling
  embedding = mean_pooling(model_output_documents, encoded_documents['attention_mask'])

  # Normalize embedding
  embedding = F.normalize(embedding, p=2, dim=1)

  return embedding

In [None]:
sentence_embeddings = perform_embedding(documents = selected_sentences_training, device = device, model = model)

### Visualizing Cluster with Hypertools

In order to get a better understanding of the dataset we will use Hypertools to transform the very high dimensional space into a something we can understand. 

We will generate a 3-dimensional wiev of the embeddings and color in correspondence to k-mean clusters. The amount of clusters corresponds to the amount of topics that is included in the dataset.

In [None]:
%pip install hypertools

In [None]:
import hypertools as hyp

n_clusters = len(set(conversation_topics_traning))

print("Number of clusters:", n_clusters)

hyp.plot(sentence_embeddings.cpu().detach().numpy(), '.', n_clusters = n_clusters)

### Creating the index with the model output

In [None]:
%pip install hnswlib

In order to search the embedding vector space we perform **k-nearest neighbors** (KNN) query. This works by creating a index using hnswlib, this is to improve the efficiency of the search.

We then perform the embedding process on the query, which in our case is the *speaker_passage*, and calculates the absolute distance to the *k* closest elements in the index.

In [None]:
import hnswlib

# Create the HNSW index
index = hnswlib.Index(space='l2', dim=sentence_embeddings.shape[1])
index.init_index(max_elements=len(sentence_embeddings), ef_construction=200, M=16)

# Add sentence embeddings to the index
index.add_items(sentence_embeddings.cpu().numpy())

In [None]:
# Perform a similarity search
def search_embeddings(query:str, k, device, model):

  query_embedding = perform_embedding(documents=query, device=device, model=model)

  indexes, distances = index.knn_query(query_embedding.cpu(), k=k)

  return indexes[0], distances, query_embedding

Get a random speaker passage from the training dataset just to verify that we can use the model

In [None]:
random_message = list(extract_train_df.sample(1).to_dict()['speaker_passage'].values())[0]
message = [random_message]

message

In [None]:
indexes, distances, query_embedding = search_embeddings(query=message, k=10, device=device, model=model)

print(indexes)

In [None]:
query_subset = []

for i, ind in enumerate(indexes):
  print("Distance:", distances[0][i], "\t", selected_sentences_training[ind][1])
  query_subset.append(selected_sentences_training[ind])

### Looking at the result

Hypertools has it's limitations, so in order to check the how the results look in comparison to the query embedding we will use *matplotlib.pyplot*.

Note: Having a 2-dimensional representation of such a high dimensional vectorspace that the embeddings are isn't optimal. But it is better than nothing 😉.

In [None]:
selected_sentences_embedding = perform_embedding(documents=query_subset, device=device, model=model)

In [None]:
import matplotlib.pyplot as plt

plt.scatter(sentence_embeddings.cpu()[:,0] , sentence_embeddings.cpu()[:,1], c = '#a9a9a9')
plt.scatter(selected_sentences_embedding.cpu()[:,0] , selected_sentences_embedding.cpu()[:,1], c = '#4363d8')
plt.scatter(query_embedding.cpu()[:,0] , query_embedding.cpu()[:,1], color = '#ffe119')
plt.show()

### Testing the model

Now we have extracted the data, finetuned the model, and proved that it works once. Now we will have to prove that it works for more cases. 

Earlier we set aside 20% of the original data for testing. Becasue we are aming to create something that is working as a vector database, we want to have absolute accuracy and we are not interested in similarity. This is by we have a one-to-one comparison and not a BLEU-evaluation or similar.

In [None]:
# Get traning data
selected_sentences_testing, _ = data_division(extract_test_df, len(extract_test_df.index))

In [None]:
score = 0

for _, sentence_pair in enumerate(selected_sentences_testing):

  indexes, distances, query_embedding = search_embeddings(query=sentence_pair[0], k=1, device=device, model=model)

  results = []
  for _, i in enumerate(indexes):

    if i >= len(selected_sentences_testing):
      break

    results.append(selected_sentences_testing[i])
  

  if sentence_pair in results:
    score += 1
  
print("Accuracy:", score/len(selected_sentences_testing)*100, "%")

Thats is quite bad, how does it look if we use the traning data?

In [None]:
score = 0

for _, sentence_pair in enumerate(selected_sentences_training):

  indexes, distances, query_embedding = search_embeddings(query=sentence_pair[0], k=1, device=device, model=model)

  results = []
  for _, i in enumerate(indexes):

    if i >= len(selected_sentences_training):
      break

    results.append(selected_sentences_training[i])
  

  if sentence_pair in results:
    score += 1
  
print("Accuracy:", score/len(selected_sentences_training)*100, "%")

### Finding the correct Wikipedia passage

To wrap it up, we want to find the correct Wiki-passage. This will be perfomed by just finding the passage in the bigger Wiki-passage and presenting it to the user.

In [None]:
def find_article(checked_sentence:str, data_extract):

  for passage in data_extract['chosen_topic_passage']:

    extracted_passage = ""

    for line in passage:
      extracted_passage = extracted_passage + " " + line

    if extracted_passage.find(checked_sentence) == 1:
      
      return extracted_passage

In [None]:
print("Sentence found:", query_subset[0][1])

# Here the whole data_extract_train is passed in, so it is a lot of uncessesary searing. 
complete_wiki_passge = find_article(checked_sentence=query_subset[0][1], data_extract=data_extract_train)

print("Wiki-passage:", complete_wiki_passge)

## Doc2Vec Approach

In [None]:
%pip install --upgrade gensim

In [None]:
%pip install numpy pyblas

In [None]:
import gensim
import scipy.linalg
from scipy.linalg import blas

### Preparing the Data

#### Preparing the training set

We first have to decide which Data we want to use to train the model aka what goal are we trying to achieve.
As we want to retrieve the correct passage for each turn we should probably train the model on the passages given and then try to retrieve the chosen passage given a sentence from the dialogue.

In [None]:
import pandas as pd

pd.DataFrame(data[0]["chosen_topic_passage"])

So we want to take all the sentences from each "chosen_topic_passage" and separately use those as the training data.

In [None]:
passages = [[passage for passage in sample['chosen_topic_passage']] for sample in data]
pd.DataFrame(passages[:2])

Now we have a nested list of lists, let's unfold that list in a way that the nested entries of those lists are their own entries.

In [None]:
sentences = []
for i in passages:
  for entry in i:
    sentences.append(entry)
    
pd.DataFrame(sentences[:10])

Let's check our dataset for duplicates.

In [None]:
print(f"Dataset with duplicates: {len(sentences)}")

#Let's turn the list into a dictionary and then back into a list to eliminate duplicates
unique_sentences = list(dict.fromkeys(sentences))
print(f"Cleaned up Dataset: {len(unique_sentences)}")

We reduced our dataset to 6% of the original one by removing the duplicates.

#### Preparing the test set

For the test set we need all the sentences created by the wizard which are based on sentences from Wikipedia articles aka the training set so we can then test the similarity between those sentences and the training set.
This way we want to be able to recover the sentence that was used to craft a response given by the wizard.
We should also save the actual used sentence in some dictionary linking the response and the used sentence to be able to evaluate the model

Let's take a look at the structure of the dialogue using pandas

In [None]:
import pandas as pd

df_dialog = pd.DataFrame(data[0]['dialog'][:2])
df_dialog

In [None]:
def get_value_from_dict(dictionary):
    for _, value in dictionary.items():
            return value

In [None]:
#Create dictionary with responses and chosen sentences and list with just responses
response_sentence_pairs = {}
wizard_resps = []

for dialogue in data:
  for entry in dialogue['dialog']:
    if not 'Wizard' in entry['speaker']: #the apprentice doesn't have any responses based on sentences from training set
      continue

    if get_value_from_dict(entry['checked_sentence']) == 'no_passages_used' or get_value_from_dict(entry['checked_sentence']) is None:
      continue

    extracted_text = get_value_from_dict(entry['checked_sentence'])

    response_sentence_pairs.update({entry['text']:extracted_text})
    wizard_resps.append(entry['text'])

Let's check our new dictionary.

In [None]:
dict_items = response_sentence_pairs.items()
print(list(dict_items)[:2])

Check out the list.

In [None]:
wizard_resps[:2]

Great, now we have a list with all the responses given by the wizard and a dictionary linking all the responses to the original source sentences.

Do we have any duplicates?

In [None]:
print(f"Dataset with duplicates: {len(wizard_resps)}")

#Eliminating a few duplicates
unique_resps = list(dict.fromkeys(wizard_resps))
print(f"Cleaned up Dataset: {len(unique_resps)}")

In [None]:
It seems so, but just a few. How come we have around ten times more responses, than source sentences?

#### Preprocess the Data

Let's define a function for preprocessing our data.

Note:

- Sadly simple_preprocess removes numbers which would be very useful for retrieval of very specific data.
- Also consider taking out stopwords.

In [None]:
def preprocess(data,tokens_only=False):
  for i, line in enumerate(data):
    tokens = gensim.utils.simple_preprocess(line, min_len=2, max_len=20)
    if tokens_only:
      yield tokens
    else:
      # For training data, add tags
      yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [None]:
train_corpus = list(preprocess(unique_sentences))
test_corpus = list(preprocess(wizard_resps,tokens_only=True))

Visualization of structure of train_corpus and test_corpus.

In [None]:
pd.DataFrame(train_corpus[:10])

In [None]:
pd.DataFrame(test_corpus[:10])

### Training the model

We instantiate a Doc2Vec model with a vector size of 50 dimensions and iterate over the training corpus 40 times

If evaluation with test set is bad, maybe try to decrease min_count to 0, so unique words are not lost

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=80)

Build a vocabulary

In [None]:
model.build_vocab(train_corpus)

Essentially, the vocabulary is a list (accessible via model.wv.index_to_key) of all of the unique words extracted from the training corpus. Additional attributes for each word are available using the model.wv.get_vecattr() method, For example, to see how many times test appeared in the training corpus:

In [None]:
print(f"Word 'obama' appeared {model.wv.get_vecattr('obama', 'count')} times in the training corpus.")

Train the model on the corpus (Took 1 minute with 80 epochs with cleaned up dataset)


In [None]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

### Model assessment

To assess our new model, we’ll first infer new vectors for each document of the training corpus, compare the inferred vectors with the training corpus, and then returning the rank of the document based on self-similarity

Note: *This took 6 minutes to execute with the cleaned up dataset*

In [None]:
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

Let’s count how each document ranks with respect to the training corpus

In [None]:
import collections

counter = collections.Counter(ranks)
print(counter)

Looking at an example by picking a random document from the corpus and infer a vector from the model

In [None]:
import random

doc_id = random.randint(0, len(train_corpus) - 1)
inferred_vector = model.infer_vector(train_corpus[doc_id].words)
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)

for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Notice above that the most similar document (usually the same text) is has a similarity score approaching 1.0. However, the similarity score for the second-ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself.


We can run the next cell repeatedly to see a sampling other target-document comparisons.

In [None]:
import random

doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
sim_id = second_ranks[doc_id]

print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

This doesn't really look good. Probably the sentences are too short and thus it doesn't work that well. Also omitting the numbers causes an information loss.

### Testing on single examples

In [None]:
import random

doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))


# Compare and print the 10 most similar documents from the train corpus
print("Test Document ({}): «{}»\n".format(doc_id, ' '.join(test_corpus[doc_id])))

for index in range(10):
    print(f"{index+1}. {sims[index]}: «{' '.join(train_corpus[sims[index][0]].words)}»")
print("\n\n")

#Similarity score of original sentence
tokenized_sentence = gensim.utils.simple_preprocess(response_sentence_pairs[wizard_resps[doc_id]], min_len=2, max_len=20)
for index in range(len(sims)):
  if train_corpus[sims[index][0]].words == tokenized_sentence:
    print(f"Similarity of original sentence: \n{index+1}. {sims[index]}: «{' '.join(train_corpus[sims[index][0]].words)}»")

print("\n\n")
print(f"Untokenized Wizard response: {wizard_resps[doc_id]}")
print(f"Original source sentence: {response_sentence_pairs[wizard_resps[doc_id]]}")

### Evaluating model performance on subset of test data

Evaluating the accuracy of the original sentence being in the top 10 and top 20 most similar sentences with 20% of the test data (takes 2 minutes)

In [None]:
#Create test subset
test_subset_corpus = test_corpus

#counter that keeps track how often the right source sentence was in the top 10
counter_10 = 0
#counter that keeps track how often the right source sentence was in the top 20
counter_20 = 0

test_size = 0.2

for i in range(int(len(test_corpus)*test_size)):
  doc_id = i

  inferred_vector = model.infer_vector(test_subset_corpus[doc_id])

  sims = model.dv.most_similar([inferred_vector], topn=20)

  tokenized_sentence = gensim.utils.simple_preprocess(response_sentence_pairs[wizard_resps[doc_id]], min_len=2, max_len=20)
  
  for index in range(len(sims)):
    if train_corpus[sims[index][0]].words == tokenized_sentence:
      counter_20 += 1
      if index <= 10:
        counter_10 += 1

print(f"Number of test samples: {int(len(test_corpus)*test_size)}")

print(f"Number of times sentence was in Top 10: {counter_10}")
print(f"Number of times sentence was in Top 20: {counter_20}")

print(f"Accuracy for Top 10: {100*counter_10/int(len(test_corpus)*test_size)}%")
print(f"Accuracy for Top 20: {100*counter_20/int(len(test_corpus)*test_size)}%")