# HW5: Building and evaluating a part-of-speech tagger

In this assignment, we are interested in what factors affect how well a trained part-of-speech tagger will do on unseen data. The homework will assess your ability to manipulate data (the output of neural language models) and your ability to discuss the results from each of the classifiers that you train. You will train four classifiers that will focus on different ways of looking at the data.

We specifically want you to assess the performance of a deep neural language model ([RoBERTa; Liu et al., 2019](https://arxiv.org/abs/1907.11692)) that is readibly available in the `huggingface` package. We would like you to compare and contrast the performance of classifiers trained on representations obtained from a lower layer of RoBERTa with classifiers trained on a higher layer from RoBERTa. This will allow you to see how different parts of a neural model's architecture do or do not encode the same kind of information.

We also want to see the effect of how close the training data is to the test data. This is like a real-world scenario, where we often train on the recent past to predict the present. So, we want to get the right part-of-speech tags for this year's abstracts (2021) either from classifiers trained on abstract part-of-speech tags from the immediately preceding year (2020) or any year prior to 2020. Think about how language and science can change while you are building these classifiers. What does it mean if 2020 and pre-2020 influence performance on tagging the 2021 dataset?

### Before you start: What are part-of-speech tags?

Part of speech tags are labels we assign to words depending on what kind of syntactic role they play in a sentence. While we have not studied part-of-speech tags yet in class, they have come up when we have talked about nouns, verbs, modifiers, etc. Building a good classifier that can do part-of-speech tagging can help us better understand things like the syntactic structure of a sentence. In order to understand the meaning of a chunk of language, we need to know what kind of "role" each word is playing.

There are lots of resources for learning lots about part of speech tags. For this assignment, we will work with the most basic of categories: "Universal" labels. These categories are designed to work for as many languages as possible. We are trying to predict the following categories in context to the best of our ability:

    VERB - verbs (all tenses and modes)
    NOUN - nouns (common and proper)
    PRON - pronouns
    ADJ - adjectives
    ADV - adverbs
    ADP - adpositions (prepositions and postpositions)
    CONJ - conjunctions
    DET - determiners
    NUM - cardinal numbers
    PRT - particles or other function words
    X - other: foreign words, typos, abbreviations
    . - punctuation

So, our classifiers will try to learn what makes something an "adjective", what makes something a "noun", and so on.

## Warning: This assignment will probably take a long time!!
## Lots of moving parts and many computations are very slow.
## Please heed the advice below:

* ### We recommend that you prototype on very small subsets of the data (e.g., `train_2020_only[0:5]` and `test_2021[0:5]`)
* ### Only once you are ready to submit your assignment and start writing up your results should you run through the whole dataset.
* ### Running RoBERTa and training your classifier can easily take half an hour or more to run depending on the efficiency of your implementation. When you have finished prototyping, expect for this to take a full 3-4 hours, just in case.

## DO NOT start until the last minute. It will only lead to avoidable suffering.

# Q1: Installing prerequisites (2 points)

In order to do this assignment, you need to install the `transformers` package from `huggingface`. Do that in the cell below.

In [15]:
!pip3 install transformers



# Q2: Imports (1 point)

Put all of the imports you will use here. Also include the neural language model and tokenizer that you will use. We will be using the RoBERTa models; for examples, refer to lecture notebooks. Keep in the `model.eval()` code below.

In [16]:
# your imports go here
from google.colab import drive, files
import json
from transformers import RobertaModel, RobertaTokenizer
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix

model = RobertaModel.from_pretrained('roberta-base', output_hidden_states=True)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

model.eval()

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Drop

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Drop

#Q3: Data preprocessing (6 points)

## Q3A: Loading in the three datasets (3 points)

For this assignment, we are going to use best practices in  machine learning and split our training data and our test data apart. We have two training datasets for you, which we described above. The 2020-only dataset is called `train_2020-only.json` and the pre-2020 dataset is called `train_pre-2020.json`. Each line correponds to one `json` object. Load in each of these training datasets as `train_2020_only` and `train_pre2020` respectively using our familiar friend `json.loads`, reading in the data line by line.

Our test data is stored in the file `test_2021.json`. It is structured exactly the same way as the training files, but when you load it in, name it `test_2021`.

All of the datasets are stored in the `data/` subdirectory.

In [17]:
drive.mount("/content/drive/", force_remount=True)

with open("/content/drive/MyDrive/Fall 2021 Computational Linguistics Notebooks/Archive/data/train_2020-only.json") as train_data:
  train_2020_only = []
  for x in train_data:
      if x != '':
          train_2020_only.append(json.loads(x))

with open("/content/drive/MyDrive/Fall 2021 Computational Linguistics Notebooks/Archive/data/train_pre-2020.json") as pre_train:
  train_pre2020 = []
  for x in pre_train:
      if x != '':
          train_pre2020.append(json.loads(x))

with open("/content/drive/MyDrive/Fall 2021 Computational Linguistics Notebooks/Archive/data/test_2021.json") as test_data:
  test_2021 = []
  for x in test_data:
      if x != '':
          test_2021.append(json.loads(x))

Mounted at /content/drive/
Mounted at /content/drive/


## Q3B: Preview data (1 point)

Print out the first two entries of `train_2020_only`.

In [18]:
train_2020_only[0:2]

[[['Dialogue', 'NOUN'],
  ['Act', 'NOUN'],
  ['(', '.'],
  ['DA', 'NOUN'],
  [')', '.'],
  ['tagging', 'NOUN'],
  ['is', 'VERB'],
  ['crucial', 'ADJ'],
  ['for', 'ADP'],
  ['spoken', 'ADJ'],
  ['language', 'NOUN'],
  ['understanding', 'VERB'],
  ['systems', 'NOUN'],
  [',', '.'],
  ['as', 'ADP'],
  ['it', 'PRON'],
  ['provides', 'VERB'],
  ['a', 'DET'],
  ['general', 'ADJ'],
  ['representation', 'NOUN'],
  ['of', 'ADP'],
  ['speakers', 'NOUN'],
  ['{', '.'],
  ["'", 'PRT'],
  ['}', '.'],
  ['intents', 'NOUN'],
  [',', '.'],
  ['not', 'ADV'],
  ['bound', 'VERB'],
  ['to', 'PRT'],
  ['a', 'DET'],
  ['particular', 'ADJ'],
  ['dialogue', 'NOUN'],
  ['system', 'NOUN'],
  ['.', '.']],
 [['Unfortunately', 'ADV'],
  [',', '.'],
  ['publicly', 'ADV'],
  ['available', 'ADJ'],
  ['data', 'NOUN'],
  ['sets', 'NOUN'],
  ['with', 'ADP'],
  ['DA', 'NOUN'],
  ['annotation', 'NOUN'],
  ['are', 'VERB'],
  ['all', 'DET'],
  ['based', 'VERB'],
  ['on', 'ADP'],
  ['different', 'ADJ'],
  ['annotation', 'NOU

[[['Dialogue', 'NOUN'],
  ['Act', 'NOUN'],
  ['(', '.'],
  ['DA', 'NOUN'],
  [')', '.'],
  ['tagging', 'NOUN'],
  ['is', 'VERB'],
  ['crucial', 'ADJ'],
  ['for', 'ADP'],
  ['spoken', 'ADJ'],
  ['language', 'NOUN'],
  ['understanding', 'VERB'],
  ['systems', 'NOUN'],
  [',', '.'],
  ['as', 'ADP'],
  ['it', 'PRON'],
  ['provides', 'VERB'],
  ['a', 'DET'],
  ['general', 'ADJ'],
  ['representation', 'NOUN'],
  ['of', 'ADP'],
  ['speakers', 'NOUN'],
  ['{', '.'],
  ["'", 'PRT'],
  ['}', '.'],
  ['intents', 'NOUN'],
  [',', '.'],
  ['not', 'ADV'],
  ['bound', 'VERB'],
  ['to', 'PRT'],
  ['a', 'DET'],
  ['particular', 'ADJ'],
  ['dialogue', 'NOUN'],
  ['system', 'NOUN'],
  ['.', '.']],
 [['Unfortunately', 'ADV'],
  [',', '.'],
  ['publicly', 'ADV'],
  ['available', 'ADJ'],
  ['data', 'NOUN'],
  ['sets', 'NOUN'],
  ['with', 'ADP'],
  ['DA', 'NOUN'],
  ['annotation', 'NOUN'],
  ['are', 'VERB'],
  ['all', 'DET'],
  ['based', 'VERB'],
  ['on', 'ADP'],
  ['different', 'ADJ'],
  ['annotation', 'NOU

## Q3C: What are each of the lines? (2 points)

What kind of data structure is it? What are the elements?

The data structure consists of a list of lists within a list. The word is the first member of the innermost list, and the part of speech of that word is the second element of the innermost list.

# Q4: Creating embeddings for each utterance (25 points) for your training data and producing four models

For hints, check out the `natural_language_inference.ipynb` functions.

Use the below function to take a single sentence and turn it into an embedding that we can use for our classifiers. This model will automatically ignore all non-initial morphemes so you do not have to worry about how RoBERTa handles word pieces.

Pay attention to the `# note` in the below for a clue to a later question.

In [19]:
def embed_words_roberta(single_data_entry, model, tokenizer):
  words_only = [x[0] for x in single_data_entry] # note
  tokenized = tokenizer(words_only, return_tensors='pt',
                        is_split_into_words=True)
  embedded = model(**tokenized)
  embeddings = embedded['hidden_states']
  token_strings = tokenizer.convert_ids_to_tokens(tokenized['input_ids'][0].tolist())
  dimensions_to_keep = [i for i, x in enumerate(token_strings)
                        if x.startswith("Ġ") or i==1]
  subsetted_embeddings = [x[:, dimensions_to_keep].detach().numpy()
                          for x in embeddings]
  return subsetted_embeddings

## Q4A: Extract the embeddings at a specific layer (2 points)

Please print out the 7th layer from the output of `embed_words_roberta(train_2020_only[0])`.

Then print out the 3rd layer.

Remember Python indexing.

In [20]:
print(embed_words_roberta(train_2020_only[0], model, tokenizer)[6])
print(embed_words_roberta(train_2020_only[0], model, tokenizer)[2])

[[[ 0.14016572 -0.11448487 -0.011288   ...  0.22353216 -0.01660346
   -0.11273213]
  [ 0.4115105   0.99736625 -0.31471416 ...  0.20526893  0.00814523
   -0.41467914]
  [-0.03960834 -0.93892    -0.17954792 ...  0.2429632  -0.00834235
   -0.452601  ]
  ...
  [ 0.35614103 -0.38965848 -0.27384546 ... -0.02938731 -0.15403344
    0.07221778]
  [ 0.02199963 -0.27218857  0.13880713 ... -0.04169824  0.26157796
    0.09182303]
  [ 0.17606992 -1.1034262  -0.12621473 ... -0.2525679   0.11628371
    0.2366817 ]]]
[[[ 9.3297042e-02 -2.7078649e-01 -1.0472690e-02 ...  1.2782261e-01
   -1.6816127e-01  9.0169740e-01]
  [ 3.3740988e-01  4.2805174e-01 -4.9100149e-01 ...  2.8090021e-01
    2.3827232e-01  4.1614282e-01]
  [ 8.2606383e-02 -1.1995723e+00  1.2631418e-02 ...  3.7670016e-01
   -1.9059345e-01  4.3387216e-01]
  ...
  [ 1.7335072e-02 -8.9443630e-01 -3.9580902e-01 ... -1.1768020e-01
   -1.0037434e-01  1.4096325e-02]
  [ 6.3420452e-02 -2.1349691e-01  1.3322832e-01 ...  3.3628538e-01
    9.3517121e-04

##Q4B: Complete the function `process_training_dataset` (7 points)

Your code below should be able to take a given dataset that you loaded in above and produce word embeddings for each word. Then, these word embeddings will be used to train a classifier. All you need to do is make sure your Xs and ys are shaped right and you should be good to go.

The function `process_training_dataset` critically must take in:

* A dataset (e.g., any of the above)
* A neural language model
* A tokenizer that the neural language model can work with
* A specific layer number that we subset to when building our training data

And it will return:
* A trained classifier that can assign part-of-speech tags given word embeddings

**Note**: Pay attention to your answer in the previous question so you can better extract the right word embeddings for all of your classifiers!

In [21]:
def process_training_dataset(dataset, neural_model, neural_tokenizer, layer_number):
    # define your Xs and ys (embeddings and POS tags)
    Xs, ys = [], []
    # loop over every sentence in the dataset
    for records in dataset:
            for record in records:
                # extract POS tags for that sentence
                # combine ys with the POS tags
                ys.append(record[1])
                # extract embeddings for that sentence
                # get embeddings at a specific layer
            embeddings = embed_words_roberta(records, neural_model, neural_tokenizer)[layer_number]
            embeddings = embeddings[0]
            Xs.append(embeddings)
    
    Xs = np.vstack(Xs)
    classifier = LogisticRegression(max_iter=1000)
    classifier.fit(Xs, ys)
    return classifier

**NOTE: I have used only 1000 lines from both the json files as I am running into run time issues.**

##Q4C: Train your model @ layer 0 on the 2020-only data (4 points)

In [29]:
train_2020_layer_0 = process_training_dataset(train_2020_only[0:1000], model, tokenizer, 0)
print(train_2020_layer_0)

LogisticRegression(max_iter=1000)


## Q4D: Model @ layer 0, on pre-2020 data (4 points)

In [23]:
train_pre2020_layer_0 = process_training_dataset(train_pre2020[0:1000], model, tokenizer, 0)
print(train_2020_layer_0)

LogisticRegression(max_iter=1000)
LogisticRegression(max_iter=1000)


## Q4E: Model @ layer 10, on 2020-only data (4 points)

In [24]:
train_2020_layer_10 = process_training_dataset(train_2020_only[0:1000], model, tokenizer, 10)
print(train_2020_layer_0)

LogisticRegression(max_iter=1000)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(max_iter=1000)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## Q4F: Model @ layer 10, on pre-2020 data (4 points)

In [25]:
train_pre2020_layer_10 = process_training_dataset(train_pre2020[0:1000], model, tokenizer, 10)
print(train_2020_layer_0)

LogisticRegression(max_iter=1000)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(max_iter=1000)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


# Q5: Evaluate and compare all models (18 points)

For this question, we would like you to compare each of the models to each other along several dimensions. The four models cross the layer within the model (0 or 10) and whether the model is trained on older or newer data (pre-2020 or 2020 data). In order to compare the models, you need to get each of your models to generate **predicted** labels for each of the test items. First, you will need to **construct your test dataset** and then evaluate each model along the following dimensions using the following functions:

* Precision (`sklearn.metrics.precision_score`)
* Recall (`sklearn.metrics.recall_score`)
* F1 (`sklearn.metrics.f1_score`)

Then, you will be asked to fill in the performance of each of these four models in the form of a table. You will find instances of this around some of the previous lectures (e.g., the natural language inference and evaluation lecture notebooks).

## Q5A: Test data processing (3 points)

In order for us to assess the ability for our models above to do well, we have to also process our test data. The way `scikit-learn` expects to produce predictions is very simple. When we _train_ a `LogisticRegression` model, we give the model as input some set of $X$ values (e.g., a matrix of word embeddings). The model we train tries to optimize the fit between $X$ and the $y$ values we give -- such as the labels associated with part-of-speech tags. Getting _predictions_ from our trained models is simply a matter of giving it new $X$ values -- from our test dataset.

In order for this to work, we also have to process our test data to conform to the same structure as our training data. So, for this question, we would like you to make a function `process_test_dataset` that is just like the `process_train_dataset` but there is no need to train a model at the end at all. Instead, the function _only_ needs to return `Xs` (a matrix containing word embeddings) and `ys` (part-of-speech tags). The stub of what you need to do is below.

In [26]:
def process_test_dataset(layer_number):
    Xs, ys = [], []
    # implement the test version of process_train_dataset
    for records in test_2021:
            for record in records:
                # extract POS tags for that sentence
                # combine ys with the POS tags
                ys.append(record[1])
                # extract embeddings for that sentence
                # get embeddings at a specific layer
            embeddings = embed_words_roberta(records, model, tokenizer)[layer_number]
            embeddings = embeddings[0]
            Xs.append(embeddings)
    
    Xs = np.vstack(Xs)
    return Xs, ys

## Q5B: Score all four models (12 points)

Loop through each of your four models (output of last four notebook cells), print the precision, recall, and f1 scores. 

In [27]:
test_X_0, test_y_0 = process_test_dataset(0)
test_X_10, test_y_10 = process_test_dataset(10)
# use precision_score, recall_score, f1_score

#train_2020_layer_0
print(precision_score(test_y_0, train_2020_layer_0.predict(np.vstack(test_X_0)), average='macro'))
print(recall_score(test_y_0, train_2020_layer_0.predict(np.vstack(test_X_0)), average='macro'))
print(f1_score(test_y_0, train_2020_layer_0.predict(np.vstack(test_X_0)), average='macro'))

#train_pre2020_layer_0
print(precision_score(test_y_0, train_pre2020_layer_0.predict(np.vstack(test_X_0)), average='macro'))
print(recall_score(test_y_0, train_pre2020_layer_0.predict(np.vstack(test_X_0)), average='macro'))
print(f1_score(test_y_0, train_pre2020_layer_0.predict(np.vstack(test_X_0)), average='macro'))

#train_2020_layer_10
print(precision_score(test_y_10, train_2020_layer_10.predict(np.vstack(test_X_10)), average='macro'))
print(recall_score(test_y_10, train_2020_layer_10.predict(np.vstack(test_X_10)), average='macro'))
print(f1_score(test_y_10, train_2020_layer_10.predict(np.vstack(test_X_10)), average='macro'))

#train_pre2020_layer_10
print(precision_score(test_y_10, train_pre2020_layer_10.predict(np.vstack(test_X_10)), average='macro'))
print(recall_score(test_y_10, train_pre2020_layer_10.predict(np.vstack(test_X_10)), average='macro'))
print(f1_score(test_y_10, train_pre2020_layer_10.predict(np.vstack(test_X_10)), average='macro'))

0.8861238524782241
0.8734297379642223
0.8779961787711147
0.9043365410857915
0.8712839775343517
0.876164333102519
0.9117594565613115
0.8934593470549509
0.9010359861706775
0.9226297220235957
0.8956899743416384
0.90478276877555
0.8861238524782241
0.8734297379642223
0.8779961787711147
0.9043365410857915
0.8712839775343517
0.876164333102519
0.9117594565613115
0.8934593470549509
0.9010359861706775
0.9226297220235957
0.8956899743416384
0.90478276877555


## Q5C: Free response (3 points)

Using the outputs above, fill out a table showing performance across each of the 4 models, along all 3 measures. Describe in words how the models differ in their performance. Are there any patterns you notice that determine model performance? Were any of the results surprising to you? Why or why not? If the differences are small, can you think of a reason why we might not trust these results?

<table style="width:100%">
  <tr>
    <th>Models</th>
    <th>Precision</th>
    <th>Recall</th>
    <th>F1 Score</th>
  </tr>
  <tr>
    <td>train_2020_layer_0</td>
    <td>0.8861238524782241</td>
    <td>0.8734297379642223</td>
    <td>0.8779961787711147</td>
  </tr>
  <tr>
    <td>train_pre2020_layer_0</td>
    <td>0.9043365410857915</td>
    <td>0.8712839775343517</td>
    <td>0.876164333102519</td>
  </tr>
  <tr>
    <td>train_2020_layer_10</td>
    <td>0.9117594565613115</td>
    <td>0.8934593470549509</td>
    <td>0.9010359861706775</td>
  </tr>
  <tr>
    <td>train_pre2020_layer_10</td>
    <td>0.9226297220235957</td>
    <td>0.8956899743416384</td>
    <td>0.90478276877555</td>
  </tr>
</table>

When compared to the f1 score and recall score, precision has a high value in all three models. And when compared to layer 0, layer 10 has better performance in all scores. As a result, we may deduce that scores will improve as the number of layers is increased exponentially. Furthermore, in both layers 0 and 10, the dataset train_pre2020 outperforms the dataset train_2020 in Precision score but in Recall and F1 score, in layer 0, dataset train_2020 has high value and in layer 10, train_pre2020 has high value. The results were unexpected and surprising because the dataset train_pre2020 outperformed train_2020 in both layers in Precision score but has less value in Recall and F1 score in layer 0. Yes, the differences between these results across all the models are small, and I find it difficult to believe these results because all the scores in all the models are above 0.85 and I find hard to believe the the clasifier is trained properly as it is trained with small amount of data.

# Bonus: Error analysis (6 points; 3 for code, 3 for free response)

*   Take the best-performing model
*   Construct a confusion matrix in any way that you would like, comparing the output of the best model on your test set and the true labels.
*   What categories are most confusable? What linguistic reasons might that be the case?

In [28]:
confusion_matrix(test_y_10, train_pre2020_layer_10.predict(np.vstack(test_X_10)))

array([[18258,     7,     0,     0,     1,     0,    12,     4,     0,
           18,     2,     0],
       [    1, 14975,    14,   294,     4,    30,  1968,    42,     2,
            0,   818,     1],
       [    1,    32, 17059,    67,     5,   111,    32,     0,     0,
            1,    54,     2],
       [    5,   261,    58,  4853,     3,    14,   128,     0,     7,
            2,    86,     0],
       [    0,     3,     6,    24,  4627,    12,     1,     0,     1,
            0,     1,     3],
       [    0,    81,    80,    17,     0, 13948,    18,     8,    15,
            1,     2,     0],
       [   18,  1964,    28,   107,     1,    29, 43327,    31,    13,
            2,  1119,    16],
       [    1,    35,     0,     4,     0,    14,    20,  2199,     0,
            0,     1,     0],
       [    0,    23,     3,     9,     0,    36,    13,     1,  4636,
            0,     2,     0],
       [    9,     3,    41,     1,     0,     0,    11,     0,     0,
         3685,    13

array([[18258,     7,     0,     0,     1,     0,    12,     4,     0,
           18,     2,     0],
       [    1, 14975,    14,   294,     4,    30,  1968,    42,     2,
            0,   818,     1],
       [    1,    32, 17059,    67,     5,   111,    32,     0,     0,
            1,    54,     2],
       [    5,   261,    58,  4853,     3,    14,   128,     0,     7,
            2,    86,     0],
       [    0,     3,     6,    24,  4627,    12,     1,     0,     1,
            0,     1,     3],
       [    0,    81,    80,    17,     0, 13948,    18,     8,    15,
            1,     2,     0],
       [   18,  1964,    28,   107,     1,    29, 43327,    31,    13,
            2,  1119,    16],
       [    1,    35,     0,     4,     0,    14,    20,  2199,     0,
            0,     1,     0],
       [    0,    23,     3,     9,     0,    36,    13,     1,  4636,
            0,     2,     0],
       [    9,     3,    41,     1,     0,     0,    11,     0,     0,
         3685,    13

<font color="red">Your bonus question answer goes here.</font>