# Lab3.3 Part-of-Speech tagger as Token in Sequence classification

In Lab3.2, you represented a text as a Bag-of-Words or BoW. In a BoW, the order of the words does not matter. This may be fine for text classification tasks such as emotion detection and topic classification in which the text as a whole can be associated with an interpretation. However for many other NLP tasks, words or phrases need to be interpreted as they occur in order. For this we need to represent a text as a **sequence** of words and we need to classify words or phrases as being part of such a sequence.

The task of assigning an interpretation to a word or phrase in a sequence is called **Token in Sequence** or **TiS** classification. Typical **TiS** classification tasks are: Part-of-Speech tagging, Named Entity Recognition, Syntactic Structure and Dependencies, Event Recognition and Semantic Role Labelling.

Whereas in the case of a Text Classification task, we represent the complete text as a feature vector such as a BoW, in the case of TiS we need to represent each token in a sequence using a feature vector.

Assume we use a feature vector with four dimension for the presence of words "the", "chicken", "produced" and "egg" in this order. The difference between a very simple BoW and a TiS representation for the following two sentences would be as follows:

```
the chicken produced the egg
    BoW = [1, 1, 1, 1]
    TiS = [1, 0, 0, 0][0, 1, 0, 0][0, 0, 1, 0][1, 0, 0, 0][0, 0, 0, 1]
the egg produced the chicken
    BoW = [1, 1, 1, 1]
    TiS = [1, 0, 0, 0][0, 0, 0, 1][0, 0, 1, 0][1, 0, 0, 0][0, 1, 0, 0]
```

The BoW repersentation of the two sentences would be the same but the TiS representations are different. Here the TiS representation only captures the word itself as a so-called one-hot vector. In reality, we package a lot of features in the representation of a token among which the position in the sentence, word-shape, suffixes and prefixes, previous and following words.

In this notebook, we are going to demonstrate how you can train a so-called **Conditional Random Field** or **CRF** classifier to predict the part-of-speech of words in a sequence. For this we are going to represent the training data as TiS and feed this to a CRF classifier. 

## 1. Conditional Random Fields or CRF

**CRF** is a discriminative classifier  that evaluates the probabilities that a set of states are dependent or not between themselves based on a set of observations. In this case, it evaluates the probabilities that a word observed in a context belongs to a specific PoS. In training time, it takes what is the best state given the set of current observations and probabilities.

In the next image, X1 to X4 represent a sequence of input tokens or words. The goal is to predict Y1 to Y4 as labels given the sequence of input tokens. For each X there is a probability $\Phi$ for the token to predict a label Y. We also see that there is another probability $\Phi$ that a label Y is followed by another Y. In a sequence classification task these conditional probablities are taken into account as well. The algorithm progresses through the sequence to choose the optimal set of corresponding labels.

<div>
<img src="images/CRF.png" width="400"/>
</div>

Further mathematical details can be found in: https://towardsdatascience.com/conditional-random-fields-explained-e5b8256da776

## 2. Creating a token in sequence representation of a sentence

Token-in-Sequence or TiS representations are typically created for sentences and not for long documents. The sequential relations between words are stronger in a sentence with some grammatical structure. Long-distance (beyond the sentence boundary) dependencies between words are more difficult to capture.

In order to make a prediction for each word in a sequence, you need to know a lot about the word and about the sequence in which it occurs. Furthermore, you want the model to generalise over the word itself. For example, the sequence ```determiner - adjective``` is very likely to be followed by ```noun``` in English regardless of the words. Furthermore to know that something could be an adjective or a noun can also be learned from the beginning (prefix) and ending (suffix) of a word. In the same vein, the use of capitals, digits, or the position of the word in the sentence can play a role.

For TiS classification tasks, very rich feature bundles are defined to define each word or token in a sequence. Below is a very simple function that takes a list of tokens as a sentence as input and returns a feature bundle for the n-th word in a sentence. Take a minute to study the next function. Note that the ```re``` module is used for creating regular expressions, which are small grammars that define patterns of characters in strings.

In [3]:
#Regex module for checking alphanumeric values.
import re

### Function that takes a list of words as input (presumably a sentence) and a integer (index) pointing to the nth word of the list as input.
### Returns a Python dictionary with features and values based on the properties of the word
def extract_features(sentence: str, index: int):
  return {
      'word':sentence[index], ## the word itself is a feature, where index is the nth words in the sentence
      'is_first':index==0, ## True if it is the first word in a sentence otherwise False
      'is_last':index ==len(sentence)-1, ## True if it is the last word in a sentence, otherwise False
      'is_capitalized':sentence[index][0].upper() == sentence[index][0], ## True if the first character is a capital, otherwise False
      'is_all_caps': sentence[index].upper() == sentence[index], ## True if all characters capitalized, otherwise False
      'is_all_lower': sentence[index].lower() == sentence[index], ## True if all characters are lower case, otherwise False
      'is_alphanumeric': int(bool((re.match('^(?=.*[0-9]$)(?=.*[a-zA-Z])',sentence[index])))), ## Mixture of characters and digits
      'prefix-1':sentence[index][0], ## first character of a word
      'prefix-2':sentence[index][:2], ## first two chars of a word
      'prefix-3':sentence[index][:3], ## first three chars of a word
      'suffix-1':sentence[index][-1], ## last character of a word
      'suffix-2':sentence[index][-2:], ## last two chars of a word
      'suffix-3':sentence[index][-3:], ## last three chars of a word
      'prev_word':'' if index == 0 else sentence[index-1], ## previous word if any
      'next_word':'' if index == len(sentence)-1 else sentence[index+1], ## next word if any
      'has_hyphen': '-' in sentence[index], ## True if it has a hyphen, otherwise False
      'is_numeric': sentence[index].isdigit(), ## only digits
      'capitals_inside': sentence[index][1:].lower() != sentence[index][1:] ## True if any capitals inside the word, otherwise False
  }

By tokenizing a sentence using ```NLTK```, we can turn a sentence into a list of words and pass these one by one to the function through their position in the list. The function will return the feature dictionary for each word, which we append to the feature sequence.

In [4]:
import nltk

sentence = "Fruit flies like a banana."
token_sequence_sentence = nltk.word_tokenize(sentence)

## The sequence of features that we will extract
feature_sequence_sentence = []
for index in range(len(token_sequence_sentence)):
    word_feature_dictionary =extract_features(token_sequence_sentence, index)
    feature_sequence_sentence.append(word_feature_dictionary),

for item in feature_sequence_sentence:
    print(item)
    print()

{'word': 'Fruit', 'is_first': True, 'is_last': False, 'is_capitalized': True, 'is_all_caps': False, 'is_all_lower': False, 'is_alphanumeric': 0, 'prefix-1': 'F', 'prefix-2': 'Fr', 'prefix-3': 'Fru', 'suffix-1': 't', 'suffix-2': 'it', 'suffix-3': 'uit', 'prev_word': '', 'next_word': 'flies', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}

{'word': 'flies', 'is_first': False, 'is_last': False, 'is_capitalized': False, 'is_all_caps': False, 'is_all_lower': True, 'is_alphanumeric': 0, 'prefix-1': 'f', 'prefix-2': 'fl', 'prefix-3': 'fli', 'suffix-1': 's', 'suffix-2': 'es', 'suffix-3': 'ies', 'prev_word': 'Fruit', 'next_word': 'like', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}

{'word': 'like', 'is_first': False, 'is_last': False, 'is_capitalized': False, 'is_all_caps': False, 'is_all_lower': True, 'is_alphanumeric': 0, 'prefix-1': 'l', 'prefix-2': 'li', 'prefix-3': 'lik', 'suffix-1': 'e', 'suffix-2': 'ke', 'suffix-3': 'ike', 'prev_word': 'flies'

We see that each word is now represented by a dictionary with feature names and values. The classifier will turn these feature representations internally into numeric representations and associate these with a sequence of labels, e.g.:

In [5]:
label_sequence = ["NOUN", "NOUN", "VERB", "DET", "NOUN", "PUNCT"]

In order to train a classifier, we need a lot more than one sentence and the corresponding labels. We therefore use a real data set next and convert these into sequences of features with the above function.

## Training a Part-of-Speech tagger using the Penn Treebank

The [Penn Treebank](https://paperswithcode.com/dataset/penn-treebank) is a famous data set in which sentences have been labeled with Part-of-Speech tags. The data set is included in NLTK but you may have to download it separately. You only need to download it once to use it.

In [6]:
#Ensure that the treebank corpus is downloaded
nltk.download('treebank')

[nltk_data] Downloading package treebank to /Users/piek/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

The data has a specific structure in which sentences are stored as lists of words that can be retrieved through separate file ids. The corresponding seuences of Part-of-Speech tags can be retrieved witht he same id. The next code pulls out sequences of words and sequences of tags that we pair as a set (note the "(", ")" brackets) and save in a list called ```penn_treebank```.

In [7]:
#Load the treebank corpus class
from nltk.corpus import treebank

#Now we iterate over all samples from the corpus (the fileids are equivalent to sentences)
#and retrieve the word and the pre-labeled PoS tag. This will be added as a list of tuples with
#a list of words and a list of their respective PoS tags (in the same order).
penn_treebank = []
for fileid in treebank.fileids():
  tokens = []
  tags = []
  for word, tag in treebank.tagged_words(fileid):
    tokens.append(word)
    tags.append(tag)
  penn_treebank.append((tokens, tags))

print('Number of paired sentences and tags in Penn', len(penn_treebank))

Number of paired sentences and tags in Penn 199


So there are 199 sentences in this data set with sequences of Part-of-Speech.
Each sentence is now represented as two list: a sequence of words and a sequences of tags.

In [8]:
### Showing the first two
for token_tag_sequences in penn_treebank:
    print(token_tag_sequences[0], token_tag_sequences[1])
    break

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.', 'Mr.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N.V.', ',', 'the', 'Dutch', 'publishing', 'group', '.'] ['NNP', 'NNP', ',', 'CD', 'NNS', 'JJ', ',', 'MD', 'VB', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', 'NNP', 'CD', '.', 'NNP', 'NNP', 'VBZ', 'NN', 'IN', 'NNP', 'NNP', ',', 'DT', 'NNP', 'VBG', 'NN', '.']


We can now use the feature extraction function that we described above to represent all sentences and associate these with the sequences of tags.

Before we transform the Penn treebank with the above function, we split in 80% train and 20% test data.

In [9]:
# Split into train and test
penn_train_size = int(0.8*len(penn_treebank))
penn_training = penn_treebank[:penn_train_size]
penn_testing = penn_treebank[penn_train_size:]

We will define a new function ```transform_to_dataset``` that will take the list of paired lists of words and tags, convert each word into a feature dictionary and pair the word features wih the corresponding label.

In [10]:
#This function returns the data as two lists, one of Dicts of features and the other with the labels.
def transform_to_dataset(tagged_sentence_pairs):
  feature_sequences, label_sequences = [], []
  for sentence, tags in tagged_sentence_pairs:
    sent_word_features, sent_tags = [],[]
    for index in range(len(sentence)):
        sent_word_features.append(extract_features(sentence, index)),
        sent_tags.append(tags[index])
    feature_sequences.append(sent_word_features)
    label_sequences.append(sent_tags)
  return feature_sequences, label_sequences

In [11]:
# Extract for each the feature sequences and label sequences
penn_train_feature_sequences, penn_train_label_sequences = transform_to_dataset(penn_training)
penn_test_feature_sequences, penn_test_label_sequences = transform_to_dataset(penn_testing)

In [12]:
print('Nr. of annotated sentences', len(penn_train_feature_sequences), len(penn_train_label_sequences))

### Showing the first feature sequence
for features in penn_train_feature_sequences:
    for feature in features:
        print(feature)
        print()
    break

### Showing the first tag sequence
for tags in penn_train_label_sequences:
    print(tags)
    break

Nr. of annotated sentences 159 159
{'word': 'Pierre', 'is_first': True, 'is_last': False, 'is_capitalized': True, 'is_all_caps': False, 'is_all_lower': False, 'is_alphanumeric': 0, 'prefix-1': 'P', 'prefix-2': 'Pi', 'prefix-3': 'Pie', 'suffix-1': 'e', 'suffix-2': 're', 'suffix-3': 'rre', 'prev_word': '', 'next_word': 'Vinken', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}

{'word': 'Vinken', 'is_first': False, 'is_last': False, 'is_capitalized': True, 'is_all_caps': False, 'is_all_lower': False, 'is_alphanumeric': 0, 'prefix-1': 'V', 'prefix-2': 'Vi', 'prefix-3': 'Vin', 'suffix-1': 'n', 'suffix-2': 'en', 'suffix-3': 'ken', 'prev_word': 'Pierre', 'next_word': ',', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}

{'word': ',', 'is_first': False, 'is_last': False, 'is_capitalized': True, 'is_all_caps': True, 'is_all_lower': True, 'is_alphanumeric': 0, 'prefix-1': ',', 'prefix-2': ',', 'prefix-3': ',', 'suffix-1': ',', 'suffix-2': ',', 'suffix-3': 

We can now feed the data to an Sklearn CRF classifier as is. 

## 3 Training  the CRF Pos-tagger

Now, we use the Conditional Random Fields (CRF) algorithm that is provided in a specific package of Sklearn called ```sklearn_crfsuite``` to train a Token in Sequence or TiS classifier that assigns PoS tags to sequences of words in a sentence. The package should already be installed at the start of the course. Otherwise, run the next cell after removing the comment tag:

In [13]:
#!pip install sklearn_crfsuite==0.5.0

We can create an instance of ```CRF``` with the deault settings as follows:

In [14]:
from sklearn_crfsuite import CRF
penn_crf_pos = CRF()

We created the CRF instance without specifying any parameters. This means we are using the default settings. There are a number of parameters that can be defined. Here are some:

* algorithm: methodology used to check if results are improving. Default is lbfgs (gradient descent).
* c1 and c2:  coefficients used for regularization.
* max_iterations: max number of iterations
* all_possible_transitions: CRF creates a "network" of probability transition states, this option allows it to map "connections" not directly present in the data.

You could also use a more advance setting to define our instance:
```
penn_crf = CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True)
```

We call the ```fit``` function to pass the feature representation of the sentences and the corresponding PoS tag sequences.

In [15]:
#The fit method is the default name used by Machine Learning algorithms to start training.
print("Started training on Penn Treebank corpus!")
penn_crf_pos.fit(penn_train_feature_sequences, penn_train_label_sequences)
print("Finished training on Penn Treebank corpus!")

Started training on Penn Treebank corpus!
Finished training on Penn Treebank corpus!


## 4. Evaluation

For evaluating the Penn Treebank test set that we split off, we can feed the feature sequence representation of the test set directly to the classifier to get the predictions.

In [16]:
penn_predictions=penn_crf_pos.predict(penn_test_feature_sequences)
print(len(penn_test_label_sequences))
print(len(penn_predictions))
print('The first sentence gold labels in sequence')
print(penn_test_label_sequences[0])
print('The first sentence predicted labels in sequence')
print(penn_predictions[0])

40
40
The first sentence gold labels in sequence
['NNP', 'NNP', 'VBD', 'DT', 'NN', 'JJ', 'NN', 'IN', '$', 'CD', 'CD', '-NONE-', ',', 'CC', 'CD', 'NNS', 'DT', 'NN', ',', 'VBN', 'IN', 'JJ', 'NN', 'IN', '$', 'CD', 'CD', '-NONE-', ',', 'CC', 'CD', 'NN', 'DT', 'NN', '.', 'DT', 'NN', 'IN', 'DT', 'NNP', ',', 'JJ', 'NN', 'VBD', '-NONE-', 'NNS', 'VBD', 'DT', 'NN', 'IN', '$', 'CD', 'CD', '-NONE-', 'IN', 'DT', 'NN', ':', 'IN', 'NN', ',', 'DT', 'NN', 'VBD', 'VBN', '-NONE-', 'IN', 'VBG', 'NNS', 'VBG', '$', 'CD', 'CD', '-NONE-', 'CC', '$', 'CD', 'CD', '-NONE-', 'IN', 'NN', 'NNS', 'IN', 'PRP', 'VBD', '-NONE-', 'IN', '``', 'JJ', '.', "''", 'DT', 'NNS', 'VBD', 'RB', 'VBN', '-NONE-', 'IN', 'DT', '$', 'CD', 'CD', '-NONE-', 'NN', 'IN', 'DT', 'NN', 'IN', 'NNS', 'IN', 'CD', 'JJ', 'NNS', ',', 'PRP', 'VBD', '-NONE-', '-NONE-', '.', 'NN', 'VBD', 'CD', 'NN', 'TO', '$', 'CD', 'CD', '-NONE-', ',', 'IN', '$', 'CD', 'CD', '-NONE-', 'DT', 'NN', 'JJR', '.', 'NNP', 'VBD', '``', 'DT', 'JJ', 'NN', 'IN', 'DT', 'NN', 'IN'

In [17]:
for gold, predict in zip(penn_test_label_sequences[0], penn_predictions[0]):
    print('Gold', gold, 'Predicted', predict)

Gold NNP Predicted NNP
Gold NNP Predicted NNP
Gold VBD Predicted VBD
Gold DT Predicted DT
Gold NN Predicted JJ
Gold JJ Predicted JJ
Gold NN Predicted NN
Gold IN Predicted IN
Gold $ Predicted $
Gold CD Predicted CD
Gold CD Predicted CD
Gold -NONE- Predicted -NONE-
Gold , Predicted ,
Gold CC Predicted CC
Gold CD Predicted CD
Gold NNS Predicted NNS
Gold DT Predicted DT
Gold NN Predicted NN
Gold , Predicted ,
Gold VBN Predicted VBN
Gold IN Predicted IN
Gold JJ Predicted JJ
Gold NN Predicted NN
Gold IN Predicted IN
Gold $ Predicted $
Gold CD Predicted CD
Gold CD Predicted CD
Gold -NONE- Predicted -NONE-
Gold , Predicted ,
Gold CC Predicted CC
Gold CD Predicted CD
Gold NN Predicted NN
Gold DT Predicted DT
Gold NN Predicted NN
Gold . Predicted .
Gold DT Predicted DT
Gold NN Predicted NN
Gold IN Predicted IN
Gold DT Predicted DT
Gold NNP Predicted NNP
Gold , Predicted ,
Gold JJ Predicted VBN
Gold NN Predicted NN
Gold VBD Predicted VBD
Gold -NONE- Predicted -NONE-
Gold NNS Predicted NNS
Gold VB

The predictions are made sentence by sentence and the result is a list of 40 sentence predictions. To evaluate the results, we  need to use a specific method ```flat_classification_report``` from ```sklearn_crfsuite.metrics``` to handle lists of lists as predictions.

In [18]:
from sklearn_crfsuite import metrics
print("## Penn Treebank CRF PoS tagger##")
print(metrics.flat_classification_report(penn_test_label_sequences, penn_predictions, labels=penn_crf_pos.classes_, digits=3, zero_division=0))

## Penn Treebank CRF PoS tagger##
              precision    recall  f1-score   support

         NNP      0.947     0.965     0.956      1213
           ,      1.000     1.000     1.000       592
          CD      1.000     0.990     0.995       683
         NNS      0.938     0.978     0.958       740
          JJ      0.854     0.896     0.874       731
          MD      0.993     1.000     0.996       135
          VB      0.970     0.930     0.949       313
          DT      0.992     0.993     0.992      1062
          NN      0.952     0.933     0.942      1899
          IN      0.978     0.981     0.980      1285
           .      1.000     1.000     1.000       509
         VBZ      0.952     0.900     0.925       219
         VBG      0.933     0.908     0.921       185
          CC      1.000     0.997     0.998       287
         VBD      0.957     0.947     0.952       492
         VBN      0.907     0.907     0.907       279
      -NONE-      0.998     1.000     0.999    

The weighted average is pretty high but not the state-of-the art. Still on average almost one out of twenty words gets a wrong Part-of-Speech.

We can see that Penn Treebank uses a large variety of PoS tags and that the performance varies across these tags, which correlates also with the support. Lower support means less training data and more arbitrary testing. For example ```EX`` has a support of 3 and a score of 0.75 for precision. 

Due to the variation, there is a large difference between the macro average and the weighted average. Adapting the parameters when creating the CRF instance may improve the macro average results.

The Counter function can be used to show the frequency of the PoS labels in the training data set.

In [149]:
from collections import Counter
label_counts = Counter(metrics.flatten(penn_train_label_sequences))
print(label_counts)

Counter({'NN': 11267, 'IN': 8572, 'NNP': 8197, 'DT': 7103, '-NONE-': 5721, 'NNS': 5307, 'JJ': 5103, ',': 4294, '.': 3365, 'CD': 2863, 'VBD': 2551, 'RB': 2526, 'VB': 2241, 'CC': 1978, 'VBZ': 1906, 'TO': 1881, 'VBN': 1855, 'PRP': 1566, 'VBG': 1275, 'VBP': 1209, 'MD': 792, 'POS': 700, 'PRP$': 692, '``': 657, "''": 642, '$': 554, ':': 514, 'WDT': 383, 'JJR': 334, 'WP': 228, 'NNPS': 207, 'RP': 191, 'JJS': 158, 'WRB': 156, 'RBR': 123, '-RRB-': 110, '-LRB-': 104, 'EX': 85, 'RBS': 34, 'PDT': 23, '#': 16, 'LS': 13, 'WP$': 10, 'FW': 4, 'UH': 3, 'SYM': 1})


Tags that occur less frequently could perhaps be merged or generalised.

## 4. Representation after thought

The DictVectorizer function of Sklearn will convert the data to numerical one-hot vector representations for each feature. It will extract all possible values for each feature from the data and create a vector that can mark which value is true. If you inspect the above representation, you will see that many features only have two values such as True and False, but others are more open, such as the word itself and the suffixes. The latter require long sparse vectors as one-hot encodings.

CRF does not provide functions to show the feature vectors that it extracts directly. However, we can use the DictVectorizer to create the vectors from our feature representation and print these to inspect them. Lets do that for our "Fruit flies" sentence.

In [20]:
from sklearn.feature_extraction import DictVectorizer

print('Feature representation of the "Fruit flies like a banana." sentence.')
print(feature_sequence_sentence)
print()

Feature representation of the "Fruit flies like a banana." sentence.
[{'word': 'Fruit', 'is_first': True, 'is_last': False, 'is_capitalized': True, 'is_all_caps': False, 'is_all_lower': False, 'is_alphanumeric': 0, 'prefix-1': 'F', 'prefix-2': 'Fr', 'prefix-3': 'Fru', 'suffix-1': 't', 'suffix-2': 'it', 'suffix-3': 'uit', 'prev_word': '', 'next_word': 'flies', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}, {'word': 'flies', 'is_first': False, 'is_last': False, 'is_capitalized': False, 'is_all_caps': False, 'is_all_lower': True, 'is_alphanumeric': 0, 'prefix-1': 'f', 'prefix-2': 'fl', 'prefix-3': 'fli', 'suffix-1': 's', 'suffix-2': 'es', 'suffix-3': 'ies', 'prev_word': 'Fruit', 'next_word': 'like', 'has_hyphen': False, 'is_numeric': False, 'capitals_inside': False}, {'word': 'like', 'is_first': False, 'is_last': False, 'is_capitalized': False, 'is_all_caps': False, 'is_all_lower': True, 'is_alphanumeric': 0, 'prefix-1': 'l', 'prefix-2': 'li', 'prefix-3': 'lik', 'suf

We will create an instance of the DictVectorizer that takes the above dictionary representation and converts it to the vector representation that CRF will use internally.

In [21]:

vec = DictVectorizer(sparse=True)
sentence_vec = vec.fit_transform(feature_sequence_sentence)  # Sparse matrix of shape (n_tokens, n_features)

# Dense vector for easier viewing
print('Shape', sentence_vec.shape)
print()
for token in sentence_vec:
    print(token.toarray())

print()
# Feature names (column names of the vector)
print(vec.get_feature_names_out())

Shape (6, 62)

[[0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1.
  0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]]
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.
  0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.
  0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0.]]
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1.]]
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1.
  0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
  0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]
[[0. 0. 1. 1. 0

The ```shape``` of sentence_vec is (6, 62), which means an ordered list of 6 words or tokens, each represented by a vector with 62 dimensions.
Below the vector representation, we printed the names of the 62 dimensions. You can see that the 6 words themselves are dimensions, as well as prefixes and suffixes of the word.

What will happen if we need to represent the full Penn Treebank corpus? How many dimensions does it need to create?

We will not do this for the full data set but for a small set of sentences. To get the vectors we need to flatten the list of sentence representations to a single list of all the tokens. We also need some other functions to print these as the vectors get very long. We will only print the vector for the first sentence represented by the dimensions extracted from the first three sentences only.

In [23]:
from itertools import chain
import numpy as np
import sys 

# Flatten your list of sentence features into a single list of token features
# This ignores the sentence structure and creates a single list of the tokens of all sentence selected
tokens_flat = list(chain.from_iterable(penn_train_feature_sequences[:3])) ## selecting three sentence
print('Total sequence of words', len(tokens_flat))
print()
vec = DictVectorizer(sparse=True)
penn_vec = vec.fit_transform(tokens_flat)  # Sparse matrix of shape (n_tokens, n_features)

print('We now have a list of tokens for the selection of sentences from the Penn treebank, each represented by a long sparse vector.')
print('Shape for the total set of tokens', penn_vec.shape)
print()
# Dense vector for easier viewing
for sentence_vec in penn_vec:    
    # Dense vector for easier viewing
    for token in sentence_vec:
        with np.printoptions(threshold=sys.maxsize):
            print(token.toarray())
    print()
    break
    
# Feature names (column names of the vector)
print(list(vec.get_feature_names_out()))

Total sequence of words 840

We now have a list of tokens for the selection of sentences from the Penn treebank, each represented by a long sparse vector.
Shape for the total set of tokens (840, 2181)

[[0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

Each token from the first three sentences in the Penn Treebank is represented by the same features but the number of values increased a lot. From the first three sentences it extracted a flat list of 840 tokens, which resulted in a 2181 dimensions, each representing a possible feature-value combination.

This is a lot more than the 62 feature-value combinations or dimensions that we got for our "Fruit flies" sentence. This is because the open-values grow with each sentence. Try to include more sentences than three from the Penn Treebank and see how the dimensions explode.

With 20 sentences, we get a shape of: (5986, 8304), so more than three times the number of dimensions. At some point, the increase of the dimensions will slow down as the same words will be repreated.

## End of notebook

In [24]:
tokens_flat = list(chain.from_iterable(penn_train_feature_sequences[:20])) ## selecting three sentence
print('Total sequence of words', len(tokens_flat))
print()
vec = DictVectorizer(sparse=True)
penn_vec = vec.fit_transform(tokens_flat)  # Sparse matrix of shape (n_tokens, n_features)

print('We now have a list of tokens for the selection of sentences from the Penn treebank, each represented by a long sparse vector.')
print('Shape for the total set of tokens', penn_vec.shape)
print()

    
# Feature names (column names of the vector)
print(list(vec.get_feature_names_out()))

Total sequence of words 5986

We now have a list of tokens for the selection of sentences from the Penn treebank, each represented by a long sparse vector.
Shape for the total set of tokens (5986, 8304)

['capitals_inside', 'has_hyphen', 'is_all_caps', 'is_all_lower', 'is_alphanumeric', 'is_capitalized', 'is_first', 'is_last', 'is_numeric', 'next_word=', 'next_word=$', 'next_word=%', 'next_word=&', "next_word='", "next_word=''", "next_word='re", "next_word='s", 'next_word=*', 'next_word=*-1', 'next_word=*-10', 'next_word=*-11', 'next_word=*-12', 'next_word=*-13', 'next_word=*-14', 'next_word=*-15', 'next_word=*-16', 'next_word=*-17', 'next_word=*-18', 'next_word=*-19', 'next_word=*-2', 'next_word=*-20', 'next_word=*-21', 'next_word=*-22', 'next_word=*-23', 'next_word=*-24', 'next_word=*-25', 'next_word=*-26', 'next_word=*-27', 'next_word=*-28', 'next_word=*-29', 'next_word=*-3', 'next_word=*-4', 'next_word=*-5', 'next_word=*-6', 'next_word=*-7', 'next_word=*-8', 'next_word=*-9', 'next_