This notebook roughly covers materials in https://huggingface.co/docs/datasets/tutorial, looking specifically at the Universal Dependencies (UD) dataset(s). You can find the documentation about the UD dataset here: https://universaldependencies.org/introduction.html.

In [2]:
!pip install transformers datasets

Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/09/7e/fd4d6441a541dba61d0acb3c1fd5df53214c2e9033854e837a99dd9e0793/datasets-2.14.5-py3-none-any.whl.metadata
  Downloading datasets-2.14.5-py3-none-any.whl.metadata (19 kB)
Collecting sentencepiece (from transformers)
  Downloading sentencepiece-0.1.99-cp311-cp311-win_amd64.whl (977 kB)
     ---------------------------------------- 0.0/977.5 kB ? eta -:--:--
     ------------------- ----------------- 512.0/977.5 kB 10.7 MB/s eta 0:00:01
     ------------------------------------- 977.5/977.5 kB 12.4 MB/s eta 0:00:00
Collecting xxhash (from datasets)
  Obtaining dependency information for xxhash from https://files.pythonhosted.org/packages/46/14/0302669d5d983ce23dc3870f4f2b16ab1d757a1d7e54a5cfe7a5df37f8e2/xxhash-3.3.0-cp311-cp311-win_amd64.whl.metadata
  Downloading xxhash-3.3.0-cp311-cp311-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Obtaining depe

In [3]:
!pip install conllu

Collecting conllu
  Obtaining dependency information for conllu from https://files.pythonhosted.org/packages/ce/3f/70a1dc5bc536755ec082b806594598a10cfffaf0de978f51d4e0e4fdfa47/conllu-4.5.3-py2.py3-none-any.whl.metadata
  Downloading conllu-4.5.3-py2.py3-none-any.whl.metadata (19 kB)
Downloading conllu-4.5.3-py2.py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-4.5.3


In [4]:
from datasets import load_dataset, get_dataset_config_names, get_dataset_split_names

In [5]:
name = "universal_dependencies"

In [6]:
ud_config = get_dataset_config_names(name)

Downloading builder script:   0%|          | 0.00/87.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.33M [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/191k [00:00<?, ?B/s]

In [7]:
print([x for x in ud_config if 'en_' in x])

['en_esl', 'en_ewt', 'en_gum', 'en_gumreddit', 'en_lines', 'en_partut', 'en_pronouns', 'en_pud']


#2A)

In [8]:
ud_ewt_train = load_dataset(name, 'en_ewt', split="train")
ud_ewt_val = load_dataset(name, 'en_ewt', split="validation")

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/321k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/318k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/12543 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2002 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2077 [00:00<?, ? examples/s]

In [9]:
ud_ewt_train

Dataset({
    features: ['idx', 'text', 'tokens', 'lemmas', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc'],
    num_rows: 12543
})

In [10]:
ud_ewt_val

Dataset({
    features: ['idx', 'text', 'tokens', 'lemmas', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc'],
    num_rows: 2002
})

In [11]:
ud_ewt_train[0]['tokens']

['Al',
 '-',
 'Zaman',
 ':',
 'American',
 'forces',
 'killed',
 'Shaikh',
 'Abdullah',
 'al',
 '-',
 'Ani',
 ',',
 'the',
 'preacher',
 'at',
 'the',
 'mosque',
 'in',
 'the',
 'town',
 'of',
 'Qaim',
 ',',
 'near',
 'the',
 'Syrian',
 'border',
 '.']

In [12]:
#Training
avg = 0
for x in ud_ewt_train['tokens']:
  avg = avg + len(x)
avg = avg / len(ud_ewt_train['tokens'])
print("# of sentences in training set: ", len(ud_ewt_train['tokens']))
print("Average # of words per sentence: ", avg)

# of sentences in training set:  12543
Average # of words per sentence:  16.50745435701188


In [13]:
#Validation
avg = 0
for x in ud_ewt_val['tokens']:
  avg = avg + len(x)
avg = avg / len(ud_ewt_val['tokens'])
print("# of sentences in validation set: ", len(ud_ewt_val['tokens']))
print("Average # of words per sentence: ", avg)

# of sentences in validation set:  2002
Average # of words per sentence:  12.726273726273726


#2B)

In [14]:
#Tokens
tokens = [token for sent in ud_ewt_train['tokens'] for token in sent]
print("Number of tokens in training set", len(tokens))

Number of tokens in training set 207053


In [15]:
#Types (ie different words)
types = set([token for sent in ud_ewt_train['tokens'] for token in sent])
print("Number of different types in training set", len(types))

Number of different types in training set 20132


#2C)

In [16]:
# 50 most popular types from training data
from statistics import mode

tokens = [token for sent in ud_ewt_train['tokens'] for token in sent]
tokens = [token.lower() for token in tokens]

popular = []
for i in range(50):
  mode1 = mode(tokens)
  popular.append(mode1)
  tokens = [tkn for tkn in tokens if tkn != mode1]
print(popular)

['the', '.', ',', 'to', 'and', 'a', 'of', 'i', 'in', 'is', 'you', 'that', 'it', 'for', '-', 'have', '"', 'on', 'was', 'with', 'this', 'be', 'are', 'they', 'not', 'as', 'we', "'s", 'my', ')', 'do', '(', 'will', 'he', 'at', '?', 'but', 'if', 'or', 'your', 'from', "n't", 'by', 'can', 'would', 'me', ':', 'there', 'so', '!']


It seems as though the most common types in the training data tend to be  punctuation marks as well as well as the English "articles." From the data, the period and comma are both in the top three most popular types, while 'the',  and 'a' make up two of the six most popular. All together that makes up 2/3 of the top six types found in the training data. In addition, pronouns seems to also be a popular type, with 'i', 'we', 'you', and 'they' being in the top 15.

#2D)

I believe the easiest method to collapse these items would be to group them based on their part of speech. This may oversimplify the data as it is still important to view the words individual to others that share their part of speech, but may lead to more interesting discoveries about the frequencies of part of speech in the language. Similar to this, punctuation marks could be grouped by function like '.', '!', and '?' for ending phrases.

It could also be a good strategy to group words by the lexeme with which they are indexed. This would group together similar variations of the same root word like "breaks, break, and breaking" for example. (Wikipedia https://en.wikipedia.org/wiki/Lemma_(morphology)) This could be utilized to better show the frequency of different ideas rather than specific individual words. This could also be used inversely to show the frequencies of how different root lexemes are utilized within writing.

#2E)

In [17]:
from collections import Counter

tokens = [token for sent in ud_ewt_train['tokens'] for token in sent]
tokens = [token.lower() for token in tokens]

count = Counter(tokens).most_common()
count = [word for word in count if word[1] < 50]
print(count)



##i)

In my personal opinion words seem to become pretty "standard" at about 20 hits on this training set. Words with about 10 instances or less are often very specific like proper nouns, website links, words with apostrophes, specific numbers, and other similar situations. Words around the 20 token mark still seem to see a bit of that "over-specificity" but more often are more standard words that could be found frequently in English sentences. In other words they have become more general.

##ii)

While becoming more "standard" I would also argue that the same words become more "reasonable," that is to say that they would be more likely to help the learning model as individual tokens. Quite a few of the "non-standard" tokens would be unhelpful in training the model as individual instances. I would propose that a grouping solution be implemented to group together some of these "non-standard" tokens so they could be more easily identified as "ideas" rather than their individual tokens. Like above, the words could be grouped based on their meanings or root words to help the model more accurately learn the meanings of specific tokens.

#2F)

##i

In [18]:
#Tokens
tokens_val = [token for sent in ud_ewt_val['tokens'] for token in sent]
print("Number of tokens in validation set", len(tokens_val))

Number of tokens in validation set 25478


##ii

In [19]:
tokens_train = [token for sent in ud_ewt_train['tokens'] for token in sent]
types_train = set(tokens_train)
types_val = set(tokens_val)

#remove all overlapping types in types_val
for tp in types_train:
  if tp in types_val: types_val.remove(tp)

print("Number of out of vocabulary words in validation set: ", len(types_val))

Number of out of vocabulary words in validation set:  1706


##iii

In [20]:
OOV = len(types_val)
types_val = set(tokens_val)
print("Number of Training types: ", len(types_train))
print("Number of Validation types: ", len(types_val))
print("Proportion of OOV types: ", OOV / len(types_val) * 100, "%")

Number of Training types:  20132
Number of Validation types:  5624
Proportion of OOV types:  30.33428165007112 %


The number of OOV words did not seem too surprising to me? To me it makes sense that to test the model about 30% of the words should be new. I believe that having about 70% of the words be in both the training and validation split, it gives the model a basis to recognize some of the information directly. But, for the last 30%, the model is forced to use its training to identify new information.

#3A)

torch.nn.linear is a function from the neural networks functionality in pyTorch that implements the a "fully connected layer" in a neural network. It takes input features and does matrix multiplication to compute output features that each end up as different weighted sums of the input features. These layers are what allow the model to find hidden features in the data, beyond what can be displayed with just a set of input features.

#3B)

In [21]:
import torch
f = torch.nn.Linear(3, 2, bias=False)
f.weight = torch.nn.Parameter(torch.Tensor(
    [[3, 2, 1],
     [1, 4, 1]]))
x = torch.Tensor([1, 0, 0])
y = f(x)
y


ModuleNotFoundError: No module named 'torch'

As shown above the output of the weighted sum is a tensor [3, 1], Which is actually:
\begin{bmatrix}
  3 \\
  1 \\
\end{bmatrix} This is achieved by the weighted sum of the matrix:
\begin{bmatrix}
  3 & 2 & 1\\
  1 & 4 & 1\\
\end{bmatrix}
and the matrix:
\begin{bmatrix}
  1 & 0 & 0 \\
\end{bmatrix}
but since pyTorch recognizes that the multiplication requires a 3x1 matrix rather than a 1x3 matrix, it uses the above 1x3 matrix as:
\begin{bmatrix}
  1 \\
  0 \\
  0 \\
\end{bmatrix}
instead.


#3C

In [290]:
z = y.sum()
grad = z.backward()
print(z)
print(f.weight.grad)

tensor(4., grad_fn=<SumBackward0>)
tensor([[1., 0., 0.],
        [1., 0., 0.]])


The gradient being computed is the derivitave of the loss function with respect to the different weights. In this case, the initial weights have no bearing on the gradient since we are just using a sum, and x is basically copied in directly for each row of f.weight.



#4A

The derivative of the funciton: 𝒇μ(x) = exp(-(1/3)(x-μ)^3)

is: (x-μ)^2 (exp(-(1/3)(x-μ)^3))

ie. just uses the chain rule

#4B)

In [299]:
# Exact Answer
from math import exp
x = 1.5
u = 5
fx = pow((x-u), 2) * exp((-1/3)*pow((x-u), 3))
print(fx)

19720960.318729453


In [300]:
# using pyTorch
x = torch.tensor(1.5, requires_grad=True)
u = torch.tensor(5.0, requires_grad=True)

fx = torch.exp(-(1/3)*pow((x - u), 3))

fx.backward()

u.grad


tensor(19720968.)

#4C)

The gradient of the function f(x) = log(ax^4 + bx^2 - 1) with respect to a and b is:

With respect to a: (x^4)/(ax^4 + bx^2 - 1)

With respect to b: (x^2)/(ax^4 + bx^2 - 1)

#4D)

In [310]:
# Exact Answer
from math import log
x = -2
a = 1
b = 1

fx = a * pow(x,4) + b * pow(x, 2) - 1
dfxda = pow(x, 4) / fx
dfxdb = pow(x, 2) / fx
print("With respect to a: ", dfxda)
print("With respect to b: ", dfxdb)

With respect to a:  0.8421052631578947
With respect to b:  0.21052631578947367


In [311]:
# using pyTorch
x = torch.tensor(-2.0, requires_grad=True)
a = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)


fx = torch.log(a * pow(x,4) + b * pow(x, 2) - 1)

fx.backward()

print("With respect to a: ", a.grad)
print("With respect to b: ", b.grad)

With respect to a:  tensor(0.8421)
With respect to b:  tensor(0.2105)


#4E & 4F

See Attached JPG File in submission

#4G)

In [314]:
A = torch.tensor([[4, 2, 0], [-1, 0, -1]])
At = torch.transpose(A, 0, 1)

AtA = torch.matmul(At, A)
AAt = torch.matmul(A, At)

print("ATA = ", AtA)
print("AAT = ", AAt)

ATA =  tensor([[17,  8,  1],
        [ 8,  4,  0],
        [ 1,  0,  1]])
AAT =  tensor([[20, -4],
        [-4,  2]])


#References

Wikipedia article on Lemmas (https://en.wikipedia.org/wiki/Lemma_(morphology))

Code based on work from Proffesor Ferraro's colab file 9/6