# Agenda for today

* Review some concepts from `distributional_semantics_bow_dimensionality.ipynb`
* Discuss implementation of latent semantic analysis and principal component analysis
* Learn how to inspect your datasets with unsupervised word vectors

# Agenda for Monday

* Going over HW2 in class so you can look out for tips for HW3
* Discussing the final project assignment and requirements

# HW3 - Due Friday October 8th by midnight

## HW3 tip (Question 4)

Getting Question 4 correct is effectively a prereq for getting Questions 5, 6, 7, and the bonus correct. 

If you build a dictionary that maps between words and their morphemes and you want to build a dictionary that does the opposite, you need to "invert" the dictionary. For each of roots, prefixes, and suffixes, you want a data structure (a dictionary) that has as `key`s roots, prefixes, and suffixes, respectively. The `value` should be a `set` that contains all the words that contain that prefix. If you have a word like "costumers", it will have one root (`"roots": ["costume"]`), no prefixes (`"prefixes": []`), and two suffixes (`"suffixes": ['er', 's']`). To find other words containing `"costume"` as a root, you need to search through all the words in the homework (each line in the original file or each entry in the dictionary you build in Question 2) to see whether `"costume"` is in the roots **sub**dictionary. Since we want to do this _for all roots/prefixes/suffixes_, you need to make sure you loop through _all_ keys and make separate sets for each key.

# Latent semantic analysis using NLTK and scikit-learn's `CountVectorizer`

In [None]:
from google.colab import drive, files

abstract_file = files.upload()

abstracts = abstract_file['abstracts.tsv'].decode("utf-8").split('\n')

Saving abstracts.tsv to abstracts (3).tsv


## Earliest word embedding method: Latent semantic analysis

The easiest method to learn word embeddings is to build a pipeline that implements Latent Semantic Analysis. The basic ingredients are as follows:

1. Bag-of-words representations (using a sparse matrix package)
  * Decide what vocabulary terms to keep
    * Stop word removal
    * Casing or text normalization
    * What kind of tokenizer to use for segmentation
2. Principal components analysis (PCA)
  * Decide the number of dimensions you want

The bag-of-words representations shown in previous classes are slow and are not optimized. We can use others' implementations of sparse matrices and others' tokenizers to make this job easier for us. Specifically, we will do bag-of-words preprocessing using our familiar `nltk.word_tokenize` and a brand new tool, `sklearn.feature_extraction.text.CountVectorizer`. `CountVectorizer` will turn our lists of words into an unordered vector.

Recall that each dimension of a bag-of-words representation corresponds to counts of a *single* word. That means that the `CountVectorizer` is going to give us a vector that is as long as our vocabulary.

In [None]:
import nltk
nltk.download("punkt")
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer( # instantiate sparse matrix-creator
    tokenizer=word_tokenize, # with our tokenization algorithm
    stop_words=stopwords.words("english"), # typically remove stop words
    lowercase=True) # optionally lowercase words
# basically just one line to get a giant matrix
bow_abstracts = vectorizer.fit_transform(abstracts)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


  'stop_words.' % sorted(inconsistent))


In [None]:
bow_abstracts

<27471x74178 sparse matrix of type '<class 'numpy.int64'>'
	with 1950546 stored elements in Compressed Sparse Row format>

### Quiz yourself:

<details>
<summary>How many dimensions does this matrix have in it (how large is its vocabulary)? 
</summary>
74,178 vocabulary items </details>

<details>
<summary>How many documents are in the corpus?
</summary>
27,471 documents </details>

<details>
<summary>How many total words are in the corpus?
</summary>
1,950,547 words </details>



## Principal Components Analysis

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/1280px-GaussianScatterPCA.svg.png" width=500/> 

By <a href="//commons.wikimedia.org/wiki/User:Nicoguaro" title="User:Nicoguaro">Nicoguaro</a> - <span class="int-own-work" lang="en">Own work</span> <a href="https://creativecommons.org/licenses/by/4.0" title="Creative Commons Attribution 4.0">CC BY 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=46871195">Link</a>

</center>

The goal of PCA is to learn several dimensions that are geometrically **orthogonal** or statistically **uncorrelated** from the other dimensions. We can use PCA to transform large, complex spaces with correlations into smaller, more orderly spaces.

The output of principle components analysis is a **projection matrix** that will correspond to all input dimensions and their lower-dimensional representations. For our purposes this means we get **lower-dimensional, latent vector representations of words**. But, we can also use this projection matrix to transform all of our documents (e.g., each abstract) into a latent document representation, too. Let's get a sense of how this works.

We can use `scikit-learn` to build a Principal Components Analysis model as well.

In [None]:
from sklearn.decomposition import TruncatedSVD # PCA but for sparse matrices

N_COMPONENTS = 100

pca = TruncatedSVD(n_components=N_COMPONENTS)
pca.fit(bow_abstracts)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
             random_state=None, tol=0.0)

In [None]:
# investigate the size of the components
pca.components_.shape

(100, 74178)

In [None]:
pca.components_[0].shape # 0th dimension values, for all vocabulary items

(74178,)

In [None]:
word_vectors = pca.components_.T

In [None]:
# let's look at the most similar words to 'parsing'
# or any other word of your choice
from sklearn.metrics.pairwise import cosine_similarity

_index = vectorizer.vocabulary_['parsing']
word_similarities = cosine_similarity(word_vectors[_index].reshape(1, -1),
                                      word_vectors)
_to_similarities = dict(zip(vectorizer.get_feature_names(),
                            word_similarities[0].tolist()))
dict(sorted(_to_similarities.items(), key=lambda item: item[1])[-15:])

{'a*': 0.8696203469011479,
 'algebras': 0.9121852366777992,
 'automaton-based': 0.8845172266719301,
 'constituency': 0.8887536414738242,
 'corner': 0.8657388453578299,
 'non-projective': 0.8863091397526464,
 'parse': 0.8477854776675771,
 'parser': 0.9399177243565137,
 'parsers': 0.9193941345745146,
 'parsing': 1.0,
 'projective': 0.8690298389224291,
 'shift-reduce': 0.9111620440324334,
 'subalgebras': 0.8845172266719301,
 'transition-based': 0.9478722739564406,
 'well-typedness': 0.8848748407807268}

In [None]:
# try again with algorithms, or any other word of your choice
_index = vectorizer.vocabulary_['algorithms']
word_similarities = cosine_similarity(word_vectors[_index].reshape(1, -1),
                                      word_vectors)
_to_similarities = dict(zip(vectorizer.get_feature_names(),
                            word_similarities[0].tolist()))
dict(sorted(_to_similarities.items(), key=lambda item: item[1])[-15:])

{'algorithm': 0.7346482899086904,
 'algorithms': 0.9999999999999998,
 'automl': 0.5373835366555085,
 'context-free': 0.6002903666609867,
 'earley': 0.539177082826312,
 'earley-like': 0.5674692888379949,
 'equations': 0.5339359442375131,
 'grammars': 0.5453604809733716,
 'lols': 0.5923359445604042,
 'n4': 0.5479863906197606,
 'non-deterministic': 0.558096062981535,
 'noncrossing': 0.567667089512389,
 'programming': 0.5398624042758533,
 'stochastic': 0.6197453377639178,
 'tractable': 0.5936803414716441}

In [None]:
import pandas as pd

pd.DataFrame(word_vectors)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,2.508522e-04,0.000346,0.000509,-0.000189,1.353582e-06,8.315041e-04,0.001464,-8.676564e-04,0.000716,-2.587595e-05,-0.000749,-0.000343,-0.001696,-0.001044,0.000766,0.000664,0.002260,-6.164335e-04,0.000249,0.001660,0.001976,-0.000166,0.000130,0.002112,-0.000523,-0.000277,-1.652553e-04,-2.365831e-03,-0.000893,-1.528323e-05,0.002910,-0.002750,-5.713626e-04,-1.908943e-03,-6.294499e-04,-1.409423e-03,-1.635486e-03,-0.001952,-0.000308,-2.507402e-03,...,0.002067,-0.001088,0.000041,-0.001224,-1.253474e-03,-0.001101,0.000099,-0.000047,0.001553,0.001024,0.002428,-0.000498,-0.001527,-1.192067e-03,-0.003359,0.001825,-0.001031,-0.000828,0.002370,0.000433,-6.482611e-04,0.001465,-9.703770e-04,0.000747,-0.001335,0.001566,-0.000205,-6.243401e-03,9.909191e-04,-2.509745e-03,1.379547e-04,-0.001409,-0.000638,-0.001707,3.493320e-03,3.093989e-03,1.338832e-03,0.000246,-0.003934,0.001443
1,1.294723e-04,0.000270,0.000076,-0.000026,1.000498e-05,2.589890e-03,0.001995,-7.997823e-04,0.000331,-4.260058e-04,-0.000416,-0.004267,-0.000425,0.000679,-0.001026,-0.000986,-0.000652,6.385847e-04,-0.001910,-0.001181,-0.001025,-0.002160,-0.000227,0.000868,0.002160,0.000538,5.036685e-04,-1.288864e-03,0.000375,1.099249e-03,-0.000064,0.000092,1.170122e-03,6.274151e-04,-1.800135e-04,8.650981e-04,1.398082e-03,-0.001794,0.001196,1.756000e-03,...,0.001732,-0.001183,-0.003756,0.001477,-8.416673e-05,0.000676,0.001549,-0.001710,-0.000315,0.000413,-0.000011,-0.000619,0.000434,-5.583696e-04,0.000752,0.000524,0.000325,0.001791,0.000504,0.000550,2.241782e-04,0.000973,5.565419e-04,-0.002331,0.000464,0.000264,-0.002078,-2.005173e-03,1.998256e-03,4.128094e-04,-1.059568e-03,0.000386,-0.001458,0.000415,9.978561e-04,2.563318e-04,1.126380e-03,0.000689,0.000159,-0.001133
2,1.420705e-03,0.003052,0.001362,0.009152,-2.549469e-05,1.308915e-02,0.006480,1.417869e-03,0.004754,5.881375e-03,0.007562,0.000279,-0.000972,-0.007401,0.009509,0.010192,0.010153,-9.184990e-03,-0.004793,0.004727,0.013775,0.001571,0.013914,-0.001598,-0.004094,-0.001335,-3.387353e-03,1.296457e-03,-0.016870,2.305251e-02,-0.003499,0.001778,-1.753166e-03,-1.596347e-02,-1.230162e-03,-5.778465e-03,1.993694e-04,-0.007993,-0.009042,9.619217e-04,...,-0.020760,0.005880,0.040556,0.012217,3.051392e-02,-0.010736,0.007966,-0.051142,-0.031155,0.008653,0.071145,0.021235,0.051603,3.949703e-03,-0.004079,-0.087182,-0.005062,-0.013488,-0.100350,-0.067229,1.053101e-02,0.016405,-5.334781e-03,-0.029210,0.016551,-0.047038,0.056713,-2.034352e-02,-1.898306e-02,-1.286549e-01,-2.135997e-02,0.092359,-0.049179,-0.120934,1.804317e-01,4.605868e-03,1.319433e-01,0.295594,0.169570,0.195337
3,1.128181e-02,0.019869,-0.021352,0.032185,1.165490e-03,2.076500e-01,0.104504,-1.515096e-02,0.009604,-4.333096e-02,0.026698,-0.344405,0.098499,-0.231285,-0.036671,-0.324086,-0.137413,4.397925e-02,0.016865,-0.053264,0.009107,-0.046364,-0.062622,-0.117717,0.031667,0.023961,1.241132e-02,-7.870682e-03,0.012014,-1.160631e-02,-0.011915,0.059466,3.280153e-02,3.591190e-03,-3.401531e-02,6.876528e-02,-3.835569e-02,0.043985,0.056251,-3.816471e-03,...,0.017531,0.015889,-0.011105,-0.004509,-2.429772e-02,-0.001466,0.002020,0.010441,0.056094,0.003872,-0.005804,-0.002112,-0.024453,-1.958156e-02,0.017677,0.030224,0.013177,0.015938,0.006157,0.005542,6.087307e-03,0.013386,-5.812074e-03,0.015066,0.003291,0.017520,-0.011365,-2.239241e-02,5.772715e-03,9.259798e-03,-5.331364e-04,0.003295,-0.012139,-0.012085,-7.943342e-03,-1.096266e-02,-4.574424e-03,-0.015592,-0.014714,0.006304
4,1.011907e-03,0.000950,0.001972,0.004322,1.514078e-05,4.197763e-03,0.005938,9.898040e-04,-0.000217,-6.519758e-04,0.000856,-0.006886,0.003170,-0.006615,0.000882,-0.003836,-0.001154,2.242011e-03,0.002377,-0.002772,-0.000387,-0.003699,-0.000145,-0.002414,0.003715,0.004477,1.325931e-03,3.708289e-04,-0.000508,-6.897928e-04,-0.003537,0.001294,1.220465e-03,-3.527645e-03,-9.539034e-04,2.507737e-04,-1.129052e-03,-0.002814,-0.004979,4.277262e-04,...,-0.001222,-0.001988,0.000122,0.003011,6.699878e-03,0.001888,-0.000086,0.006241,-0.000425,0.001031,-0.003528,0.000526,0.002583,-1.875825e-04,0.001679,-0.003323,-0.001482,0.002839,0.009332,0.003000,-2.317243e-03,0.005229,1.126759e-03,0.001151,-0.006643,0.001172,-0.000572,-3.079779e-03,1.993201e-03,-2.435965e-04,7.287516e-04,0.000643,-0.007639,0.004099,1.914519e-03,7.417029e-04,3.868687e-03,0.005458,-0.001561,0.000681
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74173,5.508020e-07,0.000003,0.000044,-0.000001,3.902008e-08,-2.148285e-06,-0.000009,-1.192037e-06,0.000012,-4.733774e-07,0.000003,-0.000004,0.000009,0.000005,-0.000002,-0.000007,0.000001,-2.205953e-07,-0.000006,0.000007,-0.000005,-0.000004,0.000003,-0.000008,-0.000011,-0.000022,-8.047728e-06,1.981576e-06,-0.000003,-3.394681e-07,-0.000011,0.000003,7.566422e-07,-2.323341e-06,-1.497987e-06,1.692201e-06,-7.640939e-07,-0.000005,-0.000003,1.465543e-06,...,-0.000012,0.000004,0.000004,0.000004,1.277427e-06,-0.000011,-0.000004,-0.000008,0.000007,-0.000002,-0.000004,-0.000003,-0.000008,-9.375290e-07,-0.000010,0.000007,-0.000004,-0.000008,-0.000003,-0.000002,-1.185173e-07,-0.000018,2.484546e-07,-0.000002,-0.000006,0.000004,0.000006,-1.908105e-06,-3.642953e-06,1.385588e-06,1.230800e-06,0.000006,0.000004,0.000002,-3.931665e-07,-7.127716e-07,9.729180e-07,-0.000008,-0.000005,-0.000007
74174,4.454054e-07,0.000002,0.000035,0.000001,2.599545e-08,-2.304717e-06,-0.000007,-1.333328e-06,0.000010,-5.714133e-07,0.000002,-0.000003,0.000007,0.000004,-0.000002,-0.000006,0.000001,-2.399629e-07,-0.000005,0.000005,-0.000004,-0.000003,0.000002,-0.000007,-0.000009,-0.000018,-7.914676e-06,1.575722e-06,-0.000002,-8.899035e-07,-0.000009,0.000003,1.072393e-06,-1.603309e-06,-1.440518e-06,1.540245e-06,-3.361553e-07,-0.000004,-0.000003,1.330491e-06,...,-0.000010,0.000003,0.000003,0.000004,9.971055e-07,-0.000009,-0.000003,-0.000006,0.000005,-0.000002,-0.000003,-0.000003,-0.000006,-9.060235e-07,-0.000008,0.000006,-0.000004,-0.000007,-0.000003,-0.000001,-5.147038e-07,-0.000015,7.320796e-08,-0.000001,-0.000005,0.000004,0.000005,-1.315215e-06,-2.715923e-06,2.198376e-07,6.037610e-07,0.000006,0.000004,0.000003,-5.195195e-08,-1.420769e-06,1.499336e-06,-0.000005,-0.000004,-0.000006
74175,5.823410e-07,0.000003,0.000048,-0.000013,7.095883e-08,9.984656e-07,-0.000008,7.526233e-07,0.000015,5.405501e-07,0.000002,-0.000008,0.000009,0.000007,-0.000005,-0.000003,0.000003,3.197291e-07,-0.000002,0.000006,-0.000003,0.000002,0.000007,-0.000003,-0.000012,-0.000008,2.544112e-05,-9.741280e-07,-0.000005,1.348684e-05,-0.000004,-0.000006,-6.470024e-06,-3.148818e-08,-5.366460e-06,1.071883e-05,-1.519370e-05,0.000004,0.000009,6.346435e-07,...,-0.000003,0.000035,-0.000020,-0.000007,-1.346989e-05,0.000013,-0.000015,-0.000042,0.000003,0.000017,0.000023,-0.000008,-0.000006,-4.367514e-06,-0.000003,-0.000011,-0.000007,-0.000005,0.000004,0.000004,8.856288e-06,-0.000029,-1.085567e-05,-0.000004,-0.000008,-0.000014,-0.000013,-4.863706e-07,-8.467709e-07,7.509932e-06,6.734287e-06,-0.000008,0.000009,-0.000002,-1.888836e-06,-1.853723e-06,1.671838e-06,-0.000010,-0.000007,-0.000007
74176,5.270179e-07,0.000003,0.000044,-0.000012,6.521394e-08,7.939130e-07,-0.000008,6.862714e-07,0.000014,5.263857e-07,0.000002,-0.000006,0.000009,0.000006,-0.000003,-0.000006,0.000001,2.471181e-07,-0.000003,0.000007,-0.000005,-0.000003,0.000005,-0.000007,-0.000012,-0.000019,-9.892245e-07,2.078835e-06,-0.000004,3.027326e-06,-0.000008,0.000001,-1.448615e-06,-3.459200e-06,-5.261019e-07,7.048799e-07,-2.365294e-06,-0.000004,-0.000002,3.029928e-07,...,-0.000013,0.000003,0.000005,0.000003,1.845493e-06,-0.000010,-0.000004,-0.000007,0.000006,-0.000002,-0.000004,-0.000003,-0.000007,-7.032820e-07,-0.000009,0.000006,-0.000004,-0.000006,-0.000003,-0.000001,7.331760e-07,-0.000017,1.513669e-06,-0.000002,-0.000007,0.000001,0.000005,-5.015745e-07,-3.557156e-06,2.113162e-06,3.097019e-06,0.000004,0.000003,0.000001,-3.129120e-07,1.037638e-07,1.030804e-06,-0.000008,-0.000004,-0.000006


In [None]:
pca.explained_variance_ratio_

array([0.40516975, 0.10441562, 0.01776554, 0.0148028 , 0.01274815,
       0.0064132 , 0.00601737, 0.00544769, 0.00459983, 0.00425061,
       0.00363456, 0.00355068, 0.00336084, 0.00317367, 0.00291097,
       0.00256747, 0.0024449 , 0.00228277, 0.00224959, 0.00215169,
       0.00203389, 0.00199196, 0.00193537, 0.00189715, 0.00184204,
       0.00177774, 0.00167116, 0.00165467, 0.00161592, 0.00160973,
       0.00153325, 0.00151213, 0.0014922 , 0.00145026, 0.00142907,
       0.00141081, 0.00140521, 0.00137056, 0.00136726, 0.00132098,
       0.00131121, 0.00128607, 0.00126673, 0.00125625, 0.00121832,
       0.00120132, 0.00117512, 0.00117264, 0.00115996, 0.00114554,
       0.0011279 , 0.00111756, 0.00109896, 0.00109112, 0.00107385,
       0.00103719, 0.00102193, 0.00100193, 0.00099382, 0.00097781,
       0.00097445, 0.00095492, 0.00095173, 0.00092537, 0.00092223,
       0.00090699, 0.00090035, 0.00088959, 0.00088761, 0.00088148,
       0.00087294, 0.00085711, 0.00084424, 0.00084023, 0.00083

In [None]:
# try again with "transformer"
_index = vectorizer.vocabulary_['transformer']
word_similarities = cosine_similarity(word_vectors[_index].reshape(1, -1),
                                      word_vectors)
_to_similarities = dict(zip(vectorizer.get_feature_names(),
                            word_similarities[0].tolist()))
dict(sorted(_to_similarities.items(), key=lambda item: item[1])[-15:])

{'+0.85': 0.6888951447558753,
 '+2.58': 0.6888951447558753,
 'decoder': 0.676695790520244,
 'down-sizing': 0.70342135214239,
 'heads': 0.7374774125445663,
 'iwslt-2017': 0.6888951447558753,
 'layers': 0.6937435166546098,
 'multi-head': 0.7250317776702224,
 'multihead': 0.6903641856083451,
 'self-attention': 0.7940999368347967,
 'straddles': 0.70342135214239,
 'transformer': 1.0000000000000004,
 'transformer-based': 0.6870660863846572,
 'un-pruned': 0.70342135214239,
 'wmt-2017': 0.6888951447558753}

In [None]:
# let's turn the above into a function
def get_sims(word, word_vectors, vectorizer, top_n=15):
  _index = vectorizer.vocabulary_[word]
  word_similarities = cosine_similarity(word_vectors[_index].reshape(1, -1),
                                        word_vectors)
  _to_similarities = dict(zip(vectorizer.get_feature_names(),
                              word_similarities[0].tolist()))
  return dict(sorted(_to_similarities.items(), key=lambda item: item[1])[-top_n:])

In [None]:
get_sims("authorship", word_vectors, vectorizer, top_n=10)

{'attribution': 0.7070898248655817,
 'authors': 0.6200433892746091,
 'authorship': 1.0000000000000002,
 'genre': 0.5401800015899374,
 'loadings': 0.5392901024670296,
 'mbsp': 0.5369898907746585,
 'sensing-intuitive': 0.5369898907746585,
 'stylometric': 0.6822366028875404,
 'sub-genre': 0.5392901024670296,
 'thinking-feeling': 0.5369898907746585}

In [None]:
get_sims("medical", word_vectors, vectorizer, top_n=10)

{'clinical': 0.7473857450222506,
 'doctors': 0.6860717648173797,
 'healthcare': 0.666439776048538,
 'hospital': 0.7059888950615654,
 'medical': 0.9999999999999998,
 'medicine': 0.7282930902760398,
 'notes': 0.704215952721517,
 'patient': 0.8081778307214346,
 'records': 0.7508346603583982,
 'treatments': 0.6796447992735681}

# Interpreting the word vector dimensions

Finally, we can try to interpret the top words for each dimension, let's just try the first 7 dimensions.

What we'll find is that some of them are interpretable, and others are less interpretable. For example, the 0th component appears to be frequent non-English words (e.g., "de", "é" from French) and symbols (\{, \}). The fourth component (=3) includes $LaTeX$ formatting symbols and other simple non-alphabetic letters. The third (=2) looks to be academi code words, and so on. 

The degree to which your space is interpretable depends on a few factors:

1. How many vocabulary terms you are using at the beginning and how sparse they are
2. How many dimensions you want to learn
3. What your learning algorithm is to generate word vectors (e.g., PCA vs. co-occurrence/mutual information vs. word2vec)

In [None]:
for dim in range(15):
  dim_vecs = word_vectors.T[dim]
  dim_vecs_named = dict(zip(vectorizer.get_feature_names(),
                            dim_vecs.tolist()))
  print(dim)
  print('\t'.join([x[0] for x in sorted(dim_vecs_named.items(), key=lambda item: item[1])[-7:]]))
  print("-" * 100)

0
'	de	.	,	\'e	{	}
----------------------------------------------------------------------------------------------------
1
models	model	language	(	)	.	,
----------------------------------------------------------------------------------------------------
2
tutorial	including	al.	e.g.	et	;	,
----------------------------------------------------------------------------------------------------
3
%	1	\	:	;	(	)
----------------------------------------------------------------------------------------------------
4
116	105	110	97	111	101	32
----------------------------------------------------------------------------------------------------
5
''	{	}	%	models	\	model
----------------------------------------------------------------------------------------------------
6
\	'	''	{	}	system	corpus
----------------------------------------------------------------------------------------------------
7
training	machine	languages	models	data	translation	language
----------------------------------------------

# <font color="red">NOTE: We did not get to anything below on 10/1/2021! We will cover this when we return to semantics after next week</font>.

## Obtaining document representations with LSA

In general, latent semantic analysis (LSA) is a great place to start to explore your data. You can use LSA word vectors in a wide variety of tasks. 

But, because of the way PCA works, we can also create a _document_ representation that lives in the same size space. Basically, we do matrix multiplication between our word embeddings (`word_vectors`) and our original bag-of-words matrix (`bow_abstracts`). 

`bow_abstracts * word_vectors`

In this example, we would obtain a lower-dimensional document matrix that is 100 dimensions instead of 80,000.

In [None]:
document_embeddings = pca.transform(bow_abstracts)
print(document_embeddings)
print(document_embeddings.shape)

[[ 4.13830514e+00  1.11154095e+01 -3.03283940e-01 ... -2.98927800e-01
  -8.14621672e-02 -1.11000604e-01]
 [ 3.21887023e+00  8.60508405e+00  1.73180506e+00 ...  9.02496108e-02
  -1.82919169e-02  3.24698321e-01]
 [ 3.10511647e+00  8.49275678e+00  5.39884682e-01 ...  1.69585186e-01
  -1.86813782e-02 -1.04224344e-01]
 ...
 [ 3.04857387e+01  1.25141350e+00  4.79350365e+00 ...  1.65147718e-01
   6.43308758e-01 -7.18689582e-01]
 [ 7.30122405e+00  1.31476322e+00  1.47368206e+00 ...  2.88248065e-02
   3.23041172e-01 -2.52280566e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]]
(27471, 100)


In [None]:
# getting document embeddings is a multiplication problem
a = bow_abstracts * word_vectors
a==document_embeddings # test for equivalence in methods

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

# Exploring the document embeddings

### Sort by each dimension to find the "top" match along that dimension.

This document scores the highest on the 0th/first dimension because it is incredibly French.

In [None]:
abstracts[document_embeddings_df.sort_values('0', ascending=False).iloc[0].name]

"La quasi-totalit{\\'e} des {\\'e}tiqueteurs grammaticaux mettent en oeuvre des r{\\`e}gles qui portent sur les successions ou collocations permises de deux ou trois cat{\\'e}gories grammaticales. Leurs performances s{'}{\\'e}tablissent {\\`a} hauteur de 96{\\%} de mots correctement {\\'e}tiquet{\\'e}s, et {\\`a} moins de 57{\\%} de phrases correctement {\\'e}tiquet{\\'e}es. Ces r{\\`e}gles binaires et ternaires ne repr{\\'e}sentent qu{'}une fraction du total des r{\\`e}gles de succession que l{'}on peut extraire {\\`a} partir des phrases d{'}un corpus d{'}apprentissage, alors m{\\^e}me que la majeure partie des phrases (plus de 98{\\%} d{'}entre elles) ont une taille sup{\\'e}rieure {\\`a} 3 mots. Cela signifie que la plupart des phrases sont analys{\\'e}es au moyen de r{\\`e}gles reconstitu{\\'e}es ou simul{\\'e}es {\\`a} partir de r{\\`e}gles plus courtes, ternaires en l{'}occurrence dans le meilleur des cas. Nous montrons que ces r{\\`e}gles simul{\\'e}es sont majoritairement agram

Likewise, this one is clearly a machine translation paper (the 8th dimension)

In [None]:
abstracts[document_embeddings_df.sort_values('8', ascending=False).iloc[0].name]

"We wrote this report in Japanese and translated it by NEC's machine translation system PIVOT/JE.) IBS (International Business Service) is the company which does the documentation service which contains translation business. We introduced a machine translation system into translation business in earnest last year. The introduction of a machine translation system changed the form of our translation work. The translation work was divided into some steps and the person who isn't experienced became able to take it of the work of each of translation steps. As a result, a total translation cost reduced. In this paper, first, we report on the usage of our machine translation system. Next, we report on translation quality and the translation cost with a machine translation system. Lastly, we report on the merit which was gotten by introducing machine translation."

But the dimensions are not particularly interpretable -- consider if we look at the *worst* matches. In what way is this paper the least similar to that dimension?

In [None]:
abstracts[document_embeddings_df.sort_values('8', ascending=True).iloc[0].name]

'The infrastructure Global Open Resources and Information for Language and Linguistic Analysis (GORILLA) was created as a resource that provides a bridge between disciplines such as documentary, theoretical, and corpus linguistics, speech and language technologies, and digital language archiving services. GORILLA is designed as an interface between digital language archive services and language data producers. It addresses various problems of common digital language archive infrastructures. At the same time it serves the speech and language technology communities by providing a platform to create and share speech and language data from low-resourced and endangered languages. It hosts an initial collection of language models for speech and natural language processing (NLP), and technologies or software tools for corpus creation and annotation. GORILLA is designed to address the Transcription Bottleneck in language documentation, and, at the same time to provide solutions to the general 

What does the least "French" document look like, then?

In [None]:
abstracts[document_embeddings_df.sort_values('0', ascending=True).iloc[0].name]

''

Oh. (Well that explains it. A matrix of 0s will give you 0s everywhere.)

# Closing notes for this week

Thanks for sticking with it! Next week will be very hands on. Please try to come to class in any modality so you can get a running start on the final paper.