## Fasttext tutorial for unsupervised learning

See [A Visual Guide to FastText Word Embeddings](https://amitness.com/2020/06/fasttext-embeddings/) for an introduction.

Also see https://fasttext.cc/docs/en/unsupervised-tutorial.html

We’ll use a small portion of text data extracted from English Wikipedia (50,000 random lines), which we’ve placed in data/wiki_sample.txt.

In [3]:
%%bash

set -x
wc -l /workspace/search_with_machine_learning_course/data/wiki_sample.txt

+ wc -l /workspace/search_with_machine_learning_course/data/wiki_sample.txt


50000 /workspace/search_with_machine_learning_course/data/wiki_sample.txt


In [4]:
%%bash
set -x

mkdir -p /workspace/fasttext-unsupervised
cp /workspace/search_with_machine_learning_course/data/wiki_sample.txt /workspace/fasttext-unsupervised

+ mkdir -p /workspace/fasttext-unsupervised
+ cp /workspace/search_with_machine_learning_course/data/wiki_sample.txt /workspace/fasttext-unsupervised


### Create unsupervised model

This will create `wiki.bin` model file and `wiki.vec` file with words and their vectors.

We’ve also used `-maxn 0` to exclude subword information. Subwords – that is, parts of words – can be very useful, especially for unknown or rare words, stemming variations, etc. But they can also introduce noise, and for this smallish dataset, we’re optimizing for simple, clean results.

One useful parameter that we didn’t explore before is `-minCount`, which specified the minimum number of times a word must occur in the corpus to be included in the model. It defaults to 5, but setting a higher number (e.g., 50) will remove lots of misspellings and other rare words. Of course, setting it too high may remove words we would rather keep. Everything is a trade-off.

In [6]:
%%bash
cd /workspace/fasttext-unsupervised
set -x

fasttext skipgram -input wiki_sample.txt -output wiki -maxn 0
wc -l wiki.vec

+ fasttext skipgram -input wiki_sample.txt -output wiki -maxn 0
Read 0M words
Number of words:  9871
Number of labels: 0
Progress: 100.0% words/sec/thread:   99647 lr:  0.000000 avg.loss:  2.473000 ETA:   0h 0m 0s
+ wc -l wiki.vec


9872 wiki.vec


### Nearest neighbors as synonyms

The fastText library comes with a nearest-neighbor method that we can use to obtain synonyms.

For running interactively, run

`fasttext nn wiki.bin`

In [16]:
%%bash
cd /workspace/fasttext-unsupervised

echo "politics" | fasttext nn wiki.bin

echo "linux" | fasttext nn wiki.bin

Query word? governors 0.881063
municipal 0.877317
offices 0.867158
privy 0.866503
constitutional 0.864229
ministers 0.862625
elections 0.8536
approved 0.852035
senators 0.851854
cooperation 0.851632
Query word? Query word? inputs 0.940591
sampling 0.937156
unix 0.936299
compiler 0.934591
random 0.933981
kernel 0.93357
java 0.93094
files 0.929595
ascii 0.926187
behavior 0.924533
Query word? 