# Unsupervised Learning Example - Semantic Spaces

The first machine learning model we will examine in more detail belongs to the class of unsupervised models.
Semantic spaces (also called word embeddings) are a form of machine learning models that tries to attempt to learn the relationships between words. 

What does this mean? While for a human it is simple to identify that speaking and spoken are very closely related and cat and dog share a relationship, this is not evident for a computer. How can we represent concepts in a form that allows computers to operate based on their meaning?

Early attempts to do so (for example as part of search engines) were based on linguistic analysis of words. Identifying the common base form of verbs and plural and singular nouns allowed computer programs to "understand" that "rocket" and "rockets" are very closely related concepts.

Modern approaches to represent the meaning of concepts are based on calculating a numeric representation (a vector with values) for concepts. Based on this approach it is possible to then to calculate how similar two concepts are by using a metric (e.g. a distance metric such as the euclidean distance or the cosine).
Probably the most famous of these semantic word spaces is called Word2Vec ( [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) ).
The idea behind Word2Vec  is pretty simple. We are making and assumption that "you can tell the meaning of a word by the company it keeps" (Firth 1957 Linguist). This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

### Imports and logging

First, we start with our imports and get logging established.

If you are operating on a new environment you will likely miss the `gensim` package.
`gensim` is a library specialized on Natural Language Processing methods.
It is an open source library developed by a machine learning consulting company from the Czech Republic. 
Among other algorithms it contains a reliable implementation of `Word2Vec`.

Use conda to install the gensim library on your system.

In [None]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


### Dataset 

Apart from the implementation of the model, the second ingredient that is necessary for us to start with the training of unsupervised models is a dataset. 

You will find the following datasets in the `data` directory of the zip for today's lecture. 

* `swiss-sms.txt.gz` is a set of Swiss German text messages (http://www.sms4science.ch/)
* `reviews_data.txt.gz` is a collection of  English reviews from different web sites.

In [None]:
# to start with specify the swiss-sms.txt.gz data file below. 
data_file="data/swiss-sms.txt.gz"

with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


### Read files into a list

Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the 
compressed file. 

I'm also doing a mild pre-processing of the reviews using `gensim.utils.simple_preprocess (line)`. This does some basic pre-processing such as tokenization (splitting the text into individual words), lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html). 



In [None]:

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} lines".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")    

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the lines that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. 
Training time depends on the size of the training data. 
The Swiss Text dataset is rather small and should train very quickly.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [None]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=15, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

## Inspecting the Semantic Space

This first example shows a simple case of looking up the most similar words. All we need to do here is to call the `most_similar` function and provide a word as input. This returns the top 10 similar words. 

In [None]:

w1 = "gang"
model.wv.most_similar (positive=w1)


That looks pretty good, right? Let's look at a few more. As you can see below the `topn` parameter specifies the number of similar words to return. 

In [None]:
# look up top 6 words similar to 'polite'
w1 = ["finito"]
model.wv.most_similar (positive=w1,topn=6)


That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that *relate to bed* only:

In [None]:
# get everything related to stuff on the bed
w1 = ['gangi','gahne']
w2 = ['fuessball']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 
This is the basic usage of the representations of the concepts.
Having a numerical representation allows us to compute the similarity between any two words.



In [None]:
# similarity between two different words
model.wv.similarity(w1="gangi",w2="gahne")

In [None]:
# similarity between two identical words
model.wv.similarity(w1="gangi",w2="gangi")

In [None]:
# similarity between two unrelated words
model.wv.similarity(w1="gangi",w2="dr")

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [None]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["gangi","gahne","dr"])

## Understanding some of the (hyper-) parameters
To train the model earlier, we had to set some hyperparameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. 100-150 dimensions are common dimensions. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give allow you to strengthen relationships that are synonymous, while larger window sizes favor associative relationships.

    The distinction between synonymous and associative relationships is based on findings in cognitive linguistics. Based on word priming experimentation, two main relations between words have been identified (see [CHIA1990]): synonymous relations (also referred to as similar or semantic relations in the cognitive science literature) and associative relations. As outlined in [CHIA1990], the distinction for both relationship types is not exclusive; that is, word relations are not exclusively synonymous or associative. Doctor - Nurse is an example of a word relation that can be considered as being of a synonymous-associative nature.


    Two terms/words are associatively related if they often appear in a shared context. The following are examples of this type of relationship:

            Spider - Web
            Google - Page rank
            Smoke - Cigarette
            Phone - Call
            Lighter - Fire

    Two terms/words are synonymously related if they share common features. The following are examples of this type of relationship:

            Wolf - Dog
            Cetacean - Whale
            Astronaut - Man
            Car - Van
            Smartphone - iPhone 4s

[CHIA1990]	(1, 2) Chiarello, Christine, et al. Semantic and associative priming in the cerebral hemispheres: Some words do, some words don’t... sometimes, some places. Brain and language 38.1 (1990): 75-104


### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant (spelling mistakes, non-words), so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?


# Exercise - Hyperparameters


Use the hyperparameters to try to improve generating semantic spaces (word embeddings) based on the Swiss SMS corpus. Follow the steps below to generate different trained word2vec models.

* Change the hyperparameters
* Train a new semantic space
* Explore your newly trained semantic space

Changing hyperparameters is one way to change the behavior of the training.
Hyperparameters are meant as tools to adapt the training process to the nature, context and amount of training data. It should become clear that it is a great advantage if the training process is fast (as in does not take a lot of time). 

You should address the following questions:

* What hyperparameters impact the training the most? Try to get an understanding of the effect of the hyperparameters. How do the results change if you adjust the available hyperparameters. 
* Compare some of your trained semantic spaces. What is your setting that works best for the SMS dataset?





# Exercise - Training Data


## Train with Reviews Data

There is a second dataset available for unsupervised training. 

    reviews_data.txt.gz

This dataset consists of review data collected from a variety of sources.
You can find details here: http://kavita-ganesan.com/entity-ranking-data/

## More Data -> Better Results

Adjusting the data used for training is the second major possibility for optimising models we train.
Generally it can be said that:
* more data -> better results
* higher quality data -> better results



However the **domain** of the data must fit our goals.
E.g. review data might be very good to train the meaning of adjectives but less viable to train vocabulary relating to engineering. 

## Re-visiting Hyperparameter Effect

Sometimes the effect of hyperparameters can only be observed if we have enough high quality training data.
Training on too small datasets will lead to 'random' results and lead to 'random' observed impact of hyperparameters.

Given the bigger dataset, please re-visit the analysis we did before.

* Switch to the new dataset and start evaluating the effect of the following two hyperparameters:
   * window
   * min_count
   
Having a larger dataset should allow us to make ***more reliable*** observations.

## Saving Models to Disk

When training these larger models we should also start to write the training output to disk.

You can use: `model.wv.save_word2vec_format(path)` function of the model to do so.

### Versioning of Models

Machine learning is a game of complexity.
Even in the most simplistic setups you will usually deal with quite a range of factors (algorithm implementations, large amount of used packages, pre-processing pipelines, training material).

It is therefore ***very important*** to get a habit of clean and exact house-keeping from the beginning on.

At the least your setup should document and version:

* Training Material Used (i.e. what collection, e.g. timestamped, S3 bucket)
* Codebase Used (git tag or commit hash)
* Hyperparameters used

It can be handy to encode some of this information into the serialised model name as shown below. 

E.g. `model.wv.save_word2vec_format('./word2vec_swiss_text_d150_ww10_min2')`


When training with different hyperparameters during this exercise please save the resulting models. 
We will need them for the next notebook. 
