# Machine Learning with Python

## Machine Learning Concepts

A very brief introduction to machine learning concepts - by no means comprehensive.

### Data Representation

Data has to be organized and quantified so the learning algorithm can operate on it. For example, for an algorithm to to recognize text from an image, the image has to be analyzed to determine the outlines of characters, and characters have to be analyzed to determine words (a little more detail at https://www.quora.com/How-does-the-Tesseract-API-for-OCR-work). Or, for the example we'll look at today, a learning algorithm can learn to recognize documents by using a dictionary of words, and counting the number of times a word appears in paragraphs in the documents.

### Bag of Words (BoW) Models

A simple and common way of representing documents is to create a list of words found in the documents, and then list the number of times each word is found in the document.

Using two simple documents from https://en.wikipedia.org/wiki/Bag-of-words_model:

1. `John likes to watch movies. Mary likes movies too.`
1. `John also likes to watch football games.`

The documents can be represented by these bags of words:

1. `{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}`
1. `{"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1}`

After analyzing a set of documents, a list of all the distinct words can be used to create an array of the number of times each known word appears in each document. So, for these documents, we'd use this list of words:

```
["John","likes","to","watch","movies","Mary","too","also","football","games"]
```

Then, we can represent each document by a list of numbers of times each word appears:

1. `[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]`
1. `[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]`

#### Context

Bag-of-words representations are obviously limited: there is no context between words. "Rain fall" is the same as "fall rain", which is obviously different to us humans. We can extend the vocabulary to phrases by using N-grams: grouping words, like "John likes", "likes to", and "to watch", to gain context between individual words.

#### Junk Words (Stop words)

Some words occur with such frequency that they not helpful. 'A', 'an', and 'the' are mostly useless to machine learning algorithms. A textual machine-learning algorithm will typically ignore them.


### Learning (Training) and Testing

Two general steps: during learning, an algorithm is trained using input data. During testing, the learned concept is checked aginst test data to determine the learned concept's effectiveness. A common approach is to randomly select 80% of input documents for training, perform the training operation, and then test the learned concept on the other 20% of documents to determine efficacy.

### Supervised Learning

Training data specifies output (results) - each input has a "label" that the algorithm learns.

### Unsupervised Learning

Training data lacks specified results, so the learning algorithm has to arbitrarily group the data.

### Semi-supervised Learning

Some training data includes specific labels, and the algorithm has to generalize its learning from those labels.

### Feedback Learning

An oracle observes the learning algorithm's outputs and gives feedback to the algorithm so it can improve its learned concept.


## Gensim

The Python gensim module provides a number of machine learning algorithms that work with text:
* Word2Vec
* Doc2Vec
* FastText
* Latent Semantic Analysis (LSI, LSA, see LsiModel)
* Latent Dirichlet Allocation (LDA, see LdaModel)

These algorithms learn to recognize text using statistical patterns. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus (large set) of document files.

An explanation of how word2vec works: http://www.deeplearningweekly.com/blog/demystifying-word2vec 

> You can also generalize inputs to algorithms like these to work with non-text data. If you can construct a "dictionary" or "vocabulary" of possible inputs over your data (like a set of words), the algorithms can be applied to your data just like it would operate on plain text.

You'll need to use pip to install gensim. Also install nltk for easy access to Project Gutenberg (public domain) documents. In the Gerdin lab:
```
pip install gensim
pip install nltk
```
Or, on your own Windows laptop:
```
pip install --user gensim
pip install --user nltk
```

There are wheel modules containing compiled code for the gensim submodules. Pip should be able to download and install them.

Let's download the Gutenberg corpus from nltk, the nltk punctuation-aware tokenizer, read and parse Macbeth, process it using Word2Vec, and look for some common nearby words:

In [52]:
import nltk
from nltk.corpus import gutenberg
import gensim
from gensim import models, similarities

nltk.download('gutenberg')
nltk.download('punkt')
macbeth_sentence = gutenberg.sents('shakespeare-macbeth.txt')
macbeth_model = gensim.models.Word2Vec(macbeth_sentence, min_count=1, size=32)

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/ghelmer/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /Users/ghelmer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [53]:
macbeth_model.wv.most_similar('King')

  if np.issubdtype(vec.dtype, np.int):


[('and', 0.9994088411331177),
 ('of', 0.9994024038314819),
 (';', 0.9993935227394104),
 ('d', 0.9993923306465149),
 ('my', 0.9993640184402466),
 ('me', 0.9993475675582886),
 ('with', 0.9993475079536438),
 ('to', 0.9993457198143005),
 ('The', 0.9993445873260498),
 ('be', 0.9993438720703125)]

Not the greatest results -- we need more data than just one book.

Let's do something similar with Milton:

In [48]:
milton_sentence = gutenberg.sents('milton-paradise.txt')
milton_model = gensim.models.Word2Vec(milton_sentence, min_count=1, size=32)

In [50]:
milton_model.wv.most_similar('Mercy')

  if np.issubdtype(vec.dtype, np.int):


[('risen', 0.9618339538574219),
 ('provoke', 0.9600409865379333),
 ('stole', 0.959398627281189),
 ('courage', 0.9587253332138062),
 ('dove', 0.9584486484527588),
 ('feed', 0.9582011103630066),
 ('vouchsafed', 0.9578553438186646),
 ('endured', 0.9577709436416626),
 ('Sea', 0.9576237797737122),
 ('fabled', 0.9570697546005249)]

In [54]:
# Lookup words similar to Evil
mm_evil = ['Evil']
milton_model.wv.most_similar(positive=mm_evil, topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('rash', 0.9622334241867065),
 ('Companion', 0.9583115577697754),
 ('oak', 0.9513239860534668),
 ('growth', 0.9511553049087524),
 ('listen', 0.9502556324005127),
 ('crushed', 0.9500308632850647),
 ('tumultuous', 0.9494795799255371),
 ('loth', 0.94921875),
 ('excels', 0.9474302530288696),
 ('defiance', 0.9470477104187012)]

Again, not great results. We would need a larger corpus to get better results.

# Word2Vec Visualization

Handy visualizations from Word2Vec data:

https://github.com/anvaka/word2vec-graph 

