# Machine learning basics

This notebook explains the simple basics of machine learning. At the end of this notebook, you learned:

- the basic principles of machine learning 
- how features are represented as vectors
- how to train a classifier from vector representations
- how to train and apply a classifier to text represented by its words
- what a bag-of-words representation is
- what the information value (TF*IDF) of a word is

**Background reading:**

NLTK Book
Chapter 6, section 1 and 3: https://www.nltk.org/book/ch06.html



## 1. Machine Learning schema

The overall process of machine learning is shown in the next image that is taken from Chapter 6 of the NLTK book. In general, machine learning consists of a training phase in which an algorithm associates data features with certain labels (e.g. sentiment, part-of-speech). The training results in a classifier model that can be applied to unseen data. The classifier compares the features of the unseen data with the previously seen data and makes a prediction of the label on the basis of some similarity calculation.

![title](images/ml-schema.pdf)


Crucial in this process is 1) the features that represent the data and 2) the algorithm that is used. In this course, we are not going to discuss the various machine learning algorithm in depth but we focus on the text features and how they are represented as so-called vectors. In the case of a text, we need to define what the features are that charcaterize the text. These features are transformed into a feature vector representation that the algorithm and model can handle. In order to compare the unseen text with the training texts, it is crucial that features are extracted and represented in the same way across training and applying.

**Preparations**

We are going to use the Scikit-learn package to transform the diverse feature values into a vector representation:

https://scikit-learn.org/stable/install.html

Scikit-learn is a package that contains a lot of machine learning algorithms and functions for dealing with features and carrying out evaluation and error analysis. To install it run one of the following commands from the command line:

- pip install -U scikit-learn

or
 
- conda install scikit-learn

We are also using a package called "numpy": https://numpy.org.

Install "numpy" from the command line following the instructions on the website. After installing, you can import it.

### 1.1 Vector representations


Before we turn to a text example, we are going to use a very simple data set. We show how to train and evaluate an SVM (Support-Vector-Machine) using a made-up example of multi-class classification for a non-linguistic dataset. The goal is to predict someone's weight category (say: skinny, fit, average, overweight) based on their properties.

We use three features:
* **age in years**
* **height in cms**
* **number of ice cream cones eaten per year**


The feature representation (for 5 people) is an array of arrays. Each instance (or person) is represented by an array of numbers in which the first is the age, the second the heights in cms and the third the number of cones per year: 

In [18]:
X = [[30, 180, 1000], 
     [80, 180, 100],
     [50, 180, 100],
     [40, 160, 500],
     [15, 160, 400]
    ]

The first person is thus 30 years old, 180 cms tall and eats 1000 cones per year. The next command prints the data for the first instance.

In [7]:
print('First instance in the data set X =', X[0])

First instance in the data set X = [30, 180, 1000]


An array of numbers in which each position holds a value for a specific feature is what we call a feature vector. For all our data in the data set we must have a feature vector of the same length. If there is no value, it will be zero.

In addition to the data that is now assigned to the variable 'X', we also need to have the prediction that goes with the instances. For this we use another array with the values that we assign to the variable 'Y'. 

In [11]:
Y = ["overweight", 
     "skinny",
     "fit",
     "average",
     "average"]

We need to have as many values as we have instances in our data set, as the software pairs the elements in X with the elements in Y.

In [12]:
print('The length of the data set =', len(X))
print('The length of the predictions =', len(Y))
print('The first prediction =', Y[0])

The length of the data set = 5
The length of the predictions = 5
The first prediction = overweight


### 1.2 Using Skikit learn to build a classifier

Now we have the data and the prediction we can train a model. We are going to use the **svm** module from **sklearn**, from which we will select the **LinearSVR** (Linear Support Vector Regression) class. For now it is not important to know the details about this algorithm. You will learn about that in the machine learning class. We instantiate a model with the variable name 'lin_classifier' (any name will do). We will use this instantiation for training and classifying.

In [14]:
from sklearn import svm

lin_classifier = svm.LinearSVC()

Now we train the model by feeding it with the data set 'X' and the predictions 'Y'. Feeding we do with the 'fit' function.

In [15]:
lin_classifier.fit(X,Y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

We train the model. You might get a warning stating that:
```
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
```
This is to be expected given that we only train using five instances.

### 1.3 Using Skikit learn to classify unseen data

Let's now apply the model to a new instance 'Z': what does SVM think the weight category is of someone of 18 years, 171cm, and who eats 400 ice cream cones per year?

In [17]:
Z=[[18, 171, 400]]
predicted_label = lin_classifier.predict(Z)
print(predicted_label)

['average']


Apparently the SVM thinks it is **average**, which is not surprising since **number of ice cream cones eaten per year** and **height** seem to correlate highly with the weight categories.

## 2. Representing a text as a Bag-Of-Words

A critical component of almost any machine learning approach is **feature representation**. 
This is not strange since we need to somehow convert a textual unit, e.g., word, sentence, tweet, or document, into something meaningful that can not only be interpreted by a computer, but is also useful for the type of learning we want to do. 

A text consists of a sequence of words on which we impose syntax and semantics. A machine needs to learn to associate the structural properties of the text to some interpretation.
We can use various properties to do this:

- the words (regardless of the order)
- the words and their frequency
- the part-of-speech of words
- word pairs
- the characters that make up the words
- sentences with words
- phrases
- the meaning of words
- the meaning of combinations of words
- etc....

Some of the above properties, we get for free if we split a text into tokens (the words), e.g. by using spaces. Still, we need to consider what to do with punctuation and how to treate upper/lower cases (the word shape). Other properties are not explicit, such as the part-of-speech of words, phrases, syntax and the meaning.

For now, we are only considering the words of a text as features. In fact, we are going to ignore the order of the words and consider a text as a *Bag-Of-Words*.

**If you want to learn more: (information from these blogs was used in this notebook)**
* [bag of words introduction](http://www.insightsbot.com/blog/R8fu5/bag-of-words-algorithm-in-python-introduction)
* [TF-IDF introduction](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3)
* [another TF-IDF introduction](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)

In the next notebook of this course, we explain how other features can be combined with a word representation

### 2.1 Bag of words
The bag of word approach consists of two main steps that result in a word-to-document index:

* 1 we extract all the unique word from a collections of textual units, e.g., documents
* 2 we compute the frequency of each word in each document.


In [33]:
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import nltk

Let's try this for the following three sentences that we list in an array (note that sentences can also be complete documents).

In [34]:
sents = ['A rose is a rose',
         'A rose is a flower',
         "A book is nice"]

We will use the **CountVectorizer** to create the bag of words representation from the above array.

In [35]:
# you can adapt min_df to restrict the representation to more frequent words e.g. 2, 3, etc..

vectorizer = CountVectorizer(min_df=1, # in how many documents the term minimally occurs
                             tokenizer=nltk.word_tokenize) # we use the nltk tokenizer to split the text into tokens
sents_counts = vectorizer.fit_transform(sents)

Printing the so-called "shape" of sents_counts shows us that we have 3 documents and 6 unique words spread over these documents:

In [36]:
# sents_counts has a dimension of 3 (document count) by 6 (# of unique words)
print(sents_counts.shape)
print('unique words:', list(vectorizer.vocabulary_.keys()))

(3, 6)
unique words: ['a', 'rose', 'is', 'flower', 'book', 'nice']


The bag of word representation for the text data looks like this:

In [43]:
# this vector is small enough to view in full! 
print('The vocabulary of all the sentences  consists of the following words:', vectorizer.get_feature_names())

print('The vector representation of the sentences looks as follows:')
sents_counts.toarray()

The vocabulary of all the sentences  consists of the followng words: ['a', 'book', 'flower', 'is', 'nice', 'rose']
The vector representation of the sentences looks as follows:


array([[2, 0, 0, 1, 0, 2],
       [2, 0, 1, 1, 0, 1],
       [1, 1, 0, 1, 1, 0]], dtype=int64)

This looks familiar (think about the age, length and cones data set we have seen before). What happened to each sentence representation?

The first array has 6 positions representing the complete vocabulary. The first position represents the first word "a" and it has value '2', which means it occurs twice in the sentence. The fourth slot is for "is" which occurs once and the sixth slot is for "rose" which occurs twice. The other slots are zero because these words do not occur in the first sentence.

Try to figure out if you understand the representation of the other two sentences!


### 2.2 TF-IDF
One big problem of the bag of words approach is that it treats all words equally. Why is that a disadvantage? It means that words that occur in many documents, such as *A,* contribute equally to the decision making of the machine learning approach as other words that are much more informative, e.g., *rose*. 
TF-IDF addresses this problem by assigning less weight to words that occur in many documents.
You read [here](https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3) a nice introduction to TF-IDF.

This is how you can do it in Python:

In [40]:
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)

In [41]:
tf_idf_array = sents_tfidf.toarray()
print(vectorizer.get_feature_names())
print(numpy.round(tf_idf_array, decimals=1))

['a', 'book', 'flower', 'is', 'nice', 'rose']
[[0.6 0.  0.  0.3 0.  0.8]
 [0.6 0.  0.5 0.3 0.  0.4]
 [0.4 0.6 0.  0.4 0.6 0. ]]


This is a good result! In the bag of words approach, The words **"a"** and **"book"** both had a frequency of 1 in the third sentence. Now that we've applied the TF-IDF approach, we see that the word *book* has a higher weight (0.6) than the word *"a"* since *"a"* occurs in all three sentences and *"book"* only in one, which might indicate that it is more informative.

### 2.3 Training a classifier with word vectors
Now we have seen how we can turn a text into a vector representation. We can associate these text representation to labels as we have seen above for predicting somebody's weight. We now use different labels but note that for the algorithm the labels are meaningless. They could be numbers of any label.

In [46]:
labels= ["tautology", 
     "hyponymy",
     "sentiment"]

In [49]:
lin_classifier_count = svm.LinearSVC()
lin_classifier_count.fit(sents_counts,labels)

lin_classifier_weight = svm.LinearSVC()
lin_classifier_weight.fit(tf_idf_array,labels)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

### 2.4 Representing a new text using the word vectors from the training vocabulary

Next, we can use the same vocabulary to make a new sentence and represent it in the same way as the training example. This means we can only represent words that have been observed during training and we ignore new words.

In [61]:
print('Unique words from the training data:', list(vectorizer.vocabulary_.keys()))

Unique words from the training data: ['a', 'rose', 'is', 'flower', 'book', 'nice']


In [62]:
from nltk import tokenize as tok
new_text="a good book is a rose"

# the vector representation using the vocabulary of the training data 
# would look as follows:
new_text_vector=[[2, 1, 0, 1, 0, 1]]

Note that the word "good" is not represented as it does not occur in the vocabulary. We cannot compare this feature with the trained model.

In [63]:
predicted_label = lin_classifier_count.predict(new_text_vector)
print(predicted_label)

['sentiment']


In [64]:
predicted_label = lin_classifier_weight.predict(new_text_vector)
print(predicted_label)

['tautology']


End of this notebooks. Please continue with notebook Lab4.2.ml.linguistic_features.