<a href="https://colab.research.google.com/github/Vlasovets/Deep_learning_course_assistantship/blob/master/mouse_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced topics in User Interfaces

## **Learning goals** 
The tutorial covers the topics explained during the lecture in the following parts:
1.   to get introduced with common deep learning models (e.g. LSTM)
2.   to understand the pipeline of a model creation in TensorFlow
3.   to train and test models on given datasets
4.   to work with different paramters of the model

## 1 Introduction



### 1.1. Mouse movement 
Machine learning became critical for marketers and advertisers, who need to analyze endless signals in real time and deliver ads at the right moments to the right people. Many researchers have focused on identifying whether a series of actions were performed by humans or bots. It is a very applied research area since people tend to use online shopping more and the location of advertisement makes a difference for the company. 

**Example**

By adding a properly placed ads on the website, a company could see directly what users’ preferences are. On top of insights about users preferences the company would also get information about the people who are actually interested in their products.
 

In [42]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://image.winudf.com/v2/image/Y29tLnRodW5rYWJsZS5hbmRyb2lkLm9mZmljaWFsYXBwNDYxOC5Nb25leV9NYWtlcl9pY29uXzE1MTQzNDk2OTZfMDEy/icon.png?w=170&fakeurl=1", width=200, height=200)

### 1.2. Dataset
This tutorial provides a glimpse of how the data from mouse movements can be used. It contains following information:


*   User ID
*   View point width
*   View point height
*   Age
*   Movements

*Please,add some details about the dataset*

### 1.2. Prerequisites

In [43]:
%tensorflow_version 1.x # complementary code

`%tensorflow_version` only switches the major version: `1.x` or `2.x`.
You set: `1.x # complementary code`. This will be interpreted as: `1.x`.


TensorFlow is already loaded. Please restart the runtime to change versions.


Make sure that you use the right version of TensorFlow

In [44]:
from google.colab import drive
drive.mount('/content/drive') # complementary code

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Do not forget to upload the data

In [0]:
import csv
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.python.keras.models import Sequential, save_model
from tensorflow.python.keras.layers import Dense, LSTM, Bidirectional, Dropout
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

Import all necessary libraries and set a configuration for the following model.

In [46]:
# Some GPUs require setting the `allow_growth` setting.
# Comment out this code is you don't have a GPU card.
import tensorflow.compat.v1 as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.Session(config=config)

<tensorflow.python.client.session.Session at 0x7f46304eb898>

### 2. Model

### 2.1 Data cleaning

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete or improperly formatted. 

It is very a rare case in a real life when data provided is clean and does not require additional manipulation. So, the 80% of the typical workload for a person working with data is preparing the data for an actual analysis.
There are several methods for cleaning data depending, here you see the following steps:

In [0]:
def load_dataset(filename):
    X, y = [], []
    with open(filename) as csv_file:
        csv_reader = csv.DictReader(csv_file, delimiter='\t')
        for row in csv_reader:
            moves = parse_moves(row['movements'])
            age = parse_age(row['age'])
            X.append(moves)
            y.append(age)
    X, y = np.array(X), np.array(y)
    return X, y

'Def' declares a function with a name 'load_dataset' and the parameter 'filename', more [info](https://wiki.python.org/moin/BeginnersGuide) . Then the 'for loop' load the data with the headings we need. Finally, we create [np.array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) for our depending (X) and target (y) variables. 

In [0]:
def parse_moves(seq):
    result = []
    coords = seq.split(',')
    for coord in coords:
        x, y, t = coord.split(';')
        result.append([int(x), int(y)])
    return result


In [0]:
def parse_age(age):
    if age == 'young':
        return 0
    elif age == 'adult':
        return 1
    elif age == 'elder':
        return 2

Functions 'parse_moves' and 'parse_age' are used for separating the moves by three age groups: 


*   young
*   adult
*   elder



In [0]:
def create_model(shape):
    units = shape[0]
    model = Sequential()
    model.add(Bidirectional(LSTM(units), input_shape=shape))
    model.add(Dropout(0.5))
    model.add(Dense(3, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Finally, we create our BiLSTM model. The idea of the model is straightforward. It involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second. It uses sofmax function as an activation function, sparse categorical crossentropy as a loss function. 

For evaluation purpose we pick metric accuracy.
It is the ratio of number of correct predictions to the total number of input samples. It works well if there are equal number of samples belonging to each class.

In [52]:
from google.colab import files # complementary code
uploaded = files.upload()

Saving mousemoves.csv to mousemoves (2).csv


### 2.2. Model training
**How to Train Deep Learning Models?**

It might seem like it took us a while to get here, but professional data scientists actually spend the bulk of their time on the steps leading up to this one:
* Exploring the data.
* Cleaning the data.
* Engineering new features.

Again, that’s because better data beats fancier algorithms.

Now, we have finished the steps above and ready to fit the model as follows:

In [53]:
if __name__ == '__main__':
    # The saved model file will be named like the dataset file.
    dataset_file = 'mousemoves.csv'

    X, y = load_dataset(dataset_file)

    # All sequences must have the same length.
    X = pad_sequences(X)

    # Create partitions.
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

    # Set callbacks, for monitoring progress.
    cb_tensorboard = TensorBoard(log_dir='/tmp/mouse_logs')
    cb_earlystopping = EarlyStopping(patience=10)
    cb_checkpoint = ModelCheckpoint('/tmp/mouse_logs/best.h5', save_best_only=True)

    # Train the model.
    model = create_model(X_train[0].shape)
    print(model.summary())
    history = model.fit(
        X_train,
        y_train,
        validation_data=(X_test, y_test),
        epochs=100,
        callbacks=[cb_tensorboard, cb_earlystopping, cb_checkpoint]
    )

    # Evaluate the model.
    loss, acc = model.evaluate(X_test, y_test)
    print('ACC: {:.2f}'.format(acc))

    # Save the model.
    save_model(model, '{}.h5'.format(dataset_file))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional (Bidirectional (None, 444)               399600    
_________________________________________________________________
dropout (Dropout)            (None, 444)               0         
_________________________________________________________________
dense (Dense)                (None, 3)                 1335      
Total params: 400,935
Trainable params: 400,935
Non-trainable params: 0
___________

In [66]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://miro.medium.com/max/1024/1*cDhZ56QNC5mrl6kjE0C2JA.png", width=500, height=350)

You might guess what is 'an epoch' in the output?

In deep learning an epoch is a [hyperparameter](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) which is defined before training a model. In other words, one epoch is when an entire dataset is passed both forward and backward through the neural network only once.

The reason why we have to split the training step by epochs is decrease the amount of data we feed to the computer at once. So, we divide it in several smaller batches. We use more than one epoch because passing the entire dataset through a neural network is not enough and we need to pass the full dataset multiple times to the same neural network. But since we are using a limited dataset we can do it in an iterative process. A batch is the total number of training examples present in a single batch and an iteration is the number of batches needed to complete one epoch.

**Example**: 

If we divide a dataset of 2000 training examples into 500 batches, then 4 iterations will complete 1 epoch.

**Our result for training**:

The training on 204 traning and 51 validation samples finished with accuracy 0.37 which is very low. If we guess, it can be explained by a huge amount of noise in the data.  You can read about how to tune the model and increase the accurucy [here](https://www.kdnuggets.com/2019/01/fine-tune-machine-learning-models-forecasting.html).

### 2.3. Model testing

Once we trained our model we would like to understand how does is work on a real data. We assume that the test dataset is a real data to evaluate our model. The reason behind this manipulation is simple - we would like to understand if we are overfitting or not. [Overfitting](https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/) refers to a model that models the training data too well. 

In [56]:
import numpy as np
from tensorflow.python.keras.models import load_model
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score
from sklearn.model_selection import train_test_split
#unncomment the next line you your model is stored in the separate file
#from mouse_train import load_dataset

# Some GPUs require setting the `allow_growth` setting.
# Comment out this code is you don't have a GPU card.
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.Session(config=config)

<tensorflow.python.client.session.Session at 0x7f45e5775940>

In [57]:
dataset_file = 'mousemoves.csv'

X, y = load_dataset(dataset_file)

# All sequences must have the same length.
X = pad_sequences(X)

# Create partitions.
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# Don't compile the model when testing new data.
model = load_model('{}.h5'.format(dataset_file), compile=False)
probs = model.predict(X_test)
y_pred = [np.argmax(x) for x in probs]

# Let's see how good those predictions were.
precision, recall, fmeasure, _ = precision_recall_fscore_support(y_test, y_pred, pos_label=None, average='weighted')
print('Precision: {:.2f}%'.format(precision * 100))
print('Recall: {:.2f}%'.format(recall * 100))
print('F-measure: {:.2f}%'.format(fmeasure * 100))

# Finally compute the ROC AUC to see the discriminative power of the model.
binary_labels = [p == y_test[i] for i, p in enumerate(y_pred)]
y_probs = [x.max() for x in probs]
auc = roc_auc_score(binary_labels, y_probs, average='weighted')
print('AUC: {:.2f}%'.format(auc * 100))

Precision: 50.28%
Recall: 49.02%
F-measure: 49.13%
AUC: 47.38%


### 2.4. Results
The output of the last cell gives us 4 different metrics for our model.



**Precision** and **recall** are two extremely important model evaluation metrics. While precision refers to the percentage of your results which are relevant, recall refers to the percentage of total relevant results correctly classified by your algorithm. Unfortunately, it is not possible to maximize both these metrics at the same time, as one comes at the cost of another. 

In [88]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://miro.medium.com/max/1068/1*EXa-_699fntpUoRjZeqAFQ.jpeg", width=400, height=150)

For simplicity, there is another metric available, called F-1 score, which is a harmonic mean of precision and recall. One can select a model which maximizes this F-1 score.

## 3 Conclusion

Now, you know:

1.   a common deep learning model
2.   a general pipeline of a model creation in TensorFlow
3.   to work with different paramters of the model
4.   a commonly used application of deep learning

Do not hesitate to ask questions at otorrent@mail.ru

Thank you for your attention and see you next exercise session!