<a href="https://colab.research.google.com/github/Vlasovets/Deep_learning_course_assistantship/blob/master/mouse_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced topics in User Interfaces

## **Learning goals** 
The tutorial covers the topics explained during the lecture in the following parts:
1.   to get introduced with common deep learning models (e.g. LSTM)
2.   to understand the pipeline of a model creation in TensorFlow
3.   to train and test models on given datasets
4.   to work with different paramters of the model

## 1 Introduction



### 1.1. Mouse movement 
Machine learning became critical for marketers and advertisers, who need to analyze endless signals in real time and deliver ads at the right moments to the right people. Many researchers have focused on identifying whether a series of actions were performed by humans or bots. It is a very applied research area since people tend to use online shopping more and the location of advertisement makes a difference for the company. 

**Example**

By adding a properly placed ads on the website, a company could see directly what users’ preferences are. On top of insights about users preferences the company would also get information about the people who are actually interested in their products.
 

In [19]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://image.winudf.com/v2/image/Y29tLnRodW5rYWJsZS5hbmRyb2lkLm9mZmljaWFsYXBwNDYxOC5Nb25leV9NYWtlcl9pY29uXzE1MTQzNDk2OTZfMDEy/icon.png?w=170&fakeurl=1", width=200, height=200)

### 1.2. Dataset
This tutorial provides a glimpse of how the data from mouse movements can be used. It contains following information:


*   User ID
*   View point width
*   View point height
*   Age
*   Movements

*Please,add some details about the dataset*

### 1.2. Prerequisites

In [28]:
%tensorflow_version 1.x # complementary code

`%tensorflow_version` only switches the major version: `1.x` or `2.x`.
You set: `1.x # complementary code`. This will be interpreted as: `1.x`.


TensorFlow 1.x selected.


Make sure that you use the right version of TensorFlow

In [30]:
from google.colab import drive
drive.mount('/content/drive') # complementary code

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


Do not forget to upload the data

In [0]:
import csv
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.python.keras.models import Sequential, save_model
from tensorflow.python.keras.layers import Dense, LSTM, Bidirectional, Dropout
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

Import all necessary libraries and set a configuration for the following model.

In [32]:
# Some GPUs require setting the `allow_growth` setting.
# Comment out this code is you don't have a GPU card.
import tensorflow.compat.v1 as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
tf.Session(config=config)

<tensorflow.python.client.session.Session at 0x7f4680dc07f0>

### 1.3. Data cleaning

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete or improperly formatted. 

It is very a rare case in a real life when data provided is clean and does not require additional manipulation. So, the 80% of the typical workload for a person working with data is preparing the data for an actual analysis.
There are several methods for cleaning data depending, here you see the following steps:

In [0]:
def load_dataset(filename):
    X, y = [], []
    with open(filename) as csv_file:
        csv_reader = csv.DictReader(csv_file, delimiter='\t')
        for row in csv_reader:
            moves = parse_moves(row['movements'])
            age = parse_age(row['age'])
            X.append(moves)
            y.append(age)
    X, y = np.array(X), np.array(y)
    return X, y

'Def' declares a function with a name 'load_dataset' and the parameter 'filename', more [info](https://wiki.python.org/moin/BeginnersGuide) . Then the 'for loop' load the data with the headings we need. Finally, we create [np.array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) for our depending (X) and target (y) variables. 

In [0]:
def parse_moves(seq):
    result = []
    coords = seq.split(',')
    for coord in coords:
        x, y, t = coord.split(';')
        result.append([int(x), int(y)])
    return result


In [0]:
def parse_age(age):
    if age == 'young':
        return 0
    elif age == 'adult':
        return 1
    elif age == 'elder':
        return 2

Functions 'parse_moves' and 'parse_age' are used for separating the moves by three age groups: 


*   young
*   adult
*   elder



In [0]:
def create_model(shape):
    units = shape[0]
    model = Sequential()
    model.add(Bidirectional(LSTM(units), input_shape=shape))
    model.add(Dropout(0.5))
    model.add(Dense(3, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Finally, we create our BiLSTM model. The idea of the model is straightforward. It involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second. It uses sofmax function as an activation function, sparse categorical crossentropy as a loss function. 

For evaluation purpose we pick metric accuracy.
It is the ratio of number of correct predictions to the total number of input samples. It works well if there are equal number of samples belonging to each class.

In [33]:
from google.colab import files # Oleg's code
uploaded = files.upload()

Saving mousemoves.csv to mousemoves (1).csv


In [0]:
if __name__ == '__main__':
    # The saved model file will be named like the dataset file.
    dataset_file = 'mousemoves.csv'

    X, y = load_dataset(dataset_file)

    # All sequences must have the same length.
    X = pad_sequences(X)

    # Create partitions.
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

    # Set callbacks, for monitoring progress.
    cb_tensorboard = TensorBoard(log_dir='/tmp/mouse_logs')
    cb_earlystopping = EarlyStopping(patience=10)
    cb_checkpoint = ModelCheckpoint('/tmp/mouse_logs/best.h5', save_best_only=True)

    # Train the model.
    model = create_model(X_train[0].shape)
    print(model.summary())
    history = model.fit(
        X_train,
        y_train,
        validation_data=(X_test, y_test),
        epochs=100,
        callbacks=[cb_tensorboard, cb_earlystopping, cb_checkpoint]
    )

    # Evaluate the model.
    loss, acc = model.evaluate(X_test, y_test)
    print('ACC: {:.2f}'.format(acc))

    # Save the model.
    save_model(model, '{}.h5'.format(dataset_file))

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_3 (Bidirection (None, 444)               399600    
_________________________________________________________________
dropout_3 (Dropout)          (None, 444)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 1335      
Total params: 400,935
Trainable params: 400,935
Non-trainable params: 0
_________________________________________________________________
None
Train on 204 samples, validate on 51 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
ACC: 0.37


The training finished with accuracy 0.37 which is very low. If we guess, it can be explained by a huge amount of noise in the data.