# Simple User Guide for Matching Scientific Variables

This is a tutorial for running this project. It will cover training machine learning models from provided files, then loading those models to make your own predictions.

First, import the relevant packages and the main functions of the project. Change the path to reflect where the project is located on your computer, and do the same in for the datasets_path in resource_creation.py. 

In [1]:
import pandas as pd
import numpy as np
import sys
sys.path.append('C:/Users/Anna/Documents/Matching-Predicting-Scientific-Variables/')
#sys.path.append('C:/Users/AND522/Documents/Matching-Predicting-Scientific-Variables/')
from main import nn_train, train_classifier, app_load_models, user_input_loop

Using TensorFlow backend.


## 1. Training machine learning models
This project uses two different types of models to classify text for different purposes.

First, it uses a Recurrent Neural Network (RNN) built with Keras to classify words as either "property" or "unit" to segment user input. Then, it sends these segmented strings to separate Naive Bayes classifiers (built with scikit-learn) which will attempt to classify "property" words as a more specific property, and do the same for "unit" words. In total, three machine learning models are used, though the last two are identical in structure and differ only in the data they ingested. 

The property and unit models are seperate as input can be comprised of both property and unit. Seperate models will give seperate sets of results. Diving the number of possible classes into two should also be beneficial to accuracy.

You can skip this step and simply make predictions from pre-trained models in part 2. However, if you want to run training you'll get more information about the models.

### 1.2 Train binary text classifier for input segmentation

Run the below function to train the binary classification model. It will result in a file 'binary.h5' which stores the trained model. This model belongs in ml_keras.py and can be modified - for example in the number of epochs, layer size, activation functions. 

Training should take about a minute.

In [2]:
nn_train('property_or_unit.csv')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          (None, 15)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 15, 50)            45000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                29440     
_________________________________________________________________
FC1 (Dense)                  (None, 256)               16640     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
out_layer (Dense)            (None, 1)                 257       
__________

### 1.2 Train multi-class classifiers for input classification

Run the below function to train the property and unit multi-class Complement Naive Bayes classifiers. The classifier code is located in machine_learning.py. Training this model is significantly faster - you'll barely notice it.

Set print_report=True if you want to see detailed information about how the classifier performs for each class in the training set. Information about this report is at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html. 

Run the function without arguments to only output accuracies.

In [2]:
train_classifier(print_report=True)

( my_unit.csv ) Test set accuracy:  0.7357615894039735
                                                         precision    recall  f1-score   support

                                               abampere       0.20      1.00      0.33         2
                         abampere per square centimeter       0.00      0.00      0.00         2
                                              abcoulomb       0.50      1.00      0.67         2
                        abcoulomb per square centimeter       0.00      0.00      0.00         2
                                                abfarad       0.50      1.00      0.67         2
                                 abfarad per centimeter       0.00      0.00      0.00         2
                                                abhenry       1.00      1.00      1.00         2
                                                  abohm       1.00      1.00      1.00         2
                                               absiemen       1.00     

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


## 2. Load machine learning models

We'll use models loaded from files to make our prediction. These files are overwritten when the model is re-trained, so they're always current. They contain all the information you need to prediction and loading them is much quicker and safer than re-training whenever you want to predict.

The below function loads the models and initialises their class instances as global variables, so it only has to be run once when running the program.

In [2]:
app_load_models()

## 3. Running the user input loop to make predictions

Enter a string composed of scientific properties and/or units to predict a match for them in the vocabulary. A list of ten ranked results for each will be returned, with associated url, abbreviation, and suggested unit, if relevant. Missing values will be displayed as NaN. Attempts were made to make these results look decent, but the best display will be given by running the web application. 

Enter 'xxx' to stop the cell.

In [3]:
user_input_loop(notebook=True)

Enter a string to predict:wind speed km hr


Unnamed: 0,processed_name,proper_name,suggested_unit,suggested_unit_url,url
0,wind speed,wind speed,Kilometer per Hour,http://registry.it.csiro.au/def/qudt/1.1/qudt-...,http://registry.it.csiro.au/def/environment/pr...
1,wind direction,wind direction,Degree Angle,http://registry.it.csiro.au/def/qudt/1.1/qudt-...,http://registry.it.csiro.au/def/environment/pr...
2,maximum wind gust,maximum wind gust,,,
3,wind velocity,wind velocity,,,
4,carbon concentration,carbon concentration,,,http://registry.it.csiro.au/def/environment/pr...
5,oxygen concentration,oxygen concentration,,,http://registry.it.csiro.au/def/environment/pr...
6,water temperature,water temperature,Degree Celsius,http://qudt.org/vocab/unit#DegreeCelsius,http://registry.it.csiro.au/def/environment/pr...
7,nitrogen concentration,nitrogen concentration,Milligrams Per Litre | Milligrams Per Cubic Me...,http://registry.it.csiro.au/def/environment/un...,http://registry.it.csiro.au/def/environment/pr...
8,hydrocarbon concentration,hydrocarbon concentration,,,http://registry.it.csiro.au/def/environment/pr...
9,phosphorus concentration,phosphorus concentration,Milligrams Per Litre,http://registry.it.csiro.au/def/environment/un...,http://registry.it.csiro.au/def/environment/pr...


Unnamed: 0,abbreviation,processed_name,proper_name,url
0,km,kilometer,Kilometer,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
1,km/hr,kilometer per hour,Kilometer per Hour,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
2,km/s,kilometer per second,Kilometer per Second,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
3,km^3/s^2,cubic kilometer per second squared,Cubic Kilometer per Second Squared,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
4,J/(km-K-Pa),joule per kilogram kelvin per pascal,Joule per Kilogram Kelvin per Pascal,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
5,hr,hour,Hour,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
6,ft/hr,foot per hour,Foot per Hour,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
7,Btu/hr,btu per hour,BTU per Hour,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
8,lb/hr,pound per hour,Pound per Hour,http://registry.it.csiro.au/def/qudt/1.1/qudt-...
9,degF-hr,degree fahrenheit hour,Degree Fahrenheit Hour,http://registry.it.csiro.au/def/qudt/1.1/qudt-...


-----------------------------------------------------------
Enter a string to predict:xxx


And that's that for the base functionality! 

The rest of the main program functions mostly concern themselves with dataset creation/manipulation. The web application has a page which allows the user to save their input to a training set, but this is currently not linked to re-running training with this training set (though it could be very straightforwadly extended to do so) as it's easy to give it bad data and would require human validation.