<img width=150 src=https://raw.githubusercontent.com/autonomio/signs/master/logo.png><center><font size=3>Signs is a set of tools for text preparation, vectorization and processing. Below is provided a set of examples that cover many of the commonly used workflows. </font></center>

In [1]:
import sys
sys.path.insert(0, '/Users/mikko/Documents/GitHub/talos/')
sys.path.insert(0, '/Users/mikko/Documents/GitHub/signs/')
sys.path.insert(0, '/Users/mikko/Documents/GitHub/dedomena/')

from pandas import read_csv

import signs
import wrangle as wr
from kerasplotlib import TrainingLog
import talos

%matplotlib inline

Using TensorFlow backend.


Let's read some data first and the title of each document for training a model. We're going to use a fake news dataset for these examples.

In [2]:
import dedomena
df = dedomena.datasets.autonomio('fake_news')

# prepared the documents from the dataset
docs = df.title[:1000].astype(str)

# create y data
y = df.label[:1000].values

### Embeddings | `signs.Embeds()`
**Signs** provides a very convinient way to create embeddings for a TF/Keras model. You can read in any pretrained vectors from one of the supported vector types:

- GloVe
- Word2Vec
- FastText

In [3]:
embeds = signs.Embeds("/Volumes/KINGSTON/glove.twitter.27B.25d.txt")

Contained within the created `embeds` object, we now have the Keras embedding layer which we can use to ingest our documents.

In [4]:
# here we also get the embedding layer for keras
embedding_layer, x = embeds.layer(docs)

Now that we have our `x` data and the embeddings layer for the model ready, we can finally split the data before moving onto the model. We will use 30% of the data to validate the results after the hyperparameter scanning process is finished.

In [5]:
import wrangle
x_train, y_train, x_val, y_val = wrangle.array_split(x, y, .3)

For hyperparameter optimization we're going to use another Autonomio solution, Talos.

In [6]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten


def fake_news(x_train, y_train, x_val, y_val, params):

    model = Sequential()
    model.add(params['embedding_layer'])
    model.add(Flatten())
    model.add(Dropout(params['dropout']))

    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer=params['optimizer'],
                  loss=params['losses'],
                  metrics=['acc'])

    out = model.fit(x_train, y_train,
                    epochs=params['epochs'],
                    batch_size=params['batch_size'],
                    verbose=0,
                    validation_split=.3,
                    validation_data=[x_val, y_val])
    
    return out, model

In addition to the input model, Talos requires us to provide a parameter dictionary with the parameters for the experiment.

In [7]:
params = {'embedding_layer': [embedding_layer],
          'batch_size': (10, 30, 5),
          'epochs': [50],
          'dropout': (0.1, 0.3, 10),
          'optimizer': ['Adam', 'Nadam'],
          'losses': ['binary_crossentropy', 'logcosh'],
          'activation':['relu', 'elu']}

The really cool thing here is that we could try different embedding layers (based on different trained vectors for example) as part of the experiment. 

In [8]:
h = talos.Scan(x_train, y_train,
               params=params,
               experiment_name='fake_news_test',
               model=fake_news,
               fraction_limit=0.1,
               clear_session=False)

100%|██████████| 40/40 [05:49<00:00, 10.16s/it]


### Predictions | `signs.Preds()`
Next, let's put the best model from the experiment into use to see how the results look like. For this, we need to first find the best model from the Talos `Scan()` object.

In [13]:
# get the best model from the experiment
model = h.best_model()

# prepare the predictions object
preds = signs.Preds(x_val, y_val, embeds.word_index, model)

There are several ways we can learn more about the model. 

In [17]:
# Examples of true positives and true negatives
preds.hits()

In [18]:
# Examples of false positives and false negatives
preds.misses()

In [20]:
# display the full results
preds.results.head()

Unnamed: 0,text,pred,truth
0,donald trump obama thanksgiving your weekend b...,0.040221,0
1,orcs of a different domain fighting with heart...,0.002642,0
2,loserpalooza 9 craziest scenes from anti trump...,0.183427,0
3,descubre que ha llevado siempre un trozo de le...,0.23998,1
4,builds 150 million war chest doubling donald t...,0.007459,0


In [22]:
# a summary of model predictions
preds.summary()

### Evaluation | `signs.Evaluate()`
Finally, let's perform some objective evaluation of the results to see how well the model is doing. We will use Talos for doing this.

In [30]:
evl = ta.Evaluate(h)
evl.evaluate(x_val, y_val, mode='binary')

[0.870967741935484,
 0.819672131147541,
 0.7878787878787877,
 0.84375,
 0.8666666666666667]