# One-Hot Encoded Char-Level Recurrent Model

Using the balanced data.

Set whether the model looks at abstracts or titles here:

In [None]:
# 'abstract' or 'title'
text_field = 'abstract'
assert text_field in ('abstract', 'title'), 'text_field must be one of title or abstract.'

# Imports and Setup

Imports and colab setup

In [None]:
%%capture import_capture --no-stder
# Jupyter magic methods
# For auto-reloading when external modules are changed
%load_ext autoreload
%autoreload 2
# For showing plots inline
%matplotlib inline

# pip installs needed in Colab for arxiv_vixra_models
!pip install wandb --upgrade
!pip install pytorch-lightning==1.4.9 # v1.5.0 breaks wandb sweeps.
!pip install unidecode
# Update sklearn
!pip uninstall scikit-learn -y
!pip install -U scikit-learn

import math
import pandas as pd
import numpy as np
import torch
pd.set_option(u'float_format', '{:f}'.format)

# pl and wandb installation and setup.
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb

`wandb` log in:

In [None]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

Google drive access

In [None]:
# Give the notebook access to the rest of your google drive files.
from google.colab import drive
drive.mount("/content/drive", force_remount=True)
# Enter the relevant foldername
FOLDERNAME = '/content/drive/My Drive/ML/arxiv_vixra'
assert FOLDERNAME is not None, "[!] Enter the foldername."
# For importing modules stored in FOLDERNAME or a subdirectory thereof:
import sys
sys.path.append(FOLDERNAME)

Mounted at /content/drive


Import models, loaders, and utility functions from our external package:

In [None]:
import arxiv_vixra_models as avm

Computing specs. Save the number of processors to pass as `num_workers` into the Datamodule and cuda availability for other flags.

In [None]:
# GPU. Save availability to is_cuda_available.
gpu_info= !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
  is_cuda_available = False
else:
  print(f"GPU\n{50 * '-'}\n", gpu_info, '\n')
  is_cuda_available = True

# Memory.
from psutil import virtual_memory, cpu_count
ram_gb = virtual_memory().total / 1e9
print(f"Memory\n{50 * '-'}\n", 'Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb), '\n')

# CPU.
print(f"CPU\n{50 * '-'}\n",f'CPU Processors: {cpu_count()}')
# Determine the number of workers to use in the datamodule
num_processors = cpu_count()

GPU
--------------------------------------------------
 Sun Nov  7 01:32:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    25W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-------------------------------

Copy data to CWD and read in with pandas.

In [None]:
# Copy data and dictionary of chars to cwd
train_data_file_name = 'balanced_filtered_normalized_data_train.feather'
val_data_file_name = 'balanced_filtered_normalized_data_validation.feather'
SUBDIR = '/data/data_splits/'
train_data_path = FOLDERNAME + SUBDIR + train_data_file_name
val_data_path = FOLDERNAME + SUBDIR + val_data_file_name
chars_file_name = 'normalized_char_set.feather'
chars_path = FOLDERNAME + SUBDIR + chars_file_name
!cp '{train_data_path}' .
!cp '{val_data_path}' .
!cp '{chars_path}' .

In [None]:
# load with pd
train_data_df = pd.read_feather(train_data_file_name)
val_data_df = pd.read_feather(val_data_file_name)
chars_df = pd.read_feather(chars_file_name)

Use notebook name as `wandb` `project` string. Remove the file extension and any "Copy of" or "Kopie van" text which arises from copying notebooks and running in parallel.  **If the notebook 

In [None]:
from requests import get
project_str = get('http://172.28.0.2:9000/api/sessions').json()[0]['name']
project_str = project_str.replace('.ipynb', '').replace('Kopie%20van%20', '').replace('Copy%20of%20', '')
print(project_str)

balanced_abstract_one-hot-char_recurrent


# Validation Set Filtering

Filter any overlap between the training and validation datasets. Most complete duplicates should have been filtered, but some may not have evaluated as equal prior to text normalization or because they only coinicided in one of the two text columns (e.g., there are distinct papers which share the same title but have different abstracts in the dataset).  We perform strict filtering below.

In [None]:
val_data_intersections_filtered_df = val_data_df.merge(train_data_df, on=[text_field], how='outer', suffixes=['', '_'], indicator=True).loc[lambda x: x['_merge'] == 'left_only'].iloc[:,:3]
print(f'{len(val_data_df) - len(val_data_intersections_filtered_df)} items removed from val set, {100 * (len(val_data_df) - len(val_data_intersections_filtered_df)) / len(val_data_df):.3f} percent.')

62 items removed from val set, 0.823 percent.


Inspect using `print` and `to_string()` to avoid colab javascript errors

In [None]:
print(train_data_df.head().to_string())

                                                                                         title                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

In [None]:
print(val_data_intersections_filtered_df.head().to_string())

                                                                                 title                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    abstract source
0                             

# Model Testing

Setting hyperparameters and performing a small test run.

Dictionary args for model and datamodule.

In [None]:
# Data was prepared with a max of 2048 chars for abstract and 128 for titles.
seq_len = 2048 if text_field == 'abstract' else 128
model_args_dic = {'seq_len': seq_len,
              'num_layers': 2,
              'bidirectional': True,
              'rnn_type': 'GRU',
              'hidden_size': 128,
              'dropout': None,
              }
# The batch size we set here determines the number of datapoints used
# in the histogram that avm.WandbOneHotTextCallback logs
data_args_dic = {'seq_len': seq_len,
                 'train_data': train_data_df,
                 'val_data': val_data_intersections_filtered_df,
                 'chars': chars_df, 
                 'text_column': text_field,
                 'check_normalization': False,
                 'num_workers': num_processors,
                 'batch_size': 512,
                 'pin_memory': is_cuda_available,
                 'persistent_workers': False
                 }

Small test run.

In [None]:
small_example_data_module = avm.OneHotCharDataModule(**{**data_args_dic, **{'sample_size': data_args_dic['batch_size']}})
small_example_data_module.setup()
small_example_loader = small_example_data_module.train_dataloader()
small_example_inputs, small_example_targets = next(iter(small_example_loader))
# Print the first few input texts
for tensor, source in zip(small_example_inputs[:3], small_example_targets[:3]):
    stripped_text = avm.one_hot_decoding(tensor, chars_file_name)
    print(f"text: {stripped_text}",
          f'len: {len(stripped_text)}',
          f'source: {source.item()}',
          sep='\n')
# Test the model        
small_example_model = avm.LitOneHotCharRecurrent(input_size=len(chars_df),
                                             **model_args_dic)
small_example_preds, small_example_losses, _ = small_example_model.preds_losses_hiddens(small_example_inputs, small_example_targets)
print('\npreds shape:', small_example_preds.shape, '\n')
print('\nactual loss:', small_example_losses.item(), '\n')
print('\nexpected approx loss', -math.log(.5), '\n')

text: day after day the world stuck more and more in wars , pollution and so many other risk that threaten the environment . with a population of more than 7 . 3 billion , the planet suffers from continuous damage from human activity . as a result of these human distortions , climate change is one of the most fatal challenges that face the world . climate change won ' t be stopped or slowed by a single action , but with the help of too many small contributions from different fields , it will have an impressive impact . changing to electricity generation , manufacturing , and transportation generate most headlines , but the technology field can also play a critical role . the internet of things ( iot ) in particular , has the potential to reduce greenhouse emissions and help slow the rise of global temperatures . iot includes more than super brilliant new gadgets and smart widgets . it also influences the earth ' s condition , from its available resources to its climate . in this paper 

For logging purposes, take another sample from the validation set which will be be used to visualize predictions.

In [None]:
val_sample  = next(iter(small_example_data_module.val_dataloader()))
val_sample_text = [avm.one_hot_decoding(input, chars_file_name) for input in val_sample[0]]
print(*val_sample_text[:3], sep='\n')

fuzzy classification has become very necessary because of its ability to use simple linguistically interpretable rules and has get control over the limitations of symbolic or crisp rule based classifiers . this paper mainly deals with classification on the basis of soft computing techniques fuzzy cognitive maps and fuzzy inference system . but the data available for classification contain some missing or ambiguous data so it is better to use the neutrosophic logic for classification .
in this paper , by the method of heat flow and the method of exhaustion , we prove an existence theorem of hermitian - yang - mills - higgs metrics on holomorphic line bundle over a class of non - compact gauduchon manifold .
we consider a relatively new hybrid generalized f - contraction involving a pair of mappings and utilize the same to prove a common fixed point theorem for a hybrid pair of occasionally coincidentally idempotent mappings satisfying generalized $ ( f , \ varphi ) $ - contraction condi

# Single, Local Model Training

Train a single model.

In [None]:
# local_model_args_dic = {'seq_len': seq_len,
#               'num_layers': 2,
#               'bidirectional': True,
#               'rnn_type': 'GRU',
#               'hidden_size': 256,
#               'truncated_bptt_steps': 256,
#               'dropout': None,
#               'lr': 0.0001024,
#               'save_best_models_wandb': True
#               }

# local_data_args_dic = {**data_args_dic,
#                        **{'batch_size': 16,}
#                        }
# local_trainer_dic = {'gpus': -1 if is_cuda_available else 0,
#                      'logger': WandbLogger(),
#                      'max_epochs': 16,
#                      'stochastic_weight_avg': True,
#                      'callbacks': [avm.WandbOneHotTextCallback(val_sample=val_sample,
#                                                                chars=chars_df,
#                                                                labels=('arxiv', 'vixra'))],                                   
#                }

# local_group_str_elements = [local_model_args_dic['rnn_type'],
#                           f"{'-bidirectional' if local_model_args_dic['bidirectional'] else ''}",
#                           f"-hidden_size_{local_model_args_dic['hidden_size']}"
#                           f"-{local_model_args_dic['num_layers']}_layers",
#                           f"-{local_group_str_elements['max_epochs']}_epochs"]
# local_group_str = ''.join(single_group_str_elements)

# local_name_str_elements = [f"lr_{local_model_args_dic['lr']:.3E}",
#                            f"{'-dropout_' + str(local_model_args_dic['dropout']) if local_model_args_dic['dropout'] else ''}"]
# local_name_str = ''.join(local_name_str_elements)

# wandb.init(project=project_str,
#            config=local_model_args_dic,
#            group=local_group_str,
#            name=local_name_str
#            )

In [None]:
# local_data_module = avm.OneHotCharDataModule(**local_data_args_dic)
# local_model = avm.LitOneHotCharRecurrent(input_size=len(chars_df),
#                                          **local_model_args_dic)
# local_trainer = Trainer(**local_trainer_dic)
# local_trainer.fit(local_model, datamodule=local_data_module)

# `wandb` Hyperparameter Tuning



Perform a hyperparameter sweep which is externally coordinated by `wandb`.

Set fixed hyperparameters and the configuration file for the `wandb` sweep.  Notes on the setup below:
* In `sweep_config['parameters']` we only include those parameters which are to be swept over.
* All fixed parameters will put in `fixed_hyperparam_dic`.
* `fixed_hyperparam_dic` is eventually merged with the hyperparameter dictionary for those hyperparameters included in `sweep_config` which is generated by `wandb`, with the result passed to the model.
* It is simplest to only sweep over hyperparameters which don't change the size of the model so that we can optimize with a single `batch_size`.  This also helps prevent `CUDA` memory errors which can occur if the sweep generates a model which is too large for the given `batch_size`.

In [None]:
fixed_hyperparam_dic = {'seq_len': seq_len,
                        'rnn_type': 'GRU',
                        'num_layers': 2,
                        'bidirectional': True,
                        'hidden_size': 256,
                        'truncated_bptt_steps': 256,
                        'lr': 0.0001024058707833583,
                        'save_best_models_wandb': True
                        }

sweep_config = {'method': 'random'}
sweep_config['early_terminate'] = {'type': 'hyperband',
                                    'min_iter': 3
                                   }
sweep_config['metric'] = {'name': 'val_loss',
                           'goal': 'minimize'
                           }
sweep_config['parameters'] =  {'dropout': {'distribution': 'uniform',
                                      'min': .05,
                                      'max': .25
                                      }
                               }

Re-instantiate data using the full dataset and a non-trivial batch size.  Implement truncated backpropagation through time, if desired, and set the `max_epochs` of the run and other desired parameters in dictionary for the trainer, which is also appended to `fixed_hyperparam_dic` for logging purposes.

In [None]:
# Setting 'batch_size' in data_args_dic controls the batch size, while setting
# this key in fixed_hyperparam_dic just lets wandb track this hyperparameter.
data_args_dic['batch_size'] = fixed_hyperparam_dic['batch_size'] = 16 #2 **10

# Set max_epochs in a dict, along with any other optional trainer kwargs.
trainer_dic = {'max_epochs': 8,
               'stochastic_weight_avg': True
               }
fixed_hyperparam_dic = {**fixed_hyperparam_dic, **trainer_dic}
def sweep_iteration():
    # Group by various properties
    group_str_elements = [fixed_hyperparam_dic['rnn_type'],
                          f"{'-bidirectional' if fixed_hyperparam_dic['bidirectional'] else ''}",
                          f"-hidden_size_{fixed_hyperparam_dic['hidden_size']}"
                          f"-{fixed_hyperparam_dic['num_layers']}_layers",
                          f"-{fixed_hyperparam_dic['max_epochs']}_epochs"]
    group_str = ''.join(group_str_elements)
    with wandb.init(group=group_str) as run:
        data = avm.OneHotCharDataModule(**data_args_dic)
        config = wandb.config
        # Merge config with model_args_dic
        config = {**fixed_hyperparam_dic, **config}
        # Passing wandb.config to the Model passes the `parameters` key from sweep_config.
        model = avm.LitOneHotCharRecurrent(input_size=len(chars_df), **config)
        # Overwrite the random run names chosen by wandb.
        name_str_elements = [f"lr_{config['lr']:.3E}",
                             f"{'-dropout_' + str(config['dropout']) if config['dropout'] else ''}"]
        name_str = ''.join(name_str_elements)
        run.name = name_str
        trainer = Trainer(
            logger=WandbLogger(),
            gpus=-1 if is_cuda_available else 0,
            callbacks=[avm.WandbOneHotTextCallback(val_sample=val_sample,
                                                   chars=chars_df,
                                                   labels=('arxiv', 'vixra'),
                                                   name=group_str)],
            **trainer_dic
            )
        trainer.fit(model, datamodule=data)

In [None]:
sweep_id = wandb.sweep(sweep_config, project=project_str)

Create sweep with ID: geemu5rm
Sweep URL: https://wandb.ai/garrett361/balanced_abstract_one-hot-char_recurrent/sweeps/geemu5rm


In [None]:
wandb.agent(sweep_id, function=sweep_iteration)

[34m[1mwandb[0m: Agent Starting Run: 6whjkkc4 with config:
[34m[1mwandb[0m: 	dropout: 0.1306458471823354
[34m[1mwandb[0m: Currently logged in as: [33mgarrett361[0m (use `wandb login --relogin` to force relogin)


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name        | Type     | Params
-----------------------------------------
0 | train_acc   | Accuracy | 0     
1 | val_acc     | Accuracy | 0     
2 | test_acc    | Accuracy | 0     
3 | rnn         | GRU      | 1.7 M 
4 | class_layer | Linear   | 513   
-----------------------------------------
1.7 M     Trainable params
0         Non-trainable params
1.7 M     Total params
6.742     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Saved at global step: 0
Epoch: 0
Validation accuracy: 0.90625
Validation Loss: 0.6814248561859131


  "or define the initial states (h0/c0) as inputs of the model. ")


Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Saved at global step: 2196
Epoch: 0
Validation accuracy: 0.8319496512413025
Validation Loss: 0.39780354499816895


  "or define the initial states (h0/c0) as inputs of the model. ")


Validating: 0it [00:00, ?it/s]

Saved at global step: 4393
Epoch: 1
Validation accuracy: 0.846679151058197
Validation Loss: 0.3482213616371155


  "or define the initial states (h0/c0) as inputs of the model. ")


Validating: 0it [00:00, ?it/s]

Saved at global step: 6590
Epoch: 2
Validation accuracy: 0.8788163065910339
Validation Loss: 0.28392428159713745


  "or define the initial states (h0/c0) as inputs of the model. ")


Validating: 0it [00:00, ?it/s]

Saved at global step: 8787
Epoch: 3
Validation accuracy: 0.892072856426239
Validation Loss: 0.26708337664604187


  "or define the initial states (h0/c0) as inputs of the model. ")


Validating: 0it [00:00, ?it/s]

Saved at global step: 10984
Epoch: 4
Validation accuracy: 0.8956882953643799
Validation Loss: 0.2538934350013733


  "or define the initial states (h0/c0) as inputs of the model. ")


Validating: 0it [00:00, ?it/s]

Saved at global step: 13181
Epoch: 5
Validation accuracy: 0.9039903879165649
Validation Loss: 0.24079185724258423


  "or define the initial states (h0/c0) as inputs of the model. ")
  fig, ax = plt.subplots()


Validating: 0it [00:00, ?it/s]

  fig, ax = plt.subplots()
  fig, ax = plt.subplots()
  fig, ax = plt.subplots()


Validating: 0it [00:00, ?it/s]

Saved at global step: 17575
Epoch: 7
Validation accuracy: 0.9088109135627747
Validation Loss: 0.24891512095928192


  "or define the initial states (h0/c0) as inputs of the model. ")
  fig, ax = plt.subplots()
  fig, ax = plt.subplots()
  fig, ax = plt.subplots()


VBox(children=(Label(value=' 28.49MB of 28.71MB uploaded (0.00MB deduped)\r'), FloatProgress(value=0.992499367…

0,1
epoch,▁▁▂▂▃▃▄▄▅▅▆▆▇▇██
global_step,▁▂▃▄▄▅▆▇█
train_acc,▁▅▆▆▇▇██
train_loss,█▅▄▃▃▂▂▁
trainer/global_step,▁▁▁▂▂▂▃▃▃▄▄▄▅▅▅▆▆▆▇▇▇██
val_acc,▁▂▅▆▇███
val_loss,█▆▃▂▂▁▁▁

0,1
epoch,7.0
global_step,17575.0
train_acc,0.70664
train_loss,0.49342
trainer/global_step,17575.0
val_acc,0.90881
val_loss,0.24892


[34m[1mwandb[0m: Agent Starting Run: wf2tngjo with config:
[34m[1mwandb[0m: 	dropout: 0.11422905085050372


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
