<a href="https://colab.research.google.com/github/felipefreitas93/Colab_Notebooks/blob/master/XLNet_imdb_GPU_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/felipefreitas93/NLPdatasets
!git clone https://github.com/felipefreitas93/xlnet.git

import pandas as pd

#0.036127167630057806, 0.07225433526011561, 0.14450867052023122, 0.21676300578034682, 0.43352601156069365

dataset_used='SST2'
FRAC = 0.43352601156069365

DATA_DIR = f'NLPdatasets/{dataset_used}'
train = pd.read_csv(DATA_DIR + '/train.tsv', sep='\t', names=['y','x']).dropna().sample(frac=FRAC)
dataset_len = train.shape[0]/FRAC
train.to_csv(DATA_DIR+ '/train.tsv', sep='\t', index=False, header=False)

NUM_TRAIN_STEPS = 5.12*dataset_len*FRAC/32 #batch size #maybe 4 epochs
WARMUP_STEPS = 0.125*5.12*dataset_len*FRAC/32 #batch size

Cloning into 'NLPdatasets'...
remote: Enumerating objects: 64, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (45/45), done.[K
remote: Total 64 (delta 18), reused 60 (delta 17), pack-reused 0[K
Unpacking objects: 100% (64/64), done.
Cloning into 'xlnet'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 38 (delta 7), reused 33 (delta 7), pack-reused 0[K
Unpacking objects: 100% (38/38), done.


# XLNet IMDB movie review classification project

This notebook is for classifying the [imdb sentiment dataset](https://ai.stanford.edu/~amaas/data/sentiment/).  It will be easy to edit this notebook in order to run all of the classification tasks referenced in the [XLNet paper](https://arxiv.org/abs/1906.08237). Whilst you cannot expect to obtain the state-of-the-art results in the paper on a GPU, this model will still score very highly. 

## Setup
Install dependencies

In [2]:
! pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 2.7MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.83


Download the pretrained XLNet model and unzip

In [3]:
# only needs to be done once
! wget https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip
! unzip cased_L-12_H-768_A-12.zip 

--2019-09-13 12:08:06--  https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.203.128, 2404:6800:4008:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.203.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433638019 (414M) [application/zip]
Saving to: ‘cased_L-12_H-768_A-12.zip’


2019-09-13 12:08:15 (50.4 MB/s) - ‘cased_L-12_H-768_A-12.zip’ saved [433638019/433638019]

Archive:  cased_L-12_H-768_A-12.zip
   creating: xlnet_cased_L-12_H-768_A-12/
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.index  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.data-00000-of-00001  
  inflating: xlnet_cased_L-12_H-768_A-12/spiece.model  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.meta  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_config.json  


Download extract the imdb dataset - surpessing output

Git clone XLNet repo for access to run_classifier and the rest of the xlnet module

## Define Variables
Define all the dirs: data, xlnet scripts & pretrained model. 
If you would like to save models then you can authenticate a GCP account and use that for the OUTPUT_DIR & CHECKPOINT_DIR - you will need a large amount storage to fix these models. 

Alternatively it is easy to integrate a google drive account, checkout this guide for [I/O in colab](https://colab.research.google.com/notebooks/io.ipynb) but rememeber these will take up a large amount of storage. 


In [0]:
SCRIPTS_DIR = 'xlnet' #@param {type:"string"}
OUTPUT_DIR = 'proc_data/imdb' #@param {type:"string"}
PRETRAINED_MODEL_DIR = 'xlnet_cased_L-12_H-768_A-12' #@param {type:"string"}
CHECKPOINT_DIR = 'exp/imdb' #@param {type:"string"}

## Run Model
This will set off the fine tuning of XLNet. There are a few things to note here:


1.   This script will train and evaluate the model
2.   This will store the results locally on colab and will be lost when you are disconnected from the runtime
3.   This uses the large version of the model (base not released presently)
4.   We are using a max seq length of 128 with a batch size of 8 please refer to the [README](https://github.com/zihangdai/xlnet#memory-issue-during-finetuning) for why this is.
5. This will take approx 4hrs to run on GPU.



In [5]:
%%time
train_command = f'python xlnet/run_classifier.py \
  --do_train=True \
  --do_eval=False \
  --eval_all_ckpt=False \
  --task_name=imdb \
  --data_dir={DATA_DIR} \
  --output_dir={OUTPUT_DIR} \
  --model_dir={CHECKPOINT_DIR} \
  --uncased=False \
  --spiece_model_file={PRETRAINED_MODEL_DIR}/spiece.model \
  --model_config_path={PRETRAINED_MODEL_DIR}/xlnet_config.json \
  --init_checkpoint={PRETRAINED_MODEL_DIR}/xlnet_model.ckpt \
  --max_seq_length=256 \
  --train_batch_size=16 \
  --eval_batch_size=16 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --learning_rate=2e-5 \
  --train_steps={int(NUM_TRAIN_STEPS)} \
  --warmup_steps={int(WARMUP_STEPS)} \
  --save_steps=5000 \
  --iterations=500'

! {train_command}





W0913 12:08:27.223542 139822679500672 deprecation_wrapper.py:119] From xlnet/run_classifier.py:639: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0913 12:08:27.223773 139822679500672 deprecation_wrapper.py:119] From xlnet/run_classifier.py:639: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0913 12:08:27.223963 139822679500672 deprecation_wrapper.py:119] From xlnet/run_classifier.py:663: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.


W0913 12:08:27.224187 139822679500672 deprecation_wrapper.py:119] From xlnet/run_classifier.py:664: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.


W0913 12:08:27.281117 139822679500672 deprecation_wrapper.py:119] From /content/xlnet/model_utils.py:27: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.


W0913 12:08:27.281531 139822679500672 deprecat

In [6]:
%%time
train_command = "python xlnet/run_classifier.py \
  --do_train=False \
  --do_eval=True \
  --eval_all_ckpt=False \
  --task_name=imdb \
  --data_dir="+DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+CHECKPOINT_DIR+" \
  --uncased=False \
  --spiece_model_file="+PRETRAINED_MODEL_DIR+"/spiece.model \
  --model_config_path="+PRETRAINED_MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+PRETRAINED_MODEL_DIR+"/xlnet_model.ckpt \
  --max_seq_length=256 \
  --train_batch_size=16 \
  --eval_batch_size=16 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --learning_rate=2e-5 \
  --train_steps=5000 \
  --warmup_steps=0 \
  --save_steps=1000 \
  --iterations=500"

! {train_command}





W0913 12:11:23.544735 140486616070016 deprecation_wrapper.py:119] From xlnet/run_classifier.py:639: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0913 12:11:23.544989 140486616070016 deprecation_wrapper.py:119] From xlnet/run_classifier.py:639: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0913 12:11:23.545181 140486616070016 deprecation_wrapper.py:119] From xlnet/run_classifier.py:663: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.


W0913 12:11:23.617311 140486616070016 deprecation_wrapper.py:119] From /content/xlnet/model_utils.py:27: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.


W0913 12:11:23.617779 140486616070016 deprecation_wrapper.py:119] From /content/xlnet/model_utils.py:36: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Single device mode.
I09