<a href="https://colab.research.google.com/github/felipefreitas93/Colab_Notebooks/blob/master/XLNet_imdb_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
FRAC = 0.02
NUM_TRAIN_STEPS = 4000*FRAC
WARMUP_STEPS = 500*FRAC

# XLNet IMDB movie review classification project

This notebook is for classifying the [imdb sentiment dataset](https://ai.stanford.edu/~amaas/data/sentiment/).  It will be easy to edit this notebook in order to run all of the classification tasks referenced in the [XLNet paper](https://arxiv.org/abs/1906.08237). Whilst you cannot expect to obtain the state-of-the-art results in the paper on a GPU, this model will still score very highly. 

## Setup
Install dependencies

In [2]:
! pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/00/95/7f357995d5eb1131aa2092096dca14a6fc1b1d2860bd99c22a612e1d1019/sentencepiece-0.1.82-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 8.9MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.82


Download the pretrained XLNet model and unzip

In [3]:
# only needs to be done once
! wget https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip
! unzip cased_L-12_H-768_A-12.zip 

import pandas as pd
import os

def get_to_keep(frac):
    path_pos = 'aclImdb/train/pos'
    path_neg = 'aclImdb/train/neg'
    pos_df = pd.DataFrame(os.listdir(path_pos))
    to_keep_pos = pos_df.sample(frac=frac, random_state=1)[0].values.tolist()
    neg_df = pd.DataFrame(os.listdir(path_neg))
    to_keep_neg = neg_df.sample(frac=frac, random_state=1)[0].values.tolist()
    for value_pos in os.listdir(path_pos):
        if value_pos not in to_keep_pos:
            os.remove(os.path.join(path_pos ,value_pos))
    for value_neg in os.listdir(path_neg):
        if value_neg not in to_keep_neg:
            os.remove(os.path.join(path_neg ,value_neg))

--2019-07-23 01:07:19--  https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.126.128, 2a00:1450:4013:c01::80
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.126.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433638019 (414M) [application/zip]
Saving to: ‘cased_L-12_H-768_A-12.zip’


2019-07-23 01:07:25 (68.6 MB/s) - ‘cased_L-12_H-768_A-12.zip’ saved [433638019/433638019]

Archive:  cased_L-12_H-768_A-12.zip
   creating: xlnet_cased_L-12_H-768_A-12/
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.index  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.data-00000-of-00001  
  inflating: xlnet_cased_L-12_H-768_A-12/spiece.model  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_model.ckpt.meta  
  inflating: xlnet_cased_L-12_H-768_A-12/xlnet_config.json  


Download extract the imdb dataset - surpessing output

In [4]:
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
! tar zxf aclImdb_v1.tar.gz
get_to_keep(FRAC)

--2019-07-23 01:07:34--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2019-07-23 01:07:41 (11.6 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



Git clone XLNet repo for access to run_classifier and the rest of the xlnet module

In [5]:
! git clone https://github.com/zihangdai/xlnet.git

Cloning into 'xlnet'...
remote: Enumerating objects: 118, done.[K
remote: Total 118 (delta 0), reused 0 (delta 0), pack-reused 118[K
Receiving objects: 100% (118/118), 135.28 KiB | 626.00 KiB/s, done.
Resolving deltas: 100% (57/57), done.


## Define Variables
Define all the dirs: data, xlnet scripts & pretrained model. 
If you would like to save models then you can authenticate a GCP account and use that for the OUTPUT_DIR & CHECKPOINT_DIR - you will need a large amount storage to fix these models. 

Alternatively it is easy to integrate a google drive account, checkout this guide for [I/O in colab](https://colab.research.google.com/notebooks/io.ipynb) but rememeber these will take up a large amount of storage. 


In [0]:
SCRIPTS_DIR = 'xlnet' #@param {type:"string"}
DATA_DIR = 'aclImdb' #@param {type:"string"}
OUTPUT_DIR = 'proc_data/imdb' #@param {type:"string"}
PRETRAINED_MODEL_DIR = 'xlnet_cased_L-12_H-768_A-12' #@param {type:"string"}
CHECKPOINT_DIR = 'exp/imdb' #@param {type:"string"}

## Run Model
This will set off the fine tuning of XLNet. There are a few things to note here:


1.   This script will train and evaluate the model
2.   This will store the results locally on colab and will be lost when you are disconnected from the runtime
3.   This uses the large version of the model (base not released presently)
4.   We are using a max seq length of 128 with a batch size of 8 please refer to the [README](https://github.com/zihangdai/xlnet#memory-issue-during-finetuning) for why this is.
5. This will take approx 4hrs to run on GPU.



In [7]:
%%time
train_command = f'python xlnet/run_classifier.py \
  --do_train=True \
  --do_eval=False \
  --eval_all_ckpt=False \
  --task_name=imdb \
  --data_dir={DATA_DIR} \
  --output_dir={OUTPUT_DIR} \
  --model_dir={CHECKPOINT_DIR} \
  --uncased=False \
  --spiece_model_file={PRETRAINED_MODEL_DIR}/spiece.model \
  --model_config_path={PRETRAINED_MODEL_DIR}/xlnet_config.json \
  --init_checkpoint={PRETRAINED_MODEL_DIR}/xlnet_model.ckpt \
  --max_seq_length=256 \
  --train_batch_size=16 \
  --eval_batch_size=16 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --learning_rate=2e-5 \
  --train_steps={int(NUM_TRAIN_STEPS)} \
  --warmup_steps={int(WARMUP_STEPS)} \
  --save_steps=5000 \
  --iterations=500'

! {train_command}


W0723 01:07:56.953937 140243552819072 deprecation_wrapper.py:119] From /content/xlnet/model_utils.py:295: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0723 01:07:56.965300 140243552819072 deprecation_wrapper.py:119] From xlnet/run_classifier.py:855: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W0723 01:07:56.965924 140243552819072 deprecation_wrapper.py:119] From xlnet/run_classifier.py:637: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0723 01:07:56.966089 140243552819072 deprecation_wrapper.py:119] From xlnet/run_classifier.py:637: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0723 01:07:56.966229 140243552819072 deprecation_wrapper.py:119] From xlnet/run_classifier.py:661: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

W0723 01:07:56.966390 140243552819072 deprecation_wr

In [11]:
%%time
train_command = "python xlnet/run_classifier.py \
  --do_train=False \
  --do_eval=True \
  --eval_all_ckpt=False \
  --task_name=imdb \
  --data_dir="+DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+CHECKPOINT_DIR+" \
  --uncased=False \
  --spiece_model_file="+PRETRAINED_MODEL_DIR+"/spiece.model \
  --model_config_path="+PRETRAINED_MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+PRETRAINED_MODEL_DIR+"/xlnet_model.ckpt \
  --max_seq_length=256 \
  --train_batch_size=16 \
  --eval_batch_size=16 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --learning_rate=2e-5 \
  --train_steps=5000 \
  --warmup_steps=0 \
  --save_steps=1000 \
  --iterations=500"

! {train_command}


W0723 01:26:41.481908 140497577334656 deprecation_wrapper.py:119] From /content/xlnet/model_utils.py:295: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0723 01:26:41.484388 140497577334656 deprecation_wrapper.py:119] From xlnet/run_classifier.py:855: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W0723 01:26:41.484971 140497577334656 deprecation_wrapper.py:119] From xlnet/run_classifier.py:637: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0723 01:26:41.485121 140497577334656 deprecation_wrapper.py:119] From xlnet/run_classifier.py:637: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0723 01:26:41.485248 140497577334656 deprecation_wrapper.py:119] From xlnet/run_classifier.py:661: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

W0723 01:26:41.542330 140497577334656 deprecation_wr