<a href="https://colab.research.google.com/github/agrudkow/xlnet/blob/master/notebooks/colab_imdb_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XLNet IMDB movie review classification project

This notebook is for classifying the [imdb sentiment dataset](https://ai.stanford.edu/~amaas/data/sentiment/).  It will be easy to edit this notebook in order to run all of the classification tasks referenced in the [XLNet paper](https://arxiv.org/abs/1906.08237). Whilst you cannot expect to obtain the state-of-the-art results in the paper on a GPU, this model will still score very highly. 

## Setup
Install dependencies

In [None]:
! pip install sentencepiece

Download the pretrained XLNet model and unzip

In [None]:
# only needs to be done once
#! wget https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip
#! unzip cased_L-24_H-1024_A-16.zip 

In [None]:
# Download and unzip base model
! wget https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip
! unzip cased_L-12_H-768_A-12.zip
! rm cased_L-12_H-768_A-12.zip

Git clone XLNet repo for access to run_classifier and the rest of the xlnet module

In [None]:
! git clone https://github.com/agrudkow/xlnet.git

In [None]:
%cd /content/xlnet
! git pull
%cd /content

Downgrade tensorflow to v1

In [12]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


## Define Variables
Define all the dirs: data, xlnet scripts & pretrained model. 
If you would like to save models then you can authenticate a GCP account and use that for the OUTPUT_DIR & CHECKPOINT_DIR - you will need a large amount storage to fix these models. 

Alternatively it is easy to integrate a google drive account, checkout this guide for [I/O in colab](https://colab.research.google.com/notebooks/io.ipynb) but rememeber these will take up a large amount of storage. 


In [18]:
SCRIPTS_DIR = 'xlnet' #@param {type:"string"}
DATASET_NAME = 'answers-students' #@param ["answers-students", "headlines", "images"] {type:"string"}
TASK_NAME = 'ists' #@param {type:"string"}
DATA_DIR = 'xlnet/' + TASK_NAME + '/' + DATASET_NAME
OUTPUT_DIR = 'proc_data/' + TASK_NAME
PRETRAINED_MODEL_DIR = 'xlnet_cased_L-12_H-768_A-12' #@param {type:"string"}
CHECKPOINT_DIR = 'exp/' + TASK_NAME
PREDICIT_DIR = 'xlnet/pred/' + TASK_NAME + '/' + DATASET_NAME
METRICS_DIR = 'xlnet/metrics/' + TASK_NAME

## Run Model
This will set off the fine tuning of XLNet. There are a few things to note here:


1.   This script will train and evaluate the model
2.   This will store the results locally on colab and will be lost when you are disconnected from the runtime
3.   This uses the large version of the model (base not released presently)
4.   We are using a max seq length of 128 with a batch size of 8 please refer to the [README](https://github.com/zihangdai/xlnet#memory-issue-during-finetuning) for why this is.
5. This will take approx 4hrs to run on GPU.



In [None]:
train_command = "CUDA_VISIBLE_DEVICES=0 python xlnet/run_classifier.py \
  --do_train=True \
  --do_eval=False \
  --eval_all_ckpt=True \
  --eval_split=test \
  --task_name="+TASK_NAME+" \
  --data_dir="+DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+CHECKPOINT_DIR+" \
  --uncased=False \
  --spiece_model_file="+PRETRAINED_MODEL_DIR+"/spiece.model \
  --model_config_path="+PRETRAINED_MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+PRETRAINED_MODEL_DIR+"/xlnet_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --eval_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --learning_rate=2e-5 \
  --train_steps=1500 \
  --warmup_steps=250 \
  --save_steps=250"

! {train_command}


## Predict classes for test set

In [None]:
predict_command = "CUDA_VISIBLE_DEVICES=0 python xlnet/run_classifier.py \
  --do_predict=True \
  --pred_all_ckpt=True \
  --eval_split=test \
  --task_name="+TASK_NAME+" \
  --data_dir="+DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+CHECKPOINT_DIR+" \
  --predict_dir="+PREDICIT_DIR+" \
  --uncased=False \
  --spiece_model_file="+PRETRAINED_MODEL_DIR+"/spiece.model \
  --model_config_path="+PRETRAINED_MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+PRETRAINED_MODEL_DIR+"/xlnet_model.ckpt \
  --max_seq_length=128 \
  --predict_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=1"

! {predict_command}

## Calculate metrics for iSTS task

In [None]:
predict_command = "python xlnet/run_classifier.py \
  --calc_ists_metrics=True \
  --data_dir="+DATA_DIR+" \
  --metrics_dir="+METRICS_DIR+" \
  --predict_dir="+PREDICIT_DIR

! {predict_command}

# Push results to github

#### Check repo status

In [None]:
%cd /content/xlnet
!git status
%cd /content

#### Check repo diff

In [None]:
%cd /content/xlnet
!git diff
%cd /content

#### Setup github environment vars

In [None]:
%cd /content/xlnet

files = 'pred/ists/answers-students/*' #@param {type:"string"}
branch = 'master' #@param {type:"string"}

%cd /content

#### Commit changes

In [None]:
# &> /dev/null - hide output
%cd /content/xlnet

from getpass import getpass

uname = getpass('User name:')
email = getpass('Email:')
# token -> https://docs.github.com/en/github/authenticating-to-github/keeping-your-account-and-data-secure/creating-a-personal-access-token
# Wystarczy zaznaczyć opcje 'Access public repositories'
token = getpass('Token:')

!git config --global user.email $email 

# Zmień nazwę
!git config --global --replace-all user.name 'Artur Grudkowski'
!git remote set-url origin https://{uname}:{token}@github.com/agrudkow/xlnet.git &> /dev/null

# create a file, then add it to stage
!git checkout $branch
!git add $files
!git commit -m 'feat(pred): add prediciotns for answers-students' -m "Config: base-xlnet, 1500 steps, 250 warm-up steps, 32 batch size" 
!git pull --rebase 
!git push origin $branch

uname = ''
email = ''
token = ''
!git remote set-url origin '' &> /dev/null

%cd /content


# Copy files to/from Google drive

##### Mount drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##### Zip and remove checkpoints

In [None]:
%cd /content/exp/ists/
!zip -r  /content/answers-students-1500-ckpt.zip *.ckpt-*

In [None]:
! rm *.ckpt-*
%cd /content

/content/exp/ists
/content


##### Copy  selected files

In [None]:
%cp -av "/content/answers-students-1500-ckpt.zip" "/content/drive/MyDrive/nlp"

'/content/answers-students-1500-ckpt.zip' -> '/content/drive/MyDrive/nlp/answers-students-1500-ckpt.zip'


##### Download  selected files

In [None]:
%cp -av "/content/drive/MyDrive/nlp/answers-students-4000-ckpt.zip" "/content/exp/ists"

'/content/drive/MyDrive/nlp/answers-students-4000-ckpt.zip' -> '/content/exp/ists/answers-students-4000-ckpt.zip'


###### Unzip and remove checkpoints dir

In [None]:
! unzip  /content/exp/ists/answers-students-4000-ckpt.zip -d /content/exp/ists


In [None]:
! rm /content/exp/ists/answers-students-4000-ckpt.zip

## Running & Results
These are the results that I got from running this experiment
### Params
*    --max_seq_length=128 \
*    --train_batch_size= 8 

### Times
*   Training: 1hr 11mins
*   Evaluation: 2.5hr

### Results
*  Most accurate model on final step
*  Accuracy: 0.92416, eval_loss: 0.31708


### Model

*   The trained model checkpoints can be found in 'exp/imdb'

