# Frame Transfer Learning Exploration

Welcome to [frame.ai's](https://frame.ai) exploration of transfer learning with [fast.ai](https://github.com/fastai/fastai) and [floydhub](https://www.floydhub.com). We assume you are viewing this on a pytorch 1.0 machine on floydhub. If you are instead viewing this notebook locally, please make sure you have pytorch 1.0 installed and the remainder of `setup.sh` and `floyd_requirements.txt` installed.

## Run Floydhub Jobs

If all you want to do is kick off jobs on floydhub all you need to do is to run the following two cells. If you are running this on floydhub, you should be good to go. If you are running locally, just make sure you have the floydhub CLI installed.  

In [1]:
# if running samples larger than 1000 please use --gpu2 instead of --cpu
def train_grid(exp_name, sample_sizes):
    for size in sample_sizes:
        for global_lm in [True, False]:
            !floyd run "bash setup.sh && python train.py mytest floyd \
                --sample-size={size} --global-lm={global_lm}" \
                --env pytorch-1.0 --cpu \
                -m "sample size {size}, global_lm {global_lm}"

In [None]:
# in our experiment we ran sample_sizes of [500, 1000, 2000, 4000, 8000, 16000, 32000, 64000]
train_grid('frame_blog_experiment', sample_sizes=[500, 1000])

## Experiment Playground
If you'd like to play around with fast.ai and the experiment here you can do so below. 

**Please note** that even the smallest sample sizes are a lot for a non-GPU machine to handle. Our advice is to not attempt anything above 1000 domain samples locally, and anything above 16000 samples on a machine not equivalent or better than a floydhub `GPU2` machine. 

In [3]:
from train import * 

In [9]:
sample_size = 500
exp_name = 'frame_blog_experiment'
env = 'local'
global_lm = True

In [5]:
data_dir = '_'.join([exp_name, str(sample_size)])
data_dir = Path(f'./{data_dir}/')

# grab and parse imdb review sentiment data
df_trn, df_val = get_imdb_data(data_dir)

# make sure we have wikitext language model
model_path = data_dir / 'models'
model_path.mkdir(exist_ok=True)
url = 'http://files.fast.ai/models/wt103_v1/'
download_url(f'{url}lstm_wt103.pth', model_path / 'lstm_wt103.pth')
download_url(f'{url}itos_wt103.pkl', model_path / 'itos_wt103.pkl')

# create csv samples to feed into fast.ai 
sample_for_experiment(train_df=df_trn,
                      test_df=df_val,
                      sample_size=sample_size,
                      dst=data_dir)

HBox(children=(IntProgress(value=0, max=221972701), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1027972), HTML(value='')))

In [10]:
print("Training Language Model...")
lm_encoder_name = 'lm1_enc'
lm_learner, lm_data = train_language_model(
    data_dir, env, global_lm)
lm_learner.save_encoder(lm_encoder_name)

Training Language Model...
Tokenizing train_lm.


HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='0.00% [0/1 00:00<00:00]')))

Numericalizing train_lm.
Tokenizing valid.


HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='0.00% [0/1 00:00<00:00]')))

Numericalizing valid.
LM Train Vocabulary size: 3986
LM Vocabulary size: 3986
LM Embedding dim: 400


VBox(children=(HBox(children=(IntProgress(value=0, max=8), HTML(value='0.00% [0/8 00:00<00:00]'))), HTML(value…

Total time: 41:18
epoch  train loss  valid loss  accuracy
1      4.847859    4.075024    0.247282  (05:06)
2      4.613337    3.959728    0.255174  (04:59)
3      4.441497    3.936216    0.257463  (04:52)
4      4.307472    3.934896    0.257945  (05:10)
5      4.184432    3.936925    0.257398  (05:20)
6      4.082025    3.942329    0.258600  (05:06)
7      3.974310    3.955752    0.257507  (04:58)
8      3.876446    3.971110    0.255521  (05:42)



In [11]:
print("Training Sentiment Classifier...")
sentiment_learner = train_classification_model(
    data_dir, env, lm_data, lm_encoder_name)

Training Sentiment Classifier...
Tokenizing train_clas.


HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1), HTML(value='0.00% [0/1 00:00<00:00]')))

Numericalizing train_clas.
Classifier Train Data Vocabulary size: 3986
Classifier Vocabulary size: 3986
Classifier Embedding dim: 400


VBox(children=(HBox(children=(IntProgress(value=0, max=8), HTML(value='0.00% [0/8 00:00<00:00]'))), HTML(value…

Total time: 21:55
epoch  train loss  valid loss  accuracy
1      0.644545    0.674134    0.534000  (02:33)
2      0.586209    0.612849    0.708000  (02:44)
3      0.553098    0.536815    0.744000  (02:41)
4      0.529575    0.503825    0.772000  (03:08)
5      0.492469    0.498575    0.756000  (02:31)
6      0.469248    0.490786    0.752000  (02:53)
7      0.450089    0.489501    0.758000  (02:39)
8      0.432364    0.496952    0.756000  (02:43)



'0.5340000014305115'