# IMDB Data Setup
Run the following if you want to download the imdb reviews, process them into csvs, and upload them as your own Floydhub dataset. 
### NOTE: 
You'll likely need to run the train_test_split shuffling in CPU2 mode because there's more RAM.

In [1]:
from fastai import *
from sklearn.model_selection import train_test_split

### Download
Uncomment out the bottom lines if you haven't downloaded it yet.

In [2]:
DATA_PATH=Path('data/')
DATA_PATH.mkdir(exist_ok=True)
! wget -c http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -O - | tar -xz -C {DATA_PATH}
! ls data/    

--2018-10-19 00:41:46--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘STDOUT’


2018-10-19 00:41:54 (10.2 MB/s) - written to stdout [84125825/84125825]

aclImdb


### Standardize format
This is lifted straight from this notebook - https://github.com/fastai/fastai/blob/master/courses/dl2/imdb.ipynb

In [14]:
PATH=Path('data/aclImdb/')

CSV_PATH=Path('data/csv/')
CSV_PATH.mkdir(exist_ok=True)

The imdb dataset has 3 classes. positive, negative and unsupervised(sentiment is unknown). There are 75k training reviews(12.5k pos, 12.5k neg, 50k unsup) There are 25k validation reviews(12.5k pos, 12.5k neg & no unsup)

Refer to the README file in the imdb corpus for further information about the dataset.

In [5]:
CLASSES = ['neg', 'pos']

def get_texts(path):
    texts,labels = [],[]
    for idx,label in enumerate(CLASSES):
        for fname in (path/label).glob('*.*'):
            texts.append(fname.open('r', encoding='utf-8').read())
            labels.append(idx)
    return np.array(texts),np.array(labels)

trn_texts,trn_labels = get_texts(PATH/'train')
val_texts,val_labels = get_texts(PATH/'test')

In [6]:
len(trn_texts),len(val_texts)

(75000, 25000)

In [7]:
col_names = ['labels','text']

We use a random permutation np array to shuffle the text reviews.

In [8]:
np.random.seed(42)
trn_idx = np.random.permutation(len(trn_texts))
val_idx = np.random.permutation(len(val_texts))

In [9]:
trn_texts = trn_texts[trn_idx]
val_texts = val_texts[val_idx]

trn_labels = trn_labels[trn_idx]
val_labels = val_labels[val_idx]
trn_labels[:20], val_labels[:20]

(array([2, 0, 1, 2, 2, 2, 2, 0, 2, 2, 1, 2, 0, 2, 0, 2, 0, 2, 2, 2]),
 array([1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1]))

In [10]:
df_trn = pd.DataFrame({'text':trn_texts, 'labels':trn_labels}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':val_labels}, columns=col_names)

The pandas dataframe is used to store text data in a newly evolving standard format of label followed by text columns. This was influenced by a paper by Yann LeCun (LINK REQUIRED). Fastai adopts this new format for NLP datasets. In the case of IMDB, there is only one text column.

In [15]:
df_trn[df_trn['labels']!=2].to_csv(CSV_PATH/'train_clas.csv', header=False, index=False)
df_val.to_csv(CSV_PATH/'valid_clas.csv', header=False, index=False)

(CSV_PATH/'classes.txt').open('w', encoding='utf-8').writelines(f'{o}\n' for o in CLASSES)

We start by creating the data for the Language Model(LM). The LM's goal is to learn the structure of the english language. It learns language by trying to predict the next word given a set of previous words(ngrams). Since the LM does not classify reviews, the labels can be ignored.

The LM can benefit from all the textual data and there is no need to exclude the unsup/unclassified movie reviews.

We first concat all the train(pos/neg/unsup = 75k) and test(pos/neg=25k) reviews into a big chunk of 100k reviews. And then we use sklearn splitter to divide up the 100k texts into 90% training and 10% validation sets.

In [11]:
trn_texts,val_texts = train_test_split(
    np.concatenate([trn_texts,val_texts]), test_size=0.1)

In [12]:
len(trn_texts), len(val_texts)

(90000, 10000)

In [16]:
df_trn = pd.DataFrame({'text':trn_texts, 'labels':[0]*len(trn_texts)}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':[0]*len(val_texts)}, columns=col_names)

df_trn.to_csv(CSV_PATH/'train.csv', header=False, index=False)
df_val.to_csv(CSV_PATH/'valid.csv', header=False, index=False)

Now, let's save these csvs as a dataset. Be sure to create this dataset using the same name below in Floyd's dataset tab. Don't forget your username in the data init path. First, let's copy the classes.txt over to this directory.

In [17]:
! cat {CSV_PATH}/classes.txt

neg
pos
unsup


### Download the pretrained wikitext language model
Let's set this up as a separate data source for transfer learning later.

In [19]:
model_path = CSV_PATH/'models'
model_path.mkdir(exist_ok=True)
url = 'http://files.fast.ai/models/wt103_v1/'
download_url(f'{url}lstm_wt103.pth', model_path/'lstm_wt103.pth')
download_url(f'{url}itos_wt103.pkl', model_path/'itos_wt103.pkl')

HBox(children=(IntProgress(value=0, max=221972701), HTML(value='')))

HBox(children=(IntProgress(value=0, max=1027972), HTML(value='')))

Don't forget to create a dataset with the name below before running the next cell.

In [20]:
%cd {CSV_PATH}
username = 'frame'
!floyd data init {username}/imdb_reviews_wt103
!floyd data upload

/floyd/home/data/csv
Data source "frame/imdb_reviews_wt103" initialized in current directory

    You can now upload your data to Floyd by:
        floyd data upload
    
Compressing data...
Making create request to server...
Initializing upload...
Uploading compressed data. Total upload size: 273.3MiB
Removing compressed data...
Upload finished.
Waiting for server to unpack data.
You can exit at any time and come back to check the status with:
	floyd data upload -r
Waiting for unpack......

NAME
-----------------------------------
frame/datasets/imdb_reviews_wt103/1


### Next up
Finished! Now try running the '00_start_here' workspace to spin off jobs.