In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [0]:
import torch
import fastai
from fastai import text

In [0]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('device:', device)

fastai.core.defaults.device = torch.device(device)

if device == 'cuda':
    text.torch.backends.cudnn.benchmark = True

device: cuda


In [0]:
path = text.untar_data(text.URLs.IMDB_SAMPLE)
path.ls()

[PosixPath('/root/.fastai/data/imdb_sample/data_save.pkl'),
 PosixPath('/root/.fastai/data/imdb_sample/texts.csv')]

In [0]:
df = text.pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [0]:
df.iloc[1]

label                                                positive
text        This is a extremely well-made film. The acting...
is_valid                                                False
Name: 1, dtype: object

## Tokenization

The first step of processing we make the texts go through is to split the raw sentences into words, or more exactly tokens. The easiest way to do this would be to split the string on spaces, but we can be smarter:

* we need to take care of punctuation
* some words are contractions of two different words, like isn't or don't
* we may need to clean some parts of our texts, if there's HTML code for instance

To see what the tokenizer had done behind the scenes, let's have a look at a few texts in a batch.

In [0]:
data = text.TextDataBunch.from_csv(path, 'texts.csv')
data.save()

In [0]:
data = text.load_data(path)
data.show_batch()

text,target
"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n \n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , steaming bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj",negative
"xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj sydney , after xxunk ) , i can xxunk join both xxunk of "" xxmaj at xxmaj the xxmaj movies "" in taking xxmaj steven xxmaj soderbergh to task . \n \n xxmaj it 's usually satisfying to watch a film director change his style /",negative
"xxbos xxmaj this film sat on my xxmaj tivo for weeks before i watched it . i dreaded a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj yorkers . \n \n xxmaj the format is the same as xxmaj max xxmaj xxunk ' "" xxmaj la xxmaj ronde",positive
"xxbos i really wanted to love this show . i truly , honestly did . \n \n xxmaj for the first time , gay viewers get their own version of the "" xxmaj the xxmaj bachelor "" . xxmaj with the help of his obligatory "" hag "" xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance",negative
"xxbos \n \n i 'm sure things did n't exactly go the same way in the real life of xxmaj homer xxmaj hickam as they did in the film adaptation of his book , xxmaj rocket xxmaj boys , but the movie "" xxmaj october xxmaj sky "" ( an xxunk of the book 's title ) is good enough to stand alone . i have not read xxmaj",positive


In [0]:
len(data.vocab.itos)

8776

In [0]:
data.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the']

In [0]:
data.train_ds[0][0]

Text xxbos xxmaj this movie is one of the most wildly distorted portrayals of history . xxmaj horribly inaccurate , this movie does nothing to honor the hundreds of thousands of xxmaj dutch , xxmaj british , xxmaj chinese , xxmaj american and xxunk enslaved xxunk that the sadistic xxmaj japanese killed and tortured to death . xxmaj the bridge was to be built " over the bodies of the white man " as stated by the head xxmaj japanese xxunk . xxmaj it is disgusting that such xxunk horrors committed by the xxmaj japanese captors is the source of a movie , where the bridge itself , is n't even close to accurate to the actual bridge . xxmaj the actual bridge was built of steel and concrete , not wood . xxmaj what of the survivors who are still alive today ? xxmaj they hate the movie and all that it is supposed to represent . xxmaj their friends were starved , tortured , and murdered by cruel xxunk . xxmaj those that did n't die of xxunk , xxunk , or disease are deeply hurt by the movie that m

In [0]:
data.train_ds[0][0].data[:10]

array([   2,    5,   20,   28,   16,   44,   14,    9,  110, 3749])