# Imdb

Introduction to `Text analysis` from IMDB reviews for movies. So let's important modules from fastai.

In [2]:
from fastai import *
from fastai.text import *

First let's download the dataset we are going to study. The [dataset](http://ai.stanford.edu/~amaas/data/sentiment/) has been curated by Andrew Maas et al. and contains a total of 100,000 reviews on IMDB. 25,000 of them are labelled as positive and negative for training, another 25,000 are labelled for testing (in both cases they are highly polarized). The remaning 50,000 is an additional unlabelled data (but we will find a use for it nonetheless).

In [6]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

[PosixPath('/home/anass/.fastai/data/imdb_sample/texts.csv')]

It's just a `csv` file so we could open it with **pandas**

In [10]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


The data contains labels from the *first* column and **Text** in the second. The third column is to determin either the text should be on the **validation** set or not.

In [12]:
df['text'][1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very u

Let's create a `dataBunch` from the csv file then

In [15]:
data_lm = TextDataBunch.from_csv(path,'texts.csv')

A text is composed of words, and we can't apply mathematical functions to them directly. 

We first have to convert them to numbers. This is done in two differents steps:
 * tokenization 
 * numericalization.
 
 A TextDataBunch does all of that behind the scenes for you. So let's save the model

In [18]:
data_lm.save()

In [23]:
data_lm.vocab.itos[:10]  # show the list of tokens

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the']

Let's get a look a single data element

In [26]:
data_lm.train_ds[0][0].data[:10]  # it's a list of element each one representing a vocabulary token

array([   2,    5,   82,  308,   12,   19, 2008,   21,   31,  103])

Let's do this with `data block` API

In [32]:
data = (TextList.from_csv(path,'texts.csv',cols='text')
       .split_from_df(col=2)
       .label_from_df(cols=2)
       .databunch())

For **NLP** Problems, we must creat two models

* **Language model** : with word embedding
* **classifier**: with Rucurrent Neural networks

In [33]:
#create an language model
learn =language_model_learner(data,URLs.)

TypeError: language_model_learner() missing 1 required positional argument: 'arch'