## Movie Review

This is for the ai-tampa-study-group effort to compete in the kaggle competetion:
https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/

The movie reviews have been downloaded into the data/reviews subdirectory

This code is based upon the IMDB notebook from the 2018 deep learning 2 course.  [i.e, Not the deep learning 1 notebook]

You can read some excellent notes on Lesson 10 here:  https://medium.com/@hiromi_suenaga/deep-learning-2-part-2-lesson-10-422d87c3340c

This notebook is not complete, but can provide a foundation for your efforts to use Lesson 10 for this competition.

In [1]:
from fastai.text import *
import html

In [2]:
BOS = 'xbos'  # beginning-of-sentence tag
FLD = 'xfld'  # data field tag

PATH=Path('data/reviews/')

## Standardize format

In [3]:
CLAS_PATH=Path('data/review_clas/')
CLAS_PATH.mkdir(exist_ok=True)

LM_PATH=Path('data/review_lm/')
LM_PATH.mkdir(exist_ok=True)

The kaggle dataset has 5 classes of reviews:

The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive


In [4]:
# trn_texts,trn_labels  => array[text] array[] where the value is in [0|1|2|3|4]
# val_texts,val_labels   same from different data

In [5]:
train_df = pd.read_table(Path(f'{PATH}/train.tsv')  )
len(train_df)

156060

In [6]:
train_df.head(10)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2
5,6,1,of escapades demonstrating the adage that what...,2
6,7,1,of,2
7,8,1,escapades demonstrating the adage that what is...,2
8,9,1,escapades,2
9,10,1,demonstrating the adage that what is good for ...,2


Split train into train and validation
** Remember this technique

In [7]:
keep_pct = 0.1
val_idxs = get_cv_idxs(len(train_df), val_pct=min(0.01/keep_pct, 0.1))
((val_df,trn_df),(val_df,trn_df)) = split_by_idx(val_idxs, train_df, train_df)
len(val_df),len(trn_df)

(15605, 140455)

In [8]:
trn_df.head(5)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2
5,6,1,of escapades demonstrating the adage that what...,2


Convert the columns to raw arrays so we can shuffle

In [9]:
trn_texts = trn_df['Phrase']
trn_labels = trn_df['Sentiment']
val_texts  = val_df['Phrase']
val_labels = val_df['Sentiment']

Set up and index to randomly shuffle the train and validation sets.

In [10]:
np.random.seed(42)
trn_idx = np.random.permutation(len(trn_texts))
val_idx = np.random.permutation(len(val_texts))

In [11]:
trn_texts = trn_texts[trn_idx]
val_texts = val_texts[val_idx]

trn_labels = trn_labels[trn_idx]
val_labels = val_labels[val_idx]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]


In [12]:
trn_texts.head(5)

98685                          an emotional level , funnier
23274                                                   NaN
82478     'll be shaking your head all the way to the cr...
96836                                 , inspirational drama
131881    got to be a more graceful way of portraying th...
Name: Phrase, dtype: object

Preparing the data so we can write out the file in the standard Fast.ai NLP data file format.

In [13]:
col_names = ['labels','text']
df_trn = pd.DataFrame({'text':trn_texts, 'labels':trn_labels}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':val_labels}, columns=col_names)

The pandas dataframe is used to store text data in a newly evolving standard format of label followed by text columns. This was influenced by a paper by Yann LeCun (LINK REQUIRED). Fastai adopts this new format for NLP datasets. In the case of IMDB, there is only one text column.

In [14]:
# 0 - negative 1 - somewhat negative 2 - neutral 3 - somewhat positive 4 - positive
CLASSES = ['neg', 'somewhat_neg', 'neutral', 'somewhat pos', 'pos']

df_trn[df_trn['labels']!=2].to_csv(CLAS_PATH/'train.csv', header=False, index=False)
df_val.to_csv(CLAS_PATH/'test.csv', header=False, index=False)

(CLAS_PATH/'classes.txt').open('w').writelines(f'{o}\n' for o in CLASSES)

We start by creating the data for the Language Model(LM). The LM's goal is to learn the structure of the english language. It learns language by trying to predict the next word given a set of previous words(ngrams). Since the LM does not classify reviews, the labels can be ignored.

The LM can benefit from all the textual data and there is no need to exclude the unsup/unclassified movie reviews.

We first concat all the train(pos/neg/unsup = 75k) and test(pos/neg=25k) reviews into a big chunk of 100k reviews. And then we use sklearn splitter to divide up the 100k texts into 90% training and 10% validation sets.

TODO.  Go back and load the real test set into a dataframe.  Not sure why Jeremy changed the test data into validation data.

In [15]:
trn_texts,val_texts = sklearn.model_selection.train_test_split(
    np.concatenate([trn_texts,val_texts]), test_size=0.1)

In [16]:
len(trn_texts), len(val_texts)

(140454, 15606)

In [17]:
df_trn = pd.DataFrame({'text':trn_texts, 'labels':[0]*len(trn_texts)}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':[0]*len(val_texts)}, columns=col_names)

df_trn.to_csv(LM_PATH/'train.csv', header=False, index=False)
df_val.to_csv(LM_PATH/'test.csv', header=False, index=False)

Language model tokens
In this section, we start cleaning up the messy text. There are 2 main activities we need to perform:

Clean up extra spaces, tab chars, new ln chars and other characters and replace them with standard ones
Use the awesome spacy library to tokenize the data. Since spacy does not provide a parallel/multicore version of the tokenizer, the fastai library adds this functionality. This parallel version uses all the cores of your CPUs and runs much faster than the serial version of the spacy tokenizer.
Tokenization is the process of splitting the text into separate tokens so that each token can be assigned a unique index. This means we can convert the text into integer indexes our models can use.

We use an appropriate chunksize as the tokenization process is memory intensive

In [18]:
chunksize=24000

In [19]:
re1 = re.compile(r'  +')

def fixup(x):
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))

In [20]:
def get_texts(df, n_lbls=1):
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
    for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
    texts = texts.apply(fixup).values.astype(str)

    tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
    return tok, list(labels)

In [21]:
def get_all(df, n_lbls):
    tok, labels = [], []
    for i, r in enumerate(df):
        print(i)
        tok_, labels_ = get_texts(r, n_lbls)
        tok += tok_;
        labels += labels_
    return tok, labels

In [22]:
df_trn = pd.read_csv(LM_PATH/'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(LM_PATH/'test.csv', header=None, chunksize=chunksize)

In [23]:
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)

0


OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.