# Movie Review Classification: A Step by Step Tutorial</br></br>

In this document, we will go through the steps of training a classification algorithm on textual data. The process will include the following steps
<ol>
<li>Loading and exploring the data using Pandas</li>
<li>Preprocessing the data using Re</li>
<li>Using TF-IDF to create document vectors</li>
<li>
    Training a classifier using Sklearn
    <ol>
      <li>Logistic Regression as a baseline classifier </li>
      <li>Searching for the best hyperparameters with Grid Search</li>
      <li>Wrapping the model for a human-friendly interface</li>
    </ol>
</li>
</ol>

# I - Loading Dataset</br></br>

we load the IMDB dataset in the next cell

In [5]:
import pandas as pd

In [6]:
# loading the dataset in a pandas DataFrame is as simple as a single line of code
df = pd.read_csv("IMDB Dataset.csv")

In [7]:
print(df)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


In [8]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


#### Printing the begining of the DataFrame and the first review to make sure we loaded the data correctly

In [9]:
print(df[:5])

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [10]:
text = df.review[0]  # equivalent to df['review'][0]
print(df.sentiment[0])
print(text)

positive
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due t

# II - Preprocessing</br>

## II.1 - Char filters

In [11]:
import re

In [12]:
def remove_html_tags(text):
    """Returns text without HTML tags
    
    Args:
        text (str): input text to clean
    
    Returns:
        str: cleaned text
    """
    return re.sub('<\w{2,6}( /)?>', '', text)

def remove_repetition(text):
    """Return text where letter and blank space repetitions are bounded to 2.
    
    Args:
        text (str): input text to clean
    
    Returns
        str: cleaned text
    """
    # handling letters
    text = re.sub(r'([^\.])(\1){2,}', r'\1\1', text)
    
    # handling points
    text = re.sub(r'\.{4,}', r'...', text)
    
    # handling blank spaces
    text = re.sub('\s{2,}', ' ', text)
    
    return text.strip() # strip removes starting and trailing white spaces    
#     return text # strip removes starting and trailing white spaces


def lowercase(text):
    return text.lower()

def char_filter(text):
    """Returns a clean version of the text using the 3 functions above
    
    Args:
        text (str): input raw text
    
    Returns:
        str: cleaned text
    """
    text = lowercase(text)
    text = remove_html_tags(text)
    text = remove_repetition(text)
    return text
    # TODO HERE: implement the character filtering function
    # make sure to call the function in the right order
    pass

In [13]:
text


"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [14]:
char_filter

<function __main__.char_filter(text)>

## Test your code

In [15]:
from test.main import simple_test_char_filter

In [16]:
simple_test_char_filter(char_filter)

All test succeeded - well done !


## II.2 - Tokenization

### Easy tokenization using split

In [17]:
## Easiest tokenization: split on white spaces

tokens = text.split(' ')
print(tokens[:20]) # problems with "you'll", "hooked.", "right,"

['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '1', 'Oz', 'episode', "you'll", 'be', 'hooked.', 'They', 'are', 'right,']


### Using regular expression

In [18]:
import re

In [19]:
# Creating the regular expression and compiling it
# compiling the regular expression makes the operation faster when used multiple times
re_tokens = re.compile('\'\w+|\w+|[\.,?!:;]+')

tokens = re_tokens.findall(text)[:20]
print(tokens[:20])

['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '1', 'Oz', 'episode', 'you', "'ll", 'be', 'hooked', '.', 'They']


### Using NLTK

In [20]:
from nltk.tokenize import word_tokenize

In [21]:
# Using off-the-shelf tokenizer we can tokenize a text in one line of code 
tokens = word_tokenize(text, preserve_line=True) # preserve_line = True is used to prevent sentence tokenization
print(tokens[:20])

['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '1', 'Oz', 'episode', 'you', "'ll", 'be', 'hooked.', 'They', 'are']


## II.3 - Putting everything together </br></br>

We now have the bricks to create our cleaned and tokenized dataset.</br> We will store the corpus in a list, each item being another list containing the tokens of the document. We will also store the associated labels, 1 for positive and 0 for negative

In [23]:
from nltk import word_tokenize
def create_dataset(df):
    """Create the dataset from the pandas DataFrame
    
    Args:
        df (pd.DataFrame): raw dataset
    Returns:
        list[list[str]]: cleaned & tokenized texts
        list[int]: labels. 1 -> positive; 0 -> negative
    """
    documents = []
    labels= []
    for i in range(0, len(df)):
        if i % 1000 == 0:
            print(i)
        text = df.review[i]
        text = char_filter(text)
        tokens = word_tokenize(text, preserve_line=True)
        label = 1 if df.sentiment[i] == "positive" else 0
        documents.append(tokens)
        labels.append(label)
    return documents, labels
        
    # TODO : Implement the code to create the dataset here
    # To iterate over the rows of a DataFrame, you can use df.iterrows()
    # To apply a function to a column of a DataFrame you can use df.ColumnName.apply(lambda elt: fun(elt))
    pass

In [24]:
documents, labels = create_dataset(df)

# test
print(documents[0][:20])
print(labels[0])

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '1', 'oz', 'episode', 'you', "'ll", 'be', 'hooked.', 'they', 'are']
1


### Let's have a look at our vocabulary size so far 

In [26]:
import time

In [27]:
# version with for loops
start = time.time()

vocabulary = set()
for doc in documents:
    for token in doc:
        vocabulary.add(token)
print("vocabulary size of the overall dataset is " + str(len(vocabulary))) # should output something like 230k but it depends on the tokenizer used

end = time.time()
print("ellapsed time : {}s".format(end - start))

vocabulary size of the overall dataset is 233923
ellapsed time : 1.8916151523590088s


In [28]:
# preferred version using list comprehension
start = time.time()

vocabulary = set([token for document in documents for token in document])
print("vocabulary size of the overall dataset is " + str(len(vocabulary))) # should output something like 230k but it depends on the tokenizer used

end = time.time()
print("ellapsed time : {}s".format(end - start))

vocabulary size of the overall dataset is 233923
ellapsed time : 1.2161848545074463s


### Reducing the vocabulary size using a threshold 

In [77]:
def find_most_common_words(documents, min_count=5):
    """Returns the words that appear at least *min_count* times in the dataset.
    The function should return a dictionary where keys are the words to keep and values are the id of the those words
    In order to save space, the most frequent words should have the
    
    Args:
        documents (list[list[str]]): list of documents
        min_count (int, optional): min count for a token to keep it in the vocabulary
    
    Returns:
        dict[str, int]: vocabulary. Keys are words and values is the id of the word.
    """
    
    # TODO : Implement the function that will find the most common words
    # You might find it useful to use the class WordCounter below
    wc = WordCounter()
    for doc in documents:
        wc.add(doc)
    tokens = wc.get_sorted_tokens()

    mapping = {}
    for token,count in tokens:
        if count < min_count:
            return mapping
        mapping[token] = len(mapping) 
    return mapping

In [58]:
class WordCounter:
    
    def __init__(self):
        self.counts = {} # dictionary containing frequencies
    
    def add(self, tokens):
        """Add a list of token to the counter
        
        Args:
            tokens (list[str]): list of tokens
        """
        for token in tokens:
            self.counts[token] = self.counts.get(token, 0) + 1
            
    def __getitem__(self, key):
        return self.counts.get(key, 0)
    
    def get_sorted_tokens(self):
        """Returns the list of tokens sorted by frequency
        
        Returns:
            list[(str, int)]: list of tuples (token, count) in decreasing order
        """
        token_freq = []
        for key, value in self.counts.items():
            token_freq.append((key, value))
        token_freq.sort(key=lambda x: x[1], reverse=True)
        return token_freq 

In [79]:
#example: to count words


docs = [["the","cat","ate","the","mouse"],["the","mouse","ate","cheese"]]
wc = WordCounter()
for doc in docs:
    wc.add(doc)
wc.get_sorted_tokens()


[('the', 3), ('ate', 2), ('mouse', 2), ('cat', 1), ('cheese', 1)]

In [40]:
min_count = 5
mapping = find_most_common_words(documents, min_count)

In [78]:
min_count = 5
mapping = find_most_common_words(documents, min_count)

print("vocabulary size after reduction : " + str(len(mapping)))  # should output 50683

print(mapping["the"])  # should output 0
print(mapping["movie"])  # should output 18
print(mapping["great"])  # should output 97

vocabulary size after reduction : 50683
0
18
97


# III - TF-IDF Representations

In [85]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [88]:
vectorizer = TfidfVectorizer()
corpus = list(df.review.apply(lambda text: char_filter(text)))
vectors = vectorizer.fit_transform(corpus)

In [89]:
print(type(vectors))

<class 'scipy.sparse.csr.csr_matrix'>


In [91]:
print(vectors.shape)

(50000, 103580)


In [95]:
def split_in_tokens(text):
    return re_tokens.findall(text)

In [97]:
vectorizer = TfidfVectorizer(vocabulary=mapping, tokenizer=split_in_tokens)

In [98]:
vectors = vectorizer.fit_transform(corpus)

In [99]:
print(type(vectors))
print(vectors.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(50000, 50683)


In [103]:
from sklearn.decomposition import TruncatedSVD

In [None]:
dim = 100
reducer = new TruncatedSVD(dim)


In [34]:
import os
import math
import pickle
from joblib import dump, load
import numpy as np
from scipy import sparse as sp
from collections import Counter
from sklearn.decomposition import TruncatedSVD

In [35]:
class TfIdfVectorizer:

    def __init__(self, vocabulary, stopwords, documents):
        
        # initializing punctuation and stop-words
        self.stopwords = set(stopwords)
        self.punctuation = set(['?', '!', ',', '.', ';', ')', '('])
        self.vocabulary = vocabulary
        
        self.tf = []  # list: element *i* in tf should contain a word Counter associated with the document *i*
        self.doc_sizes = []  # list: element *i* of doc_sizes should contain the token size of document *i*
        self.idf = {}  # dict: key are words, values are number of documents in which key was encountered
        self.size = len(documents)
        
        self.dim_reducer = None  # will be define later
        
        count = 0
        for document in documents:
            count += 1
            if count % 100 == 0:
                self.print_progress(count)
            self.add_document(document)
    
    def print_progress(self, i):
        progress = i * 50 // self.size
        if progress == 0:
            print('[{}]   {}k documents added'.format('-'*50, i // 1000), end= '\r')
        elif progress < 50:
            print('[{}>{}]   {}k documents added'.format('='*(progress - 1), '-'*(50 - progress), i // 1000), end='\r')
        else:
            print('[{}]   {}k documents added'.format('='*50, i // 1000), end= '\r')
    
    def add_document(self, tokens):
        """Add a single document to the collection
        
        Args:
            tokens (list[str]): document represented as a list of tokens
        """
        # First step: filtering the tokens
        tokens = self.filter_tokens(tokens)
        
        # Second Step: counting tokens in document
        counter = Counter(tokens)
        
        # updating tf and doc_sizes scores
        self.tf.append(counter)
        self.doc_sizes.append(len(tokens))
        
        # updating IDF scores
        # TODO HERE : update the inverse document frequency here
        
    def filter_tokens(self, tokens):
        """Returns the list of tokens belonging to the vocabulary, without punctuation and stopwords
        
        Args:
            tokens (list[str]): input list of tokens
        
        Returns:
            list[str]: the cleaned list of tokens
        """
        # TODO HERE : fill the function to clean the input tokens by removing out-of-vocabulary tokens, punctuations and stop-words
        pass
        
        
    def get_tf_idf_score(self, doc_id, token):
        """Compute the TF-IDF score of the given token in the given document
        
        Args:
            doc_id (int): id of the document
            token (str): token on which tf-idf will be computed
        
        Returns:
            float: tf-idf score of token in document
        """
        # TODO HERE : implement the TF-IDF scoring function


    def compute_tf_idf_matrix_sparse(self, dim=100):
        """Compute the TF-IDF matrix of the corpus and perform dimensionality reduction
        
        Args:
            dim (int, optional): output vector dimension
        
        Returns:
            np.array: Shape (corpus_size, dim). The document vectors
        """
        row = []
        col = []
        data = []
        for i in range(len(self.tf)):
            for key in self.tf[i]:
                row.append(i)
                col.append(self.vocabulary[key])
                data.append(self.get_tf_idf_score(i, key))
        m = sp.coo_matrix((data, (row, col)), shape=(self.size, len(self.vocabulary)))
        self.dim_reducer = TruncatedSVD(dim)
        output = self.dim_reducer.fit_transform(m)
        return output
    
    def vectorize_document(self, tokens):
        """Returns the tf-idf vector of a new document
        
        Args:
            tokens (list[str]): document to vectorize
            
        Returns:
            np.array: tf-idf vector of the input document after PCA
        """
        # TODO HERE : starting from the function compute_tf_idf_matrix_sparse above
        # fill the function that will vectorize a new document
          
    def save(self, folder):
        if not os.path.isdir(folder):
            os.makedirs(folder)
        pkl_dump(self.tf, os.path.join(folder, 'tf.pkl'))
        pkl_dump(self.idf, os.path.join(folder, 'idf.pkl'))
        pkl_dump(self.doc_sizes, os.path.join(folder, 'doc_sizes.pkl'))
        pkl_dump(self.stopwords, os.path.join(folder, 'stopwords.pkl'))
        pkl_dump(self.punctuation, os.path.join(folder, 'punctuation.pkl'))
        dump(self.dim_reducer, os.path.join(folder, 'dim_reducer.pkl'))


def pkl_dump(obj, file):
    with open(file, 'wb') as f:
        pickle.dump(obj, f)

        
def pkl_load(file):
    with open(file, 'rb') as f:
        return pickle.load(f)

    
def reload_vectorizer(folder):
    result = TfIdfVectorizer([], [], [])
    result.tf = pkl_load(os.path.join(folder, 'tf.pkl'))
    result.idf = pkl_load(os.path.join(folder, 'idf.pkl'))
    result.doc_sizes = pkl_load(os.path.join(folder, 'doc_sizes.pkl'))
    result.size = len(result.tf)
    result.dim_reducer = load(os.path.join(folder, 'dim_reducer.pkl'))
    result.stopwords = pkl_load(os.path.join(folder, 'stopwords.pkl'))
    result.punctuation = pkl_load(os.path.join(folder, 'punctuation.pkl'))
    return result

In [None]:
stopwords = set(['the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'this', 'that', 'was', 'as', 'with', 'for', 'on', 'at'])

# adding the collection of documents to the TfIdfVectorizer object
tfidf = TfIdfVectorizer(mapping, stopwords, documents)

# computing the tf-idf matrix
dim = 100
matrix = tfidf.compute_tf_idf_matrix_sparse(dim)

In [None]:
print(matrix.shape) # should output (50000, 100)

### Saving the document matrix with associated labels

In [None]:
# saving vectorizer (optional)
tfidf.save('data/vectorizer_model_{}_{}'.format(dim, min_count))

In [None]:
X = np.concatenate([matrix, np.array(labels).reshape(-1,1)], axis=1)
np.save('data/doc_vectors_{}_{}.npy'.format(dim, min_count), X)

# IV - Classification Algorithms

### Loading document vectors and creating train and test sets

In [None]:
import numpy as np

In [None]:
def load_train_test(file, prop_test=0.2):
    """load the document vectors and split the np.array in train and test sets using the given proportion
    
    Args: 
        file (str): file containing the numpy array
        prop_test (float): proportion of data to keep for validating the model
    
    Returns:
        np.array: X_train, training document vectors
        np.array: Y_train, training labels associated with training data
        np.array: X_test, test document vectors
        np.array: Y_test, test labels associated with test data
    """
    if prop_test < 0 or prop_test > 1:
        raise ValueError("proportion of test data is not valid")
    
    X = load_and_shuffle(file)
    boundary = int(X.shape[0] * (1 - prop_test))
    X_train = X[:boundary, :-1]
    Y_train = X[:boundary, -1]
    X_test = X[boundary:, :-1]
    Y_test = X[boundary:, -1]
    return X_train, Y_train, X_test, Y_test

def load_and_shuffle(file):
    X = np.load(file)
    np.random.shuffle(X)
    return X

### Baseline Model : Logistic Regression</br></br>

In order to get a sense of the difficulty of the task we are trying to solve, a good practice is to create a simple baseline algorithm that will be used as a comparison when using more complex models. In our case of binary classification task, the logistic regression is one of the simplest model that we can use.</br></br>
To perform a logistic regression we need an annotated corpus $\mathcal{C} = (X_i,y_i)_{i=1...N}$ where
<ul>
  <li>$X_i = [x_{i,1}, \ldots, x_{i,d}]$ is a feature vector, in our case $X_i \in \mathbb{R}^d$ is the TF-IDF document vector</li>
  <li>$y_i \in \{0, 1\}$ is the (binary) label associated with example i. In our case this correspond to the sentiment of the document (positive = 1, negative = 0)</li>
</ul>

In the Logistic Regression settings, we then model the probability of obtaining the label 1 given a feature vector $\mathbf{x}$ as:
$$
p_{\theta}(y=1\mid x) = \frac{1}{1 + e^{-\mathbf{\theta}^T\mathbf{x}}}
$$
The model is then trained using the annotated corpus and the goal is to find the parameter vector $\mathbf{\theta}$ so that the probability of the corpus 
$$
p(\mathcal{C}) = \prod_{i = 1}^{N}p_{\theta}(y_i \mid \mathbf{x_i})
$$
is maximal.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# input file
input_file = 'data/doc_vectors_{}_{}.npy'.format(dim, min_count)
X_train, Y_train, X_test, Y_test = load_train_test(input_file)

In [None]:
# creating and training model

baseline_model = LogisticRegression(solver='liblinear')
baseline_model.fit(X_train, Y_train)

In [None]:
baseline_model.score(X_test, Y_test)

### Find the best hyperparameters with Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV
import pandas as pd  # used to visualize grid search results

In [None]:
# input file
input_file = 'data/doc_vectors_{}_{}.npy'.format(dim, min_count)
X_train, Y_train, X_test, Y_test = load_train_test(input_file)

In [None]:
# Grid search works on a base model that we define here
base_model = LogisticRegression()

# The following dictionary contains the hyper-parameters of the model that we want to optimize
parameters = [
    {'penalty': ['l1','l2'], 'solver': ['liblinear', 'saga'], 'max_iter': [150]},
    {'penalty': ['l2'], 'C': [0.1, 1], 'solver': ['lbfgs'], 'max_iter': [150]},
    {'penalty': ['none'], 'solver': ['lbfgs'], 'max_iter': [150]},
    {'penalty': ['elasticnet'], 'C': [0.1, 1], 'l1_ratio': [0.4, 0.6], 'solver': ['saga'], 'max_iter': [150]},
    {'penalty': ['none'], 'solver': ['saga'], 'max_iter': [150]},
]

# defining the grid search model
grid_search_model = GridSearchCV(base_model, parameters, verbose=2, cv=3)

In [None]:
grid_search_model.fit(X_train, Y_train)

Once the GridSearch was fitted to the training data, we have access to many information concerning the cross validation procedure that are summarized in the cv_results_ attribute of the grid_searhc_model.</br>
In the next cells we gather some of those information in a pandas DataFrame to understand what they mean.

In [None]:
columns_to_keep = ['param_C', 'param_max_iter', 'param_penalty', 'param_solver', 'param_l1_ratio', 'mean_test_score', 'std_test_score', 'rank_test_score']
pd.DataFrame({key: grid_search_model.cv_results_[key] for key in columns_to_keep})

In [None]:
# We test on the held-out data that were not used to optimize the parameters.
# We should get a higher score than with the simple Logistic Regression model
grid_search_model.score(X_test, Y_test)

#### Great ! we now have a trained model but how do we use it on new documents?</br>
Indeed for the moment our model takes vectors as input and returns an integer, it would be nice to have a more user-friendly interface

In [None]:
toy_document = np.random.normal(1, 1, (1, dim))  # creates a fake document
print(grid_search_model.predict(toy_document)) # use the model to predict the label associated with our fake document

## Wrapping the model</br></br></br>

Now that we have a fairly good trained model, it would be nice to see how well it works by sending him an input string and getting the associated sentiment as output with the following simple line of code:
```
wrapped_model = WrappedModel(trained_model, ...)
wrapped_model('I really loved this movie')  # should ouptut 'positive'
```
We will create the WrappedModel class in the following section, using all the steps that we did up to that point


In [None]:
class WrappedModel:
    
    def __init__(self, mapping, char_filter, tokenizer, vectorizer, model):
        self.mapping = mapping
        self.char_filter = char_filter
        self.tokenizer = tokenizer
        self.vectorizer = vectorizer
        self.model = model
    
    def predict(self, text):
        """Returns the sentiment associated to the given document
        
        Args:
            text (str): input raw text
        
        Returns:
            str: sentiment of the text
        """
        # TODO HERE: Fill the code that will predict the sentiment associated with our text
        # You need to process the text the same way we did before : char_filter, tokenizer, vectorizer, model prediction
        pass

In [None]:
int2label = {0: 'negative', 1: 'positive'}
wrapped_model = WrappedModel(int2label, char_filter, word_tokenize, tfidf, grid_search_model)


# Testing our model is now really simple:
print(wrapped_model.predict('I really liked this film!'))  # should output positive
print(wrapped_model.predict('I did not really like this movie!'))  # should output negative
print(wrapped_model.predict('At first I was amazed by the graphics.' +
                            'However the scenario is a too simple. Overall I was a bit disappointed'))  # should output negative