# Support Vector Machine IMDB Movie Review Dataset  

## Importing IMDB Dataset and cleaning reviews

Import libraries such as pandas, re, nltk, and bs4. Remember to make sure all these packages are installed before importing them.
Stopwords need to be downloaded before nltk.corpus can use them. So make sure you run the commands “nltk.download(‘stopwords’)” and “nltk.download(‘wordnet’)” before running the whole script.

In [1]:
# Import Libraries
import nltk
import pandas as pd
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

The csv file can be imported and assigned to the df variable thanks to pandas. The encoding must be “Latin-1” to avoid it throwing an error.

In [2]:
## Read Datasaets and 
df = pd.read_csv('IMDB Dataset.csv', encoding = 'Latin-1')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


At this point, we can map the two values contained in the sentiment column to just ones and zeros instead of positive and negative

In [4]:
## Replacing labels with 0 and 1 for classification
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# TEXT PROCESSING

## Tokenization

Tokenization breaks down a full document or a sentence into a string of characters to ensure a more effective manipulation. For example, an algorithm has no interest in blank spaces, black lines, or line breaks. The result from the tokenization process only has words and punctuation.

## Lemmatization and Stopwords

Lemmatization is about reducing words to their canonical and basic form. For example, the verbs writing, writes, wrote, and written can be represented by the word write, the associated lemma. Lemmatization allows a simplification of the process, the algorithm can refer to a single word instead of all its forms. Stopwords, on the other hand, refer to the elimination of some words that add little to no value in the sentiment computation. The articles “a” or “an” do not indicate any positive or negative sentiment.

In [5]:
## Defining stop_words and Lemmatizer
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

In [6]:
print(stop_words)

{'am', 'further', 'mustn', 'on', 'such', "hasn't", 'off', "wouldn't", 're', 'll', "don't", 'the', 'any', 'yourself', 'them', 'myself', 'was', 'own', 'didn', 'now', 'between', 'their', 'have', 't', 'before', 'there', 'and', 'who', 'against', 'no', 'what', 'aren', 'these', 'not', "shan't", 'under', 'don', 'will', 'doesn', 'up', "needn't", 'for', "mightn't", 'being', 'wouldn', 'as', 'his', 'shan', 'ours', 'can', 'we', 's', 'be', 'they', 'why', 'my', 'yours', 'himself', 'isn', 'mightn', 'did', "mustn't", 'ain', 'haven', 'i', 'an', 'doing', 'were', 'both', 'y', 'here', 'or', 'couldn', 'been', "you'll", 'after', 'same', 'd', "isn't", 'below', 'shouldn', 'too', 'where', 'so', 'having', 'needn', 'her', 'until', "she's", 'during', 'it', 'but', 'how', 'our', "should've", "won't", 'more', 'him', "couldn't", "didn't", 'had', 'a', 'if', 'its', 'you', 'because', "you've", 'to', 'down', 'hadn', 'ma', 'out', 'is', 'other', "haven't", 'm', 'only', 'again', 'whom', 'o', "you'd", 'has', 'from', 've', 'on

In [7]:
## REmoving the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

The clean_text function at this stage can be defined. The library re provides regular expression matching operations. To summarize, the code eliminates HTML line breaks and every special character. Text is then transformed into lowercase letters only. The lemmatization process is applied to each token. As a final step, stopwords are eliminated.

In [8]:
## Defining Clean_text function
def clean_text(text):
    text = strip_html(text)
    text = re.sub(r'[^A-Za-z0-9]+',' ',text)
    text = text.lower()
    text = [lemmatizer.lemmatize(token) for token in text.split(" ")]
    text = [lemmatizer.lemmatize(token, "v") for token in text]
    text = [word for word in text if not word in stop_words]
    text = " ".join(text)
    return text

The clean_text function is ultimately applied on each row under the “review” column as a new column called “Processed Reviews” is created.

In [9]:
### Creating new column for processed reviews
process_reviews = df['Processed_Reviews'] = df.review.apply(lambda x: clean_text(x))

Each review is finally ready to be processed by the SVM algorithm. The result of this first major section is the following:
As you might notice, every period, comma, question mark or parenthesis have been filtered out. There are no pronouns and line breaks are not there anymore. By completing the first major step of text processing we have reduced the complexity of the data. The model has now a lot fewer symbols and different words to process, which is likely to increase its generalization capabilities while maintaining high accuracy.

In [10]:
print(process_reviews)

0        one reviewer ha mention watch 1 oz episode hoo...
1        wonderful little production film technique una...
2        think wa wonderful way spend time hot summer w...
3        basically family little boy jake think zombie ...
4        petter mattei love time money visually stun fi...
                               ...                        
49995    think movie right good job creative original f...
49996    bad plot bad dialogue bad act idiotic direct a...
49997    catholic teach parochial elementary school nun...
49998    go disagree previous comment side maltin one s...
49999    one expect star trek movie high art fan expect...
Name: review, Length: 50000, dtype: object


# Model Deployment

The following code shows a simple model deployment of the SVM model to calculate the accuracy of the model without hyperparameters optimization.

After importing the sklearn library and its sub-functions, we can define the input and the target variables.

In [11]:
## Deploying SVM model on available data

## Importing libraries
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [12]:
## Defining input and target variable
x = df['Processed_Reviews']
y = df['sentiment']

As in every ml process, the train and test split method follows the first step, in this case, the test size is 20% of records in the data set, the training is the remaining 80%.

In [13]:
## Training and splitting
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

For the project, I decided to use a BoW feature extractor with its base parameters. The count vectorizer, therefore, creates a dictionary and transforms both the X_train and X_test data subsets according to the dictionary guidelines.

In [14]:
## Vectorization and Bag of words method with default parameters
count_vect = CountVectorizer().fit(df['Processed_Reviews'].values.astype('U'))
bow_train = count_vect.transform(X_train.values.astype('U'))
bow_test = count_vect.transform(X_test.values.astype('U'))

The SVC() object is assigned to the SVM variable to instantiate the model, which is then fit on the records provided by the data set for the training portion of the script.


In [15]:
## Instantiate the model (using the default parameters)
SVM = SVC()
SVM

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [16]:
## Fit the model with pre-processed data
SVM.fit(bow_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Finally, we can run the command SVM.predict(), which deploys the model on new records for the testing portion. The classification report gives the user information on the accuracy by making a comparison between the actual sentiment and the predicted one.

In [17]:
### Perform classification and prediction on samples in tf_test
predicted_SVM = SVM.predict(bow_test)
print(classification_report(y_test, predicted_SVM))

              precision    recall  f1-score   support

           0       0.89      0.85      0.87      5035
           1       0.86      0.89      0.87      4965

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



The initial set of results is quite comforting, the overall accuracy stands at 87% by testing the algorithm on 5035 negative reviews and 4965 positive ones. The SVM algorithm performed much better.

# Hyperparameters optimization

## Finding best parameters for Count Vectorizer and SVM model

We now know the overall accuracy with the default settings both for the vectorizer and the SVM model. The objective of the following code is to calculate the best combination of hyperparameters to increase the model performance:

We can start by importing the different functions of the sklearn library such as RepeatedStratifiedKfold, GridSearchCV, SVC, Pipeline, and CountVectorizer.

In [25]:
#Importing libraries
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

Then we can create a pipeline. The concept of pipeline in computing most of the times refers to a data pipeline, it is a group of data processing elements where the output of an element is the input of the next one. The first element of the pipeline is CountVectorizer(), which we renamed “vect” while the second element is SVC(). Informally speaking, we need a pipeline to allow the cross-validation process.

In [26]:
### Creating a Pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('SVM', SVC())
])

The parameters’ list is built so that each name of the pipeline is associated with a name of a parameter and its values. For example, the CountVectorizer function includes the parameters max_df and ngram_range. The name vect__max_df tells us that the parameter max_df is associated with the “vect” previously defined in the pipeline section.

In [27]:
### Defining Hyperparameters
parameters = {
    'vect__max_df':[0.1,0.2,0.3,0.4,0.5,0.6,0.7],
    'vect__ngram_range':  [(1,1), (1,2), (1,3)],
    'SVM__kernel': ['poly', 'rbf', 'sigmoid'],
    'SVM__C': [50, 10, 1.0, 0.1, 0.01]}

The grid search combines information found in the pipeline and the parameters grid to calculate the optimal combination of hyperparameters that maximize the SVM performance. Of course, the concept is much more complicated than that and will be covered in future articles. As of now, we are interested in the mere mechanics of the code. The grid search calculates every single hyperparameter combination on our data. Of course, 50,000 would be a lot of computations to perform each time, this is why I reduced the number of rows to only 5000 for this particular scenario.

The code runs for more than an hour.

In [28]:
### Define Grid Search
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(pipeline, param_grid=parameters, refit = True, verbose = 3, cv=5)
grid_result = grid_search.fit(df.loc[:5000, 'Processed_Reviews'].values.astype('U'), df.loc[:5000, 'sentiment'].values.astype('U'))

Fitting 5 folds for each of 315 candidates, totalling 1575 fits
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.718, total=  13.4s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.4s remaining:    0.0s


[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.700, total=  11.6s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   25.0s remaining:    0.0s


[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.760, total=  12.9s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.769, total=  11.5s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.733, total=  12.9s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2), score=0.535, total=  26.1s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2), score=0.524, total=  26.2s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2), s

[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.769, total=  18.9s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.751, total=  16.9s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.764, total=  19.4s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2), score=0.671, total=  33.2s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2), score=0.641, total=  30.8s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2), s

[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.779, total=  16.5s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.745, total=  16.9s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.785, total=  16.6s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2), score=0.736, total=  30.0s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2), score=0.694, total=  31.3s
[CV] SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2), s

[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.814, total=  13.4s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.832, total=  13.4s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.857, total=  13.5s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.860, total=  26.1s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.860, total=  26.1s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.845,

[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.838, total=  14.3s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.856, total=  14.4s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.847, total=  25.9s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.853, total=  26.2s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.845, total=  26.4s
[CV] SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.852,

[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.770, total=   5.5s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.780, total=   5.8s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2), score=0.722, total=  13.8s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2), score=0.762, total=  14.2s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2), score=0.742, total=  13.8s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df

[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.711, total=   5.8s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.748, total=   6.0s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.748, total=   5.9s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.738, total=  10.7s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.742, total=  11.2s
[CV] SVM__C=50, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  SVM__C=50, SVM__kernel=sigmoid, vect__max_df

[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.631, total=  13.0s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.568, total=  12.9s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.586, total=  13.1s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.610, total=  12.9s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2), score=0.654, total=  29.1s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2), s

[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.823, total=  15.5s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.743, total=  14.2s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.772, total=  15.0s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.827, total=  15.0s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2), score=0.753, total=  26.8s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2), s

[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.766, total=  14.0s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.797, total=  15.5s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.774, total=  14.5s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.807, total=  16.8s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2), score=0.745, total=  26.1s
[CV] SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2), s

[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.854, total=  15.6s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.820, total=  15.4s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.834, total=  16.0s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.859, total=  16.0s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.860, total=  30.2s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.859,

[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.820, total=  14.3s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.844, total=  14.2s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.861, total=  14.5s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.848, total=  26.9s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.853, total=  26.2s
[CV] SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.845,

[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.765, total=   5.6s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.806, total=   6.1s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.797, total=   5.7s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2), score=0.803, total=  15.0s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2), score=0.797, total=  38.1s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df

[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.751, total=   5.6s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.720, total=   5.3s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.760, total=   5.5s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.744, total=   5.5s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.770, total=  11.4s
[CV] SVM__C=10, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  SVM__C=10, SVM__kernel=sigmoid, vect__max_df

[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.527, total=  11.7s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.523, total=  12.1s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.518, total=  11.6s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.521, total=  12.1s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.530, total=  11.5s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 2) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_rang

[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.626, total=  14.4s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.603, total=  14.7s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.585, total=  14.5s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.584, total=  14.8s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.595, total=  14.6s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_range=(1, 2) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.4, vect__ngram_rang

[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.740, total=  16.1s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.740, total=  16.2s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.753, total=  16.1s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.777, total=  16.8s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.781, total=  16.1s
[CV] SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=1.0, SVM__kernel=poly, vect__max_df=0.7, vect__ngram_rang

[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.850, total=  11.9s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.864, total=  12.0s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.820, total=  11.6s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.835, total=  11.8s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.864, total=  11.9s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.3, vect__ngram_range=(1, 2), s

[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.835, total=  12.7s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.854, total=  12.8s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.827, total=  12.2s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.835, total=  12.2s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.873, total=  12.6s
[CV] SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=1.0, SVM__kernel=rbf, vect__max_df=0.6, vect__ngram_range=(1, 2), s

[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3), score=0.852, total=  28.4s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.841, total=   8.7s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.862, total=   8.8s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.831, total=   8.5s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.839, total=   8.5s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, v

[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3), score=0.847, total=  27.2s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3), score=0.867, total=  27.2s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3), score=0.876, total=  28.1s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.797, total=   9.2s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.827, total=   9.2s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, v

[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3), score=0.827, total=  29.2s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3), score=0.849, total=  29.4s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3), score=0.805, total=  27.3s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3), score=0.835, total=  27.5s
[CV] SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3) 
[CV]  SVM__C=1.0, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3), score=0.837, total=  28.8s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__m

[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.507, total=  28.5s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3), score=0.506, total=  38.7s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3), score=0.507, total=  37.6s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3), score=0.507, total=  36.6s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3), score=0.506, total=  37.1s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_rang

[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.516, total=  28.7s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3), score=0.507, total=  38.3s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3), score=0.511, total=  39.4s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3), score=0.507, total=  37.5s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3), score=0.508, total=  38.0s
[CV] SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_rang

[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 2), score=0.508, total=  27.5s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3), score=0.506, total=  39.0s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3), score=0.506, total=  36.1s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3), score=0.506, total=  36.3s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3), score=0.506, total=  36.2s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 3), s

[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.666, total=  28.0s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3), score=0.619, total=  37.5s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3), score=0.573, total=  37.4s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3), score=0.520, total=  36.6s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3), score=0.532, total=  37.2s
[CV] SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 3), s

[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 2), score=0.787, total=  26.6s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3), score=0.590, total=  36.7s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3), score=0.598, total=  37.9s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3), score=0.578, total=  36.5s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3), score=0.570, total=  37.4s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, v

[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 2), score=0.791, total=  26.3s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 2), score=0.809, total=  26.5s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 2), score=0.823, total=  26.5s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3), score=0.796, total=  36.3s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3), score=0.812, total=  35.2s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, v

[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2), score=0.768, total=  26.7s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2), score=0.773, total=  26.3s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2), score=0.789, total=  25.7s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2), score=0.770, total=  26.8s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 2), score=0.779, total=  25.7s
[CV] SVM__C=0.1, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.1, SVM__kernel=sigmoid, v

[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 1), score=0.507, total=  15.7s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.506, total=  28.7s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.506, total=  29.2s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.506, total=  28.4s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2), score=0.506, total=  28.0s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.3, vect_

[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.507, total=  16.7s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 1), score=0.508, total=  17.2s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.506, total=  28.8s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.507, total=  29.6s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 2), score=0.506, total=  29.0s
[CV] SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=poly, vect__max_df=0.6, vect_

[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.506, total=  14.3s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.506, total=  14.3s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 1), score=0.507, total=  15.5s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 2), score=0.506, total=  27.2s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 2), score=0.506, total=  28.0s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.2, vect__ngram_rang

[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.506, total=  16.5s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.506, total=  16.3s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 1), score=0.507, total=  16.4s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.506, total=  28.6s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 2), score=0.506, total=  28.4s
[CV] SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=rbf, vect__max_df=0.5, vect__ngram_rang

[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.506, total=  12.7s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.506, total=  12.4s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.506, total=  11.7s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 1), score=0.507, total=  12.3s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 2), score=0.506, total=  26.9s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.1, vect__ngram_range=(1, 2) 
[CV]  SVM__C=0.01, SVM__kernel

[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.3, vect__ngram_range=(1, 3), score=0.507, total=  36.1s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.506, total=  16.7s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.506, total=  15.9s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.506, total=  15.8s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1), score=0.506, total=  16.2s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.4, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel

[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.6, vect__ngram_range=(1, 3), score=0.506, total=  37.2s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.6, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.6, vect__ngram_range=(1, 3), score=0.506, total=  38.0s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.6, vect__ngram_range=(1, 3) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.6, vect__ngram_range=(1, 3), score=0.507, total=  38.0s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.507, total=  18.1s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 1), score=0.506, total=  16.4s
[CV] SVM__C=0.01, SVM__kernel=sigmoid, vect__max_df=0.7, vect__ngram_range=(1, 1) 
[CV]  SVM__C=0.01, SVM__kernel

[Parallel(n_jobs=1)]: Done 1575 out of 1575 | elapsed: 639.8min finished


It seems that the best hyperparameters are:

    C=50
    the kernel is the rbf type
    max_df = 0.5
    the ngram range is (1,2)

The last section organizes and summarizes all the results found by reporting the average accuracy and the hyperparameters it’s been achieved with.

In [29]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.861830 using {'SVM__C': 1.0, 'SVM__kernel': 'sigmoid', 'vect__max_df': 0.4, 'vect__ngram_range': (1, 3)}
0.736056 (0.025628) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.1, 'vect__ngram_range': (1, 1)}
0.536693 (0.012899) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.1, 'vect__ngram_range': (1, 2)}
0.501100 (0.001743) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.1, 'vect__ngram_range': (1, 3)}
0.759850 (0.017164) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.2, 'vect__ngram_range': (1, 1)}
0.582683 (0.017768) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.2, 'vect__ngram_range': (1, 2)}
0.520697 (0.007182) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.2, 'vect__ngram_range': (1, 3)}
0.760047 (0.009491) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.3, 'vect__ngram_range': (1, 1)}
0.637872 (0.010197) with: {'SVM__C': 50, 'SVM__kernel': 'poly', 'vect__max_df': 0.3, 'vec

The accuracy score calculated only on 5000 records is 84.5%. Of course, it is lower compared to the one we obtained at first. This time though, the model has been trained on far fewer records compared to the 40,000 we used previously. The difference in performance is therefore explained. It is now time to implement the model on the full data set with the hyperparameters adjustments and re-check the final score. In order to run the model with the new hyperparameters it is necessary to make code implementations:

    1. “countvect = CountVectorizer()” will become “countvect = CountVectorizer(ngram_range=(1,2), max_df=0.5)” as we are telling the tool to consider groups of two words at a time (ngram) and to ignore terms that appear in more than 50% of the documents (max_df)
    2. “SVM = SVC()” will become “SVM = SVC(C = 50, kernel = ‘rbf’)” as we are telling the SVM to use 50 as a C paramater and rbf as the kernel.

## Testing Algorithm on single sentences

In [33]:
### Defining test sentences
test = ['The movie was really good, I could have not imagined a better ending']
test_1 = ['The movie was generally bad, the plot was boring and the characters badly interpreted']
test = count_vect.transform(test).toarray()
test_1 = count_vect.transform(test_1).toarray()#Printing prediction
print(SVM.predict(test))
print(SVM.predict(test_1))

[1]
[0]
