# Fine Tuning
In this chapter we will go over several strategies for boosting the accuracy of a
classifier.

## Try a different classifier

One of the classifiers that is always worth trying out is the LogisticRegression one
from nltk. It is very versatile and especially good with text. The main advantage
of this classifier is that it doesn’t need any parameter adjustments, just like the
Naive Bayes we’ve been experimenting with. Only change you need to make to the
previous script to try this out is:

**Try LogisticRegression Classifier**

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [4]:
data = pd.read_csv('./twitter_sentiment_analysis.csv')

In [5]:
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
# Split the data for training and for testing and shuffle it
X_train, X_test, y_train, y_test = train_test_split(tweets, sentiments,
test_size=0.2, shuffle=True)

In [7]:
vectorizer = CountVectorizer(lowercase=True)
# Compute the vocabulary only on the training data
vectorizer.fit(X_train)

# Transform the text list to a matrix form
X_train_vectorized = vectorizer.transform(X_train)

classifier = LogisticRegression()

# Train the classifier
classifier.fit(X_train_vectorized, y_train)

# Vectorize the test data
X_test_vectorized = vectorizer.transform(X_test)

# Check our classifier performance
score = classifier.score(X_test_vectorized, y_test)
print("Accuracy=", score)

Accuracy= 0.826781051846326


Just by doing this change, we got a boost up to 0.82 in accuracy. Nice!

## Use Ngrams Instead of Words

The `Scikit-Learn` vectorizer API allows us to use ngrams rather than just words.
Remember what we’ve covered in the previous chapters about ngrams? It’s exactly
the same procedure. Instead of using only single-word features, we use consecutive,
multi-word features as well. Changes to the previous script to make this happen are
minimal
**Using Ngram Features**

In [12]:
from nltk.tokenize.casual import TweetTokenizer
tweet_tokenizer = TweetTokenizer(strip_handles=True)

tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
# Split the data for training and for testing and shuffle it
X_train, X_test, y_train, y_test = train_test_split(tweets, sentiments,
test_size=0.2, shuffle=True)

In [13]:
vectorizer = CountVectorizer(lowercase=True, tokenizer=tweet_tokenizer.tokenize, ngram_range=(1, 3))


# Compute the vocabulary only on the training data
vectorizer.fit(X_train)

# Transform the text list to a matrix form
X_train_vectorized = vectorizer.transform(X_train)

classifier = LogisticRegression()

# Train the classifier
classifier.fit(X_train_vectorized, y_train)

# Vectorize the test data
X_test_vectorized = vectorizer.transform(X_test)

# Check our classifier performance
score = classifier.score(X_test_vectorized, y_test)
print("Accuracy=", score)

Accuracy= 0.9376352107422603


Happy day, now we’re up to 0.93 in accuracy.

### Using a Pipeline

Using a pipeline has a bunch of benefits:

- the ansamble of components behaves as a single classifier
- the code is clean and encapsulated
- it is easy to iterate on improving the model (more about that in the following
section)

**Using a Pipeline**

In [14]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize.casual import TweetTokenizer

In [15]:
tweet_tokenizer = TweetTokenizer(strip_handles=True)
data = pd.read_csv('./twitter_sentiment_analysis.csv')

tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
# Split the data for training and for testing and shuffle it
X_train, X_test, y_train, y_test = train_test_split(tweets, sentiments,
test_size=0.2, shuffle=True)

# Put everything in a Pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, tokenizer=tweet_tokenizer.tokenize,ngram_range=(1, 3))),('classifier', LogisticRegression())])

pipeline.fit(X_train, y_train)
# Check our classifier performance

score = pipeline.score(X_test, y_test)

In [None]:
print("Accuracy=", score)

## Cross Validation


This strategy might seem a bit harder to grasp, but bare with me and we’ll get to the
bottom of it.

I stated earlier in the book that we need to keep the test data separate from the
train data, in order to not influence classifier’s output. Did you wonder why that is?
Well, if we test the system with the same data we trained on, obviously we would
get awesome results, but biased. In order to get valid results, we need to test the
system with data it hasn’t seen yet.

If we continuously tweak the parameters to improve the results on the test set, we
indirectly overfit the system on the test dataset. That would be an undesired result
because it makes the system worse at generalizing. That means that if we will test
on unseed data, outside of the test set, it will underperform. One way to fix this
problem is to keep some more data aside and never test on this data while we tune
the parameters. This type of data is called the Validation Set. After we’re satisfied
with the results on the test set, then and only then we use the Validation Set to check
how our system is doing on unseen data.

This approach has a huge drawback: Data is usually scarce, and we will be putting
even more data aside that’s not going to be used for training.

An approach for getting around this drawback would be doing Cross Validation.
This implies splitting the dataset into N folds. The system will be trained N times on
all the data, each time excluding a different fold, out of the N total ones.
At the end, the scores of all trains are averaged. This way we don’t waste much data.
Here’s an example of Cross Validation with N = 5 folds we do that:
**Cross Validation Score**

In [16]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from nltk.tokenize.casual import TweetTokenizer

In [17]:
tweet_tokenizer = TweetTokenizer(strip_handles=True)
data = pd.read_csv('./twitter_sentiment_analysis.csv')
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)

In [18]:
# Put everything in a Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer(
lowercase=True,
tokenizer=tweet_tokenizer.tokenize,
ngram_range=(1, 3))),
('classifier', LogisticRegression())
])
tweets, sentiments = shuffle(tweets, sentiments)

In [None]:
%%time
print("MeanAccuracy=", cross_val_score(pipeline, tweets, sentiments, cv=5).mean())



Using the Cross Validation strategy, we’re still around 0.82 in accuracy.

## Grid Search
As we’ve seen so far, there are quite a few parameters we can tune to improve
accuracy, and we have not explored that many yet. Moreover, there’s no way to
know for sure what will be the effects of tuning a parameter in a certain way. There’s
no exact algorithm for tuning a model. Mastering this implies curiosity and lots of
practice.

However, here’s a simple way of optimizing parameter combinations called Grid
Search. This technique implies using Cross Validation for every possible parameter
combination. That’s a lot of work, so it will take a while
**Tuning Parameters with GridSearch**

In [None]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize.casual import TweetTokenizer

In [None]:
tweet_tokenizer = TweetTokenizer(strip_handles=True)
data = pd.read_csv('./twitter_sentiment_analysis.csv')
tweets = data['tweet'].values.astype(str)
sentiments = data['sentiment'].values.astype(str)
# Shuffle the data
tweets, sentiments = shuffle(tweets, sentiments)

In [None]:
# Put everything in a Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer(
lowercase=True,
tokenizer=tweet_tokenizer.tokenize,
ngram_range=(1, 3))),
('classifier', LogisticRegression())
])

classifier = GridSearchCV(pipeline, {
# try out different ngram ranges
'vectorizer__ngram_range': ((1, 2), (2, 3), (1, 3)),
# check if setting all non zero counts to 1 makes a difference
'vectorizer__binary': (True, False),
}, n_jobs=-1, verbose=True, error_score=0.0, cv=5)
# Compute the vocabulary and train the classifier
classifier.fit(tweets, sentiments)

In [None]:
print("Best Accuracy: ", classifier.best_score_)
print("Best Parameters: ", classifier.best_params_)
# Best Accuracy: 0.81920859947