# Plan:

1.  **Get tokens** for positive and negative tweets (by `token` in this
    context we mean `word`).
2.  **Lemmatize** them (convert to base word forms). For that we will
    use a Part-of-Speech tagger.
3.  **Clean’em up** (remove mentions, URLs, stop words).
4.  **Prepare models** for the classifier, based on cleaned-up tokens.
5.  **Run the Naive Bayes classifier**.

First, download necessary prepared samples.

In [None]:
import nltk

In [None]:
nltk.download('twitter_samples')

Get some sample positive/negative tweets.

In [None]:
from nltk.corpus import twitter_samples


We can either get the actual string content of those tweets:

In [None]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

In [None]:
positive_tweets[50]

Or we can get a list of tokens using [tokenized
method](https://www.nltk.org/howto/twitter.html) on `twitter_samples`.

In [None]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(tweet_tokens[50])

Now let’s setup a Part-of-Speech tagger. Download a perceptron tagger
that will be used by the PoS tagger.

In [None]:
nltk.download('averaged_perceptron_tagger')

Import Part-of-Speech tagger that will be used for lemmatization

In [None]:
from nltk.tag import pos_tag

Check how it works. Note that it returns tuples, where second element is
a Part-of-Speech identifier.

In [None]:
pos_tag(tweet_tokens[50])

Let’s write a function that will lemmatize twitter tokens.

For that, let’s first fetch a WordNet resource. WordNet is a
semantically-oriented dictionary of English - check chapter 2.5 of the
NLTK book. In online version, this is part 5
[here](https://www.nltk.org/book/ch02.html).

In [None]:
nltk.download('wordnet')

Now fetch PoS tokens so that they can be passed to `WordNetLemmatizer`.

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
tokens = tweet_tokens[50]

# Create a lemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
# Convert PoS tags into a format used by the lemmatizer
# and run lemmatize
for word, tag in pos_tag(tokens):
    if tag.startswith('NN'):
        pos = 'n'
    elif tag.startswith('VB'):
        pos = 'v'
    else:
        pos = 'a'
    lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
print(lemmatized_sentence)

Note that it converts words to their base forms (‘are’ -\> ‘be’, ‘comes’
-\> ‘come’).

Now we can proceed to processing. During processing, we will perform
cleanup: - remove URLs and mentions using regexes - after lemmatization,
remove *stopwords*

In [None]:
nltk.download('stopwords')

What are these stopwords? Let’s see some.

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
for i in range(10):
    print(stop_words[i])


Here comes the `process_tokens` function:

In [None]:
import re, string

def process_tokens(tweet_tokens):

    cleaned_tokens = []
    stop_words = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()

    for token, tag in pos_tag(tweet_tokens):
        # Now note the sheer size of regex for URLs :)
        # Mentions regex is comparatively short and sweet
        if (re.search(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', token) or 
            re.search(r'(@[A-Za-z0-9_]+)', token)):
            continue

        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
   
        token = lemmatizer.lemmatize(token, pos)

        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

Let’s test `process_tokens`:

In [None]:
print("Before:", tweet_tokens[50])
print("After:", process_tokens(tweet_tokens[50]))

Run `process_tokens` on all positive/negative tokens.

In [None]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = [process_tokens(tokens) for tokens in positive_tweet_tokens]
negative_cleaned_tokens_list = [process_tokens(tokens) for tokens in negative_tweet_tokens]

Let’s see how did the processing go.

In [None]:
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

Let’s see what is most common there. Add a helper function
`get_all_words`:

In [None]:
def get_all_words(cleaned_tokens_list):
    return [w for tokens in cleaned_tokens_list for w in tokens]

all_pos_words = get_all_words(positive_cleaned_tokens_list)

Perform frequency analysis using `FreqDist`:

In [None]:
from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

Fine. Now we’ll convert these to a data structure usable for NLTK’s
naive Bayes classifier ([docs
here](https://www.nltk.org/_modules/nltk/classify/naivebayes.html)):

In [None]:
[tweet_tokens for tweet_tokens in positive_cleaned_tokens_list][0]

In [None]:
def get_token_dict(tokens):
    return dict([token, True] for token in tokens)
    
def get_tweets_for_model(cleaned_tokens_list):   
    return [get_token_dict(tweet_tokens) for tweet_tokens in cleaned_tokens_list]

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

Create two datasets for positive and negative tweets. Use 7000/3000
split for train and test data.

In [None]:
import random

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

Finally we use the nltk’s NaiveBayesClassifier on the training data
we’ve just created:

In [None]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Note the Positive:Negative ratios.

Let’s check some test phrase. First, download punkt sentence tokenizer
([docs here](https://www.nltk.org/api/nltk.tokenize.punkt.html))

In [None]:
nltk.download('punkt')

Now we won’t rely on `twitter_samples.tokenized`, but rather will use a
generic tokenization routine - `word_tokenize`.

In [None]:
from nltk.tokenize import word_tokenize

custom_tweet = "the service was so bad"

custom_tokens = process_tokens(word_tokenize(custom_tweet))

print(classifier.classify(get_token_dict(custom_tokens)))

Let’s package it as a function:

In [None]:
def get_sentiment(text):
    custom_tokens = process_tokens(word_tokenize(text))
    return classifier.classify(get_token_dict(custom_tokens))

texts = ["bad", "service is bad", "service is really bad", "service is so terrible", "great service", "they stole my money"]
for t in texts:
    print(t, ": ", get_sentiment(t))


Seems ok!