### Plan:
1. **Get tokens** for positive and negative tweets (by `token` in this context we mean `word`).
2. **Lemmatize** them (convert to base word forms). For that we will use a Part-of-Speech tagger.
3. **Clean'em up** (remove mentions, URLs, stop words).
4. **Prepare models** for the classifier, based on cleaned-up tokens.
5. **Run the Naive Bayes classifier**.

First, download necessary prepared samples.

In [1]:
import nltk

In [2]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

Get some sample positive/negative tweets.

In [3]:
from nltk.corpus import twitter_samples


We can either get the actual string content of those tweets:

In [4]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

In [5]:
positive_tweets[50]

'@groovinshawn they are rechargeable and it normally comes with a charger when u buy it :)'

Or we can get a list of tokens using [tokenized method](https://www.nltk.org/howto/twitter.html) on `twitter_samples`.

In [6]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(tweet_tokens[50])

['@groovinshawn', 'they', 'are', 'rechargeable', 'and', 'it', 'normally', 'comes', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']


Now let's setup a Part-of-Speech tagger.  Download a perceptron tagger that will be used by the PoS tagger.

In [7]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Import Part-of-Speech tagger that will be used for lemmatization

In [8]:
from nltk.tag import pos_tag

Check how it works. Note that it returns tuples, where second element is a Part-of-Speech identifier.

In [9]:
pos_tag(tweet_tokens[50])

[('@groovinshawn', 'NN'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('rechargeable', 'JJ'),
 ('and', 'CC'),
 ('it', 'PRP'),
 ('normally', 'RB'),
 ('comes', 'VBZ'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('charger', 'NN'),
 ('when', 'WRB'),
 ('u', 'JJ'),
 ('buy', 'VB'),
 ('it', 'PRP'),
 (':)', 'JJ')]

In [10]:
tweet_tokens[50]

['@groovinshawn',
 'they',
 'are',
 'rechargeable',
 'and',
 'it',
 'normally',
 'comes',
 'with',
 'a',
 'charger',
 'when',
 'u',
 'buy',
 'it',
 ':)']

Let's write a function that will lemmatize twitter tokens.

For that, let's first fetch a WordNet resource. WordNet is a semantically-oriented dictionary of English - check chapter 2.5 of the NLTK book. In online version, this is part 5 [here](https://www.nltk.org/book/ch02.html).

In [11]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Now fetch PoS tokens so that they can be passed to `WordNetLemmatizer`.

In [12]:
from nltk.stem.wordnet import WordNetLemmatizer
tokens = tweet_tokens[50]

# Create a lemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
# Convert PoS tags into a format used by the lemmatizer
# and run lemmatize
for word, tag in pos_tag(tokens):
    if tag.startswith('NN'):
        pos = 'n'
    elif tag.startswith('VB'):
        pos = 'v'
    else:
        pos = 'a'
    lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
print(lemmatized_sentence)

['@groovinshawn', 'they', 'be', 'rechargeable', 'and', 'it', 'normally', 'come', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']


Note that it converts words to their base forms ('are' -> 'be', 'comes' -> 'come').

Now we can proceed to processing. 
During processing, we will perform cleanup:
  - remove URLs and mentions using regexes
  - after lemmatization, remove *stopwords*

In [13]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

What are these stopwords? Let's see some.

In [14]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
for i in range(10):
    print(stop_words[i])


179
i
me
my
myself
we
our
ours
ourselves
you
you're


Here comes the `process_tokens` function:

In [15]:
import re, string

def process_tokens(tweet_tokens):

    cleaned_tokens = []
    stop_words = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()

    for token, tag in pos_tag(tweet_tokens):
        # Now note the sheer size of regex for URLs :)
        # Mentions regex is comparatively short and sweet
        if (re.search(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', token) or 
            re.search(r'(@[A-Za-z0-9_]+)', token)):
            continue

        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
   
        token = lemmatizer.lemmatize(token, pos)

        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

Let's test `process_tokens`:

In [16]:
print("Before:", tweet_tokens[50])
print("After:", process_tokens(tweet_tokens[50]))

Before: ['@groovinshawn', 'they', 'are', 'rechargeable', 'and', 'it', 'normally', 'comes', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']
After: ['rechargeable', 'normally', 'come', 'charger', 'u', 'buy', ':)']


Run `process_tokens` on all positive/negative tokens.

In [17]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = [process_tokens(tokens) for tokens in positive_tweet_tokens]
negative_cleaned_tokens_list = [process_tokens(tokens) for tokens in negative_tweet_tokens]

Let's see how did the processing go.

In [18]:
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht']
['dang', 'rad', '#fanart', ':d']


Let's see what is most common there. Add a helper function `get_all_words`:

In [19]:
def get_all_words(cleaned_tokens_list):
    return [w for tokens in cleaned_tokens_list for w in tokens]

all_pos_words = get_all_words(positive_cleaned_tokens_list)

Perform frequency analysis using `FreqDist`:

In [20]:
from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]


Fine. Now we'll convert these to a data structure usable for NLTK's naive Bayes classifier ([docs here](https://www.nltk.org/_modules/nltk/classify/naivebayes.html)):

In [21]:
[tweet_tokens for tweet_tokens in positive_cleaned_tokens_list][0]

['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

In [22]:
def get_token_dict(tokens):
    return dict([token, True] for token in tokens)
    
def get_tweets_for_model(cleaned_tokens_list):   
    return [get_token_dict(tweet_tokens) for tweet_tokens in cleaned_tokens_list]

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

Create two datasets for positive and negative tweets. Use 7000/3000 split for train and test data.

In [23]:
import random

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

Finally we use the nltk's NaiveBayesClassifier on the training data we've just created:

In [24]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.9953333333333333
Most Informative Features
                      :( = True           Negati : Positi =   2076.9 : 1.0
                      :) = True           Positi : Negati =    984.0 : 1.0
                follower = True           Positi : Negati =     22.0 : 1.0
                     sad = True           Negati : Positi =     17.8 : 1.0
                      aw = True           Negati : Positi =     14.5 : 1.0
                     x15 = True           Negati : Positi =     13.1 : 1.0
                  arrive = True           Positi : Negati =     12.9 : 1.0
              appreciate = True           Positi : Negati =     12.2 : 1.0
                     ugh = True           Negati : Positi =     11.8 : 1.0
                    blog = True           Positi : Negati =     11.5 : 1.0
None


Note the Positive:Negative ratios.

Let's check some test phrase. First, download punkt sentence tokenizer ([docs here](https://www.nltk.org/api/nltk.tokenize.punkt.html))

In [25]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now we won't rely on `twitter_samples.tokenized`, but rather will use a generic tokenization routine - `word_tokenize`.

In [26]:
from nltk.tokenize import word_tokenize

custom_tweet = "the service was #bad"

custom_tokens = process_tokens(word_tokenize(custom_tweet))

print(classifier.classify(get_token_dict(custom_tokens)))

Positive


Let's package it as a function:

In [27]:
def get_sentiment(text):
    custom_tokens = process_tokens(word_tokenize(text))
    return classifier.classify(get_token_dict(custom_tokens))

texts = ["bad", "service is bad", "service is really bad", "service is so terrible", "great service", "they stole my money", "#good"]
for t in texts:
    print(t, ": ", get_sentiment(t))


bad :  Negative
service is bad :  Positive
service is really bad :  Negative
service is so terrible :  Positive
great service :  Positive
they stole my money :  Negative
#good :  Positive


Seems ok!

### This is where the homework begins

In [28]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [29]:
# Імпортування модуля з відгуками на фільми
from nltk.corpus import movie_reviews

positive = movie_reviews.fileids('pos')
negative = movie_reviews.fileids('neg')

In [30]:
# Розділення відгуків на позитивні та негативні
positive_reviews = [nltk.corpus.movie_reviews.raw((positive[i])).replace("\n", "") for i in range(len(positive))]
negative_reviews = [nltk.corpus.movie_reviews.raw((negative[i])).replace("\n", "") for i in range(len(negative))]


In [None]:
# Підготовка масивів з токенами для подальшого їх використання в моделі
positive_tokens_for_model = get_tweets_for_model([process_tokens(word_tokenize(positive_reviews[i])) for i in range(len(positive_reviews))])
negative_tokens_for_model = get_tweets_for_model([process_tokens(word_tokenize(negative_reviews[i])) for i in range(len(negative_reviews))])

In [None]:
# Створення позитивної та негативної вибірки для моделі
positive_dataset = [(token, "Positive")
                     for token in positive_tokens_for_model]

negative_dataset = [(token, "Negative")
                     for token in negative_tokens_for_model]

In [None]:
# Перемішування датасетів
test_dataset = positive_dataset + negative_dataset

random.shuffle(test_dataset)

In [None]:
# Визначення точності на вибірці
print("Accuracy is:", classify.accuracy(classifier, test_dataset))


### Conclusion: on the dataset with the reviews, the model showed a poor result.

In [None]:
def process_tokens_test(tweet_tokens):

    cleaned_tokens = []
    stop_words = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()

    for token, tag in pos_tag(tweet_tokens):
        # Now note the sheer size of regex for URLs :)
        # Mentions regex is comparatively short and sweet
        if (re.search(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', token) or 
            re.search(r'(@[A-Za-z0-9_]+)', token) or re.search(r'(#[A-Za-z0-9_]+)', token)):
            continue

        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
   
        token = lemmatizer.lemmatize(token, pos)

        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

positive_tweet_tokens_test = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens_test = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list_test = [process_tokens_test(tokens) for tokens in positive_tweet_tokens_test]
negative_cleaned_tokens_list_test = [process_tokens_test(tokens) for tokens in negative_tweet_tokens_test]

positive_tokens_for_model_test = get_tweets_for_model(positive_cleaned_tokens_list_test)
negative_tokens_for_model_test = get_tweets_for_model(negative_cleaned_tokens_list_test)

In [None]:
positive_dataset_test = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model_test]

negative_dataset_test = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model_test]

dataset_test = positive_dataset_test + negative_dataset_test

random.shuffle(dataset_test)

train_data_new = dataset_test[:7000]
test_data_new = dataset_test[7000:]

In [None]:
classifier_new = NaiveBayesClassifier.train(train_data_new)
acc_hashon = []
acc_hashoff = []
for i in range(1500):
    random.shuffle(dataset)
    test_data = dataset[7000:]
    acc_hashon.append(classify.accuracy(classifier, test_data))
    acc_hashoff.append(classify.accuracy(classifier_new, test_data))

print("Mean accuracy with hashtags is:", sum(acc_hashon)/len(acc_hashon))
print("Mean accuracy without hashtags is:", sum(acc_hashoff)/len(acc_hashoff))
if sum(acc_hashon)/len(acc_hashon) > sum(acc_hashoff)/len(acc_hashoff):
    print(f"Mean accuracy with hashtags is higher at: {round(100 * (sum(acc_hashon)/len(acc_hashon) - sum(acc_hashoff)/len(acc_hashoff)), 4)} %")
else:
    print(f"Mean accuracy without hashtags is higher at: {round(100 * (sum(acc_hashoff)/len(acc_hashoff) - sum(acc_hashon)/len(acc_hashon)), 4)} %")

### Conclusion: a model trained on a dataset excluding hashtags yields more accurate results.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

In [None]:
features = [features for features, label in dataset]
labels = [label for features, label in dataset]

vectorizer = DictVectorizer()

log_reg_cls = LogisticRegression()

pipeline = make_pipeline(vectorizer, log_reg_cls)

accuracy_scores = cross_val_score(pipeline, features, labels, cv=5, scoring='accuracy')

print("Mean accuracy:", accuracy_scores.mean())


In [None]:
if accuracy_scores.mean() < sum(acc_hashon)/len(acc_hashon):
    print(f"Mean accuracy NaiveBayesClassifier is higher then mean accuracy LogisticRegression")
else:
    print(f"Mean accuracy LogisticRegression is higher then mean accuracy NaiveBayesClassifier")    

### Conclusion: NaiveBayesClassifier is better then LogisticRegression in this situation.

### "To speed up code execution, you can make the following changes:

1. Reduce the number of loops.
2. Whenever possible, use generators instead of loops."

### "To improve the model accuracy, you can do the following:

1. Choose a different classification model.
2. Utilize various hyperparameter tuning methods, such as GridSearchCV or RandomizedSearchCV."