# University Related Coronavirus Sentiment Analysis
---

**Insert generic writeup here**

We are using the NLTK package in Python to do our natural language processing tasks in this project. Let's start with some basic setup

## Download Data for Training and Testing our Model
---

In [None]:
import nltk
nltk.download('twitter_samples')

This command downloads a data set contained within NLTK. This is a collection of 20,000 tweets which will be used to test and train our model.



In [13]:
from nltk.corpus import twitter_samples

This imports 3 datasets contained within the "twitter_samples" folder which we downloaded earlier.

```negative_tweets.json```: A collection of 5,000 tweets with negative sentiment <br/>
```positive_tweets.json```: A collection of 5,000 tweets with positive sentiment <br/>
```tweets.20150430-223406.json```: A collection of 10,000 tweets with no sentiment label

The 10,000 positive and negative tweets will be used to train our model, and the remmaining 10,000 will be used to test it.

## Tokenizing the Data
---

There are numerous ways we can "clean" our data to make our final model better. First, we will do what is called "tokenizing."
This process will take the Tweets as a whole, and split it into smaller subsections called tokens. These tokens make it much
easier for machines to understand the context of the text when developing the model.

Start by storing the positive, negative, and general tweets as strings.

In [24]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
nonpolar = twitter_samples.strings('tweets.20150430-223406.json')

Fortunately, NLTK contains another helpful resource known as ```punkt```. This is a pre-trained model that allows us to easily tokenize our data.

To get the ```punkt``` resrouce, we run the following command:

In [None]:
nltk.download('punkt')

Now we are able to utilize NLTK's powerful tokenization tools. We simply use the ```.tokenized()``` method in order to tokenize our data.

To demonstrate how this works, let's tokenize ```negative_tweets.json```

In [26]:
print(twitter_samples.strings('negative_tweets.json')[0])   # String
print(twitter_samples.tokenized('negative_tweets.json')[0]) # The same string, tokenized

hopeless for tmr :(
['hopeless', 'for', 'tmr', ':(']


Let's go ahead and tokenize ```positive_tweets.json``` for later:

In [27]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

## Normalizing the Data
---

Normalization, in terms of natural language processing, is the process of transforming a text into a canonical (standard) form.
For example, "gooood," and "gud" can be resolved to the normalized form "good." This can also apply for different tenses of the same word. For example, "ran," "runs," and "running" are all forms of "run."

<br/>

#### There are a few things at work here:

Stemming is the process of removing suffixes and prefixes from words. As an example, it reduces the inflection in words such as "troubled" and "troubles" to their root form "trouble."

Here are some stemming examples made using Porters Algorithm, one of the most common stemming algorithms:

<html>
<img src="Documents/StemmingExample.PNG" alt="drawing" width="275"/>
</html>

Lemmatization is similar to stemming, but rather than just cutting off the affixes, it will transform the word to it's root. As an example, it may transform the word "better" to "good."

Here are some examples of lemmatization using a dictionary mapping for the translations:

<html>
<img src="Documents/LemmatizationExample.PNG" alt="drawing" width="275"/>
</html>

<br/>
<br/>


This processing is essential for noisy social-media posts, as abbreviations and mispellings are very common!

We will be using lemmatization for our data, so let's download ```wordnet```, a lexical database, and ```averaged_perceptron_tagger```, which will help us in determining context

In [None]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

Before using the lemmatizer, we must determine the context of each word within our tweets. To do this, we use what's called a tagging algorithm. Fortunately, NLTK provides a function for this.

Let's test it here:

In [33]:
from nltk.tag import pos_tag
print(pos_tag(tweet_tokens[0]))

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]


Here are some common tags and their meaning:
- NNP: Noun, proper, singular
- NN: Noun, common, singular or mass
- IN: Preposition or conjunction, subordinating
- VBG: Verb, gerund or present participle
- VBN: Verb, past participle

Using the fact that tags starting with ``NN`` are typically nouns, and tags starting with ```VB``` are typically verbs, we can incorporate this into a function to lemmatize our data:

In [35]:
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

This function gets the tag of each token within the Tweet, and lemmatizes accordingly.
Let's test it here:

In [36]:
print(lemmatize_sentence(tweet_tokens[0]))

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']
