# University Related Coronavirus Sentiment Analysis
---

**Insert generic writeup here**

We are using the NLTK package in Python to do our natural language processing tasks in this project. Let's start with some basic setup

To begin with, there are many ways we can help the computer process our data. There are numerous ways which we can "clean" our data to make our final prediction more accurate. Let's get into some of these methods below

## Getting our Data
---

We have collected our data in csv format, so let's put it in a Pandas dataframe and pull out the Tweets and place them in a list

In [53]:
import pandas as pd
import nltk
# Extract content and date fields from csv
fields=['date', 'content']
d = pd.read_csv('combined_csv.csv', usecols=fields)

uni_related_tweets = d['content']

## Tokenizing the Data
---

First, we will do what is called "tokenizing." This process will take the Tweets as a whole, and split it into smaller subsections called tokens. These tokens make it much easier for machines to understand the context of the text when developing the model.

To demonstrate how tokenization works, lets tokenize some Tweets:

In [56]:
from nltk.tokenize import word_tokenize
print(uni_related_tweets[0])   # String
print()
print(word_tokenize(uni_related_tweets[0])) # The same string, tokenized

Marcus Yoder (HCB ‘92) has been appointed senior vice president of sales for @GameAccountNtwk, a leading business-to-business supplier of internet gambling software-as-a-service solutions. https://t.co/a8uu63502V https://t.co/WqS4p0PVN9

['Marcus', 'Yoder', '(', 'HCB', '‘', '92', ')', 'has', 'been', 'appointed', 'senior', 'vice', 'president', 'of', 'sales', 'for', '@', 'GameAccountNtwk', ',', 'a', 'leading', 'business-to-business', 'supplier', 'of', 'internet', 'gambling', 'software-as-a-service', 'solutions', '.', 'https', ':', '//t.co/a8uu63502V', 'https', ':', '//t.co/WqS4p0PVN9']


Let's go ahead and tokenize all of these tweets, and store them for use later:

In [58]:
tweet_tokens = []
for i, sentence in enumerate(uni_related_tweets):
    tweet_tokens.append(word_tokenize(uni_related_tweets[i]))
tweet_tokens

[['Marcus',
  'Yoder',
  '(',
  'HCB',
  '‘',
  '92',
  ')',
  'has',
  'been',
  'appointed',
  'senior',
  'vice',
  'president',
  'of',
  'sales',
  'for',
  '@',
  'GameAccountNtwk',
  ',',
  'a',
  'leading',
  'business-to-business',
  'supplier',
  'of',
  'internet',
  'gambling',
  'software-as-a-service',
  'solutions',
  '.',
  'https',
  ':',
  '//t.co/a8uu63502V',
  'https',
  ':',
  '//t.co/WqS4p0PVN9'],
 ['Congratulations',
  'to',
  'the',
  'Haslam',
  'College',
  'of',
  'Business',
  '@',
  'SHRM',
  'student',
  'chapter',
  'for',
  'being',
  'awarded',
  'a',
  '2019-2020',
  'Outstanding',
  'Student',
  'Chapter',
  'merit',
  'award',
  '!',
  'The',
  'group',
  'has',
  'been',
  'recognized',
  'for',
  'providing',
  'superior',
  'growth',
  'and',
  'development',
  'opportunities',
  'to',
  'its',
  'members',
  '.',
  'https',
  ':',
  '//t.co/4vW7Mxk5tA'],
 ['#',
  'FacultyFriday',
  'Supply',
  'chain',
  'management',
  'professor',
  'Chris',
  

As you can see above, the tokenization process took the original string and split it into smaller subsections known as tokens

## Normalizing the Data
---

Normalization, in terms of natural language processing, is the process of transforming a text into a canonical (standard) form.
For example, "gooood," and "gud" can be resolved to the normalized form "good." This can also apply for different tenses of the same word. For example, "ran," "runs," and "running" are all forms of "run."

<br/>

#### There are a few things at work here:

Stemming is the process of removing suffixes and prefixes from words. As an example, it reduces the inflection in words such as "troubled" and "troubles" to their root form "trouble."

Here are some stemming examples made using Porters Algorithm, one of the most common stemming algorithms:

<html>
<img src="Documents/StemmingExample.PNG" alt="drawing" width="275"/>
</html>

Lemmatization is similar to stemming, but rather than just cutting off the affixes, it will transform the word to it's root. As an example, it may transform the word "better" to "good."

Here are some examples of lemmatization using a dictionary mapping for the translations:

<html>
<img src="Documents/LemmatizationExample.PNG" alt="drawing" width="275"/>
</html>

<br/>
<br/>


This processing is essential for noisy social-media posts, as abbreviations and mispellings are very common!

We will be using lemmatization for our data, so let's download ```wordnet```, a lexical database, and ```averaged_perceptron_tagger```, which will help us in determining context

In [59]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /home/tucker/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/tucker/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Before using the lemmatizer, we must determine the context of each word within our tweets. To do this, we use what's called a tagging algorithm. Fortunately, NLTK provides a function for this.

Let's test it here:

In [60]:
from nltk.tag import pos_tag
print(pos_tag(tweet_tokens[0]))

[('Marcus', 'NNP'), ('Yoder', 'NNP'), ('(', '('), ('HCB', 'NNP'), ('‘', 'NNP'), ('92', 'CD'), (')', ')'), ('has', 'VBZ'), ('been', 'VBN'), ('appointed', 'VBN'), ('senior', 'JJ'), ('vice', 'NN'), ('president', 'NN'), ('of', 'IN'), ('sales', 'NNS'), ('for', 'IN'), ('@', 'JJ'), ('GameAccountNtwk', 'NNP'), (',', ','), ('a', 'DT'), ('leading', 'VBG'), ('business-to-business', 'JJ'), ('supplier', 'NN'), ('of', 'IN'), ('internet', 'JJ'), ('gambling', 'VBG'), ('software-as-a-service', 'JJ'), ('solutions', 'NNS'), ('.', '.'), ('https', 'NN'), (':', ':'), ('//t.co/a8uu63502V', 'JJ'), ('https', 'NN'), (':', ':'), ('//t.co/WqS4p0PVN9', 'NN')]


Here are some common tags and their meaning:
- NNP: Noun, proper, singular
- NN: Noun, common, singular or mass
- IN: Preposition or conjunction, subordinating
- VBG: Verb, gerund or present participle
- VBN: Verb, past participle

Using the fact that tags starting with ``NN`` are typically nouns, and tags starting with ```VB``` are typically verbs, we can incorporate this into a function to lemmatize our data:

In [61]:
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

This function gets the tag of each token within the Tweet, and lemmatizes accordingly.
Let's test it here:

In [62]:
print(lemmatize_sentence(tweet_tokens[0]))

['Marcus', 'Yoder', '(', 'HCB', '‘', '92', ')', 'have', 'be', 'appoint', 'senior', 'vice', 'president', 'of', 'sale', 'for', '@', 'GameAccountNtwk', ',', 'a', 'lead', 'business-to-business', 'supplier', 'of', 'internet', 'gamble', 'software-as-a-service', 'solution', '.', 'http', ':', '//t.co/a8uu63502V', 'http', ':', '//t.co/WqS4p0PVN9']
