# Vader Sentiment Analysis Average Over Time Data Prep
This notebook prepares tweets to give an average sentiment over time as well as a count of tweets over time. You can group tweets by day, week, month, or year depending on your data and needs. To view the output as a line graph, use the "vaderAverageOverTimeResults" notebook.

For more information about how Vader works behind the scenes see here: https://github.com/cjhutto/vaderSentiment

###  Before we begin
Before we start, you will need to have set up a [Carbonate account](https://kb.iu.edu/d/aolp) in order to access [Research Desktop (ReD)](https://kb.iu.edu/d/apum). You will also need to have access to ReD through the [thinlinc client](https://kb.iu.edu/d/aput). If you have not done any of this, or have only done some of this, but not all, you should go to our [textPrep-Py.ipynb](https://github.com/cyberdh/Text-Analysis/blob/master/textPrep-Py.ipynb) before you proceed further. The textPrep-Py notebook provides information and resources on how to get a Carbonate account, how to set up ReD, and how to get started using the Jupyter Notebook on ReD.   

### Run CyberDH environment
The code in the cell below points to a Python environment specificaly for use with the Python Jupyter Notebooks created by Cyberinfrastructure for Digital Humanities. It allows for the use of the different packages in our notebooks and their subsequent data sets.

##### Packages
- **sys:** Provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. It is always available.
- **os:** Provides a portable way of using operating system dependent functionality.

#### NOTE: This cell is only for use with Research Desktop. You will get an error if you try to run this cell on your personal device!!

In [1]:
import sys
import os
sys.path.insert(0,"/N/u/cyberdh/Carbonate/dhPyEnviron/lib/python3.6/site-packages")
os.environ["NLTK_DATA"] = "/N/u/cyberdh/Carbonate/dhPyEnviron/nltk_data"

### Include necessary packages for notebook 

Python's extensibility comes in large part from packages. Packages are groups of functions, data, and algorithms that allow users to easily carry out processes without recreating the wheel. Some packages are included in the basic installation of Python, others created by Python users are available for download.

In your terminal, packages can be installed by simply typing `pip install nameofpackage --user`. However, since you are using ReD and our Python environment, you will not need to install any of the packages below to use this notebook. Anytime you need to make use of a package, however, you need to import it so that Python knows to look in these packages for any functions or commands you use. Below is a brief description of the packages we are using in this notebook:    

- **nltk:** Platform for building Python programs to work with human language data. Here we bring in the VADER sentiment analysis tool which is now a part of the nltk package.

- **pandas:** An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

- **glob:** Finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

- **zipfile:** Allows for handling of zipfiles.

- **warnings:** Allows for the manipulation of warning messages in Python.

Notice we import some of the packages differently. In some cases we just import the entire package when we say `import XYZ`. For some packages which are small, or, from which we are going to use a lot of the functionality it provides, this is fine. 

Sometimes when we import the package directly we say `import XYZ as X`. All this does is allow us to type `X` instead of `XYZ` when we use certain functions from the package. So we can now say `X.function()` instead of `XYZ.function()`. This saves time typing and eliminates errors from having to type out longer package names. I could just as easily type `import XYZ as potato` and whenever I use a function from the `XYZ` package I would need to type `potato.function()`. What we import the package as is up to you, but some commonly used packages have abbreviations that are standard amongst Python users such as `import pandas as pd` or `import matplotlib.pyplot as plt`. You do not need to us `pd` or `plt`, however, these are widely used and using something else could confuse other users and is generally considered bad practice. 

Other times we import only specific elements or functions from a package. This is common with packages that are very large and provide a lot of functionality, but from which we are only using a couple functions or a specific subset of the package that contains the functionality we need. This is seen when we say `from XYZ import ABC`. This is saying I only want the `ABC` function from the `XYZ` package. Sometimes we need to point to the specific location where a function is located within the package. We do this by adding periods in between the directory names, so it would look like `from XYZ.123.A1B2 import LMN`. This says we want the `LMN` function which is located in the `XYZ` package and then the `123` and `A1B2` directory in that package. 

You can also import more than one function from a package by separating the functions with commas like this `from XYZ import ABC, LMN, QRS`. This imports the `ABC`, `LMN` and `QRS` functions from the `XYZ` package.

In [2]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
import glob
import zipfile
import warnings

This will ignore future warnings. All the warnings in this code are not concerning and will not break the code or cause errors in the results.

In [3]:
warnings.filterwarnings("ignore",category=FutureWarning)

### Variables
Here we create some variables for use later in our code. We do this to minimize the number and complexity of the changes you will need to make later.

First we need to decide if we want to read in one '.json' file or a whole directory of '.json' files. If you want to read in a whole directory then set `source` equal to `"*"`. This is what is called a regular expression that means 'all'. So we are reading in 'all' the files in a directory. If you wish to read in a single file then set `source` equal to the name of the desired file in quotes, but leaving the '.json' or '.csv' off at the end. So a single file should look like this: `source = "myFileName"`.

Next we assign the file type our data comes in to a variable. At the moment the only options are '.csv' or '.json' as these are the most popular twitter data formats. We assign the format to the `fileType` variable. It should look like this: `fileType = ".json"`.

The `textColIndex` variable is where we put the header name of the dataframe column that will contain the content we are interested in from our tweets. Generally the content of the tweets are labeled as "text" since this is the label given to the tweet content when it is pulled directly from the Twitter API. For this reason our default value assigned to the `textColIndex` is `"text"`. If for some reason the tweet content has a different label or header, and you need to change this, remember to keep the quotes around the new label.

We assign a unit of time to our `timeLength` variable. This is used later to get our average sentiment and tweet counts by day, week, month, or year. To signify a per day mean/count we use `"D"` (with the quotes). For a per week average we use `"W"`, followed by `"M"` for month, and `"Y"` for year if your data happens to span multiple years and that is what you are most interested in. Note that this will affect the graphical output in the "vaderAverageOverTimeResults" notebook as it will show the means/counts based on the setting here.

The `remove` variable is assigned a boolean of either **True** or **False**. If it is **True** it means that you want to "remove" the terms in the `remWords` list below from the vader lexicon. If you set it to **False**, you do not have any words to remove.

The `add` variable is assigned a boolean of either **True** or **False**. If it is **True** it means you want add the key/value pairs in the dictionary `newWords` to the vader lexicon. The `newWords` dcitionary (just below the `add` boolean variable) contains **"word": vader polarity score** for words you would like to add to the vader lexicon. The scores in the example dictionary were made up (did not follow vader protocol), however, if you wanted to add terms you would need to follow a similar protocol to vader and find ten people to score the word between -4 (most negative) and 4 (most positive) including 0 as a possible score, and then get the average score (which is the number after each word in the `newWords` dictionary) and also determine the standard deviation as the creators of vader did not include words that had a standard deviation of over 2.5.

**NOTE:** If you want to change the score/polarity of an existing word in the dictionary, first remove the word by including it in the `remWords` list, then add the word with a new polarity score in the `newWords` dictionary. This removes the current word and score and adds the word with a new score in the algorithm. 

The variable `encoding` is where you determine what type of encoding to use (ascii, ISO-8850-1, utf-8, etc...). We have it set to utf-8 at the moment as we have found it is less likely to have any problems.

In [4]:
source = "coronaVirus01-21Jan2020"
fileType = ".json"
textColIndex = "text"
timeLength = "D"
remove = True
remWords = ["novel", "ha", "l", "gt", "positive"]
add = True
newWords = {"virus": -1.7, "outbreak": -0.6, "epidemic": -2.3, "pandemic": -3.1, "quarantine": -2.6, "positive": -2.6}
encoding = "utf-8"

### File paths
Here we assign file paths we will need throughout this notebook to variables. This way we only need to make changes here and they will be implemeneted throughout the code. The `homePath` variable uses the `environ` function from the `os` package. This function points to your home directory no matter your operating system (Linux, osX, Windows).

Then we join the `homePath` variable to folders that point to where our data is stored and we assign this file path to the variable `dataHome`. The folder names are in quotes and separated by a comma. 

Finally, we again use the `homePath` variable and join it with a file path that points to a folder where we can save cleaned and prepped data to be used in the "vaderAverageOverTimeResults" notebook. We assign this file path to a variable called `dataClean`.

You can change any of these to better match where your data can be found (`dataHome`) and where you want your cleaned/prepped data stored (`dataClean`).

In [5]:
homePath = os.environ['HOME']
dataHome = os.path.join(homePath,'Text-Analysis-master', 'data', 'twitter')
dataClean = os.path.join(homePath,"Text-Analysis-master","VADERSentimentAnalysis", "cleanedData")

#### Shorten SentimentIntensityAnalyzer Function
We shorten the `SentimentIntensityAnalyzer()` to the variable `vader`.

In [6]:
vader = SentimentIntensityAnalyzer()

#### Remove words
Here we have an "if...else" statement. If we assigned **True** to the variable `remove` above then we apply the `.pop` function from vader to each word in the list. 

If we assigned **False** to `remove` then we do nothing.

In [7]:
if remove == True:
    [vader.lexicon.pop(x) for x in remWords]
else:
    None

#### Add words
Here we have another "if...else" statement. If we assigned **True** to `add` then we use the `.update` function from vader to add each `{"key": value}` pair from our `newWords` dictionary above.

If we assigned **False** to `add` then we do nothing.

In [8]:
if add == True:
    vader.lexicon.update(newWords)
else:
    None

### Unzip files

Here we are unzipping files. Since twitter data can be rather large it is often necessary to compress it into a '.zip' file in order to upload it to places such as GitHub. For this reason, we have setup some code to go in and automatically extract all the items in a compressed '.zip' file so you don't have to and so you don't get errors later. If the data is not in a '.zip' file there is no need to worry, it will not give an error if there are no files ending in '.zip' in your directory.

The only changes you may wish to make are in the first two lines. These are the lines that point to the file paths where your '.zip' files are stored. If you have '.zip' files stored in another folder you will want to change the path. Note that the first line points to the directory and the second line points to the files.  

In [9]:
if fileType == ".json":
    direct = os.path.join(dataHome, "JSON")
    allZipFiles = glob.glob(os.path.join(dataHome, "JSON","*.zip"))
    for item in allZipFiles:
            fileName = os.path.splitext(direct)[0]
            zipRef = zipfile.ZipFile(item, "r")
            zipRef.extractall(fileName)
            zipRef.close()
            os.remove(item)    
else:
    direct = os.path.join(dataHome, "CSV")
    allZipFiles = glob.glob(os.path.join(dataHome, "CSV","*.zip"))
    for item in allZipFiles:
            fileName = os.path.splitext(direct)[0]
            zipRef = zipfile.ZipFile(item, "r")
            zipRef.extractall(fileName)
            zipRef.close()
            os.remove(item)

### Reading in .csv and .json files

If you chose `".json"` as your `fileType` up above, then the first `if` statement in the code below reads in ".json" files and saves the contents to a dataframe using the Pandas package. It will read in either an entire directory or a single ".json" file depending on what you chose for `source` above. 

Once we have read in the ".json" file using the Pandas `read_json` function, we need to concatenate the ".json" files if there are multiple. Because of this it is important that your ".json" files have an identical key count and each key has identical names or you will get errors. If you have a single ".json" file then you should be fine for this step. We assign this process to the variable `cdf` so we can use it later.

Now we convert our `cdf` to a pandas dataframe. This allows for easier manipulation of the data in the next line.

Finally, we pull in the key containing the data we are interested in which we assigned to the variable `textColIndex` earlier and turn it into a list assigned to the variable `tweets`.

If you chose `".csv"` for your fileType, then the second `if` statement will read in ".csv" files and save the content to a dataframe using the Pandas package much like the ".json" file process described above. The only difference is that we use the Pandas function `read_csv` instead of `read_json`. Everything else is exactly the same as what is described above in the ".json" section. 

In [10]:
if fileType == ".json":
    allFiles = glob.glob(os.path.join(dataHome,"JSON",source + fileType))     
    df = (pd.read_json(f, encoding = encoding, lines = True) for f in allFiles)
    cdf = pd.concat(df, ignore_index=True)
    cdf = pd.DataFrame(cdf)
if fileType == ".csv":
    allFiles = glob.glob(os.path.join(dataHome, "CSV", source + fileType))     
    df = (pd.read_csv(f, engine = "python") for f in allFiles)
    cdf = pd.concat(df, ignore_index=True)
    cdf = pd.DataFrame(cdf)
cdf["text"] = cdf["text"].astype(str)
print(len(cdf["text"]))

14887


### Run VADER

Now we run vader over our tweets. We do this by "applying" `vader.polarity_scores` to each tweet in the "text" column of our "cdf" data frame.

Then we concatenate the scores to the end of the data frame.

Lastly we display the first five rows so we can check and make sure we have what we need.

In [11]:
sentiment = cdf['text'].apply(lambda x: vader.polarity_scores(x))
cdf = pd.concat([cdf,sentiment.apply(pd.Series)],1)
cdf.head(5)

Unnamed: 0,coordinates,created_at,entities,favorited,id_str,in_reply_to_screen_name,in_reply_to_status_id_str,otherfields,place,retweet_count,...,source,text,truncated,user,user_id_str,user_screen_name,neg,neu,pos,compound
0,,2020-01-01 11:10:10,"{'hashtags': [{'text': 'coronavirus', 'indices...",False,1212330167269515264,,,"{'display_text_range': '[0,140]', 'favorite_co...",,1,...,"<a href=""http://twitter.com/download/android"" ...","Simply...si no és #coronavirus, NO hi ha #SARS...",True,{'created_at': 'Mon Aug 29 16:48:30 -0400 2011...,364476368,XavierAbadMdG,0.165,0.835,0.0,-0.6289
1,,2020-01-01 12:08:45,"{'hashtags': [{'text': 'coronavirus', 'indices...",False,1212344909665046528,,,"{'favorite_count': '0', 'filter_level': 'low',...",,0,...,"<a href=""http://twitter.com/#!/download/ipad"" ...",RT @XavierAbadMdG: Simply...si no és #coronavi...,False,{'created_at': 'Sat Nov 24 15:43:45 -0500 2012...,968761746,RossellRos,0.204,0.796,0.0,-0.6289
2,,2020-01-03 09:42:22,"{'hashtags': [{'text': 'Patients', 'indices': ...",False,1213032848086773760,,,"{'favorite_count': '0', 'filter_level': 'low',...",,0,...,"<a href=""http://www.agenparl.com/"" rel=""nofoll...",Update on cluster of #Patients infected with #...,True,{'created_at': 'Mon Jul 19 04:38:21 -0400 2010...,168425731,Agenparl,0.158,0.842,0.0,-0.4939
3,,2020-01-03 17:08:04,"{'hashtags': [{'text': 'HK', 'indices': [0, 3]...",False,1213145011774218240,,,"{'favorite_count': '0', 'filter_level': 'low',...",,0,...,"<a href=""https://www.bloglovin.com"" rel=""nofol...","#HK, Two suspected #MERS #Coronavirus cases re...",False,{'created_at': 'Wed Aug 06 15:51:01 -0400 2008...,15754217,ironorehopper,0.137,0.863,0.0,-0.2263
4,,2020-01-04 08:23:03,"{'hashtags': [{'text': 'Coronavirus', 'indices...",False,1213375273590251520,,,"{'favorite_count': '0', 'filter_level': 'low',...",,0,...,"<a href=""http://twitter.com/download/android"" ...",Suspicion de Syndrome Respiratoire du Moyen-Or...,False,{'created_at': 'Sun May 01 11:50:43 -0400 2016...,726801029963108352,Azeria64Azeria,0.224,0.776,0.0,-0.3818


#### Get dates

Now we need to get the dates for the tweets so we can organize the tweets chronologically. 

We start by sorting the tweets in order based on the dates in our "created_at" column in our data frame. If your dates are labeled by another name then change "created_at" to the name of the key (.json) or column header (.csv) containing your dates in your original dataset.

Then we have an "if...else" statement to remove the timezone info if our data was in .csv format as it did not do this automatically like it does for the ".json" format. Having the timezone info causes issues with the graphical output later, but it does not affect the results.

Next we convert the dates in the "created_at" column to the date/time format in the `to_datetime` function from the pandas package which is a YYYY-MM-DD format.

Then we remove rows that do not contain a date as we have no way of knowing where to group the tweet. This means we will only get the average sentiment for tweets with dates.

Now we make the column containing our dates the index column as this is necessary for getting the mean for the dates.

Next we change the name of the index so we don't have two columns with the same name as this would cause errors later.

Lastly, we take a look at the first five rows to see our results.

In [12]:
cdf.sort_values(by="created_at", inplace=True)

if fileType == ".csv":
    cdf["created_at"] = cdf["created_at"].astype(str).str[:-6]
else:
    None

cdf["created_at"] = pd.to_datetime(cdf["created_at"])
cdf = cdf.dropna(subset=["created_at"])
cdf.index = cdf["created_at"]
cdf.index.names = ["dateTime"]
cdf.head(5)

Unnamed: 0_level_0,coordinates,created_at,entities,favorited,id_str,in_reply_to_screen_name,in_reply_to_status_id_str,otherfields,place,retweet_count,...,source,text,truncated,user,user_id_str,user_screen_name,neg,neu,pos,compound
dateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-01-01 11:10:10,,2020-01-01 11:10:10,"{'hashtags': [{'text': 'coronavirus', 'indices...",False,1212330167269515264,,,"{'display_text_range': '[0,140]', 'favorite_co...",,1,...,"<a href=""http://twitter.com/download/android"" ...","Simply...si no és #coronavirus, NO hi ha #SARS...",True,{'created_at': 'Mon Aug 29 16:48:30 -0400 2011...,364476368,XavierAbadMdG,0.165,0.835,0.0,-0.6289
2020-01-01 12:08:45,,2020-01-01 12:08:45,"{'hashtags': [{'text': 'coronavirus', 'indices...",False,1212344909665046528,,,"{'favorite_count': '0', 'filter_level': 'low',...",,0,...,"<a href=""http://twitter.com/#!/download/ipad"" ...",RT @XavierAbadMdG: Simply...si no és #coronavi...,False,{'created_at': 'Sat Nov 24 15:43:45 -0500 2012...,968761746,RossellRos,0.204,0.796,0.0,-0.6289
2020-01-03 09:42:22,,2020-01-03 09:42:22,"{'hashtags': [{'text': 'Patients', 'indices': ...",False,1213032848086773760,,,"{'favorite_count': '0', 'filter_level': 'low',...",,0,...,"<a href=""http://www.agenparl.com/"" rel=""nofoll...",Update on cluster of #Patients infected with #...,True,{'created_at': 'Mon Jul 19 04:38:21 -0400 2010...,168425731,Agenparl,0.158,0.842,0.0,-0.4939
2020-01-03 17:08:04,,2020-01-03 17:08:04,"{'hashtags': [{'text': 'HK', 'indices': [0, 3]...",False,1213145011774218240,,,"{'favorite_count': '0', 'filter_level': 'low',...",,0,...,"<a href=""https://www.bloglovin.com"" rel=""nofol...","#HK, Two suspected #MERS #Coronavirus cases re...",False,{'created_at': 'Wed Aug 06 15:51:01 -0400 2008...,15754217,ironorehopper,0.137,0.863,0.0,-0.2263
2020-01-04 08:23:03,,2020-01-04 08:23:03,"{'hashtags': [{'text': 'Coronavirus', 'indices...",False,1213375273590251520,,,"{'favorite_count': '0', 'filter_level': 'low',...",,0,...,"<a href=""http://twitter.com/download/android"" ...",Suspicion de Syndrome Respiratoire du Moyen-Or...,False,{'created_at': 'Sun May 01 11:50:43 -0400 2016...,726801029963108352,Azeria64Azeria,0.224,0.776,0.0,-0.3818


### Create two separate data frames and export data
Now we get the mean for the sentiment compound score by date and assign the results to a new data frame called "meanDF".

Then we name the two columns in the new data frame. The names are "date" and "mean".

Next we export the data frame as a ".csv" file. This ".csv" will be used in the "vaderAverageOverTimeResults" notebook to create the interactive line graph displaying our results.

Then we look at the first five rows of the data frame to make sure it worked.

In [13]:
meanDF = cdf.groupby(pd.Grouper(freq = timeLength))["compound"].mean().fillna(0).sort_index().reset_index()
meanDF.columns = ["date", "mean"]
meanDF.to_csv(os.path.join(dataClean, "vaderAvg.csv"))
meanDF.head(5)

Unnamed: 0,date,mean
0,2020-01-01,-0.6289
1,2020-01-02,0.0
2,2020-01-03,-0.3601
3,2020-01-04,-0.278133
4,2020-01-05,-0.8885


Here we get the number of tweets per date and assign the results to a new data frame called "countDF".

Then we name the two columns in the new data frame. The names are "date" and "count".

Next we export the data frame as a ".csv" file. This ".csv" will be used in the "vaderAverageOverTimeResults" notebook to create the interactive line graph displaying our results.

Then we look at the first five rows of the data frame to make sure it worked.

In [14]:
countDF = cdf.groupby(pd.Grouper(freq = timeLength))["compound"].count().sort_index().reset_index()
countDF.columns = ["date", "count"]
countDF.to_csv(os.path.join(dataClean, "tweetCount.csv"))
countDF.head(5)

Unnamed: 0,date,count
0,2020-01-01,2
1,2020-01-02,0
2,2020-01-03,2
3,2020-01-04,3
4,2020-01-05,1


## VOILA!!

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.