# Sentiment Analysis

We anticipate that (despite its known shortcomings) sentiment analysis could provide an interesting view into our conversation data. This notebook takes the random samples of Twitter and Reddit conversations that were previously generated and will add sentiment scores using [Vader](https://github.com/cjhutto/vaderSentiment). See Melanie Walsh's description of Vader in [this Jupyter notebook](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/04-Sentiment-Analysis.html) for more context on the approach that was taken here.

In [9]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

vader = SentimentIntensityAnalyzer()
vader.polarity_scores("This is a terrible mistake.")

{'neg': 0.647, 'neu': 0.353, 'pos': 0.0, 'compound': -0.6705}

As Walsh suggests we are going to be using the `compound` value as an aggregate of `negative`, `neutral` and `positive` scores. The Twitter and Reddit conversations that have been sampled have been put into a zip file per original dataset. So for the "racism" tweets there is a zip file containing 30 CSV files, each of which contains metadata for replies in a particular thread that was found in the Twitter search results for "racism". Similarly for the "racism" Reddit posts there is a zip file containing 30 CSV files, each of which contains metadata for comments in a particular thread found in the Reddit search results for "racism".

This notebook will unpack thosze zip files, add the sentiment scores, and then zip them back up again.

## Twitter

While working with the zip files it's easiest if we change directory to where the zip files are.

But we can write a function that will take a zip file, and a column name to use for the text, which will read each csv in the zip file, write out the csv with a new *sentiment* column, and then zip it up again.

In [33]:
import os
import sh
import pandas

from pathlib import Path

tweets_dir = Path('/home/ubuntu/jupyter/data/tweets.pull')
os.chdir(tweets_dir)

def process(zipfile, text_col):
    sh.unzip('-u', zipfile)
    sample_dir = Path(zipfile.stem)
    for csv_file in sample_dir.glob('*.csv'):
        print(csv_file)
        df = pandas.read_csv(csv_file)
        df['sentiment'] = df[text_col].apply(lambda t: vader.polarity_scores(t)['compound'])
        df.to_csv(csv_file, index=False)
    sh.zip(zipfile.name, sample_dir.name)
    sh.rm('-rf', sample_dir)

In [None]:
for zipfile in tweets_dir.glob('*_30.zip'):
    print(zipfile)
    process(zipfile, 'text')

## Reddit

In [35]:
reddit_dir = Path('/home/ubuntu/jupyter/data/reddit.pull')
os.chdir(reddit_dir)

for zipfile in reddit_dir.glob('*_30.zip'):
    print(zipfile)
    process(zipfile, 'body')

/home/ubuntu/jupyter/data/reddit.pull/reddit_black_people_convs_30.zip
reddit_black_people_convs_30/lspdve.csv
reddit_black_people_convs_30/hc1qww.csv
reddit_black_people_convs_30/irr3vx.csv
reddit_black_people_convs_30/it2tgn.csv
reddit_black_people_convs_30/iu73ac.csv
reddit_black_people_convs_30/i6et5p.csv
reddit_black_people_convs_30/kumrlb.csv
reddit_black_people_convs_30/i4rsdb.csv
reddit_black_people_convs_30/huknvk.csv
reddit_black_people_convs_30/ihixet.csv
reddit_black_people_convs_30/gye49e.csv
reddit_black_people_convs_30/hqqpr8.csv
reddit_black_people_convs_30/n5c6rl.csv
reddit_black_people_convs_30/hd4266.csv
reddit_black_people_convs_30/gwfghb.csv
reddit_black_people_convs_30/ir7jei.csv
reddit_black_people_convs_30/lppo6o.csv
reddit_black_people_convs_30/gw55oe.csv
reddit_black_people_convs_30/ifn3o9.csv
reddit_black_people_convs_30/i4tskq.csv
reddit_black_people_convs_30/n0t5jw.csv
reddit_black_people_convs_30/ifa34n.csv
reddit_black_people_convs_30/iryadd.csv
reddit_bl