<a href="https://colab.research.google.com/github/barbara-balcon/Reddit-Sentiment-Analysis/blob/main/Dataframe_Reddit_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Obtaining Reddit data: accessing the Reddit API
The first preliminary step to performing any kind of sentiment analysis on Reddit data is establishing a **Reddit instance** via **API**. This can be done with the **Python Reddit API Wrapper (PRAW)**, see documentation: https://praw.readthedocs.io/en/latest/

In order to access the API, a **Reddit account** is needed, login details below.
> E-mail: barbara.balcon@studbocconi.it
Username: thesis_3078976
Password: class_of_2021

I created an **application** at the following link: https://www.reddit.com/prefs/apps
> - Name: Sentiment Analysis
I selected 'script' (Script for personal use. Will only have access to the developers accounts
description)
- Description: Data will be used to perform sentiment analysis
- About url: blank
- Redirect url: http://www.example.com/unused/redirect/url (this is empty)

After creating the app, the following is generated.
> ID: L_wWPUNiMBmn1Q  
Secret: bUYKyV21W3xXK4GzXsY5tsNzpG3pRw



In [None]:
#installing packages we will later import and use
!pip install praw
!pip install mysql.connector
!pip install vaderSentiment
!pip install unidecode

In [25]:
#import of the relevant libraries 
import praw
import pandas as pd
from praw.models import MoreComments

In [26]:
#create a reddit connection with reddit api details
reddit=praw.Reddit(client_id='L_wWPUNiMBmn1Q', client_secret='bUYKyV21W3xXK4GzXsY5tsNzpG3pRw', user_agent='ua')

The following pulls the **five hottest posts** in the subreddit wallstreetbets, printing the title and ID of each. This step is for illustrating purposes only and does *not* feed into the sentiment analysis performed later.

In [27]:
subreddit=reddit.subreddit('wallstreetbets')
for submission in subreddit.hot(limit=5):
    print(submission.title)
    print('Submission ID = ', submission.id, '\n')

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



What Are Your Moves Tomorrow, April 06, 2021
Submission ID =  mktg0s 

WSB Rules - Please Read Before Posting
Submission ID =  mkkfno 

I have come to the realization that a stock with this amount of volatility that Game stop has does not give a FLYING F*CK (smooth brains - the star represents a U) about Technical Analysis!
Submission ID =  ml0bsy 

26k-> 99k in a day . Mainly thanks to $SPY yolo 😀
Submission ID =  mktvom 

I built a tool for us to track US Representatives Stock Trades
Submission ID =  mkpdr9 



# Alternative storage of the data: creating a dataframe


As a more readily accessible alternative to a SQL database, in this version of the code we are creating a Pandas dataframe. 


A dataframe can be thought of as table: each column is a characteristic of a post in r/wallstreetbets e.g. author, and each row stores a post.


The database has defined columns, but it is empty for now, will be populated in subsequent steps.

In [28]:
import pandas as pd
import datetime 
import time #packages for handling date and time data

 
df = pd.DataFrame(columns=['Current time','Subreddit','Author','Title','Body','Sentiment'])
#creation of a database for storing reddit data 

print(df)
#creating variables we are about to create instances of when analysing sentiment
 

Empty DataFrame
Columns: [Current time, Subreddit, Author, Title, Body, Sentiment]
Index: []


# Performing post sentiment analysis and storing the results

## Introduction to VADER
Now we are ready to **live stream** comments from Reddit and perform sentiment analysis via **VADER** (Valence Aware Dictionary and sEntiment Reasoner). It is an open-source tool that was designed for social media specifically. It is lexicon and rule-based. 
See: 
> Hutto, C.J. & Gilbert, E.E. (2014). *VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text.* Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

It returns a **polarity score (-1, +1)** for each post: a normalized, weighted composite score that acts as a metric of the overall sentiment of a given post.  

## Storing the data in the database
In the *while* loop, we are **populating the table** in the SQL database we created earlier with data streamed from the Reddit API. For each post, we include:
- current date and time;
- subreddit (wallstreetbets);
- author;
- title;
- body;
- compound sentiment score

## A practical note
Since this is live streaming, the code cell below will keep running indefinitely: just use the 'stop' button to interrupt it.This might return:

> KeyboardInterrupt                         Traceback (most recent call last)

It is not an actual error, it just indicates that the run was manually interrupted. Also, please ignore the message calling for use of anyncronous PRAW,  the current one works just fine.

In [29]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer= SentimentIntensityAnalyzer() #just an abbreviation 
from unidecode import unidecode #a package for dealing with Unicode


columns=list(df)
data=[]
body=''
while True:
    try:
        
        subreddit = reddit.subreddit('wallstreetbets')
        for comment in subreddit.stream.comments(skip_existing=True):

                vs = analyzer.polarity_scores(unidecode(body))
                sentiment = vs['compound'] #we are interested in the compound score
                
                if len(str(comment.body)) < 2000:
                    body = str(comment.body)
                elif len(str(comment.body)) > 2000:
                    body = "data is too large"
                
                values = [datetime.datetime.now(), str(comment.subreddit),
                          str(comment.author), str(comment.link_title), body, sentiment]
                
               
                zipped=zip(columns, values)
                a_dict=dict(zipped)
                data.append(a_dict)


                df=df.append(data, True)
    except Exception as e:
        print(str(e))
        time.sleep(10)
'''We keep an exception so that in case of error we do not hit
the API multiple times''' 

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readt

KeyboardInterrupt: ignored

In [None]:
print(df) # prints the full content of the dataframe


In [30]:
df.head() # returns the first 5 rows of the dataframe (recommended)

Unnamed: 0,Current time,Subreddit,Author,Title,Body,Sentiment
0,2021-04-06 08:50:07.441350,wallstreetbets,mamadidntraiseabitch,"What Are Your Moves Tomorrow, April 06, 2021",What happened during that one year?,0.0
1,2021-04-06 08:50:07.441350,wallstreetbets,mamadidntraiseabitch,"What Are Your Moves Tomorrow, April 06, 2021",What happened during that one year?,0.0
2,2021-04-06 08:50:11.830204,wallstreetbets,VisualMod,If only I’d have known...,I am a bot. You submitted a picture of a banne...,0.0


In [None]:
df.to_csv('df.csv')# saves the file as a csv in this cloud environment
from google.colab import files
files.download('df.csv') # downloads the file to the local machine



To **open the csv file** created: click on the folder icon in the top left of the screen, then double-click on df.csv: it will open in the current window.

Please, note: this is just a version of the same code as in the other notebook  that uses a **pandas dataframe of a SQL database** to store data.

It allows to skip the download, installation and setup of SQL software and can be fully executed and reproduced in this cloud environment.


Nevertheless, a dataframe cannot fully replace a relational database: for this reason, I would keep the latter for the actual development of the project, and use the dataframe for *illustrating* purposes only.