# Getting Data from Reddit API

### Sources used:
[PRAW - Python Reddit API Wrapper](https://www.geeksforgeeks.org/python-praw-python-reddit-api-wrapper/)<br>
[How to Use the Reddit API in Python](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c)<br>
[Scraping Reddit Data](https://towardsdatascience.com/scraping-reddit-data-1c0af3040768)<br>
[NER for Extracting Stock Mentions on Reddit](https://towardsdatascience.com/ner-for-extracting-stock-mentions-on-reddit-aa604e577be)


#1. Setup
Download the required modules: 
1. PRAW - to interact with the Reddit API
1. SpaCy - for natural language processing 
  1. Download the SpaCy `en_core_web_sm` model 

In [None]:
!pip install praw

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting praw
  Downloading praw-7.6.0-py3-none-any.whl (188 kB)
[K     |████████████████████████████████| 188 kB 15.6 MB/s 
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.4.1-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 3.5 MB/s 
[?25hCollecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.6.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.4.1


In [None]:

!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!python -m spacy download en_core_web_sm

2022-10-14 05:47:54.275209: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 12.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


#2. Initializing the Program

import all required libraries:

In [None]:
import requests
import praw
import spacy
import pandas as pd

to access the reddit API, we need to authenticate with OAUTH2. 

for this, after creating an app using our account with the [Reddit API](https://www.reddit.com/wiki/api/), we use our credentials with PRAW to create an instance of a `reddit` object.

In [None]:
client_id = 'Qxc7nMBoS_35Rg4hpo4aLQ'
secret = 'j7ZhGG8zMkouebcIdAEYEkTNul85vw'
name = 'positech'
username = 'positech-22'
password = 'PROJECTSinprogramming22'

reddit = praw.Reddit(client_id = client_id,
                     client_secret = secret,
                     user_agent = name,
                     username = username,
                     password = password)

reddit.read_only = True

we then create the model we will use to analyze the text

In [None]:
nlp = spacy.load('en_core_web_sm')

function that uses the SpaCy model to extract the entities from the given text

In [None]:
def get_orgs(post):
  text = nlp(post)
  org_list = []
  for entity in text.ents:
    if entity.label_ == 'ORG':
      org_list.append(entity.text)
  org_list = list(set(org_list))
  return org_list


In [None]:
test = '''
The Consumer Price Index for All Urban Consumers (CPI-U) rose 0.4 percent in September on a seasonally adjusted basis after rising 0.1 percent in August, the U.S. Bureau of Labor Statistics reported today. Over the last 12 months, the all items index increased 8.2 percent before seasonal adjustment.

Increases in the shelter, food, and medical care indexes were the largest of many contributors to the monthly seasonally adjusted all items increase. These increases were partly offset by a 4.9-percent decline in the gasoline index. The food index continued to rise, increasing 0.8 percent over the month as the food at home index rose 0.7 percent. The energy index fell 2.1 percent over the month as the gasoline index declined, but the natural gas and electricity indexes increased.

The index for all items less food and energy rose 0.6 percent in September, as it did in August. The indexes for shelter, medical care, motor vehicle insurance, new vehicles, household furnishings and operations, and education were among those that increased over the month. There were some indexes that declined in September, including those for used cars and trucks, apparel, and communication.

The all items index increased 8.2 percent for the 12 months ending September, a slightly smaller figure than the 8.3-percent increase for the period ending August. The all items less food and energy index rose 6.6 percent over the last 12 months. The energy index increased 19.8 percent for the 12 months ending September, a smaller increase than the 23.8-percent increase for the period ending August. The food index increased 11.2 percent over the last year.
'''

In [None]:
get_orgs(test)

['the U.S. Bureau of Labor Statistics']

#3. Getting the Raw Data
from our `reddit` instance, we create a `subreddit` instance. The subreddits we use for our `subreddit` instance are [r\stocks](https://www.reddit.com/r/stocks/), [r\wallstreetbets](https://www.reddit.com/r/wallstreetbets/) and [r\investing](https://www.reddit.com/r/investing/).

from our subreddit instance, we extract the last 10000 new posts.*

Every element in `post` is an instance of a `submission`.

In [None]:
subreddit = reddit.subreddit('stocks+WallStreetBets+Investing')

posts = subreddit.new(limit=1000)

posts

<praw.models.listing.generator.ListingGenerator at 0x7fd0354ecb90>

keywords to filter the data*

In [None]:
keywords = ['APPL', 'Apple', 'GOOGL', 'GOOG', 'Google', 'META', 'IBM', 'AMZN', 'Amazon']

#4. Creating the DataFrame for Analysis
create a `pandas` dataframe that has the text of the post, the entities in the post, the date the post was created and the link to the post.

In [None]:
column_names = ['post', 'keywords', 'date', 'link']

df = pd.DataFrame(columns=column_names)


for every `submission`, get the relevant data and create a new row to append for the dataframe

In [None]:
for p in posts:
    orgs = get_orgs(p.selftext)
    row = [p.selftext, orgs, p.created_utc, p.url]
    df.loc[len(df.index)] = row



df


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

Unnamed: 0,post,keywords,date,link
0,,[],1.665731e+09,https://i.redd.it/uje2vtztypt91.jpg
1,Did you guys hear about this new hedge fund ow...,[],1.665730e+09,https://www.reddit.com/r/wallstreetbets/commen...
2,,[],1.665730e+09,https://i.redd.it/3bhyq14uvpt91.jpg
3,,[],1.665729e+09,https://i.redd.it/0ml82g9bvpt91.jpg
4,Im new to stocks (like everyone bud) but i got...,[],1.665729e+09,https://www.reddit.com/r/wallstreetbets/commen...
...,...,...,...,...
995,,[],1.665438e+09,https://i.redd.it/wbzpw6jpt1t91.jpg
996,"Thought we’d see a new low, and it started to ...",[],1.665438e+09,https://www.reddit.com/gallery/y0qywz
997,,[],1.665438e+09,https://i.redd.it/mf9ywqggs1t91.png
998,,[],1.665438e+09,https://v.redd.it/mukav5lpr1t91


clean up the dataframe, clean the dataframe and only include rows that have text.

In [None]:
df = df[df.post != '']

In [None]:
df

Unnamed: 0,post,keywords,date,link
1,Did you guys hear about this new hedge fund ow...,[],1.665730e+09,https://www.reddit.com/r/wallstreetbets/commen...
4,Im new to stocks (like everyone bud) but i got...,[],1.665729e+09,https://www.reddit.com/r/wallstreetbets/commen...
6,"Hi, it seems a lot of people are shocked by SP...","[YahooFinance, SPY, High-Low %, +4.13%, +6.89%]",1.665729e+09,https://www.reddit.com/r/stocks/comments/y3m5e...
9,Can anyone explain what is going on with the o...,[],1.665728e+09,https://www.reddit.com/r/stocks/comments/y3luh...
10,\n\nTHE CONCEPT\n\n(October 2022)\n\nSuperfic...,[Fed],1.665727e+09,https://www.reddit.com/r/stocks/comments/y3ljg...
...,...,...,...,...
988,"So this morining at market open, I realized th...",[],1.665441e+09,https://www.reddit.com/r/wallstreetbets/commen...
989,A common piece of advice on this subreddit at ...,"[Fed, S&P]",1.665441e+09,https://www.reddit.com/r/stocks/comments/y0s4j...
992,50% of my portfolio is invested into index fun...,"[YOY, Fundies, Home, ROE, EPS\n\n&#]",1.665440e+09,https://www.reddit.com/r/wallstreetbets/commen...
994,[https://www.cnbc.com/2022/10/10/arks-cathie-w...,"[FED, Tech]",1.665438e+09,https://www.reddit.com/r/wallstreetbets/commen...


In [None]:
df = df.reset_index(drop=True)
df

Unnamed: 0,post,keywords,date,link
0,Did you guys hear about this new hedge fund ow...,[],1.665730e+09,https://www.reddit.com/r/wallstreetbets/commen...
1,Im new to stocks (like everyone bud) but i got...,[],1.665729e+09,https://www.reddit.com/r/wallstreetbets/commen...
2,"Hi, it seems a lot of people are shocked by SP...","[YahooFinance, SPY, High-Low %, +4.13%, +6.89%]",1.665729e+09,https://www.reddit.com/r/stocks/comments/y3m5e...
3,Can anyone explain what is going on with the o...,[],1.665728e+09,https://www.reddit.com/r/stocks/comments/y3luh...
4,\n\nTHE CONCEPT\n\n(October 2022)\n\nSuperfic...,[Fed],1.665727e+09,https://www.reddit.com/r/stocks/comments/y3ljg...
...,...,...,...,...
510,"So this morining at market open, I realized th...",[],1.665441e+09,https://www.reddit.com/r/wallstreetbets/commen...
511,A common piece of advice on this subreddit at ...,"[Fed, S&P]",1.665441e+09,https://www.reddit.com/r/stocks/comments/y0s4j...
512,50% of my portfolio is invested into index fun...,"[YOY, Fundies, Home, ROE, EPS\n\n&#]",1.665440e+09,https://www.reddit.com/r/wallstreetbets/commen...
513,[https://www.cnbc.com/2022/10/10/arks-cathie-w...,"[FED, Tech]",1.665438e+09,https://www.reddit.com/r/wallstreetbets/commen...


export the dataframe into an external csv file and save to local machine



In [None]:
from google.colab import files
df.to_csv('reddit.csv') 
files.download('reddit.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#5. Conclusion
We managed to create a dataframe to use to analyze the state of the current stock market. There are some limitations/further improvements to be made in this data extraction:


*   Initially, the idea was to extract reddit posts by time. I used the `subreddit.top(time_filter = week)` expression in the program, however the resulting issue was that there were not enough posts for the dataframe. Because, as seen above, majority of the posts do not even have text. 
*   I used the `subreddit.new(limit = 1000)` to get the last most recent 1000 posts from the given subreddits. The initial plan was to get posts filtered by keyword (*from the keyword list) given above but again, when I filtered the data, the dataframe became too small. So I have left the entire dataframe as is.
*   When I tried to extract more posts than a thousand, I kept getting a server side error. 






