# SI699 :: Seminar :: Book 1 of 3 :: Using PRAW to Extract from R/UofM

<b>Setup</b>
- Get a Reddit account and generate API keys (See "References" #1)
- Download PRAW via conda, pip, etc.

To get credentialled, go <a href="https://www.reddit.com/prefs/apps">here</a> while logged into Reddit account and "create application" to get setup. Name/description not particularly important. 

# Tutorial Roadmap

<b>Acquisition (Part 1 of 3 :: PRAW // Data Gathering)</b>
- We're going to use the library PRAW with API keys to go extract data from the Reddit platform
- PRAW is an unofficial library but quite popular and built to harvest data with best practices to remain compliant with terms-of-service agreement for the platform
- You instantiate a "Reddit" instance with your keys
- You connect to a specified subreddit
- You sample data based on some ordering ("hot"/"new"/etc.)
- Divide it up to get a chunk or chunks you'll annotate (We'll get to that later)
- Save to external files, database, etc.

<b>Preparation (Part 2 of 3 :: Data Annotation)</b>
- Import the textual data we want to make into training data
- Provide annotations so the model can "learn"

<b>Execution (Part 3 of 3 :: Natural Entity Recognition Training)</b>
- Use our labelled data to train a NER model in SpaCy and observe our results

***

Credentials:
- `client_id` is going to show up right under the specified application name
- `client_secret` is going to show up in the "secret" field
- `user_agent` can be pretty much anything, really
- other stuff not necessary since we're just doing read-only to grab the data

***

References:

<a href="https://github.com/reddit-archive/reddit/wiki/OAuth2#getting-started">Getting Started (Authentication)</a>

<a href="https://github.com/praw-dev/praw">Github :: Praw</a> ++ <a href="https://praw.readthedocs.io/en/latest/">Praw Documentation</a>

***

### Setup Connection to Reddit

In [None]:
with open("Keys.txt") as file:
    Client = file.readline().split(',')[1].strip()
    Secret = file.readline().split(',')[1].strip()
    Agent = file.readline().split(',')[1].strip()

FileNotFoundError: ignored

In [None]:
import praw
reddit = praw.Reddit(client_id=Client, client_secret=Secret, user_agent=Agent)
reddit.read_only

True

### Select a Subreddit
In this case, let's use <a href="https://www.reddit.com/r/uofm/new/">R/UofM</a> because, yeah, Go Blue

Instantiate a subreddit object 

In [None]:
subreddit = reddit.subreddit('uofm')
print(subreddit.display_name, subreddit.title)
if(0):print(subreddit.description)

uofm University of Michigan


In [None]:
#### Use dir(object) to see all methods and attributes, if interested:
# [i for i in dir(subreddit) if i[0]!="_"]

Now we've connected to the subreddit. To sample, we can write `subreddit.YYY(limit=N)` where YYY is a standard Reddit sorting option ("new" posts / "hot" posts / etc) and N is how many posts we would like sampled.

In [None]:
sample = subreddit.new(limit=1000)

***
### A Quick Aside :: Generators

In [None]:
type(sample)

praw.models.listing.generator.ListingGenerator

The return of calling samples from a subreddit object is technically a Generator object. <u>What's a generator? Why use this? How does this compare to just making a list?</u>

A quick analogy:
- Pretend you have 40 apples on your desk that you want to measure
- Just as you can't really hold 40 apples once, your computer can only hold so much data for active use at once (RAM)
- Using a list of 10,000,000,000 reddit posts might be more than your computer can handle
- A generator effectively "points" to the data but only loads in one record at a time, as you prompt it to
- - either with next(generatorObject) or iterating "for i in generatorObject"
- This would be analagous to saying "pick up one apple at a time, then do your individual operation on it, then move on"

In this case we're just grabbing 1000 records so memory isn't much an issue. The PRAW package just uses this design as a best-practice for effective performance in larger-scale-data operations.

Generators do have one particular weird component, however, in that they are <u>exhaustible</u>, meaning when you've iterated through something, it's gone.

In [None]:
temp = subreddit.hot(limit=10)

In [None]:
#### If you run this list comprehension, it'll print the post ID of all sampled posts.
[i for i in temp]

[Submission(id='j3k4t8'),
 Submission(id='m0l37o'),
 Submission(id='mgxsrq'),
 Submission(id='mgxtl2'),
 Submission(id='mgxwjs'),
 Submission(id='mgq62r'),
 Submission(id='mguwpg'),
 Submission(id='mgm7mq'),
 Submission(id='mgzf40'),
 Submission(id='mgjzcc')]

In [None]:
#### If you run it a second time, it'll be empty because the generator iterated out all elements.
[i for i in temp]

[]

***
### Getting Post Attributes
While we are just using original-post text for this demonstration, realize that PRAW does afford you the ability to access a lot of other post data, including comment trees you can parse.

Below you can see extraction of a post's title, body-text, and unique primary-key identifier.

In [None]:
temp = subreddit.hot(limit=10)

In [None]:
#### next(generator) --> "Generator, please give is the next record off the top"
one_post = next(temp)

In [None]:
print(f"\t[{one_post.id}] :: {one_post.title}")
print("="*80)
print(one_post.selftext)

	[j3k4t8] :: Michigan Mental Health Resources
Hey everyone! I've seen some people talking about how difficult things are right now on here so I am putting the list of Michigan mental health resources compiled by the Unmasked team here!

- [Unmasked Project](https://www.unmaskedproject.com/) Anonymous peer support app for Michigan Students
- [Counseling and Psychological Services](https://caps.umich.edu) Speak with a mental health professional 24/7 on their phone line or make an appointment for a mental health consultation
- [MiTalk](https://caps.umich.edu//mitalk) Mental health resources for specific groups around campus (undocumented students, first generation, graduate students, international students, etc.)
- [U-M Community Provider Database](https://umcpd.umich.edu/) Database of off-campus mental health professionals in the Ann Arbor area
- [Office of the Ombuds](https://ombuds.umich.edu/) A place where all students are welcome to come and talk in confidence about any campus issue,

In [None]:
#### Disclaimer :: When pulling from Hot, your first and second results CAN end up being stickied posts
one_post.stickied

True

### Conclusion: Grab Samples & Go
Grabbing the samples and dumping them:

In [None]:
import pandas as pd
posts = pd.DataFrame(columns=["id","fused"])

sample = subreddit.new(limit=500)

for each in sample:
    posts = posts.append({"id":each.id, "fused":f"{each.title} {each.selftext}"}, ignore_index=True)
posts = posts.set_index('id')

In [None]:
posts

Unnamed: 0_level_0,fused
id,Unnamed: 1_level_1
mh5n9v,Transferring within Taubman College Hi all! I’...
mh5anj,Michigan Mentorship Program Peer Mentor Decisi...
mgy95u,Any Attractive People? (This thread may come o...
mgzf40,Newly admitted student to CoE I've been admitt...
mgxwjs,Pain.
...,...
lvvp5i,Anyone heard from SURE 2021? The website says ...
lvvg51,just copped a whole 45% on my 203 midterm Can ...
lvvfo4,Should I drop 203? 203 exam scores just got po...
lvud33,Eecs autograder Do they only look at your most...


And now that the posts are extracted, just use .to_csv to dump them out into a file. In this case, I'm splitting them up into a testing set of unlabelled post data, plus a few subsets of training data for our team to divide-and-conquer to get things labelled.

<b>Won't run as written below, needs to be given a filepath you have access to, beit local or on cloud / drive</b>

In [None]:
posts.tail(400).to_csv("SOME_PATH/reddit_test_data.csv")

In [None]:
top_100 = posts.head(100)

In [None]:
top_100.tail(60).head(20).to_csv('SOME_PATH/to_label_1.csv')

In [None]:
top_100.tail(40).head(20).to_csv('SOME_PATH/to_label_2.csv')

In [None]:
top_100.tail(20).to_csv('SOME_PATH/to_label_3.csv')

In [None]:
top_100.head(40).to_csv('SOME_PATH/to_label_rest.csv')