# Predicting Developer/Publisher success 
## Data collection, cleaning, and analysis. 

In [1]:
import praw
import requests
import pandas as pd
import datetime as dt
import csv
from bs4 import BeautifulSoup
## installed lxml.html using "...$ pip install lxml", then import into python3 env
import lxml.html as lh

# Making a Reddit App for Authorization

1. Make a Reddit account and log in
2. Go to https://www.reddit.com/prefs/apps/
3. Create an App
4. Fill out the create application form 
  1. Choose the "script" option
  2. For our class, a redirect uri of http://soic.indiana.edu will suffice
5. After you've created the app, you'll see a window with your app's settings
  1. Get the client id - it's under your app's name
  2. Get the client secret
  


# Creating PRAW Reddit api object
### The parameters in the variables below are as follows: 

- client_id='PERSONAL_USE_SCRIPT_14_CHARS', \
- client_secret='SECRET_KEY_27_CHARS ', \
- user_agent='YOUR_APP_NAME', \
- username='YOUR_REDDIT_USER_NAME', \
- password='YOUR_REDDIT_LOGIN_PASSWORD')

In [2]:
client_id = "WhkpjLo6_5t5zQ" # insert your client ID here
client_secret = "nZhrnOnulzDse-k6AujCKkGPyh4" # client secret here
user_agent = "IU-SMM-2" # a string identifying your app to agents; it is courteous practice to provide your contact info
# username = "psuaggie"

r = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

# Analysis Notes
- The end goal of the structured data will be to have a tabular dataset of aggregated sentiment for studios and titles. For instance, if Bethesda makes fallout, I would capture comments with either 'Bethesda' or 'Fall Out' and rate the sentiment of those comments, along with the other features, then aggregate by the date. 

## Anticipated problems/obstacles
- how do I treat submissions and comments - the same? does it matter if they're related? 
- can we obtain other information about a given comment (e.g. score or controversial-ity)

## Needed Features: 
1) Comment body
- needs to be filtered on particular keywords (e.g. studio name, title name)
2) Date
3) Score
4) Platform.
- Maybe there's some difference in how users of different platforms feel about a given title

# Getting Content from Subreddits

Let's make a Pythonic object representing the subreddit.

In [21]:
subreddit_gaming = r.subreddit("gaming")
subreddit_ps4 = r.subreddit("ps4")
subreddit_wii = r.subreddit("wii")
subreddit_xboxone = r.subreddit("xboxone")

Get most recent comments and scores.

In [19]:
comments_gaming = []
comments_ps4 = []
comments_wii = []
comments_xboxone = []

for c in subreddit_gaming.comments(limit=100):
    comments_gaming.append((c.body, c.score, c.replies, c.is_root))
    
for c in subreddit_xboxone.comments(limit=100):
    comments_xboxone.append((c.body, c.score, c.replies, c.is_root))    

for c in subreddit_ps4.comments(limit=100):
    comments_ps4.append((c.body, c.score, c.replies, c.is_root))

for c in subreddit_wii.comments(limit=100):
    comments_wii.append((c.body, c.score, c.replies, c.is_root))


In [20]:
print(comments_gaming[0:10])
print('*'* 50)

print(comments_ps4[0:10])
print('*'* 50)

##print(comments_xbox360[0:10])
##print('*'* 50)

print(comments_xboxone[0:10])
print('*'* 50)

print(comments_wii[0:10])


[('My parents used to know how to pause online games... they unplugged the internet cable.', 1, <praw.models.comment_forest.CommentForest object at 0xa914cb6c>, True), ("I don't game online but I do play some games that can't be paused. I just let my wife know ahead of time and she's okay with it. Of course we have a kid so it's not like we can just play whenever we want to.", 1, <praw.models.comment_forest.CommentForest object at 0xa914cecc>, False), ('"You Died."', 1, <praw.models.comment_forest.CommentForest object at 0xa914c20c>, True), ('i think you mean 17 colossi', 1, <praw.models.comment_forest.CommentForest object at 0xa914cacc>, False), ('He could have gotten it anywhere, from friends, siblings, it doesn’t have to be the internet ', 1, <praw.models.comment_forest.CommentForest object at 0xa914c7ac>, False), ('Gamers RISE UP\n\nBOTTOM TEXT ', 1, <praw.models.comment_forest.CommentForest object at 0xa914cf2c>, True), ('👏👏👏 **ＹＯＵ\u3000ＡＲＥ\u3000Ａ\u3000ＢＡＤ\u3000ＰＡＲＥＮＴＳ** 👏👏👏', 2, 

# Getting comments on submissions

Getting comments on submissions is a little complicated. With users or entire subreddits, it's simple, because there is a one-to-many correspondence between user/subreddit and the comments it has. However, comments on a submission are organized in a *tree-like structure*; that is, the submission itself may have comments, and those comments may have comments on them, and so on. Because of this, we don't have helpful organizing functions like ``new`` or ``top``. We have to get them all and organize them ourselves. 

First, let's get the most recent submission of a subreddit and try to get the *top-level comments* on that submission. 

In [7]:
newssubreddit = r.subreddit("news")

submissions = []

for submission in newssubreddit.top(limit=5, time_filter="week"):
    submissions.append(submission)
    
s = submissions[0] # Let's work with the first submission

In [8]:
comments = []

for top_level_comment in s.comments:
    comments.append(top_level_comment)

In [9]:
# Text of the 11th comment
comments[0].body

"Teacher here. Just wanted to chime in and explain why these policies exist, as it was explained to me years ago:\n\nThe idea is that if a student can't receive lower than a 50 per marking period, there is never a point where it is impossible for them to pass for the year. Technically, they could not show up for three quarters, pull a 100 in the fourth, and still pass for the year. \n\nNow, an optimist would say this is a good thing, as it means the students will always have that opportunity to make a comeback. Particularly in low-income districts that lack parent engagement, the last thing you want is a kid realizing they can't possibly pass for the year and deciding to spend their day on the street instead of wasting it in school. I've seen firsthand students who really only kept coming to class because they still had that chance -- however small -- to pass for the year.\n\nThe more pessimistic view is what many have already pointed out in these comments: that it allows schools to ke

Comments can have comments themselves. Here's how extract the children comments of the first comment on the original submission.

In [10]:
replies = []

for reply in comments[0].replies:
    replies.append(reply)

In [11]:
replies[0].body

"For extenuating circumstances and crises, a blank or N/A is sufficient and won't hurt or help a grade. You could always assign an extra credit assignment that takes a lot of effort or allow past work to be made up for half credit. Basically, you can still give students a chance, but they have to bust their ass for a bit and show that they want it. I see way too many students who have been pushed through the system bragging that they do no work and still pass."

PRAW deals with Reddit rate limitations on comments by inserting "MoreComments" objects into the comment tree. For example, at the time of me writing this code, the fourth item in replies is a "MoreComments" object.

In [12]:
replies

[Comment(id='e6m8zqf'),
 Comment(id='e6mbi4r'),
 Comment(id='e6macx5'),
 Comment(id='e6mk93s'),
 <MoreComments count=132, children=['e6m9i4i', 'e6nnt3a', 'e6mn1my', '...']>]

We can open up a MoreComments object, but this necessitates sending another request to reddit.

In [93]:
mc = replies[-1]

comments = []
for c in mc.comments():
    comments.append(c)

In [94]:
comments[0].body

'Thank you for your work'

It's important to keep in mind if you work with reddit Comment Forests to calibrate your code to handle "MoreComments" objects gracefully. If you need help writing the code to do this, let me know.