# Toward Predicting Developer/Publisher success 
## Data collection, cleaning, and analysis. 

In [2]:
import praw
import requests
import pandas as pd
import datetime as dt
import csv
from bs4 import BeautifulSoup
## installed lxml.html using "...$ pip install lxml", then import into python3 env
import lxml.html as lh
import pprint

# Making a Reddit App for Authorization

1. Make a Reddit account and log in
2. Go to https://www.reddit.com/prefs/apps/
3. Create an App
4. Fill out the create application form 
  1. Choose the "script" option
  2. For our class, a redirect uri of http://soic.indiana.edu will suffice
5. After you've created the app, you'll see a window with your app's settings
  1. Get the client id - it's under your app's name
  2. Get the client secret
  


# Creating PRAW Reddit api object
### The parameters in the variables below are as follows: 

- client_id='PERSONAL_USE_SCRIPT_14_CHARS', \
- client_secret='SECRET_KEY_27_CHARS ', \
- user_agent='YOUR_APP_NAME', \
- username='YOUR_REDDIT_USER_NAME', \
- password='YOUR_REDDIT_LOGIN_PASSWORD')

In [3]:
client_id = "WhkpjLo6_5t5zQ" # insert your client ID here
client_secret = "nZhrnOnulzDse-k6AujCKkGPyh4" # client secret here
user_agent = "IU-SMM-2" # a string identifying your app to agents; it is courteous practice to provide your contact info
# username = "psuaggie"

r = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

# Analysis Notes
- The end goal of the structured data will be to have a tabular dataset of aggregated sentiment for studios and titles. For instance, if Bethesda makes fallout, I would capture comments with either 'Bethesda' or 'Fall Out' and rate the sentiment of those comments, along with the other features, then aggregate by the date. 

## Anticipated problems/obstacles
- how do I treat submissions and comments - the same? does it matter if they're related? 
- can we obtain other information about a given comment (e.g. score or controversial-ity)

## Needed Features: 
1) Comment body
- needs to be filtered on particular keywords (e.g. studio name, title name)
2) Date
3) Score
4) Platform.
- Maybe there's some difference in how users of different platforms feel about a given title

## To do: 
- [ ] create list to iterate through: controversial, gilded, hot, new, rising, top
- [ ] grab top 50 comments for the year for each subreddit, including ID, title, and body
- [ ] what variables are available - use vars and pprint. 

## Creating subreddit objects
- Nintendo 
- Playstation
- X Box

In [4]:
subreddit_nintendo = r.subreddit("nintendo")
subreddit_ps4 = r.subreddit("ps4")
subreddit_xbox = r.subreddit("xboxone")

I want to check to see what variables are available in each comment and submission. 
First I need a submission ID: 

In [5]:
for submission in subreddit_nintendo.hot(limit=1):
    print(submission.title)
    print(submission.score) 
    print(submission.id)    
    print(submission.url)

r/Nintendo are looking for more moderators!
51
9nu4wn
https://www.reddit.com/r/nintendo/comments/9nu4wn/rnintendo_are_looking_for_more_moderators/


In [6]:
# Take the submission ID: 9nu4wn and pass it into a new submission variable
submission = r.submission(id='9nu4wn')
print(submission.title) # to make it non-lazy
pprint.pprint(vars(submission))

r/Nintendo are looking for more moderators!
{'_comments': <praw.models.comment_forest.CommentForest object at 0xb4a584ac>,
 '_comments_by_id': {'t1_e7oyz0m': Comment(id='e7oyz0m'),
                     't1_e7p0s7c': Comment(id='e7p0s7c'),
                     't1_e7p1edq': Comment(id='e7p1edq'),
                     't1_e7p3d4o': Comment(id='e7p3d4o'),
                     't1_e7p3kly': Comment(id='e7p3kly'),
                     't1_e7p3s99': Comment(id='e7p3s99'),
                     't1_e7p48av': Comment(id='e7p48av'),
                     't1_e7p6480': Comment(id='e7p6480'),
                     't1_e7p6jy2': Comment(id='e7p6jy2'),
                     't1_e7p6rmd': Comment(id='e7p6rmd'),
                     't1_e7p8hm8': Comment(id='e7p8hm8'),
                     't1_e7pbza1': Comment(id='e7pbza1'),
                     't1_e7pffgs': Comment(id='e7pffgs'),
                     't1_e7pk2jc': Comment(id='e7pk2jc'),
                     't1_e7pk6ne': Comment(id='e7pk6ne'),
       

The variables of interest include 
- ```created``` which gives us the date the submission was created. 
- ```ups```
- ```downs```
- ```comments``` which is a comment forest
- ```selftext``` this is the body of the text
- ```upvote_ratio```

Now I can pass those params into an iterator and pull in top submissions for each subreddit for the year.  
Before I do that, since reddit typically has a limit of 1000 items returned (not sure if this is an API call), I'll want to figure out how to save each call to an output file, then come back to it later. This way I can issue multiple calls and get the data I need. 

Get most recent comments and scores.

In [9]:
# We'll start out with the top 100 submissions over the last month in the 'ps4' subreddit, 
# then save it to a file for later evaluation
ps4_top_submissions = []
for s in subreddit_ps4.top(limit=100, time_filter="month"):
    ps4_top_submissions.append([s.title, s.created, s.ups, s.downs, s.selftext, s.upvote_ratio])
    
print(ps4_top_submissions[0:10])

[["Red Dead Redemption 2 - If you turn off the mini-map, NPC's dialogue will change - giving you directions involving routes and landmarks", 1539067494.0, 42231, 0, '', 0.9], ['[Image] The Last of Us Part II Ellie Theme overlaps the menu 😍', 1538119662.0, 21802, 0, '', 0.88], ['[Image] Say what you will about Ubisoft, but this is pretty cool of them (regarding TellTale layoffs)', 1537605361.0, 17764, 0, '', 0.9], ['Red Dead Redemption 2 Main Campaign is around 65 hours long', 1539601854.0, 17197, 0, '', 0.92], ['Can we stop pretending that digital downloads are worth the same amount of money as physical copies?', 1539723773.0, 15978, 0, 'I really have never understood why online purchases of games are often the same price or more expensive than physical copies of the same game. ', 0.85], ['[Image] Sony please integrate all these into one app x-post from r/gaming', 1537212408.0, 14740, 0, '', 0.91], ["[Image]Sharks in Assassin's Creed Odyssey will actually eat bodies floating on the sur

In [16]:
comments_ps4 = []
comments_xbox = []
comments_nintendo = []

for c in subreddit_nintendo.comments(limit=100):
    comments_nintendo.append((c.body, c.score, c.ups, c.downs))
    
for c in subreddit_xbox.comments(limit=100):
    comments_xboxone.append((c.body, c.score, c.ups, c.downs))    

for c in subreddit_ps4.comments(limit=100):
    comments_ps4.append((c.body, c.score, c.ups, c.downs))

In [21]:
ps4_top_submissions = []
xbox_top_submissions = []
nintendo_top_submissions = []

for s in subreddit_ps4.top(limit=100, time_filter="month"):
    ps4_top_submissions.append((s.title, s.score, s.ups, s.downs))

for s in subreddit_xbox.top(limit=100, time_filter="month"):
    xbox_top_submissions.append((s.title, s.score, s.ups, s.downs))

for s in subreddit_nintendo.top(limit=100, time_filter="month"):
    nintendo_top_submissions.append((s.title, s.score, s.ups, s.downs))

ps4_top_submissions[0:10]
nintendo_top_submissions[0:10]
xbox_top_submissions[0:10]


#for s in subreddit_democrat.top(limit=10, time_filter="week"):
#    d_top_submissions.append((s.title, s.score))
#for submission in subreddit_nintendo:
#    print(submission.title)

[('I jokingly tweeted Xbox asking for a birthday present and they delivered!',
  20861,
  20861,
  0),
 ("Red Dead Redemption 2 - If you turn off the mini-map, NPC's dialogue will change - giving you directions involving routes and landmarks",
  14648,
  14648,
  0),
 ('Driving around in Forza Horizon 4 & I come across the Windows XP Desktop',
  11415,
  11415,
  0),
 ('Red Dead Redemption 2 will feature full first person mode at launch',
  8550,
  8550,
  0),
 ('Would anybody else like the option to remove these items from the home screen?',
  9053,
  9053,
  0),
 ('Footage Of A Harry Potter RPG Has Apparently Leaked', 7774, 7774, 0),
 ('Forza Horizon 4', 7492, 7492, 0),
 ('Can we take a moment to really appreciate the people at Microsoft working hard on backwards compatibility?',
  7195,
  7195,
  0),
 ('Buy an elite controller they said, it will be fun they said... 5th or 6th time this has happened to my lb buttons. I can’t be the only one. It’s so frustrating.',
  6601,
  6601,
  0

In [22]:
print(comments_nintendo[0:10])
print('*'* 50)

print(comments_ps4[0:10])
print('*'* 50)

##print(comments_xbox360[0:10])
##print('*'* 50)

print(comments_xbox[0:10])
print('*'* 50)

print(comments_wii[0:10])


[("Yeah, the podcast doesn't count since I'm too lazy to watch these videos and I'd rather just read it, because I read faster than I listen, I zone out less, and I can just skim it over instead of viewing the full thing. It could be included with Switch Online perhaps.", 1, 1, 0), ('=IF("Current Year", "No Mother 3", 0)', 1, 1, 0), ('"marketing" lol no, tons of gaming sites would cover it and they wouldnt have to pay anything', 1, 1, 0), ("When held in a sideways formation having your hands close together isn't nearly as much of an issue as having to reach all the buttons on the Joycon in an upright position with nothing but a thumb. I think the current setup is the optimal formation for both modes.", 1, 1, 0), ("Testing, quality assurance and marketing that could all go to products that are going to make them more money. There's opportunity cost. ", 1, 1, 0), ("Because the sides aren't interchangeable i wish the left one was like that, at least then you'd have one good controller", 1

NameError: name 'comments_xbox' is not defined

In [23]:
nintendo_comments = []
for comment in comments_nintendo:
    nintendo_comments.append(comment)
    
nintendo_comments[0:10]
nintendo_df = pd.DataFrame(comments_nintendo, columns=("Comment", "Score", "Ups", "Downs"))
nintendo_df.sort_values(by='Score', ascending=False)
nintendo_df.head()

Unnamed: 0,Comment,Score,Ups,Downs
0,"Yeah, the podcast doesn't count since I'm too ...",1,1,0
1,"=IF(""Current Year"", ""No Mother 3"", 0)",1,1,0
2,"""marketing"" lol no, tons of gaming sites would...",1,1,0
3,When held in a sideways formation having your ...,1,1,0
4,"Testing, quality assurance and marketing that ...",1,1,0


# Getting comments on submissions

Getting comments on submissions is a little complicated. With users or entire subreddits, it's simple, because there is a one-to-many correspondence between user/subreddit and the comments it has. However, comments on a submission are organized in a *tree-like structure*; that is, the submission itself may have comments, and those comments may have comments on them, and so on. Because of this, we don't have helpful organizing functions like ``new`` or ``top``. We have to get them all and organize them ourselves. 

First, let's get the most recent submission of a subreddit and try to get the *top-level comments* on that submission. 

In [7]:
newssubreddit = r.subreddit("news")

submissions = []

for submission in newssubreddit.top(limit=5, time_filter="week"):
    submissions.append(submission)
    
s = submissions[0] # Let's work with the first submission

In [8]:
comments = []

for top_level_comment in s.comments:
    comments.append(top_level_comment)

In [9]:
# Text of the 11th comment
comments[0].body

"Teacher here. Just wanted to chime in and explain why these policies exist, as it was explained to me years ago:\n\nThe idea is that if a student can't receive lower than a 50 per marking period, there is never a point where it is impossible for them to pass for the year. Technically, they could not show up for three quarters, pull a 100 in the fourth, and still pass for the year. \n\nNow, an optimist would say this is a good thing, as it means the students will always have that opportunity to make a comeback. Particularly in low-income districts that lack parent engagement, the last thing you want is a kid realizing they can't possibly pass for the year and deciding to spend their day on the street instead of wasting it in school. I've seen firsthand students who really only kept coming to class because they still had that chance -- however small -- to pass for the year.\n\nThe more pessimistic view is what many have already pointed out in these comments: that it allows schools to ke

Comments can have comments themselves. Here's how extract the children comments of the first comment on the original submission.

In [10]:
replies = []

for reply in comments[0].replies:
    replies.append(reply)

In [11]:
replies[0].body

"For extenuating circumstances and crises, a blank or N/A is sufficient and won't hurt or help a grade. You could always assign an extra credit assignment that takes a lot of effort or allow past work to be made up for half credit. Basically, you can still give students a chance, but they have to bust their ass for a bit and show that they want it. I see way too many students who have been pushed through the system bragging that they do no work and still pass."

PRAW deals with Reddit rate limitations on comments by inserting "MoreComments" objects into the comment tree. For example, at the time of me writing this code, the fourth item in replies is a "MoreComments" object.

In [12]:
replies

[Comment(id='e6m8zqf'),
 Comment(id='e6mbi4r'),
 Comment(id='e6macx5'),
 Comment(id='e6mk93s'),
 <MoreComments count=132, children=['e6m9i4i', 'e6nnt3a', 'e6mn1my', '...']>]

We can open up a MoreComments object, but this necessitates sending another request to reddit.

In [93]:
mc = replies[-1]

comments = []
for c in mc.comments():
    comments.append(c)

In [94]:
comments[0].body

'Thank you for your work'

It's important to keep in mind if you work with reddit Comment Forests to calibrate your code to handle "MoreComments" objects gracefully. If you need help writing the code to do this, let me know.