# Predicting Developer/Publisher success 
## Data collection, cleaning, and analysis. 

In [8]:
import praw
import requests
import pandas as pd
import datetime as dt
import csv
from bs4 import BeautifulSoup
## installed lxml.html using "...$ pip install lxml", then import into python3 env
import lxml.html as lh


# Scraping html data

1. Create a handle to store the data
2. Store the contents of the data under one website
3. Parse the data that's stored between the <tr> elements of HTML

In [64]:
url = "http://www.vgchartz.com/preorders/43338/USA/"
response = requests.get(url)
response.text[:100] 

'\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/x'

In [65]:
 class HTMLTableParser:
       
        def parse_url(self, url):
            response = requests.get(url)
            soup = BeautifulSoup(response.text, 'lxml')
            return [(table['id'],self.parse_html_table(table))\
                    for table in soup.find_all('table')]  
    
        def parse_html_table(self, table):
            n_columns = 0
            n_rows=0
            column_names = []
    
            # Find number of rows and columns
            # we also find the column titles if we can
            for row in table.find_all('tr'):
                
                # Determine the number of rows in the table
                td_tags = row.find_all('td')
                if len(td_tags) > 0:
                    n_rows+=1
                    if n_columns == 0:
                        # Set the number of columns for our table
                        n_columns = len(td_tags)
                        
                # Handle column names if we find them
                th_tags = row.find_all('th') 
                if len(th_tags) > 0 and len(column_names) == 0:
                    for th in th_tags:
                        column_names.append(th.get_text())
    
            # Safeguard on Column Titles
            if len(column_names) > 0 and len(column_names) != n_columns:
                raise Exception("Column titles do not match the number of columns")
    
            columns = column_names if len(column_names) > 0 else range(0,n_columns)
            df = pd.DataFrame(columns = columns,
                              index= range(0,n_rows))
            row_marker = 0
            for row in table.find_all('tr'):
                column_marker = 0
                columns = row.find_all('td')
                for column in columns:
                    df.iat[row_marker,column_marker] = column.get_text()
                    column_marker += 1
                if len(columns) > 0:
                    row_marker += 1
                    
            # Convert to float if possible
            for col in df:
                try:
                    df[col] = df[col].astype(float)
                except ValueError:
                    pass
            
            return df

In [66]:
df.head

<bound method NDFrame.head of       Spider-Man (PS4) (PS4)Sony Interactive Entertainment, Action-Adventure
0        Super Smash Bros. (2018) (NS)Nintendo, Fighting                    
1      Red Dead Redemption 2 (PS4)Rockstar Games, Act...                    
2      Red Dead Redemption 2 (XOne)Rockstar Games, Ac...                    
3      Kingdom Hearts III (PS4)Square Enix, Role-Playing                    
4      Call of Duty: Black Ops IIII (PS4)Activision, ...                    
5      Days Gone (PS4)Sony Interactive Entertainment,...                    
6                        NBA 2K19 (PS4)2K Sports, Sports                    
7                 Dead Island 2 (PS4)Deep Silver, Action                    
8      Spyro Reignited Trilogy (PS4)Activision, Platform                    
9      Call of Duty: Black Ops IIII (XOne)Activision,...                    
10                 Super Mario Party (NS)Nintendo, Party                    
11     Shadow of the Tomb Raider (PS4)Square E

# Making a Reddit App for Authorization

1. Make a Reddit account and log in
2. Go to https://www.reddit.com/prefs/apps/
3. Create an App
4. Fill out the create application form 
  1. Choose the "script" option
  2. For our class, a redirect uri of http://soic.indiana.edu will suffice
5. After you've created the app, you'll see a window with your app's settings
  1. Get the client id - it's under your app's name
  2. Get the client secret
  


# Creating PRAW Reddit api object
### The parameters in the variables below are as follows: 

- client_id='PERSONAL_USE_SCRIPT_14_CHARS', \
- client_secret='SECRET_KEY_27_CHARS ', \
- user_agent='YOUR_APP_NAME', \
- username='YOUR_REDDIT_USER_NAME', \
- password='YOUR_REDDIT_LOGIN_PASSWORD')

In [20]:


client_id = "WhkpjLo6_5t5zQ" # insert your client ID here
client_secret = "nZhrnOnulzDse-k6AujCKkGPyh4" # client secret here
user_agent = "IU-SMM-2" # a string identifying your app to agents; it is courteous practice to provide your contact info
# username = "psuaggie"

r = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

# Analysis Notes
- The end goal of the structured data will be to have a tabular dataset of aggregated sentiment for studios and titles. For instance, if Bethesda makes fallout, I would capture comments with either 'Bethesda' or 'Fall Out' and rate the sentiment of those comments, along with the other features, then aggregate by the date. 

## Anticipated problems/obstacles
- how do I treat submissions and comments - the same? does it matter if they're related? 
- can we obtain other information about a given comment (e.g. score or controversial-ity)

## Needed Features: 
1) Comment body
    - needs to be filtered on particular keywords (e.g. studio name, title name)
2) Date
3) Score
4) Platform.  
    - Maybe there's some difference in how users of different platforms feel about a given title

# Getting Content from Subreddits

Let's make a Pythonic object representing the subreddit.

In [4]:
subreddit_gaming = r.subreddit("gaming")
subreddit_ps4 = r.subreddit("ps4")
## subreddit_xbox360 = r.subreddit("xbox360")
subreddit_wii = r.subreddit("wii")
subreddit_xboxone = r.subreddit("xboxone")
gaming = r.subreddit('gaming')

Get most recent comments and scores.

In [5]:
comments_gaming = []
comments_ps4 = []
## comments_xbox360 = []
comments_wii = []
comments_xboxone = []

for c in subreddit_gaming.comments(limit=10):
    comments_gaming.append((c.body, c.score))
    
for c in subreddit_xboxone.comments(limit=10):
    comments_xboxone.append((c.body, c.score))    

for c in subreddit_ps4.comments(limit=10):
    comments_ps4.append((c.body, c.score))

##for c in subreddit_xbox360.comments(limit=10):
  ##  comments_xbox360.append((c.body, c.score))

for c in subreddit_wii.comments(limit=10):
    comments_wii.append((c.body, c.score))


In [6]:
print(comments_gaming[0:10])
print('*'* 50)

print(comments_ps4[0:10])
print('*'* 50)

##print(comments_xbox360[0:10])
##print('*'* 50)

print(comments_xboxone[0:10])
print('*'* 50)

print(comments_wii[0:10])


[(' NO. 1st ONE!', 1), ('Why not let people enjoy what they are doing. No point in stopping them. ', 1), ("Should I make this post again or does it add nothing to the thread and I'm only doing it because I'm a loser? Nah I'm right I'll do it again - u/tastyboye", 1), ('remember when bad rats was the only grossly low quality game on steam? good times.', 1), ("I was wondering if Play E had any other outlets overseas but it seems they're entirely local. Hello there fellow singaporean. ", 1), ('Half-life is a classic. Usually library updates. ', 1), ('r/sbubby ', 1), ('2nd one', 1), ('Praise be Randy, we give thanks for the 5 fucking cargo drops in a row filled with human skin cowboy hats. Amen.', 1), ('I honestly had the same game from block buster', 1)]
**************************************************
[('Yes it is great', 1), ('https://www.google.com/amp/s/www.dailymail.co.uk/news/article-4190178/amp/Jaws-death-story-USS-Indianapolis.html', 1), ('Detroit Become Human ', 1), ("Because I

In [16]:
 html_string = '''
    <table>
        <tr>
            <td> Hello! </td>
            <td> Table </td>
        </tr>
    </table>
'''
    
    soup = BeautifulSoup(html_string, 'lxml') # Parse the HTML as a string
    
    table = soup.find_all('table')[0] # Grab the first table
    
    new_table = pd.DataFrame(columns=range(0,2), index = [0]) # I know the size 
    
    row_marker = 0
    for row in table.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            new_table.iat[row_marker,column_marker] = column.get_text()
            column_marker += 1
    
    new_table

In [15]:
import csv
with open("/home/khickman/IUMSDS/SMM/Project/data/25_Aug_2018.txt") as f:
    reader = csv.reader(f, delimiter="\t")
    d = list(reader)
print (d) # 248

[['Pos ', 'Game ', 'Weeks to Launch ', 'Weekly Change ', 'Total'], ['1 ', ''], ['Spider-Man (PS4) Wiki | Gamewise'], ['', 'Spider-Man (PS4) (PS4)'], ['Sony Interactive Entertainment, Action-Adventure'], ['', '2 ', '13,142 ', '295,073'], ['2 ', ''], ['Super Smash Bros. (2018) Wiki | Gamewise'], ['', 'Super Smash Bros. (2018) (NS)'], ['Nintendo, Fighting'], ['', '15 ', '29,610 ', '272,702'], ['3 ', ''], ['Red Dead Redemption 2 Wiki | Gamewise'], ['', 'Red Dead Redemption 2 (PS4)'], ['Rockstar Games, Action-Adventure'], ['', '9 ', '6,474 ', '253,289'], ['4 ', ''], ['Red Dead Redemption 2 Wiki | Gamewise'], ['', 'Red Dead Redemption 2 (XOne)'], ['Rockstar Games, Action-Adventure'], ['', '9 ', '3,330 ', '176,801'], ['5 ', ''], ['Kingdom Hearts III Wiki | Gamewise'], ['', 'Kingdom Hearts III (PS4)'], ['Square Enix, Role-Playing'], ['', '23 ', '3,744 ', '169,742'], ['6 ', ''], ['Call of Duty: Black Ops IIII Wiki | Gamewise'], ['', 'Call of Duty: Black Ops IIII (PS4)'], ['Activision, Shooter']

In [6]:
preorder = pd.read_txt('~/IUMSDS/SMM/Project/data', sep="\t", header=None)
# data.columns = ["a", "b", "c", "etc."]

AttributeError: module 'pandas' has no attribute 'read_txt'

# Getting comments on submissions

Getting comments on submissions is a little complicated. With users or entire subreddits, it's simple, because there is a one-to-many correspondence between user/subreddit and the comments it has. However, comments on a submission are organized in a *tree-like structure*; that is, the submission itself may have comments, and those comments may have comments on them, and so on. Because of this, we don't have helpful organizing functions like ``new`` or ``top``. We have to get them all and organize them ourselves. 

First, let's get the most recent submission of a subreddit and try to get the *top-level comments* on that submission. 

In [7]:
newssubreddit = r.subreddit("news")

submissions = []

for submission in newssubreddit.top(limit=5, time_filter="week"):
    submissions.append(submission)
    
s = submissions[0] # Let's work with the first submission

In [8]:
comments = []

for top_level_comment in s.comments:
    comments.append(top_level_comment)

In [9]:
# Text of the 11th comment
comments[0].body

"Teacher here. Just wanted to chime in and explain why these policies exist, as it was explained to me years ago:\n\nThe idea is that if a student can't receive lower than a 50 per marking period, there is never a point where it is impossible for them to pass for the year. Technically, they could not show up for three quarters, pull a 100 in the fourth, and still pass for the year. \n\nNow, an optimist would say this is a good thing, as it means the students will always have that opportunity to make a comeback. Particularly in low-income districts that lack parent engagement, the last thing you want is a kid realizing they can't possibly pass for the year and deciding to spend their day on the street instead of wasting it in school. I've seen firsthand students who really only kept coming to class because they still had that chance -- however small -- to pass for the year.\n\nThe more pessimistic view is what many have already pointed out in these comments: that it allows schools to ke

Comments can have comments themselves. Here's how extract the children comments of the first comment on the original submission.

In [10]:
replies = []

for reply in comments[0].replies:
    replies.append(reply)

In [11]:
replies[0].body

"For extenuating circumstances and crises, a blank or N/A is sufficient and won't hurt or help a grade. You could always assign an extra credit assignment that takes a lot of effort or allow past work to be made up for half credit. Basically, you can still give students a chance, but they have to bust their ass for a bit and show that they want it. I see way too many students who have been pushed through the system bragging that they do no work and still pass."

PRAW deals with Reddit rate limitations on comments by inserting "MoreComments" objects into the comment tree. For example, at the time of me writing this code, the fourth item in replies is a "MoreComments" object.

In [12]:
replies

[Comment(id='e6m8zqf'),
 Comment(id='e6mbi4r'),
 Comment(id='e6macx5'),
 Comment(id='e6mk93s'),
 <MoreComments count=132, children=['e6m9i4i', 'e6nnt3a', 'e6mn1my', '...']>]

We can open up a MoreComments object, but this necessitates sending another request to reddit.

In [93]:
mc = replies[-1]

comments = []
for c in mc.comments():
    comments.append(c)

In [94]:
comments[0].body

'Thank you for your work'

It's important to keep in mind if you work with reddit Comment Forests to calibrate your code to handle "MoreComments" objects gracefully. If you need help writing the code to do this, let me know.