## Import Libraries

In [2]:
import pandas as pd
import requests, json, time, datetime

import praw

## PRAW Authorization

- **Step 1** Log into your Reddit account here: https://www.reddit.com/prefs/apps to identify your id, secret, username and password.

- **Step 2** Save your credential as following in a json file.

```
{"id": "Enter your id" ,
"secret": "Enter your secret",
"user": "Enter your username",
"pass": "Enter your password"}```

- **Step 3** Name the json file "creds.json" and save it in the directory this jupyter notebook is located.

In [3]:
# Use your reddit credentials to instantiate the reddit class from PRAW

creds_file = open('../assets/creds.json', 'r')

reddit_creds = json.loads(creds_file.read())

In [4]:
# Check if logged in
reddit = praw.Reddit(
    client_id = reddit_creds['id'],
    client_secret = reddit_creds['secret'],
    username = reddit_creds['user'],
    password = reddit_creds['pass'],
    user_agent = 'dae')

# Return False, if you're logged in
reddit.read_only

False

## Scrapping

### Function: scrape_reddits_new()

Use class **subreddit** to scrape 'new' items.

The following from Reddit posts were collected.
- *title*: Title of the reddit post
- *body*: Body of the reddit post
- *coms*: Comments on the reddit post
- *num_coms*: Number of comments in the reddit post
- *age*: Number of days the reddit has been up since it was scraped.
- *subreddit*: Subreddit name
___
### Parameters
- `subreddit_name`: *str* | subreddit name
- `save_csv`: *boolean* | if True, save resulting dataframe as csv
- `save_path`: *str* | path to the directory to save resulting dataframe

Visit the documentation page for subreddit class [here](https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html).

In [13]:
def scrape_reddits_new(subreddit_name ,save_csv = False, save_path = './'):
    red = reddit.subreddit(subreddit_name)
    new_spot = [i for i in red.new(limit = 2_000)]

    title = [post.title for post in new_spot]
    body = [post.selftext for post in new_spot]
    num_coms = [post.num_comments for post in new_spot]
    coms = [[comment.body for comment in post.comments] \
            if (post.num_comments > 0) else 0 for post in new_spot]
    age = [(time.time() - post.created_utc)/60/1440 for post in new_spot]
    
    spot_dict = {
        'title': title,
        'body': body,
        'coms': coms,
        'num_coms': num_coms,
        'age': age,
        'subreddit': 'spotify'} 
    
    spot_df = pd.DataFrame(spot_dict)
    
    if save_csv:
        spot_df.to_csv(save_path ,index=False)
    
    return pd.DataFrame(spot_dict)

### Scrape hot items from Spotify subreddit

In [20]:
spot= reddit.subreddit('Spotify')

In [21]:
hot_spot = [i for i in spot.hot(limit = 2_000)]

In [22]:
title = [post.title for post in hot_spot]
body = [post.selftext for post in hot_spot]

num_coms = [post.num_comments for post in hot_spot]
coms = [[comment.body for comment in post.comments] \
        if (post.num_comments > 0) else 0 for post in hot_spot]
age = [(time.time() - post.created_utc)/60/1440 for post in hot_spot]

In [23]:
spot_dict = {
    'title': title,
    'body': body,
    'coms': coms,
    'num_coms': num_coms,
    'age': age,
    'subreddit': 'spotify'} 

In [24]:
spot_df = pd.DataFrame(spot_dict)

In [26]:
spot_df.to_csv('../data/web_data/spot_hot_101419.csv',index=False)

### Scrape controversial items from Spotify subreddit

In [38]:
cont_spot = [i for i in spot.controversial(limit = 2_000)]

In [39]:
title = [post.title for post in cont_spot]
body = [post.selftext for post in cont_spot]
num_coms = [post.num_comments for post in cont_spot]
coms = [[comment.body for comment in post.comments] \
        if (post.num_comments > 0) else 0 for post in cont_spot]
age = [(time.time() - post.created_utc)/60/1440 for post in cont_spot]

In [40]:
spot_dict = {
    'title': title,
    'body': body,
    'coms': coms,
    'num_coms': num_coms,
    'age': age,
    'subreddit': 'spotify'
    } 

In [41]:
spot_df = pd.DataFrame(spot_dict)

In [44]:
spot_df.to_csv('../data/web_data/spot_cont_101419.csv',index=False)

### Scrape new items from Spotify subreddit 
use 'scrap_reddits_new()'

In [15]:
spot_df = scrape_reddits_new('spotify', save_csv = True, save_path = '../data/spot_new_101419.csv')

In [16]:
spot_df.head()

Unnamed: 0,title,body,coms,num_coms,age,subreddit
0,"Melodic rap playlist, your goto spot for daily...",,0,0,0.017879,spotify
1,Southern Rock and American Rock,,0,0,0.019338,spotify
2,Spotify from Pandora,How do I make awful songs go away! Such bad mu...,0,0,0.022393,spotify
3,I literally can't find a way to listen to a fu...,"I have spotify free, so I can't really go that...","[uninterrupted*, sorry, You need Spotify premi...",5,0.035148,spotify
4,Acoustic Guitar Attic - Beautiful acoustic ins...,,0,0,0.044037,spotify


### Scrape hot items from Apple Music subreddit

In [56]:
app.hot(limit = 2_000)

<praw.models.listing.generator.ListingGenerator at 0x1107f3198>

In [57]:
hot_app = [i for i in app.hot(limit = 2_000)]

In [58]:
title = [post.title for post in hot_app]
body = [post.selftext for post in hot_app]
num_coms = [post.num_comments for post in hot_app]
coms = [[comment.body for comment in post.comments] \
        if (post.num_comments > 0) else 0 for post in hot_app]
age = [(time.time() - post.created_utc)/60/1440 for post in hot_app]

In [59]:
app_dict = {
    'title': title,
    'body': body,
    'coms': coms,
    'num_coms': num_coms,
    'age': age,
    'subreddit': 'apple_music'} 

In [60]:
app_df = pd.DataFrame(app_dict)

In [63]:
app_df.to_csv('../data/web_data/app_hot_101419.csv',index=False)

### Scrape controversial items from Apple Music subreddit

In [64]:
app.controversial(limit = 2_000)

<praw.models.listing.generator.ListingGenerator at 0x111d1a080>

In [65]:
cont_app = [i for i in app.controversial(limit = 2_000)]

In [66]:
title = [post.title for post in cont_app]
body = [post.selftext for post in cont_app]
num_coms = [post.num_comments for post in cont_app]
coms = [[comment.body for comment in post.comments] \
        if (post.num_comments > 0) else 0 for post in cont_app]
age = [(time.time() - post.created_utc)/60/1440 for post in cont_app]

In [67]:
app_dict = {
    'title': title,
    'body': body,
    'coms': coms,
    'num_coms': num_coms,
    'age': age,
    'subreddit': 'apple_music'} 

In [68]:
app_df = pd.DataFrame(app_dict)

In [71]:
app_df.to_csv('../data/web_data/app_cont_101419.csv',index=False)

### Scrape new items from AppleMusic subreddit 
use 'scrap_reddits_new()'

In [27]:
app_df = scrape_reddits_new('AppleMusic', save_csv = True, save_path = '../data/app_new_101419.csv')

In [29]:
app_df.head()

Unnamed: 0,title,body,coms,num_coms,age,subreddit
0,"Bedroom Diamonds, slow tunes for fast nights (...",,0,0,0.035848,spotify
1,To get you through the dark days of fall and w...,,0,0,0.074054,spotify
2,Brandon Kai & Kory Ryan - Overtime #Newmusic,[https://music.apple.com/us/album/overtime-sin...,0,0,0.109031,spotify
3,Switching from Google Play Music - Questions,In light of Google's recent news to end GPM an...,"[Yeah, you can have it all in the same playlis...",7,0.254355,spotify
4,Apple Music gone haywire after App Store count...,Switched countries on the App Store and Apple ...,"[I just want the service im paying for..., Did...",3,0.260142,spotify


# Result

'new', 'hot', and 'controversial' items from 'Spotify' and 'AppleMusic' Subreddits were scraped. I scraped 'new' items daily using 'scrap_reddits_new()' function to collect new reddit posts. A total of 7368 reddit posts were scraped.