<div style="text-align: center;">
    <img src="../images/ga_logo_large.png">
</div>

---
## **Project 3: Web APIs and NLP**

---
**Script to use Reddit API and scrape data**

In [1]:
# imports
import pandas as pd
import os
import requests
import time
import getpass
from datetime import datetime

---
**Access Info**

In [3]:
client_id = getpass.getpass()       # alphanumeric string provided under "personal use script"
client_secret = getpass.getpass()   # alphanumeric string provided as "secret"
user_agent = getpass.getpass()      # name of application
username = getpass.getpass()        # reddit username
password =  getpass.getpass()       # reddit password

 ········
 ········
 ········
 ········
 ········


**Parsing Function**

In [5]:
def parse_data(response, subreddit):
    
    batch = response.json()['data']['children']

    data = []
    output_path = '../data/reddit--' + subreddit + '.csv'

    for item in batch:
        timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        post = {
            'post_id' : item['data']['name'],
            'post_title' : item['data']['title'],
            'post_text' : item['data']['selftext'],
            'published_on': item['data']['created_utc'],
            'scraped_on': timestamp 
        }
        data.append(post)
        
# **===== consulted chatgpt for some of the lines on this specific bloc ===================**
    # check if there are posts to proces
    if not data:
        print("No new posts to process, try later")
        return None, pd.DataFrame()
# **=======================================================================================**

    # build df
    df = pd.DataFrame(data)
    
    # check if file already exists to write header
    if not os.path.isfile(output_path): # <<<---- chatgpt help for this logic
        # write header if file does not exist
        df.to_csv(output_path, index = False)
    else:
        # append to existing file, omit header
        df.to_csv(output_path, mode = 'a', header = False, index = False)
    
    # fetch last post id
    last_id = data[-1]['post_id']
    
    return last_id, df

----------

---
## **Full Script Scraping Script Here**

Run this cell to scrape data.

In [7]:
# authorization request and initial connection-----------------------------------------

#-----------  these lines are adapted from the API walkthrough ------------------------
auth = requests.auth.HTTPBasicAuth(client_id, client_secret)

data = {'grant_type': 'password',
       'username': username,
       'password': password}

headers = {'User-Agent': 'dsb0826project3/0.0.1'}

response = requests.post('https://www.reddit.com/api/v1/access_token',
                        auth = auth,
                        data = data,
                        headers = headers)

# response confirmation
if response.status_code == 200:
    print('Established connection')
else:
    print('Failed to connect, troubleshoot...')

    
# retrieve access token
token = response.json()['access_token']
headers['Authorization'] = f'bearer {token}'
if requests.get('https://oauth.reddit.com/api/v1/me', headers = headers).status_code == 200:
    print('Success, token retrieved')
else:
    print('Failed to retrieve token, troubleshoot...')
# --------------------------------------------------------------------------------------

# ----  API connection, data scraping and parsing --------------------------------------
# ----  Iteratively improved, some lines adapted/debugged after consulting with chatgpt
base_url = 'https://oauth.reddit.com/r/'
subreddits = ['RealEstate', 'travel']


    
for sub in subreddits:
    count = 1
    last = None # reset this key for new subreddit
    params = {'limit': 100, 'after': last}

    while True: # scrape until no more posts are returned


        response = requests.get(base_url + sub,
                            headers = headers,
                            params = params)

        if response.status_code == 200:
            print(f"Scraping request number {count} for {sub} subreddit in progress. Connection status: {response.status_code}")
            last, df = parse_data(response, sub)

            if last is None:
                print(f"No more posts available for subreddit: {sub}")
                break
            # otherwise update the key to pass on to next request
            params['after'] = last

        else:
            print(f"Failed to scrape subreddit {sub}.  Status code: {response.status_code}")
            break

        # increase count variable
        count += 1

        # run WHILE loop iterations (scrapes) every 15 seconds
        time.sleep(15)


Established connection
Success, token retrieved
Scraping request number 1 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 2 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 3 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 4 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 5 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 6 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 7 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 8 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 9 for RealEstate subreddit in progress. Connection status: 200
Scraping request number 10 for RealEstate subreddit in progress. Connection status: 200
No new posts to process, try later
No more posts available for subreddit: RealEst