# Getting data from an API
This notebook walks you through some steps in collecting data from Reddit using the Pushshift.io API.

We will use the **Python Pushshift.io API Wrapper (PSAW)** which is documented here -> https://psaw.readthedocs.io/en/latest/

### Import package
This wrapper package allows the searching of public submissions and comments.

In [1]:
from psaw import PushshiftAPI
# https://pushshift.io/
# -> has a list of API endpoints, 
# which allow you to get data or search the data for comment, 
# submissions, and subreddit
# -> documentation: https://github.com/pushshift/api

import pandas as pd
# -> pandas is a library that allows you to read CSV files
# and other excel files.
# -> a way to create tables in Python


api = PushshiftAPI()
# -> if you type "api.", then press TAB after this point, 
# then you will see the functions.

# NOTE: When you're calling API connections, it requires Internet.
# When you get disconnected, it's going to stop.

# -> Using API is really about understanding the structure that you're
# getting and reading their documentation. 

### Get the 5 most recent posts in all of Reddit

In [2]:
posts = api.search_submissions(limit=5, filter=['full_link','author', 'title', 'subreddit', 'created_utc'])
results = list(posts)

# -> setting a limit avoid the program crashing or stopping as retrieving
# large amount of data might cause it to crash/stop



How do you get the first element?

In [3]:
# your code here
results [0]

submission(author='steven_exit', created_utc=1649505353, full_link='https://www.reddit.com/r/wichsbros3/comments/tzrom7/telegram_gruppe_mit_corinna_kopf_elena_kamperi/', subreddit='wichsbros3', title='Telegram Gruppe mit Corinna kopf Elena kamperi etc. täglich 100 onlyfans ,ohne invites, Tausche gegen deutsche Teens bei Interesse dm', created=1649476553.0, d_={'author': 'steven_exit', 'created_utc': 1649505353, 'full_link': 'https://www.reddit.com/r/wichsbros3/comments/tzrom7/telegram_gruppe_mit_corinna_kopf_elena_kamperi/', 'subreddit': 'wichsbros3', 'title': 'Telegram Gruppe mit Corinna kopf Elena kamperi etc. täglich 100 onlyfans ,ohne invites, Tausche gegen deutsche Teens bei Interesse dm', 'created': 1649476553.0})

In [5]:
type(results[0])
# -> As we can see, the type of the data is Pushshift object, which is why
# we cannot directly play with it.
# -> the part that has d_={} is the JSON object. It repeats the part of the
# submission but in JSON form.

psaw.PushshiftAPI.submission

Check if you actually got only 5 results.

In [6]:
# your code here
len(results)

5

### Get the most recent post from r/philippines

In [7]:
posts = api.search_submissions(limit=5, subreddit="philippines", filter=['full_link','author', 'title', 'subreddit', 'created_utc', 'selftext'])
posts_df = pd.DataFrame([thing.d_ for thing in posts])
# NOTE: to easily put the data into a DataFrame.

Display the `DataFrame`

In [12]:
# your code here
posts_df

# -> created uses the epoch time, which is why it's an internet number
# for the date

# TWO WAYS TO ACCESS A SPECIFIC COLUMN:
# 1. posts_df.full_link
# 2. posts_df['full_link'] -> how we usually access links/dictionaries

# for a specific column of a specific row -> posts_df['full_link'][0]

Unnamed: 0,author,created_utc,full_link,selftext,subreddit,title,created
0,ziesPrime95,1649505997,https://www.reddit.com/r/Philippines/comments/...,,Philippines,Sobrang kutya sa teleprompter pero ano to? 🤡,1649477000.0
1,venom029,1649505893,https://www.reddit.com/r/Philippines/comments/...,,Philippines,BBM Solid Kakampinks ✌️❤️💚🌷,1649477000.0
2,Whenthingsgotwrong,1649505813,https://www.reddit.com/r/Philippines/comments/...,Pinoys of reddit what is the worst punishment ...,Philippines,worst punishment,1649477000.0
3,joshuuuu214,1649505807,https://www.reddit.com/r/Philippines/comments/...,,Philippines,A Change of Heart,1649477000.0
4,Bombooclat,1649505789,https://www.reddit.com/r/Philippines/comments/...,Hi! Just a short preface na hindi na ako masya...,Philippines,How does Leni's legal team work?,1649477000.0


Retrieve the `full_link`  from the first item

In [15]:
# your code here
posts_df['full_link'][0]

'https://www.reddit.com/r/Philippines/comments/tvx9ns/question_about_sss/'

### Get posts from March 11 in r/philippines

In [16]:
import datetime as dt

sub="philippines"
start="2021-04-04"

start_date=pd.to_datetime(start)
# -> You can use pandas to convert the string into a DateTime object.
# Because a DateTime object can be converted to an Epoch integer

start_epoch=int(start_date.timestamp())
# -> Converting the DateTime object into an Epoch float, before converting
# it into an Integer as the reddit API only accepts an Integer

# -> We are going to get the posts before April 4, 2022
posts = api.search_submissions(limit=10, 
                               subreddit=sub, 
                               before=start_epoch,
                               filter=['full_link','author', 'title', 'subreddit', 'created_utc'])
posts_df = pd.DataFrame([thing.d_ for thing in posts])



In [17]:
# Display the dataframe
posts_df

Unnamed: 0,author,created_utc,full_link,subreddit,title,created
0,Intelligent_Ear3155,1615420404,https://www.reddit.com/r/Philippines/comments/...,Philippines,"Cuzette is a good jewelry brand, they offer go...",1615392000.0
1,ladyfromthedarkside,1615419908,https://www.reddit.com/r/Philippines/comments/...,Philippines,Makati’s strict implementation of wearing of f...,1615391000.0
2,Logical_Ad_3556,1615419483,https://www.reddit.com/r/Philippines/comments/...,Philippines,Hong Kong Toymakers Are Philippines’ New Targe...,1615391000.0
3,setardo,1615418893,https://www.reddit.com/r/Philippines/comments/...,Philippines,"Early Morning Coconut Trees View - Siargao, Ph...",1615390000.0
4,CommunicationFar116,1615418058,https://www.reddit.com/r/Philippines/comments/...,Philippines,Filipino on Guam Musician,1615389000.0
5,Reach_Round,1615417483,https://www.reddit.com/r/Philippines/comments/...,Philippines,Crypto to Peso ?,1615389000.0
6,VeterinarianDry7601,1615415742,https://www.reddit.com/r/Philippines/comments/...,Philippines,https://app.shopback.com/pK2fNgYuweb,1615387000.0
7,luvie06,1615414525,https://www.reddit.com/r/Philippines/comments/...,Philippines,PLS ANSWER I need this for my research :((,1615386000.0
8,the_yaya,1615413301,https://www.reddit.com/r/Philippines/comments/...,Philippines,"Daily random discussion - Mar 11, 2021",1615385000.0
9,threehappypenguins,1615411232,https://www.reddit.com/r/Philippines/comments/...,Philippines,Mail Forwarding Service,1615382000.0


## Challenge 

Let's put this in a loop to get the first 20 posts for every day of a month.

In [14]:
# getting everything for the entire March
start = '2022-03-01'
end = '2022-03-31'

start_date = pd.to_datetime(start)
end_date = pd.to_datetime(end)

start_epoch = int(start_date.timestamp())
end_epoch = int(end_date.timestamp())

posts_df = pd.DataFrame()

# to do
for date in dates:
    posts = api.search_submissions(limit=20, 
                               subreddit=sub, 
                               before=,
                               after=end_epoch,
                               filter=['full_link','author', 'title', 'subreddit', 'created_utc'])
    posts_df.append([thing.d_ for thing in posts])

NameError: name 'dates' is not defined