# Module 2: Reddit \& Bing Search APIs

In this demo I will demonstrate how to utilize the reddit API and Bing Search to pull news articles and posts as a source of external data.

First, I will show how to create a Reddit personal use script for accessing the Reddit API. This will require having a reddit account, if you don't have one, follow along using the provided excel file.

Then, we will all create a university account on Azure, and then create a Bing Search resource to access the Bing Search API.

Use this link to create a personal use script for the Reddit API [Click Here](https://www.reddit.com/prefs/apps)

## Load in Dependcies, pip install praw

In [2]:
!pip install praw
import praw

from datetime import datetime
from datetime import date

import pandas as pd
import re
import string
from google.colab import userdata



## Specify Reddit credentials and subreddits to be scraped

In [4]:
# Create a Reddit instance
reddit = praw.Reddit(client_id='4aygbvUFqWfGAllqJemgvQ',
                     client_secret=userdata.get('client_secret'),
                     user_agent='reddit_app/v1')


# not a secure way to store credentials, consider using a separate file, creating environment variables, keyvault, etc.

In [5]:
# Specify the subreddit names you want to retrieve posts from
subreddit_names = ['powsurf', 'gxor', 'exmormon', 'datascience']

## Pull in selected Post Attributes, store and convert to dataframe

In [21]:
# create an empty post_attributes list
post_attributes = []

for subreddit_name in subreddit_names:
    subreddit = reddit.subreddit(subreddit_name) # set subreddits
    posts = subreddit.top(time_filter='month', limit=20) # set post parameters

    for post in posts: # pull in the following post attributes
        post_attributes.append({
            'Title': post.title,
            'Content': post.selftext,
            'URL': post.url,
            'Date': datetime.utcfromtimestamp(post.created_utc).strftime('%Y-%m-%d'),
            'Provider': subreddit_name
        })

df_red = pd.DataFrame(post_attributes) # create dataframe
df_red['All_Text'] = df_red['Title'] + ' ' + df_red['Content'] # create all_text column
df_red['Provider'] = 'r/' + df_red['Provider'] # create provider column
df_red.head()

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,Title,Content,URL,Date,Provider,All_Text
0,Got out for a little urban powsurf in my city,,https://v.redd.it/m0yfs73e52bc1,2024-01-07,r/powsurf,Got out for a little urban powsurf in my city
1,POW TIME,,https://v.redd.it/rpn97sgteucc1,2024-01-16,r/powsurf,POW TIME
2,Getting some Midwest turns in!,Found this community and picked up a board a f...,https://v.redd.it/y4t90y3hjtdc1,2024-01-21,r/powsurf,Getting some Midwest turns in! Found this comm...
3,All good dogs deserve faceshots,,https://www.reddit.com/gallery/196xax6,2024-01-15,r/powsurf,All good dogs deserve faceshots
4,Yukiita Piatra,"Finished my first board a couple of days ago, ...",https://www.reddit.com/r/Powsurf/comments/190y...,2024-01-07,r/powsurf,Yukiita Piatra Finished my first board a coupl...


In [None]:
# df_red = pd.read_excel('Reddit_posts.xlsx')
# df_red.head()

## Clean Data Function (not in template??)

In [28]:
# from text corpora notebook
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
nltk.download('punkt')
nltk.download('stopwords')

def clean_text(text):
    # cleaned_text = BeautifulSoup(text, 'html.parser').get_text()
    if not isinstance(text, str):
        # Convert non-string types to string
        text = str(text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = text.lower() # lowercase
    text = text.strip() # remove extra whitespaces

    stop_words = set(stopwords.words("english")) # bring in stopwords
    tokens = nltk.word_tokenize(text) # tokenize
    cleaned_tokens = [token for token in tokens if token not in stop_words] # remove stopwords

    cleaned_text = ' '.join(cleaned_tokens) #rejoin tokens
    print(cleaned_text)
    return cleaned_text

df_red['Clean_All'] = df_red['All_Text'].apply(clean_text)
df_red.head()

got little urban powsurf city
pow time
getting midwest turns found community picked board weeks back lets keep thing going
good dogs deserve faceshots
yukiita piatra finished first board couple days ago still waiting snow go test run httpsredditcomlink190yxoavideohpk43yyfb2bc1player x200b
dumb question split surfers put bindings looks like pretty standard voile binding hardware something like throw bindings pack descent
aesmo vs grassroots ive burton backseat driver past two seasons looking upgrade nicer board ridden aesmo grassroots one prefer including price thanks
snowboard powsurfing anybody ever experienced using surfboard splitboard powsurf feel like lot shapes similar powsurfs good traction pad could work also easier find used market cheaper could use snowboardsplitboard times anybody tried would advise x200b
epoxy base hey powsurf reddit glad found community last year made powsurf old birch ply extra experimenting shape turned cool decided epoxy base thinking thatd good slick s

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Title,Content,URL,Date,Provider,All_Text,Clean_All
0,Got out for a little urban powsurf in my city,,https://v.redd.it/m0yfs73e52bc1,2024-01-07,r/powsurf,Got out for a little urban powsurf in my city,got little urban powsurf city
1,POW TIME,,https://v.redd.it/rpn97sgteucc1,2024-01-16,r/powsurf,POW TIME,pow time
2,Getting some Midwest turns in!,Found this community and picked up a board a f...,https://v.redd.it/y4t90y3hjtdc1,2024-01-21,r/powsurf,Getting some Midwest turns in! Found this comm...,getting midwest turns found community picked b...
3,All good dogs deserve faceshots,,https://www.reddit.com/gallery/196xax6,2024-01-15,r/powsurf,All good dogs deserve faceshots,good dogs deserve faceshots
4,Yukiita Piatra,"Finished my first board a couple of days ago, ...",https://www.reddit.com/r/Powsurf/comments/190y...,2024-01-07,r/powsurf,Yukiita Piatra Finished my first board a coupl...,yukiita piatra finished first board couple day...


## Filter dataframe to external URLs

In [30]:
filtered_df = df_red[~(df_red['URL'].str.startswith('https://www.reddit.com') | df_red['URL'].str.startswith('https://i.redd/it') | df_red['URL'].str.startswith('https://v.redd/it'))]
filtered_df.head()

Unnamed: 0,Title,Content,URL,Date,Provider,All_Text,Clean_All
0,Got out for a little urban powsurf in my city,,https://v.redd.it/m0yfs73e52bc1,2024-01-07,r/powsurf,Got out for a little urban powsurf in my city,got little urban powsurf city
1,POW TIME,,https://v.redd.it/rpn97sgteucc1,2024-01-16,r/powsurf,POW TIME,pow time
2,Getting some Midwest turns in!,Found this community and picked up a board a f...,https://v.redd.it/y4t90y3hjtdc1,2024-01-21,r/powsurf,Getting some Midwest turns in! Found this comm...,getting midwest turns found community picked b...
5,Dumb question for split surfers. Where do you ...,So it looks like these both have pretty standa...,https://i.redd.it/xod44yml5wcc1.jpeg,2024-01-17,r/powsurf,Dumb question for split surfers. Where do you ...,dumb question split surfers put bindings looks...
13,Happy New Year GXOR! Everyone stay safe! 🥂,,https://i.redd.it/2sg027d1qq9c1.jpeg,2024-01-01,r/gxor,Happy New Year GXOR! Everyone stay safe! 🥂,happy new year gxor everyone stay safe


In [32]:
filtered_df['URL'][21]

'https://i.redd.it/rfzcnhlthtbc1.jpeg'

**What is the purpose of filtering out internal URLs? When might you want to use straight reddit posts vs reddit posts linking to external sources?**

In order to find news via reddit. If you looked at internal URLs they would not lead to news, rather just more reddit posts.

## Bing Search API

Next, we will use Microsoft Azure to create a Bing Search Resource to Access the Bing Search API. Why not Google? Google got rid of their Google News Search API so Bing is what we've got!

Being by going to the [Azure Portal](https://portal.azure.com/#home) and creating a university account, then we will walk through the steps of creating a Bing Search resource and storing our secret keys within an environment variable.

In [33]:
import json
import os
from pprint import pprint
import requests

## Set up Bing Credentials \& Specify Search Query

In [35]:
# Set subscription key and endpoint variables
subscription_key = userdata.get('bing_secret')
endpoint = 'https://api.bing.microsoft.com/v7.0/news/search'

# Query term(s) to search for
query = "snowboard"

## Set Bing Search Parameters

In [36]:
mkt = 'en-US'
params = { 'q': query,                                                  # Search query string
          'mkt': mkt,                                                 # Only searching US currently
         'freshness': 'week',                                          # set freshness
         'count': 100}                                                # max articles returned
headers = { 'Ocp-Apim-Subscription-Key': subscription_key}

## Call to API, store info in JSON

In [37]:
# Call the API
try:
    response = requests.get(endpoint, headers=headers, params=params) # fill in parameters
    response.raise_for_status()


    json_data = response.json()


    with open('response.json', 'w', encoding='utf-8') as json_file:
        json.dump(json_data, json_file, ensure_ascii=False, indent=4)

    print("JSON response saved to response.json.")

except Exception as ex:
    raise ex

JSON response saved to response.json.


## Extract JSON response store in dataframe

In [40]:
# create empty list to store data
data = []
for article in json_data['value']:
    name = article.get('name','')
    description = article.get('description','')
    url = article.get('url','')

    provider_list = article.get('provider','')
    provider_name = next((provider.get('name', '') for provider in provider_list if provider.get('name')), '')

    date_published = article.get('datePublished','')
    formatted_date = datetime.strptime(date_published[:-18], "%Y-%m-%d")

    category = article.get('category','')
    data.append([name, description, url,provider_name,formatted_date,category])

# Define the column names
columns = ['Name','Description','URL','Provider','Date Published','Category']

# Create a DataFrame from the extracted data
df_news = pd.DataFrame(data, columns=columns)
df_news.head()

Unnamed: 0,Name,Description,URL,Provider,Date Published,Category
0,Snowboard Community Commemorates Legend and Pi...,There were no winners in Chicago on Friday nig...,https://www.yahoo.com/lifestyle/snowboard-comm...,Yahoo,2024-01-22,Entertainment
1,X Games Street Ski And Snowboard Events Return...,X Games Aspen 2024 will feature the return of ...,https://www.forbes.com/sites/michellebruton/20...,Forbes,2024-01-19,
2,"Peak Chic: Balenciaga’s $5,600 Snowboard Leads...",With the Sundance film festival now in full sw...,https://www.yahoo.com/entertainment/peak-chic-...,Yahoo,2024-01-20,
3,X Games Aspen 2024: Predicting who takes home ...,"Keep an eye on Colorado’s Chris Corning, the l...",https://www.aspentimes.com/news/x-games-aspen-...,The Aspen Times,2024-01-22,Sports
4,"Fresh snow lures skiers, snowboarders to Lee C...",Las Vegas and snow skiing don't seem like they...,https://news.yahoo.com/fresh-snow-lures-skiers...,YAHOO!News,2024-01-23,
