## Algorithms

Today's studio is all about algorithms and automation! 

Last week we looked at how tables and tablular forms of data work (and their associated HTML and XML), and their weird griddy structures, but now using python, we're going to start to examine some more relational forms of how different kinds of information connect across the internet. 

Also, we're going to scrape Reddit. 

>Jupyter is an interesting structure with which to do this, since it doesn't operate like a normal script. You can run each section separately to make sure it works, and you don't have to put all the imports at the top. I'll put a sample python (.py) script in the Algorithms Studio folder on GitHub so you can see what it looks like and with instructions on how to make it work. 

### 1. Check out your page.

The first thing we need to do take a look at the website we want to scrape. There are hundreds of different ways constructing algorithmic structures (and use algorithms that others have made for us) - but we are also encountering information structures that others have made as well.

So, go to http://old.reddit.com/r/*chooseyoursubreddit*

I'm going to go to http://old.reddit.com/r/Wallstreetbets

Then, using our friendly neighbourhood `Dev Tools`, let's take a look at the underlying structure of the page for something that we might be able to grab. Keep this open - we'll come back to it in a second.

### 2. Download the HTML

Now, we have a view of the underlying structure of our page, let's grab the HTML. Our structure of doing so is like this:

1. Import necessary packages (I'll only say this once ;-) )
2. Set up a variable for our URL
3. Set up headers to make Reddit think we are accessing the html via a browser
4. Launch a request to access the URL and its data
5. Check out what we've got and make sure it's worked. In python, we use the 'print' function, and we just want to print the first 100 characters (chars) of the text.

In [5]:
import requests # imports the urllib (URL Library) package, so that you can use it in your script.

url = "https://old.reddit.com/r/canada/" # defines the URL. Put your URL here.
headers = {'User-Agent': 'Mozilla/5.0'} # browser spoof settings, to make Reddit think we are accessing the html via a browser

page = requests.get(url, headers=headers)# uses urllib "requests" function to "get" the url, using the headers as a spoof
print(page.text[:100])

<!doctype html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>Canad


Now, we're going to use another python package, `Beautiful Soup`, which helps us sort through (or parse) all the html data, pulling out the bits that we want. 

To do this, we need to:
1. Set up a variable called "soup", and define that it's the result of the "BeautifulSoup" package going to the "page" URL, and using the 'html.parser' to grab the HTML.
2. Set up a variable called "html", and define that it's the result of everything that is in the soup that meets a specific criteria (otherwise we'd just get everything). If we go back to our Subreddit and DevTools, you can see that the table that holds all the posts (as opposed to the headers and everything else) is contained within this unique tag: `<div id="siteTable" class="sitetable link listing"> ==$0`. So, we can specify that we're looking for the "div" with the id attribute of "siteTable".
3. Print out the HTML so we can look more closely at it. It will be messy, so we'll use the handy `.prettify()` to make the code more readable. 


In [6]:
from bs4 import BeautifulSoup 

soup = BeautifulSoup(page.text,'html.parser')
html = soup.find("div",attrs={'id':'siteTable'})
print(html.prettify())

<div class="sitetable linklisting" id="siteTable">
 <div class="thing id-t3_k423qh linkflair linkflair-cv odd stickied link self" data-author="OrzBlueFog" data-author-fullname="t2_hnfvn" data-comments-count="105" data-context="listing" data-domain="self.canada" data-fullname="t3_k423qh" data-gildings="0" data-nsfw="false" data-num-crossposts="0" data-oc="false" data-permalink="/r/canada/comments/k423qh/covid19_health_support_megathread_8_reminder/" data-promoted="false" data-rank="" data-score="271" data-spoiler="false" data-subreddit="canada" data-subreddit-fullname="t5_2qh68" data-subreddit-prefixed="r/canada" data-subreddit-type="public" data-timestamp="1606761645000" data-type="link" data-url="/r/canada/comments/k423qh/covid19_health_support_megathread_8_reminder/" data-whitelist-status="all_ads" id="thing_t3_k423qh" onclick="click_thing(this)">
  <p class="parent">
  </p>
  <span class="rank">
  </span>
  <div class="midcol unvoted">
   <div aria-label="upvote" class="arrow up log

### 3. Identify your object

So, that's cool but still waay too much. We need to figure out a way to define what an individual post is, so we need to look for something unique in the enclosing HTML. More tricky, some of the posts are actually advertisements, so we need to make sure we're only getting posts.

If you look closely, all posts have an "id" that looks like this:
```
<div class="thing id-t3_k423qh linkflair linkflair-cv odd stickied link self"....
```
So, we can use the "thing" class to identify a post. Then we need to make sure that it's a "ResultSet" type (so, if you imagine a table, it is an individual entry or line - we'll work more with this next week), and find out how many posts there are.

The steps look like this:
1. search the HTML, and find every div that has an class of "thing", and call the collection of objects "posts"
2. Check what type of object "posts" is
2. Count how many "posts" objects there are.


In [7]:
posts = html.find_all('div', class_ = 'thing')
print(type(posts))
print(len(posts)) #len is short for "length"

<class 'bs4.element.ResultSet'>
28


Now, let's take a closer look at the individual posts, and examine the relationships in the html tree hierarchy. This is easier to understand via example rather than explanation. 

Let's take a look at a single post by creating a "first_post" object, and look more closely at the html levels inside.

In [8]:
first_post = posts[0]
posts

[<div class="thing id-t3_k423qh linkflair linkflair-cv odd stickied link self" data-author="OrzBlueFog" data-author-fullname="t2_hnfvn" data-comments-count="105" data-context="listing" data-domain="self.canada" data-fullname="t3_k423qh" data-gildings="0" data-nsfw="false" data-num-crossposts="0" data-oc="false" data-permalink="/r/canada/comments/k423qh/covid19_health_support_megathread_8_reminder/" data-promoted="false" data-rank="" data-score="271" data-spoiler="false" data-subreddit="canada" data-subreddit-fullname="t5_2qh68" data-subreddit-prefixed="r/canada" data-subreddit-type="public" data-timestamp="1606761645000" data-type="link" data-url="/r/canada/comments/k423qh/covid19_health_support_megathread_8_reminder/" data-whitelist-status="all_ads" id="thing_t3_k423qh" onclick="click_thing(this)"><p class="parent"></p><span class="rank"></span><div class="midcol unvoted"><div aria-label="upvote" class="arrow up login-required access-required" data-event-action="upvote" role="button" 

Whoa, okay. Let's go down a level to the next "div" below and see what that looks like.

In [9]:
first_post.div

<div class="midcol unvoted"><div aria-label="upvote" class="arrow up login-required access-required" data-event-action="upvote" role="button" tabindex="0"></div><div class="score dislikes" title="271">271</div><div class="score unvoted" title="272">272</div><div class="score likes" title="273">273</div><div aria-label="downvote" class="arrow down login-required access-required" data-event-action="downvote" role="button" tabindex="0"></div></div>

Cool. Now, let's experiment a little moving around some of the divs. Use the section below (it won't affect the python script) to try your own. Any tag will work - `first_post.a`, `first_post.div.div`, `first_post.p` - what do you think is happening, what are the relationships?

In [10]:
first_post.p

<p class="parent"></p>

### 4. Find the data that you want and turn it into a list

Using the above method helps us figure out how we can point our script directly to where it can find the specific data that we want. 

For this example, we want to find the following information for each post: title, author, comments and likes.

**Title** seems to be defined by the `<p>` tag with `class` "title"
**Author** seems to be defined by the `<a>` tag with `class` "author"
**Comments** seem to be defined by the `<a>` tag with `class` "comments"
**Likes** are a little more complicated, but we can narrow them down to the `<div>` tag with attributes of a  `class` of "score likes"

> `class` is a different function in python. To specify div class, use `class_=`

Taking it slowly, before we go ahead and put it all together, it helps to make sure you've got the right directions, so we will:
1. build each piece of information from the first post;
2. turn it into text and get rid of the html, and
3. print to make sure that it's what we want. 

In [11]:
first_title = first_post.find('p', class_="title") #defines "first_title" as the result of finding the <p> tag with class "title" in the first_post html
first_title = first_title.text # pulls out the text (.text), leaving html behind
print(first_title) #prints the result

COVID-19 Health & Support Megathread #8 - REMINDER: Abide by local health orders and guidelines - reduce social circles, wear a mask, wash your hands, social distance. Do not post pandemic misinformation.COVID-19 (self.canada)


In [12]:
first_author = first_post.find('a', class_="author")
first_author = first_author.text
print(first_author)

OrzBlueFog


In [13]:
first_comment = first_post.find('a', class_="comments")
first_comment = first_comment.text.split()[0] # the text reads "x comments" but .split()[0] splits the text so we only get the first numbers
print(first_comment)

105


In [14]:
first_like = first_post.find("div", attrs={"class": "score likes"})
first_like = first_like.text.split()[0]
print(first_like)

273


Looking good! Now we know how to find the info, let's bring it all together into a list, so that we can turn it into a table. 

Remember, this course is all about the spatial aspects of digital technologies and computing: we're rearranging information here, grouping it together, organising and classifying it. 

The first thing we need to do is define what the titles of our lists are - titles, authors, comments and likes - and provide a scaffold for the information to be entered into with "[]", so we know it's grouped together.

Then, we need to return to our giant "posts" html, and ask our script to:
1. Sort through and find the info with the right class
2. Add it to the right list section.

In [15]:
# List Titles
titles = []
authors = []
comments = []
likes = []

# Extract data from individual posts
for post in posts:
    title = post.find('p', class_='title').text
    titles.append(title)
    author = post.find('a', class_='author').text
    authors.append(author)
# there's an issue with comments, that means that if there are no comments, it comes back empty (or "None").
# A "None" object isn't made of text, so there is no text to extract and split, so you get an error.
# So, first we find all the comments. 
    comment = post.find('a', class_='comments')
#Then, we as long as the comment isn't empty, do our usual .text.split to just get the numbers
    if comment is not None:
        comment = comment.text.split()[0]
# But if it IS None, or empty, then make it = 0
    else:
        comment = "0"
# Now, we can attach it to the "comments" list.
    comments.append(comment)
    like = post.find("div", attrs={'class': 'score likes'}).text.split()[0]
    likes.append(like)

### 5. Make it a DataFrame

Right now, we're just imagining what this might look like, so let's pull all these lists together into a table. There are many different data formats - CSV (which we've already looked at), JSON (later!), but really, the most flexible kind for Python, is the DataFrame.

DataFrames can be build using the "pandas" python package. We will work a little more with this next week when we look more at interoperability, but for now, we can build a simple DataFrame (or df), using each of our lists as columns. 

The steps are:
1. Import pandas, and let the script know you'll be using the acronymn "pd"
2. Define a "reddit" data frame, with 4 columns made of our lists.
3. Print some info to make sure we have the same number of lines and it looks fine
4. Take a look at our snazzy table!


In [16]:
import pandas as pd
reddit = pd.DataFrame({
'title': titles,
'author': authors,
'comments': comments,
'likes': likes
})
print(reddit.info())
reddit


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     28 non-null     object
 1   author    28 non-null     object
 2   comments  28 non-null     object
 3   likes     28 non-null     object
dtypes: object(4)
memory usage: 1.0+ KB
None


Unnamed: 0,title,author,comments,likes
0,COVID-19 Health & Support Megathread #8 - REMI...,OrzBlueFog,105,273
1,Canadian government to force incoming travelle...,Vaynar,1027,1483
2,Rocket Pro℠ Insight gives real estate agents t...,rocketmortgage,0,•
3,Ontario to implement mandatory COVID-19 tests ...,cyclinginvancouver,172,1137
4,Moderna to cut deliveries to Canada in new blo...,amirsadeghi,215,423
5,"Singh says day traders are ‘not the problem,’ ...",BeerAndADart,689,14.7k
6,Canadians growing more anxious over federal va...,FlyingDutchman997,233,340
7,Julie Payette upset that her RCMP security det...,Hfx1987,128,374
8,Johnson & Johnson single-shot COVID-19 vaccine...,trackofalljades,68,166
9,Huawei exec's application for more freedom den...,cyclinginvancouver,18,83


### Save it as a .csv

Finally, just so we have a proper copy of the data (and so you know how to get it out of python and into your favourite spreadsheet software), let's save it to csv.

This is stupidly easy with python. See below:

In [17]:
reddit.to_csv("reddit.csv")

### OMG JSON EASTER EGG!!

A final point of joy - one reddit user u/MrSirStevo has discovered that adding .json to the new Reddit website brings up a JSON file.

So, try typing http://reddit.com/r/*yoursubreddit*/.json - mine looks like http://reddit.com/r/Wallstreetbets/.json

It's really easy to get this info!

In [18]:
import requests
import json

url = 'http://reddit.com/r/Wallstreetbets/.json'
json_page   = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})

len(json_page.json()['data']['children'])

27

JSON - JavaScript Object Notation - is a data format. It's not great for python, but used all the time in Java development - and GeoJSON is a really popular format for storing geographic data. 

I've included it because the new reddit JSON data gives us a range of different information than the HTML from the old reddit. Let's take a look, using the json.dumps formatter:

In [19]:
print(json.dumps(json_page.json(), indent=2, sort_keys=True))

{
  "data": {
    "after": "t3_l7vf3i",
    "before": null,
    "children": [
      {
        "data": {
          "all_awardings": [
            {
              "award_sub_type": "GLOBAL",
              "award_type": "global",
              "awardings_required_to_grant_benefits": null,
              "coin_price": 150,
              "coin_reward": 0,
              "count": 4,
              "days_of_drip_extension": 0,
              "days_of_premium": 0,
              "description": "Thank you stranger. Shows the award.",
              "end_date": null,
              "giver_coin_reward": null,
              "icon_format": null,
              "icon_height": 2048,
              "icon_url": "https://i.redd.it/award_images/t5_22cerq/klvxk1wggfd41_Helpful.png",
              "icon_width": 2048,
              "id": "award_f44611f1-b89e-46dc-97fe-892280b13b82",
              "is_enabled": true,
              "is_new": false,
              "name": "Helpful",
              "penny_donate": null,
 

That's all for now, folks. Hopefully, you've learnt a little about algorithmic structures! To further improve on your own time - I'd strongly recommend copying this notebook and trying your hand at scraping some information beyond the four categories above, thinking in steps about what you need to tell the code to do, to get what you want (and laughing/crying when you just get random rows of junk)