# Outline for Wednesday, April 7
## Web 1 - How to get data from the 'net

Core ideas:
 - Network structure
     - IP addresses
     - host/domain names
     - client/server
     - request/response
 - HTTP protocol
     - URL
     - GET/POST
     - headers
     - status codes
 - The requests module
     - Etiquette
     - requests.get
     - requests.post

## Networking basics

Client (computer) sends **request** to server (another computer)
 - Request may contain data
 
Server sends back **response** to client
 - Response definitely contains data
 
How do we find the right server?
 - IP address (Example: 18.216.110.65 ) <-- IPV4 Address (You may encounter IPV6)
 - Use a "Domain Name" (a nickname / alias for an IP Address)
     - Example: www.msyamkumar.com
 
Once we've found the server, how do we find the right program?
 - Use the "port number" to find the correct program
     - Often can use a default port
     - Example: Your browser defaults to 80 or 443. ( 18.216.110.65:80 )

## HTTP
HyperText Transfer Protocol

What is it?
 - It's an agreed-upon standard format for making requests to servers

What is a URL?
 - Uniform Resource Locator
     - Domain name
     - Port number
     - Resource (file name)

GET? POST?
 - GET means a request for a simple download of data
 - POST means we're uploading some data as part of our request
 - Note: Never use GET for sensitive information!

What is an HTTP header?
 - Another format for useful metadata (similar to json or csv)

HTTP Status Codes overview
- 1XX : Informational
- 2XX : Successful
    - 200 : Standard "this worked!"
- 3XX : Redirection
- 4XX : Client Error
    - 404 File Not Found
- 5XX : Server Error

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [8]:
import requests
import json
from pandas import DataFrame

## DEMO: Simple string example
- URL: https://www.msyamkumar.com/hello.txt

In [9]:
url = "https://www.msyamkumar.com/hello.txt"
r = requests.get(url)
assert r.status_code == 200
print(type(r.text))
r.text

<class 'str'>


'Hello CS220 / CS319 students! Welcome to my website. Hope you are staying safe and healthy!\n'

In [10]:
typo_url = "https://www.msyamkumar.com/hello.txtt"
r = requests.get(typo_url)
assert r.status_code == 200
print(type(r.text))
r.text

AssertionError: 

In [11]:
r = requests.get(typo_url)
r.raise_for_status() #similar to asserting r.status_code == 200
r.text

HTTPError: 404 Client Error: Not Found for url: https://www.msyamkumar.com/hello.txtt

## DEMO: JSON file example
- URL: https://www.msyamkumar.com/scores.json
- json.load(FILE_OBJECT)
- json.loads(STRING)

In [15]:
url = "https://www.msyamkumar.com/scores.json"
r = requests.get(url)
r.raise_for_status()
urltext = r.text
print(urltext)

d = json.loads(urltext)
print(type(d))

#Shortcut to bypass using json.loads()
d2 = r.json()
print(d2)
print(type(d2))

{
  "alice": 100,
  "bob": 200,
  "cindy": 300
}

<class 'dict'>
{'alice': 100, 'bob': 200, 'cindy': 300}
<class 'dict'>


## Etiquette

Core idea: Don't make a lot of requests to the same server all at once.
 - Requests use the server's time
 - Professional websites/servers will ban you (sometimes permanently) if you make too many requests
 - If you don't get banned, you may break the server
     - DOS "Denial Of Service" attack
     - DDOS "Distributed Denial Of Service"
     - Don't ever do either of these!


## DEMO 1: reddit json processing
- URL: https://www.reddit.com/r/UWMadison.json or https://www.msyamkumar.com/cs220/f20/materials/lectureDemo_code/lec-30/UWMadison.json

THE FIRST LINK IS TO A LIVE WEBPAGE - Review requests etiquette before running!

In [17]:
#url = "https://www.reddit.com/r/UWMadison.json" #In testing, failed with 429 Client Error: Too Many Requests
url = "https://www.msyamkumar.com/cs220/f20/materials/lectureDemo_code/lec-30/UWMadison.json"
r = requests.get(url)
r.raise_for_status()
d = r.json()
print(type(d))

<class 'dict'>


### How to explore an unknown JSON?
- If you run into a dict, try .keys() functions to look at the keys of the dictionary
- If you run into a list, iterate over the list and print each item

In [18]:
d.keys()

dict_keys(['kind', 'data'])

In [21]:
type(d["data"])
d["data"].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [23]:
type(d["data"]["children"])
len(d["data"]["children"])

25

In [24]:
d["data"]["children"]

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'UWMadison',
   'selftext': '',
   'author_fullname': 't2_3r6u0pqt',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'This person clearly doesnâ€™t care about attendance smh. /s',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/UWMadison',
   'hidden': False,
   'pwls': 6,
   'link_flair_css_class': None,
   'downs': 0,
   'top_awarded_type': None,
   'hide_score': False,
   'name': 't3_jrzboi',
   'quarantine': False,
   'link_flair_text_color': 'dark',
   'upvote_ratio': 0.99,
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 111,
   'total_awards_received': 0,
   'media_embed': {},
   'author_flair_template_id': None,
   'is_original_content': False,
   'user_reports': [],
   'secure_media': None,
   'is_reddit_media_domain': True,
   'is_meta': False,
   'category': None,
   'secure_media_embed': {},
   'link_flair_text

In [26]:
for child in d["data"]["children"]:
    print(child["data"].keys())
    break

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'url_overridden_by_de

In [27]:
for child in d["data"]["children"]:
    print(child["data"]["score"], child["data"]["title"])

111 This person clearly doesnâ€™t care about attendance smh. /s
98 Here we go again
64 Breaking my lease for next semester?
124 What resources does UW provide that you would otherwise have to pay for (and that people donâ€™t know about)?
12 Sophomore Dorm
7 The Nick track
18 UW Thanksgiving To Go - Including free meals for students in need
3 I want to cancel my housing contract and get an apartment for spring but they say theyre not releasing kids for that reason. They will release me if I say Im going to live back home, though. So what if I said I was studying remotely at home and got a campus apartment anyway?
52 What to do when I canâ€™t afford food in college?
6 How can I meet people
11 The DoIT Help Desk is hiring for remote student jobs!
3 Fresh Market - I'm looking for a job
7 For anyone who is in one of the bands/orchestras how is that going this semester?
8 Math 340 Professors
247 I've been procrastinating this morning/afternoon by making this graphic of our Capitol
5 Need adv

## DEMO 2: State populations
- URL: https://www.msyamkumar.com/cs220/f20/materials/lectureDemo_code/lec-30/data/state_files.txt

Challenge problem: Each line in state_files.txt contains the name of a .json file (in the same directory on the server). Using `get` requests, load the contents of all of the json files and make one combined DataFrame with all of them. You will probably need to explore the data!

In [None]:
url = "https://www.msyamkumar.com/cs220/f20/materials/lectureDemo_code/lec-30/data/state_files.txt"
