# Ways to Get Data from/through the Internet

## Best - Backend API
- structured calls (intended by dev)
- good for us because we get (usually) useful, clean-ish data
- good for dev bc they have more control over what data we get

## Second Best (though usually much worse) - Frontend Webscraping
- generally loop through code rather than single call
- frontend usually not as carefully structured as backend
- dev has little control over requests
- frontends change much more likely and rapidly than backend, can cause web scrapers to break

## Usually worst option by far - regex
- "I have to solve a problem with regex. Now I have two problems."

# Web Scrapping Example - Old Way

In [21]:
import urllib.request
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

In [22]:
source = urllib.request.urlopen('https://www.basketball-reference.com/teams/MEM/2022.html').read()

In [23]:
soup = BeautifulSoup(source, 'html')

In [24]:
table = soup.find_all('table')[1]

In [25]:
roster_html = soup.find(id='roster')

# Web Scrapping Example - Pandas

In [27]:
# read into DataFrame
roster = pd.read_html(str(roster_html))[0]
roster

Unnamed: 0,No.,Player,Pos,Ht,Wt,Birth Date,Unnamed: 6,Exp,College
0,13,Jaren Jackson Jr.,PF,6-11,242,"September 15, 1999",us,3,Michigan State
1,22,Desmond Bane,SF,6-5,215,"June 25, 1998",us,1,TCU
2,4,Steven Adams,C,6-11,265,"July 20, 1993",nz,8,Pitt
3,0,De'Anthony Melton,SG,6-2,200,"May 28, 1998",us,3,USC
4,21,Tyus Jones,PG,6-0,196,"May 10, 1996",us,6,Duke
5,46,John Konchar,SG,6-5,210,"March 22, 1996",us,2,Purdue-Fort Wayne
6,1,Kyle Anderson,PF,6-9,230,"September 20, 1993",us,7,UCLA
7,15,Brandon Clarke,PF,6-8,215,"September 19, 1996",ca,2,"San Jose State, Gonzaga"
8,8,Ziaire Williams,SF,6-8,215,"September 12, 2001",us,R,Stanford
9,12,Ja Morant,PG,6-3,174,"August 10, 1999",us,2,Murray State


In [31]:
roster_old_way = soup.find_all('table')[0]
roster_old_way
roster2 = pd.read_html(str(roster_old_way))
roster2[0]

Unnamed: 0,No.,Player,Pos,Ht,Wt,Birth Date,Unnamed: 6,Exp,College
0,13,Jaren Jackson Jr.,PF,6-11,242,"September 15, 1999",us,3,Michigan State
1,22,Desmond Bane,SF,6-5,215,"June 25, 1998",us,1,TCU
2,4,Steven Adams,C,6-11,265,"July 20, 1993",nz,8,Pitt
3,0,De'Anthony Melton,SG,6-2,200,"May 28, 1998",us,3,USC
4,21,Tyus Jones,PG,6-0,196,"May 10, 1996",us,6,Duke
5,46,John Konchar,SG,6-5,210,"March 22, 1996",us,2,Purdue-Fort Wayne
6,1,Kyle Anderson,PF,6-9,230,"September 20, 1993",us,7,UCLA
7,15,Brandon Clarke,PF,6-8,215,"September 19, 1996",ca,2,"San Jose State, Gonzaga"
8,8,Ziaire Williams,SF,6-8,215,"September 12, 2001",us,R,Stanford
9,12,Ja Morant,PG,6-3,174,"August 10, 1999",us,2,Murray State


# Can also do it locally ... instead of spamming urls

In [32]:
import pandas as pd
from bs4 import BeautifulSoup

In [33]:
html_file = "downloaded html file"

with open(html_file, 'rb') as f:
    soup = BeautifulSoup(f, 'lxml')

FileNotFoundError: [Errno 2] No such file or directory: 'downloaded html file'

In [34]:
roster_html = soup.find(id='roster')
roster = pd.read_html(str(roster_html))[0]
roster

Unnamed: 0,No.,Player,Pos,Ht,Wt,Birth Date,Unnamed: 6,Exp,College
0,13,Jaren Jackson Jr.,PF,6-11,242,"September 15, 1999",us,3,Michigan State
1,22,Desmond Bane,SF,6-5,215,"June 25, 1998",us,1,TCU
2,4,Steven Adams,C,6-11,265,"July 20, 1993",nz,8,Pitt
3,0,De'Anthony Melton,SG,6-2,200,"May 28, 1998",us,3,USC
4,21,Tyus Jones,PG,6-0,196,"May 10, 1996",us,6,Duke
5,46,John Konchar,SG,6-5,210,"March 22, 1996",us,2,Purdue-Fort Wayne
6,1,Kyle Anderson,PF,6-9,230,"September 20, 1993",us,7,UCLA
7,15,Brandon Clarke,PF,6-8,215,"September 19, 1996",ca,2,"San Jose State, Gonzaga"
8,8,Ziaire Williams,SF,6-8,215,"September 12, 2001",us,R,Stanford
9,12,Ja Morant,PG,6-3,174,"August 10, 1999",us,2,Murray State


# Reg expressions 
- Ton of exercises in textbook

# API Example

In [35]:
import pandas as pd
import urllib.request, urllib.parse
import json

In [36]:
api_url = 'https://www.thecocktaildb.com/api/json/v1/1/search.php?s=margarita'

uh = urllib.request.urlopen(api_url)

data = uh.read().decode()

In [39]:
drinks_json = json.loads(data)
drinks_df = pd.json_normalize(drinks_json['drinks'])
drinks_df

Unnamed: 0,idDrink,strDrink,strDrinkAlternate,strTags,strVideo,strCategory,strIBA,strAlcoholic,strGlass,strInstructions,...,strMeasure10,strMeasure11,strMeasure12,strMeasure13,strMeasure14,strMeasure15,strImageSource,strImageAttribution,strCreativeCommonsConfirmed,dateModified
0,11007,Margarita,,"IBA,ContemporaryClassic",,Ordinary Drink,Contemporary Classics,Alcoholic,Cocktail glass,Rub the rim of the glass with the lime slice t...,...,,,,,,,https://commons.wikimedia.org/wiki/File:Klassi...,Cocktailmarler,Yes,2015-08-18 14:42:59
1,11118,Blue Margarita,,,,Ordinary Drink,,Alcoholic,Cocktail glass,Rub rim of cocktail glass with lime juice. Dip...,...,,,,,,,,,Yes,2015-08-18 14:51:53
2,17216,Tommy's Margarita,,"IBA,NewEra",,Ordinary Drink,New Era Drinks,Alcoholic,Old-Fashioned glass,Shake and strain into a chilled cocktail glass.,...,,,,,,,,,No,2017-09-02 18:37:54
3,16158,Whitecap Margarita,,,,Other/Unknown,,Alcoholic,Margarita/Coupette glass,Place all ingredients in a blender and blend u...,...,,,,,,,,,No,2015-09-02 17:00:22
4,12322,Strawberry Margarita,,,,Ordinary Drink,,Alcoholic,Cocktail glass,Rub rim of cocktail glass with lemon juice and...,...,,,,,,,,,No,2015-08-18 14:41:51
5,178332,Smashed Watermelon Margarita,,,,Cocktail,,Alcoholic,Collins glass,In a mason jar muddle the watermelon and 5 min...,...,,,,,,,,,No,
