## Live Coding Mod 3

### H. Diana McSpadden (hdm5s)

Don't worry: JSON data isn't as scary as Jason (source: https://www.pinterest.com/pin/40673202872115688/)Links to an external site.
In this live coding session we will access three different datasets which we can access on the internet without having to supply any API keys or other credentials (we'll cover APIs with credentials next week).

The goal is to examine the structure of the data, decide what is metadata and what constitutes the content of the dataframe we are trying to build, and to use the various tools available to us to convert the data to a pandas dataframe in our Python environment.

We'll work on the following three examples together:

1. Cocktail recipes from The Cocktail DB Links to an external site.: http://www.thecocktaildb.com/api/json/v1/1/filter.php?c=Cocktail Links to an external site.
1. The published works of J.K. Rowling from OpenLibrary.org Links to an external site.: https://openlibrary.org/authors/OL23919A/works.json Links to an external site.
1. Data on the pages on Wikipedia Links to an external site.that pop up when searching for the term "Virginia": https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=Virginia&format=json&srlimit=500 Links to an external site.(This search gets the first 500 hits, but there are 210391 results. If time, we will use the sroffset parameter described in the API documentation Links to an external site.to get the full list)

In [1]:
import numpy as np
import pandas as pd
import requests
import json
import sys
sys.tracebacklimit = 0 # turn off the error tracebacks, this has been standard in other notebooks for class

In [2]:
pd.set_option('display.max_columns', None)

## Cocktail recipes from The Cocktail DB

In [3]:
# http://www.thecocktaildb.com/api/json/v1/1/filter.php?c=Cocktail
url = 'http://www.thecocktaildb.com/api/json/v1/1/filter.php?c=Cocktail'

r = requests.get(url) # get or post are the two main methods
r


<Response [200]>

In [4]:
json.loads(r.text) # the s means "string" in your kernel vs. load() is for an external file

# json.loads()
# json.load()
# json.dumps() # for string : dict/list to string in kernel
# json.dump() # for file : dict/list to external file/socket

{'drinks': [{'strDrink': '155 Belmont',
   'strDrinkThumb': 'https://www.thecocktaildb.com/images/media/drink/yqvvqs1475667388.jpg',
   'idDrink': '15346'},
  {'strDrink': '57 Chevy with a White License Plate',
   'strDrinkThumb': 'https://www.thecocktaildb.com/images/media/drink/qyyvtu1468878544.jpg',
   'idDrink': '14029'},
  {'strDrink': '747 Drink',
   'strDrinkThumb': 'https://www.thecocktaildb.com/images/media/drink/i9suxb1582474926.jpg',
   'idDrink': '178318'},
  {'strDrink': '9 1/2 Weeks',
   'strDrinkThumb': 'https://www.thecocktaildb.com/images/media/drink/xvwusr1472669302.jpg',
   'idDrink': '16108'},
  {'strDrink': "A Gilligan's Island",
   'strDrinkThumb': 'https://www.thecocktaildb.com/images/media/drink/wysqut1461867176.jpg',
   'idDrink': '16943'},
  {'strDrink': 'A True Amaretto Sour',
   'strDrinkThumb': 'https://www.thecocktaildb.com/images/media/drink/rptuxy1472669372.jpg',
   'idDrink': '17005'},
  {'strDrink': 'A.D.M. (After Dinner Mint)',
   'strDrinkThumb': 'ht

In [5]:
myjson = json.loads(r.text) 
myjson['drinks'][0]['strDrinkThumb'] # thumbnamil url for the first drink in the JSON

'https://www.thecocktaildb.com/images/media/drink/yqvvqs1475667388.jpg'

In [6]:
[x ['strDrinkThumb'] for x in myjson['drinks'][0:5]] # list comprehension for first 5 drinks, just to make it short

['https://www.thecocktaildb.com/images/media/drink/yqvvqs1475667388.jpg',
 'https://www.thecocktaildb.com/images/media/drink/qyyvtu1468878544.jpg',
 'https://www.thecocktaildb.com/images/media/drink/i9suxb1582474926.jpg',
 'https://www.thecocktaildb.com/images/media/drink/xvwusr1472669302.jpg',
 'https://www.thecocktaildb.com/images/media/drink/wysqut1461867176.jpg']

#### ways to put in a dataframe

**way One**

json_normalize()

In [7]:
pd.json_normalize(myjson, 'drinks') # normalize the JSON into a dataframe, you need to know the ROOT node

Unnamed: 0,strDrink,strDrinkThumb,idDrink
0,155 Belmont,https://www.thecocktaildb.com/images/media/dri...,15346
1,57 Chevy with a White License Plate,https://www.thecocktaildb.com/images/media/dri...,14029
2,747 Drink,https://www.thecocktaildb.com/images/media/dri...,178318
3,9 1/2 Weeks,https://www.thecocktaildb.com/images/media/dri...,16108
4,A Gilligan's Island,https://www.thecocktaildb.com/images/media/dri...,16943
...,...,...,...
95,Michelada,https://www.thecocktaildb.com/images/media/dri...,178343
96,Midnight Mint,https://www.thecocktaildb.com/images/media/dri...,14842
97,Mojito,https://www.thecocktaildb.com/images/media/dri...,11000
98,Mojito Extra,https://www.thecocktaildb.com/images/media/dri...,15841


In [8]:
cocktail_json= requests.get(url).json()
cocktail_df = pd.DataFrame(cocktail_json['drinks'])
cocktail_df.head()

Unnamed: 0,strDrink,strDrinkThumb,idDrink
0,155 Belmont,https://www.thecocktaildb.com/images/media/dri...,15346
1,57 Chevy with a White License Plate,https://www.thecocktaildb.com/images/media/dri...,14029
2,747 Drink,https://www.thecocktaildb.com/images/media/dri...,178318
3,9 1/2 Weeks,https://www.thecocktaildb.com/images/media/dri...,16108
4,A Gilligan's Island,https://www.thecocktaildb.com/images/media/dri...,16943


In [9]:
cocktail_json['drinks'][0]['idDrink']

'15346'

## Wikipedia Links

In [10]:
# https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=Virginia&format=json&srlimit=500

url = 'https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=Virginia&format=json&srlimit=500'

r = requests.get(url) # get or post are the two main methods
print(r.status_code)
wikijson = json.loads(r.text) 


200


In [11]:
df_wiki = pd.json_normalize(wikijson, record_path=['query','search']) # normalize the JSON into a dataframe, you need to know the ROOT node

In [12]:
df_wiki.head(10)

Unnamed: 0,ns,title,pageid,size,wordcount,snippet,timestamp
0,0,Virginia,32432,297339,25683,"<span class=""searchmatch"">Virginia</span>, off...",2023-01-28T19:08:31Z
1,0,West Virginia,32905,183261,17506,"West <span class=""searchmatch"">Virginia</span>...",2023-01-29T20:30:07Z
2,0,"Virginia Beach, Virginia",91239,130491,11788,"<span class=""searchmatch"">Virginia</span> Beac...",2023-01-31T15:42:01Z
3,0,Virginia Woolf,32742,329612,31706,"Adeline <span class=""searchmatch"">Virginia</sp...",2023-01-29T16:50:04Z
4,0,"Norfolk, Virginia",57898,141027,13515,(/ˈnɔːrfʊk/ (listen) NOR-fuk) is an independen...,2023-01-26T04:19:08Z
5,0,List of United States senators from West Virginia,416472,28623,115,of United States senators from West <span clas...,2023-01-04T00:55:30Z
6,0,"Richmond, Virginia",53274,192563,17111,Richmond (/ˈrɪtʃmənd/) is the capital city of ...,2023-01-31T00:33:00Z
7,0,Virginia-class submarine,32726,130525,10175,"The <span class=""searchmatch"">Virginia</span> ...",2022-12-08T18:19:07Z
8,0,Flag and seal of Virginia,1203783,19590,2168,"of the Commonwealth of <span class=""searchmatc...",2023-01-23T01:10:58Z
9,0,2000 West Virginia gubernatorial election,42579648,6314,148,"The 2000 West <span class=""searchmatch"">Virgin...",2023-01-29T06:16:35Z


### Tech Crunch Headlines

In [13]:
import dotenv
import os
# load the env file
dotenv.load_dotenv('live.env')
API_KEY = os.getenv('API_KEY')

In [14]:

url = "https://newsapi.org/v2/top-headlines?sources=techcrunch&apiKey=" + API_KEY

In [15]:
r = requests.get(url) # get or post are the two main methods
print(r.status_code)
newsjson = json.loads(r.text) 

200


In [16]:
#newsjson

In [17]:
df_news = pd.json_normalize(newsjson, record_path=['articles']) # normalize the JSON into a dataframe, you need to know the ROOT node

In [18]:
df_news.head()

Unnamed: 0,author,title,description,url,urlToImage,publishedAt,content,source.id,source.name
0,Ingrid Lunden,"Zopa, the UK neobank, raises $93M more at a $1...",After raising $300 million in a round led by S...,https://techcrunch.com/2023/02/01/zopa-the-uk-...,https://techcrunch.com/wp-content/uploads/2018...,2023-02-02T00:12:28Z,After raising $300 million in a round led by S...,techcrunch,TechCrunch
1,Harri Weber,'Keep Lex filthy': Users react to queer dating...,"Lex, the venture-backed queer dating app, is c...",https://techcrunch.com/2023/02/01/keep-lex-fil...,https://techcrunch.com/wp-content/uploads/2020...,2023-02-01T23:58:23Z,"Lex, the hookup and social app that launched i...",techcrunch,TechCrunch
2,Natasha Mascarenhas,Disclo aims to inspire inclusive workplaces - ...,Disclo is building software that helps employe...,https://techcrunch.com/2023/02/01/disclo-aims-...,https://techcrunch.com/wp-content/uploads/2021...,2023-02-01T23:21:06Z,Disclo CEO and co-founder Hannah Olson was dia...,techcrunch,TechCrunch
3,Devin Coldewey,'Inaudible' watermark could identify AI-genera...,Resemble AI's proposal for watermarking genera...,https://techcrunch.com/2023/02/01/inaudible-wa...,https://techcrunch.com/wp-content/uploads/2022...,2023-02-01T23:09:07Z,The growing ease with which anyone can create ...,techcrunch,TechCrunch
4,Taylor Hatmaker,Meta stock perks up as the company promises a ...,Meta's stock shot up around 15 percent in afte...,https://techcrunch.com/2023/02/01/meta-q4-2022...,https://techcrunch.com/wp-content/uploads/2022...,2023-02-01T22:35:37Z,"Meta is all-in on becoming a lean, mean cash-p...",techcrunch,TechCrunch
