<a href="https://colab.research.google.com/github/oliverrmaa/data-wrangling-springboard/blob/main/notebooks/data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection

In [22]:
# Installations:
#!pip install tabula-py

# Imports:
import requests
import json
from bs4 import BeautifulSoup
from tabula.io import read_pdf
import pandas as pd

# Google Drive mount authentication
#from google.colab import drive
#drive.mount('/content/drive')

# 1. Fetch Data From API

## 1.1 Exchange Rate Demo

https://exchangeratesapi.io

In [23]:
API_KEY = "7d621c8860a896ffe2d06c2f27317372"
URL = "http://api.exchangeratesapi.io/v1/latest?access_key={}".format(API_KEY)
 
response = requests.get(URL)
print(response)
text = response.text
json_data = json.loads(text)

<Response [200]>


In [24]:
print(json_data)
print(json_data.keys())

{'success': True, 'timestamp': 1629562939, 'base': 'EUR', 'date': '2021-08-21', 'rates': {'AED': 4.296196, 'AFN': 100.619999, 'ALL': 121.829999, 'AMD': 574.159996, 'ANG': 2.0967, 'AOA': 744.52696, 'ARS': 113.634699, 'AUD': 1.639383, 'AWG': 2.106045, 'AZN': 1.993134, 'BAM': 1.95583, 'BBD': 2.3584, 'BDT': 99.401899, 'BGN': 1.9595, 'BHD': 0.441, 'BIF': 2317.449983, 'BMD': 1.1697, 'BND': 1.5918, 'BOB': 8.0771, 'BRL': 6.3256, 'BSD': 1.1681, 'BTC': 2.3699672e-05, 'BTN': 86.859899, 'BWP': 13.2435, 'BYN': 2.938, 'BYR': 22926.11983, 'BZD': 2.3545, 'CAD': 1.499842, 'CDF': 2341.739806, 'CHF': 1.07314, 'CLF': 0.033346, 'CLP': 920.121567, 'CNY': 7.604809, 'COP': 4520.899967, 'CRC': 724.829995, 'CUC': 1.1697, 'CUP': 30.99705, 'CVE': 110.264999, 'CZK': 25.566426, 'DJF': 207.939998, 'DKK': 7.435437, 'DOP': 66.638, 'DZD': 158.411999, 'EGP': 18.364, 'ERN': 17.55113, 'ETB': 53.1318, 'EUR': 1, 'FJD': 2.480876, 'FKP': 0.845229, 'GBP': 0.858653, 'GEL': 3.638224, 'GGP': 0.845229, 'GHS': 7.0551, 'GIP': 0.8452

In [25]:
df = pd.DataFrame({"rate": json_data['rates']})
df.head()

Unnamed: 0,rate
AED,4.296196
AFN,100.619999
ALL,121.829999
AMD,574.159996
ANG,2.0967


In [26]:
df = df.reset_index(inplace=False)
df.head()

Unnamed: 0,index,rate
0,AED,4.296196
1,AFN,100.619999
2,ALL,121.829999
3,AMD,574.159996
4,ANG,2.0967


In [27]:
df = df.rename(columns={'index': 'currency'})
df.head()

Unnamed: 0,currency,rate
0,AED,4.296196
1,AFN,100.619999
2,ALL,121.829999
3,AMD,574.159996
4,ANG,2.0967


In [28]:
df['base_currency'] = json_data['base']
df['date_accessed'] = json_data['date']
df.head()

Unnamed: 0,currency,rate,base_currency,date_accessed
0,AED,4.296196,EUR,2021-08-21
1,AFN,100.619999,EUR,2021-08-21
2,ALL,121.829999,EUR,2021-08-21
3,AMD,574.159996,EUR,2021-08-21
4,ANG,2.0967,EUR,2021-08-21


## 1.2 Exercise: How Many People Are In Space?

Let's use an open API so we do not need to worry about authentication. Check how many people are currently in space using this open API. Be sure to reshape
the resulting JSON into a proper DataFrame. For the URL, please use:
`"http://api.open-notify.org/astros.json"`

In [29]:
url = "http://api.open-notify.org/astros.json"
response = requests.get(url)
print(response)
text = response.text
json_data = json.loads(text)
print (json_data)

<Response [200]>
{'people': [{'name': 'Mark Vande Hei', 'craft': 'ISS'}, {'name': 'Oleg Novitskiy', 'craft': 'ISS'}, {'name': 'Pyotr Dubrov', 'craft': 'ISS'}, {'name': 'Thomas Pesquet', 'craft': 'ISS'}, {'name': 'Megan McArthur', 'craft': 'ISS'}, {'name': 'Shane Kimbrough', 'craft': 'ISS'}, {'name': 'Akihiko Hoshide', 'craft': 'ISS'}, {'name': 'Nie Haisheng', 'craft': 'Tiangong'}, {'name': 'Liu Boming', 'craft': 'Tiangong'}, {'name': 'Tang Hongbo', 'craft': 'Tiangong'}], 'number': 10, 'message': 'success'}


In [30]:
df = pd.DataFrame(json_data)
df

Unnamed: 0,people,number,message
0,"{'name': 'Mark Vande Hei', 'craft': 'ISS'}",10,success
1,"{'name': 'Oleg Novitskiy', 'craft': 'ISS'}",10,success
2,"{'name': 'Pyotr Dubrov', 'craft': 'ISS'}",10,success
3,"{'name': 'Thomas Pesquet', 'craft': 'ISS'}",10,success
4,"{'name': 'Megan McArthur', 'craft': 'ISS'}",10,success
5,"{'name': 'Shane Kimbrough', 'craft': 'ISS'}",10,success
6,"{'name': 'Akihiko Hoshide', 'craft': 'ISS'}",10,success
7,"{'name': 'Nie Haisheng', 'craft': 'Tiangong'}",10,success
8,"{'name': 'Liu Boming', 'craft': 'Tiangong'}",10,success
9,"{'name': 'Tang Hongbo', 'craft': 'Tiangong'}",10,success


In [31]:
df2 = pd.DataFrame(json_data['people'])
df2

Unnamed: 0,name,craft
0,Mark Vande Hei,ISS
1,Oleg Novitskiy,ISS
2,Pyotr Dubrov,ISS
3,Thomas Pesquet,ISS
4,Megan McArthur,ISS
5,Shane Kimbrough,ISS
6,Akihiko Hoshide,ISS
7,Nie Haisheng,Tiangong
8,Liu Boming,Tiangong
9,Tang Hongbo,Tiangong


# 2. Fetch Data Via Parse HTML

## 2.1 Video Game Music Demo

We will parse a webpage full or links (https://www.vgmusic.com/music/console/nintendo/nes/).
 
This would be of interest if we were harvesting data stored in links from a web-based source. There are important background information on html:
- `<a>`is known as an anchor element in HTML, we usually search for these elements when parsing for links. (`https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a`)
- `href` is an attribue of `<a>` which creates the hyperlink

In [32]:
URL = 'https://www.vgmusic.com/music/console/nintendo/nes/'
response = requests.get(URL)
text = response.text
soup = BeautifulSoup(text, 'html.parser')

In [33]:
soup.find_all("a")[0:10]

[<a href="http://www.vgmusic.com/information/donate.php">Please contribute today</a>,
 <a href="/"><img alt="You are surfing the Videogame Music Archive" border="0" height="60" src="/images/mikey57a.gif" width="468"/></a>,
 <a name="10Yard_Fight">10-Yard Fight</a>,
 <a href="10-Yard_Fight-Kick_Off.mid">Kick Off</a>,
 <a href="/file/debcd7c61535f6aba8d4b88d8d0182db.html#disqus_thread">Comments</a>,
 <a name="1943">1943</a>,
 <a href="1943.mid">"Raid and Pacific Attack" Title Screen Song</a>,
 <a href="/file/c6d8c1b732822f614e3b5892b703c58f.html#disqus_thread">Comments</a>,
 <a href="1943sab.mid">Assault on Surface Forces B</a>,
 <a href="/file/07f297682c9731ad956da14180f2aa65.html#disqus_thread">Comments</a>]

In [34]:
len(soup.find_all("a"))

8851

In [35]:
href_list = [link.get("href") for link in soup.find_all("a")]

# code without using list comprehension:
# href_list = []
# for link in soup.find_all("a"):
#   href_list.append(link.get("href"))

In [36]:
df = pd.DataFrame({"song_list": href_list, "base_url": URL})

In [37]:
len(df.song_list)

8851

In [38]:
aux = df[df["song_list"].str.contains(".mid", na=False)]
aux.head()

Unnamed: 0,song_list,base_url
3,10-Yard_Fight-Kick_Off.mid,https://www.vgmusic.com/music/console/nintendo...
6,1943.mid,https://www.vgmusic.com/music/console/nintendo...
8,1943sab.mid,https://www.vgmusic.com/music/console/nintendo...
10,1943-lev1.mid,https://www.vgmusic.com/music/console/nintendo...
12,43pbos1.mid,https://www.vgmusic.com/music/console/nintendo...


In [39]:
aux.reset_index(inplace=True)
aux.head()

Unnamed: 0,index,song_list,base_url
0,3,10-Yard_Fight-Kick_Off.mid,https://www.vgmusic.com/music/console/nintendo...
1,6,1943.mid,https://www.vgmusic.com/music/console/nintendo...
2,8,1943sab.mid,https://www.vgmusic.com/music/console/nintendo...
3,10,1943-lev1.mid,https://www.vgmusic.com/music/console/nintendo...
4,12,43pbos1.mid,https://www.vgmusic.com/music/console/nintendo...


In [40]:
# axis=1 is for columns 
aux = aux.drop('index', axis=1)

In [41]:
aux['full_path'] = aux['base_url'] + aux["song_list"]
aux.head()

Unnamed: 0,song_list,base_url,full_path
0,10-Yard_Fight-Kick_Off.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...
1,1943.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...
2,1943sab.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...
3,1943-lev1.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...
4,43pbos1.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...


In [42]:
len(aux)

4204

In [43]:
aux.full_path[0]

'https://www.vgmusic.com/music/console/nintendo/nes/10-Yard_Fight-Kick_Off.mid'

## 2.2 Exercise: Google Scholar Link Extraction

Let's practice link extraction from html. For the following exercise we will
go to google scholar and extract every link that leads to an academic article. Be sure to put the results into a nice list or DataFrame showing just the necessary links.

Please go to Google Scholar https://scholar.google.com

Type in "astroids", then copy and paste that web url link to use. 

In [50]:
url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=asteroids&oq=astero"
response2 = requests.get(url)
text2 = response2.text
soup2 = BeautifulSoup(text2, 'html.parser')

In [51]:
soup2.find_all("a")[0:10]

[<a aria-label="Cancel" class="gs_btnCLS gs_md_x gs_md_hdr_c gs_in_ib gs_btn_lrge" data-mdx="gs_cit" href="javascript:void(0)" id="gs_cit-x" role="button"><span class="gs_ico"></span><span class="gs_lbl"></span></a>,
 <a aria-label="Cancel" class="gs_btnCLS gs_md_x gs_md_hdr_c gs_in_ib gs_btn_lrge" data-mdx="gs_asd" href="javascript:void(0)" id="gs_asd-x" role="button"><span class="gs_ico"></span><span class="gs_lbl"></span></a>,
 <a aria-controls="gs_hdr_drw" aria-label="Options" class="gs_btnMNT gs_in_ib gs_btn_lrge" href="javascript:void(0)" id="gs_hdr_drw_mnu" role="button"><span class="gs_ico"></span><span class="gs_lbl"></span></a>,
 <a aria-label="Homepage" href="/schhp?hl=en&amp;oe=ASCII&amp;as_sdt=0,5" id="gs_hdr_drw_lgo"></a>,
 <a class="gs_in_ib gs_md_li gs_md_lix gs_in_gray gs_sel" href="/scholar?as_sdt=0,5&amp;q=asteroids&amp;hl=en&amp;oe=ASCII" role="menuitem"><span class="gs_ico"></span><span class="gs_lbl">Articles</span></a>,
 <a class="gs_in_ib gs_md_li gs_md_lix gs_i

# 3. Fetch Data From PDF

https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf

In [52]:
URL = "https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf"
df_list = read_pdf(URL, output_format="dataframe", pages="all")

JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`

In [None]:
df = df_list[0]
df.head()

In [None]:
df.columns

In [None]:
df['decrease_accel'] = df["Percent Fuel Savings"].str.split(" ").str[0]
df['eliminate_stops'] = df["Percent Fuel Savings"].str.split(" ").str[1]
df.head()

In [None]:
df = df[3:7]
df = df.rename(columns={
    "Unnamed: 0": "cycle_name", 
    "Unnamed: 1": "KI",
    "Unnamed: 2": "distance",
    "Unnamed: 3": "improved_speed",
    "Unnamed: 5": "decrease_idle"
  })

In [None]:
df = df[["cycle_name", "KI", "distance", "improved_speed", "decrease_accel", "eliminate_stops", "decrease_idle"]]
df.head()

# 4. Fetch Data From CSV / Save Data To CSV 

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

In [53]:
URL = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
df = pd.read_csv(URL)

In [54]:
df.head()

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA


In [55]:
PATH = "/content/drive/MyDrive/00_temp/countries.csv"
df.to_csv(PATH, index=False)

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/00_temp/countries.csv'