<a href="https://colab.research.google.com/github/oliverrmaa/data-wrangling-springboard/blob/main/solutions/data_collection_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection

In [2]:
# Installations:
!pip install tabula-py

# Imports:
import requests
import json
from bs4 import BeautifulSoup
from tabula.io import read_pdf
import pandas as pd

# Google Drive mount authentication
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 1. Fetch Data From API

## 1.1 Exchange Rate Demo

https://exchangeratesapi.io

In [3]:
API_KEY = "7d621c8860a896ffe2d06c2f27317372"
URL = "http://api.exchangeratesapi.io/v1/latest?access_key={}".format(API_KEY)
 
response = requests.get(URL)
print(response)
text = response.text
json_data = json.loads(text)

<Response [200]>


In [4]:
print(json_data)
print(json_data.keys())

{'success': True, 'timestamp': 1629496924, 'base': 'EUR', 'date': '2021-08-20', 'rates': {'AED': 4.297482, 'AFN': 100.917283, 'ALL': 121.923875, 'AMD': 573.453668, 'ANG': 2.101145, 'AOA': 744.749732, 'ARS': 113.763929, 'AUD': 1.639874, 'AWG': 2.106675, 'AZN': 1.99373, 'BAM': 1.959976, 'BBD': 2.363399, 'BDT': 99.612608, 'BGN': 1.960138, 'BHD': 0.441132, 'BIF': 2324.889312, 'BMD': 1.17005, 'BND': 1.595174, 'BOB': 8.094222, 'BRL': 6.295375, 'BSD': 1.158349, 'BTC': 2.3766687e-05, 'BTN': 87.044022, 'BWP': 13.271573, 'BYN': 2.944228, 'BYR': 22932.979624, 'BZD': 2.359491, 'CAD': 1.49953, 'CDF': 2342.440485, 'CHF': 1.072878, 'CLF': 0.033356, 'CLP': 920.396879, 'CNY': 7.607085, 'COP': 4530.433526, 'CRC': 726.366467, 'CUC': 1.17005, 'CUP': 31.006324, 'CVE': 110.716027, 'CZK': 25.568171, 'DJF': 207.941743, 'DKK': 7.43696, 'DOP': 66.763512, 'DZD': 158.459398, 'EGP': 18.369495, 'ERN': 17.556381, 'ETB': 52.828215, 'EUR': 1, 'FJD': 2.479629, 'FKP': 0.845482, 'GBP': 0.858752, 'GEL': 3.639313, 'GGP': 0

In [5]:
df = pd.DataFrame({"rate": json_data['rates']})
df.head()

Unnamed: 0,rate
AED,4.297482
AFN,100.917283
ALL,121.923875
AMD,573.453668
ANG,2.101145


In [6]:
df = df.reset_index(inplace=False)
df.head()

Unnamed: 0,index,rate
0,AED,4.297482
1,AFN,100.917283
2,ALL,121.923875
3,AMD,573.453668
4,ANG,2.101145


In [7]:
df = df.rename(columns={'index': 'currency'})
df.head()

Unnamed: 0,currency,rate
0,AED,4.297482
1,AFN,100.917283
2,ALL,121.923875
3,AMD,573.453668
4,ANG,2.101145


In [8]:
df['base_currency'] = json_data['base']
df['date_accessed'] = json_data['date']
df.head()

Unnamed: 0,currency,rate,base_currency,date_accessed
0,AED,4.297482,EUR,2021-08-20
1,AFN,100.917283,EUR,2021-08-20
2,ALL,121.923875,EUR,2021-08-20
3,AMD,573.453668,EUR,2021-08-20
4,ANG,2.101145,EUR,2021-08-20


## 1.2 Exercise: How Many People Are In Space?

Let's use an open API so we do not need to worry about authentication. Check how many people are currently in space using this open API. Be sure to reshape
the resulting JSON into a proper DataFrame. For the URL, please use:
`"http://api.open-notify.org/astros.json"`

In [9]:
# YOUR CODE GOES HERE

In [10]:
URL = "http://api.open-notify.org/astros.json"
response = requests.get(URL)
print(response)

<Response [200]>


In [11]:
text = response.text
json_data = json.loads(text)

In [12]:
df = pd.DataFrame(json_data["people"])

df.head()

Unnamed: 0,name,craft
0,Mark Vande Hei,ISS
1,Oleg Novitskiy,ISS
2,Pyotr Dubrov,ISS
3,Thomas Pesquet,ISS
4,Megan McArthur,ISS


# 2. Fetch Data Via Parse HTML

## 2.1 Video Game Music Demo

We will parse a webpage full or links (https://www.vgmusic.com/music/console/nintendo/nes/).
 
This would be of interest if we were harvesting data stored in links from a web-based source. There are important background information on html:
- `<a>`is known as an anchor element in HTML, we usually search for these elements when parsing for links. (`https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a`)
- `href` is an attribue of `<a>` which creates the hyperlink

In [13]:
URL = 'https://www.vgmusic.com/music/console/nintendo/nes/'
response = requests.get(URL)
text = response.text
soup = BeautifulSoup(text, 'html.parser')

In [14]:
soup.find_all("a")[0:10]

[<a href="http://www.vgmusic.com/information/donate.php">Please contribute today</a>,
 <a href="/"><img alt="You are surfing the Videogame Music Archive" border="0" height="60" src="/images/mikey57a.gif" width="468"/></a>,
 <a name="10Yard_Fight">10-Yard Fight</a>,
 <a href="10-Yard_Fight-Kick_Off.mid">Kick Off</a>,
 <a href="/file/debcd7c61535f6aba8d4b88d8d0182db.html#disqus_thread">Comments</a>,
 <a name="1943">1943</a>,
 <a href="1943.mid">"Raid and Pacific Attack" Title Screen Song</a>,
 <a href="/file/c6d8c1b732822f614e3b5892b703c58f.html#disqus_thread">Comments</a>,
 <a href="1943sab.mid">Assault on Surface Forces B</a>,
 <a href="/file/07f297682c9731ad956da14180f2aa65.html#disqus_thread">Comments</a>]

In [15]:
len(soup.find_all("a"))

8851

In [16]:
href_list = [link.get("href") for link in soup.find_all("a")]

# code without using list comprehension:
# href_list = []
# for link in soup.find_all("a"):
#   href_list.append(link.get("href"))

In [17]:
df = pd.DataFrame({"song_list": href_list, "base_url": URL})

In [18]:
len(df.song_list)

8851

In [19]:
aux = df[df["song_list"].str.contains(".mid", na=False)]
aux.head()

Unnamed: 0,song_list,base_url
3,10-Yard_Fight-Kick_Off.mid,https://www.vgmusic.com/music/console/nintendo...
6,1943.mid,https://www.vgmusic.com/music/console/nintendo...
8,1943sab.mid,https://www.vgmusic.com/music/console/nintendo...
10,1943-lev1.mid,https://www.vgmusic.com/music/console/nintendo...
12,43pbos1.mid,https://www.vgmusic.com/music/console/nintendo...


In [20]:
aux.reset_index(inplace=True)
aux.head()

Unnamed: 0,index,song_list,base_url
0,3,10-Yard_Fight-Kick_Off.mid,https://www.vgmusic.com/music/console/nintendo...
1,6,1943.mid,https://www.vgmusic.com/music/console/nintendo...
2,8,1943sab.mid,https://www.vgmusic.com/music/console/nintendo...
3,10,1943-lev1.mid,https://www.vgmusic.com/music/console/nintendo...
4,12,43pbos1.mid,https://www.vgmusic.com/music/console/nintendo...


In [21]:
# axis=1 is for columns 
aux = aux.drop('index', axis=1)

In [22]:
aux['full_path'] = aux['base_url'] + aux["song_list"]
aux.head()

Unnamed: 0,song_list,base_url,full_path
0,10-Yard_Fight-Kick_Off.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...
1,1943.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...
2,1943sab.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...
3,1943-lev1.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...
4,43pbos1.mid,https://www.vgmusic.com/music/console/nintendo...,https://www.vgmusic.com/music/console/nintendo...


In [23]:
len(aux)

4204

In [24]:
aux.full_path[0]

'https://www.vgmusic.com/music/console/nintendo/nes/10-Yard_Fight-Kick_Off.mid'

## 2.2 Exercise: Google Scholar Link Extraction

Let's practice link extraction from html. For the following exercise we will
go to google scholar and extract every link that leads to an academic article. Be sure to put the results into a nice list or DataFrame showing just the necessary links.

Please go to Google Scholar https://scholar.google.com

Type in "astroids", then copy and paste that web url link to use. 


In [25]:
# YOUR CODE GOES HERE

In [26]:
URL = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=astroids&btnG="
response = requests.get(URL)
text = response.text
soup = BeautifulSoup(text, 'html.parser')

In [27]:
href_list = [link.get("href") for link in soup.find_all("a")]

filtered_list = [link for link in href_list if "https" in link and "google" not in link]

filtered_list

[]

# 3. Fetch Data From PDF

https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf

In [28]:
URL = "https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf"
df_list = read_pdf(URL, output_format="dataframe", pages="all")

In [29]:
df = df_list[0]
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Percent Fuel Savings,Unnamed: 5
0,Cycle,KI,Distance,,,
1,Name,(1/km),(mi),Improved,Decreased Eliminate,Decreased
2,,,,Speed,Accel Stops,Idle
3,2012_2,3.30,1.3,5.9%,9.5% 29.2%,17.4%
4,2145_1,0.68,11.2,2.4%,0.1% 9.5%,2.7%


In [30]:
df.columns

Index(['Unnamed: 0', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3',
       'Percent Fuel Savings', 'Unnamed: 5'],
      dtype='object')

In [31]:
df['decrease_accel'] = df["Percent Fuel Savings"].str.split(" ").str[0]
df['eliminate_stops'] = df["Percent Fuel Savings"].str.split(" ").str[1]
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Percent Fuel Savings,Unnamed: 5,decrease_accel,eliminate_stops
0,Cycle,KI,Distance,,,,,
1,Name,(1/km),(mi),Improved,Decreased Eliminate,Decreased,Decreased,Eliminate
2,,,,Speed,Accel Stops,Idle,Accel,Stops
3,2012_2,3.30,1.3,5.9%,9.5% 29.2%,17.4%,9.5%,29.2%
4,2145_1,0.68,11.2,2.4%,0.1% 9.5%,2.7%,0.1%,9.5%


In [32]:
df = df[3:7]
df = df.rename(columns={
    "Unnamed: 0": "cycle_name", 
    "Unnamed: 1": "KI",
    "Unnamed: 2": "distance",
    "Unnamed: 3": "improved_speed",
    "Unnamed: 5": "decrease_idle"
  })

In [33]:
df = df[["cycle_name", "KI", "distance", "improved_speed", "decrease_accel", "eliminate_stops", "decrease_idle"]]
df.head()

Unnamed: 0,cycle_name,KI,distance,improved_speed,decrease_accel,eliminate_stops,decrease_idle
3,2012_2,3.3,1.3,5.9%,9.5%,29.2%,17.4%
4,2145_1,0.68,11.2,2.4%,0.1%,9.5%,2.7%
5,4234_1,0.59,58.7,8.5%,1.3%,8.5%,3.3%
6,2032_2,0.17,57.8,21.7%,0.3%,2.7%,1.2%


# 4. Fetch Data From CSV / Save Data To CSV 

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

In [34]:
URL = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
df = pd.read_csv(URL)

In [35]:
df.head()

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA


In [37]:
PATH = "/content/drive/MyDrive/countries.csv"
df.to_csv(PATH, index=False)