<a href="https://colab.research.google.com/github/oliverrmaa/data-wrangling-springboard/blob/main/solutions/data_collection_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection

In [2]:
# Installations:
!pip install tabula-py

# Imports:
import requests
import json
from bs4 import BeautifulSoup
from tabula.io import read_pdf
import pandas as pd

# Google Drive mount authentication
from google.colab import drive
drive.mount('/content/drive')

Collecting tabula-py
  Downloading tabula_py-2.2.0-py3-none-any.whl (11.7 MB)
[K     |████████████████████████████████| 11.7 MB 229 kB/s 
[?25hCollecting distro
  Downloading distro-1.5.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.5.0 tabula-py-2.2.0
Mounted at /content/drive


# 1. Fetch Data From API

## 1.1 Exchange Rate Demo

https://exchangeratesapi.io

In [None]:
API_KEY = "7d621c8860a896ffe2d06c2f27317372"
URL = "http://api.exchangeratesapi.io/v1/latest?access_key={}".format(API_KEY)
 
response = requests.get(URL)
print(response)
text = response.text
json_data = json.loads(text)

In [None]:
print(json_data)
print(json_data.keys())

In [None]:
df = pd.DataFrame({"rate": json_data['rates']})
df.head()

In [None]:
df = df.reset_index(inplace=False)
df.head()

In [None]:
df = df.rename(columns={'index': 'currency'})
df.head()

In [None]:
df['base_currency'] = json_data['base']
df['date_accessed'] = json_data['date']
df.head()

## 1.2 Exercise: How Many People Are In Space?

Let's use an open API so we do not need to worry about authentication. Check how many people are currently in space using this open API. Be sure to reshape
the resulting JSON into a proper DataFrame. For the URL, please use:
`"http://api.open-notify.org/astros.json"`

In [None]:
# YOUR CODE GOES HERE

In [3]:
URL = "http://api.open-notify.org/astros.json"
response = requests.get(URL)
print(response)

<Response [200]>


In [4]:
text = response.text
json_data = json.loads(text)

In [5]:
df = pd.DataFrame(json_data["people"])

df.head()

Unnamed: 0,name,craft
0,Mark Vande Hei,ISS
1,Oleg Novitskiy,ISS
2,Pyotr Dubrov,ISS
3,Thomas Pesquet,ISS
4,Megan McArthur,ISS


# 2. Fetch Data Via Parse HTML

## 2.1 Video Game Music Demo

We will parse a webpage full or links (https://www.vgmusic.com/music/console/nintendo/nes/).
 
This would be of interest if we were harvesting data stored in links from a web-based source. There are important background information on html:
- `<a>`is known as an anchor element in HTML, we usually search for these elements when parsing for links. (`https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a`)
- `href` is an attribue of `<a>` which creates the hyperlink

In [None]:
URL = 'https://www.vgmusic.com/music/console/nintendo/nes/'
response = requests.get(URL)
text = response.text
soup = BeautifulSoup(text, 'html.parser')

In [None]:
soup.find_all("a")[0:10]

In [None]:
len(soup.find_all("a"))

In [None]:
href_list = [link.get("href") for link in soup.find_all("a")]

# code without using list comprehension:
# href_list = []
# for link in soup.find_all("a"):
#   href_list.append(link.get("href"))

In [None]:
df = pd.DataFrame({"song_list": href_list, "base_url": URL})

In [None]:
len(df.song_list)

In [None]:
aux = df[df["song_list"].str.contains(".mid", na=False)]
aux.head()

In [None]:
aux.reset_index(inplace=True)
aux.head()

In [None]:
# axis=1 is for columns 
aux = aux.drop('index', axis=1)

In [None]:
aux['full_path'] = aux['base_url'] + aux["song_list"]
aux.head()

In [None]:
len(aux)

In [None]:
aux.full_path[0]

## 2.2 Exercise: Google Scholar Link Extraction

Let's practice link extraction from html. For the following exercise we will
go to google scholar and extract every link that leads to an academic article. Be sure to put the results into a nice list or DataFrame showing just the necessary links.

Please go to Google Scholar https://scholar.google.com

Type in "astroids", then copy and paste that web url link to use. 


In [12]:
# YOUR CODE GOES HERE

In [13]:
URL = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=astroids&btnG="
response = requests.get(URL)
text = response.text
soup = BeautifulSoup(text, 'html.parser')

In [15]:
href_list = [link.get("href") for link in soup.find_all("a")]

filtered_list = [link for link in href_list if "https" in link and "google" not in link]

filtered_list

[]

# 3. Fetch Data From PDF

https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf

In [None]:
URL = "https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf"
df_list = read_pdf(URL, output_format="dataframe", pages="all")

In [None]:
df = df_list[0]
df.head()

In [None]:
df.columns

In [None]:
df['decrease_accel'] = df["Percent Fuel Savings"].str.split(" ").str[0]
df['eliminate_stops'] = df["Percent Fuel Savings"].str.split(" ").str[1]
df.head()

In [None]:
df = df[3:7]
df = df.rename(columns={
    "Unnamed: 0": "cycle_name", 
    "Unnamed: 1": "KI",
    "Unnamed: 2": "distance",
    "Unnamed: 3": "improved_speed",
    "Unnamed: 5": "decrease_idle"
  })

In [None]:
df = df[["cycle_name", "KI", "distance", "improved_speed", "decrease_accel", "eliminate_stops", "decrease_idle"]]
df.head()

# 4. Fetch Data From CSV / Save Data To CSV 

https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

In [None]:
URL = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
df = pd.read_csv(URL)

In [None]:
df.head()

In [None]:
PATH = "/content/drive/MyDrive/00_temp/countries.csv"
df.to_csv(PATH, index=False)