![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fhackathon&branch=master&subPath=SustainabilityOnMars/Tutorials/accessing-open-data.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Accessing Open Data

Many agencies and governments publish open data, these can be found via a web search or from the Wikipedia article [Open data in Canada](https://en.wikipedia.org/wiki/Open_data_in_Canada).

There are different ways to access those data from a Jupyter notebook. We'll show you examples using CSV (comma separated values), XLSX (Excel), JSON (JavaScript Object Notation), and API (Application Programming Interface). 

## CSV Data

One of the more common ways to access data is via a CSV link, as in the following example using [COVID-19 statistics from Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19).

If you get an **invalid start byte** error, try `df = pd.read_csv(csv_link, encoding='windows-1251')` 

In [None]:
import pandas as pd 
csv_link = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/05-05-2020.csv'
df = pd.read_csv(csv_link)
df

## Excel Data

If you have an XLS or XLSX link, try this example from [Ontario Open Data](https://data.ontario.ca).

In [None]:
!pip install openpyxl --user 
import pandas as pd
xlsx_link = 'https://data.ontario.ca/dataset/fb3a7c18-90af-453e-bc0a-a76ecc471862/resource/523b98e0-c677-4ac4-b453-08e9727cb712/download/publicly_funded_schools_xlsx_april_2020_en.xlsx'
df = pd.read_excel(xlsx_link)
df

## JSON Data

If the data are in JSON format, you can use this example that uses [Vancouver Open Data](https://opendata.vancouver.ca/pages/home/).

In [None]:
import requests
import pandas as pd
from pandas.io.json import json_normalize # Need to remove this if we upgrade to pandas 1.0.1

json_link = "https://opendata.vancouver.ca/api/records/1.0/search/?dataset=public-art&rows=500&facet=type&facet=status&facet=sitename&facet=siteaddress&facet=primarymaterial&facet=ownership&facet=neighbourhood&facet=artists&facet=photocredits"
data = requests.get(json_link).json()
df = json_normalize(data=data['records']) # or if pandas > 1.0 then  df = pd.json_normalize(data=data['records'])
df

## Socrata API

If the data are published using Socrata, you can also use the SODA API ([Socrata Open Data Application Programming Interface](https://dev.socrata.com/)). Have a look at the [Getting Started](https://dev.socrata.com/consumers/getting-started.html) and [API Docs](https://dev.socrata.com/docs/endpoints.html) pages, as well as the following example with [Edmonton Open Data]().

In [None]:
import requests
import io
import pandas as pd

domain = 'https://data.edmonton.ca/resource/'
uuid = 'ceg3-ihxx'  # https://data.edmonton.ca/Surveys/Cat-Strategy-Edmonton-Insight-Community/ceg3-ihxx
query = 'SELECT *'

session = requests.Session()
results = session.get(domain + uuid +'.csv?$query=' + query)
df =  pd.read_csv(io.StringIO(results.content.decode('utf-8')))
df

# Conclusion

Those are a few examples of ways to download online data into a pandas dataframe, depending on the available format.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)