![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-science-and-artificial-intelligence&branch=main&subPath=12-getting-data.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Getting Data

We will look at a few different ways to get more data from sources on the internet.

## Weather Data

A website that will give us current and forecasted weather data is [WTTR](https://wttr.in/). It will try to guess your location, or you can specify a location like https://wttr.in/calgary or https://wttr.in/vancouver,+bc or https://wttr.in/yeg (using a three-letter airport code).

We can also ask for data in [JSON](https://en.wikipedia.org/wiki/JSON) format (with `?format=j1`), which is useful for programming. We'll use the [Requests](https://requests.readthedocs.io/en/latest/) library for downloading data.

In [None]:
import pyodide_http
pyodide_http.patch_all()
import requests
r = requests.get('https://wttr.in/?format=j1')
r.json()['current_condition'][0]

Let's try getting some data to compare the weather in different locations.

In [None]:
locations = ['Edmonton, AB', 'Calgary, AB', 'Victoria, BC']

import pyodide_http
pyodide_http.patch_all()
import requests
import pandas as pd

data = pd.DataFrame(locations, columns=['location'])  # create a dataframe with locations
def get_weather(location):
    r = requests.get('https://wttr.in/'+location+'?format=j1')
    return r.json()['current_condition'][0]
data['weather'] = data['location'].apply(get_weather) # add weather column to dataframe
data

So that gave us a dataframe with the weather data in a single column. Let's expand that column and also try to convert an numbers to integers.

In [None]:
data = data.join(pd.DataFrame(data['weather'].tolist())).drop('weather', axis=1)

for column in data.columns:
    data[column] = pd.to_numeric(data[column], errors='ignore')

data

Now let's try a visualization of some of those columns.

In [None]:
import plotly.express as px
px.bar(data, x='location', y=['humidity','temp_C','visibility','uvIndex'], barmode='group', title='Weather Data')

## Data Tables on Pages

If there are data tables on webpages we can read them using `.read_html()` from the [pandas](https://pandas.pydata.org/) library.

### Wikipedia

For example, we can read the tables on a Wikipedia page such as [List of Alberta general elections](https://en.wikipedia.org/wiki/List_of_Alberta_general_elections).

In [None]:
import pyodide_http
pyodide_http.patch_all()
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Alberta_general_elections')
for table in tables:
    display(table.head(3))

The second data table looks the most interesting, we can access it with `tables[1]` (the first table is `table[0]`). But in case that changes we can search through the data tables to find the one that contains a `'Seats'` column.

In [None]:
data = [t for t in tables if 'Seats' in t.columns][0]
data.head()

If we want to create a visualization of the data, we'll need to convert the values in the columns to numbers. We'll also make sure there are only digits in those columns.

In [None]:
import piplite
await piplite.install(['plotly','nbformat'])
import plotly.express as px

data['Seats'] = data['Seats'].str.extract('(\d+)', expand=False) # only keep the digits
data['Seats'] = pd.to_numeric(data['Seats'])
try:
    data['Year'] = data['Year'].str.extract('(\d+)', expand=False)
except:
    pass # if it's not a column of strings, don't do anything
data['Year'] = pd.to_numeric(data['Year'])
px.line(data, x='Year', y='Seats', title='Seats in Alberta Legislature')

### Another Webpage Example

Next let's get the Alberta electrical supply and demand report dashboard from [AESO (Alberta Electric System Operator)](http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet). We'll need to use `header=0` to read the data tables properly.

In [None]:
import pyodide_http
pyodide_http.patch_all(
import pandas as pd
tables = pd.read_html('http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet', header=0)

for table in tables:
    display(table.head(3))

Let's check out `tables[5]` (the sixth data table) that shows us the totals for each type of energy generation technology.

In [None]:
tables[5]

Because of the way it was formatted, the titles are actually on the first row. Let's fix that, convert values to numbers, and then create a bar chart.

In [None]:
data = tables[5]
data.columns = data.iloc[0] # set the column names to the first row
data = data.drop(0) # drop the labels row
data = data.drop(9) # drop the 'TOTAL' row

for column in data.columns:
    data[column] = pd.to_numeric(data[column], errors='ignore')

px.bar(data, x='GROUP', y='MC', title='Maximum Capacity of Elecrical Generation Technologies in Alberta')

---

<span style="color:#663399">Your **assignment** is to create a data visualization from one of the other data tables in this notebook, then paste your visualization into a document.</span>

<span style="color:#FF6633">An **optional advanced challenge** is to create a data visualization from an online table and paste your visualization into a document.</span>

---

That is the end of the "Data Science and Artificial Intelligence" series of notebooks. If you are interested in exploring more, check out [Callysto.ca](https://www.callysto.ca/).

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)