In [None]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Web Scraping and Examining Solar Energy Usage:

In this notebook we continue our analysis of Solar energy data collected here:

http://www.energy.ca.gov/almanac/renewables_data/solar/index.php


In the previous notebook we developed a few key functions to process the tables on the energy website.  We will reuse those here:

In [None]:
def find_table(name, tables):
    return {t.iloc[0,0]: t for t in tables}[name].copy()

In [None]:
def clean_solar_table(table):
    table = table.copy()
    # Extract and set the column names
    table.columns = table.iloc[1,:].values
    # drop headers and summary at end
    table = table.iloc[2:-1]
    # Change types
    table = table.astype({"Year": "int", "Net MWh": "float", "Capacity (MW)": "float"})
    return table.reset_index(drop=True)

In [None]:
def extract_and_combine_pv_and_thermal(tables):
    thermal_table = clean_solar_table(find_table("Solar Thermal", tables))
    pv_table = clean_solar_table(find_table("Solar PV", tables))
    thermal_table["Kind"] = "Thermal"
    pv_table["Kind"] = "PV"
    return pd.concat([thermal_table, pv_table]).reset_index(drop=True)

# Examining Sources Over Time

What if we wanted to examine these sources overtime?  We can get more information from the website:

Try opening the page using developer tools in [Chrome](https://developer.chrome.com/devtools) or [Safari](https://developer.apple.com/safari/tools/).


![Webpage](webpage_form.png)

Notice that we can select a different year.  To do this we would need to send some additional information to the web server.  If you look at the HTML source on the right you see that the website requires `POST`ing additional values to access a particular year.  

We can do this using the Python [`requests` library](http://docs.python-requests.org/en/master/user/quickstart/):

In the following we make `POST` request with the body containing `newYear=2012`

In [None]:
import requests
resp = requests.post(
    "http://www.energy.ca.gov/almanac/renewables_data/solar/index.php", 
    data = {'newYear':'2012'})
resp

### Examining the Request

In [None]:
resp.request.method

In [None]:
resp.request.path_url

In [None]:
for k in resp.request.headers:
    print(k, "=", resp.request.headers[k])

In [None]:
resp.request.body

### Examining the Response

In [None]:
resp.status_code

In [None]:
for k in resp.headers:
    print(k, ":", resp.headers[k])

In [None]:
resp.content[0:500]

Notice that the content is currently encoded as raw bytes.  If we wanted to work on the text version of the content we would need to decode it to the correct string encoding.  This can be done using the `decode` function and the type defined in response header.

In [None]:
encoding = resp.headers['Content-Type'].split("=")[-1]
encoding

In [None]:
resp.content.decode(encoding)[0:500]

## Loading the Response into Pandas

Alternatively, we can send the HTML to Pandas to decode into tables as before.  Notice that we have now changed the year.

In [None]:
tables = pd.read_html(resp.content, encoding=encoding)
for t in tables:
    display(t.head())

# Downloading All Years

We would now like to programmatically extract the data for all the available years.  We can break this into three steps:

1. Get the list of possible years
1. Download the data for each year
1. Combine the data into a single DataFrame

## Get the list of possible years

We would like to programatically extract the possible years we can submit to the form.  To do this we will use the [`BeatifulSoup` (version 4) Library](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).  This is a fairly sophisticated library for reading and navigating HTML documents.  We won't cover this library in detail in Data100 but it will be helpful for you to know about it:

In [None]:
from bs4 import BeautifulSoup

### Beautiful Soup Makes HTML Readable

You can use Beautiful soup to parse an HTML document and reindent it:

In [None]:
dom = BeautifulSoup(resp.content.decode(encoding), "html.parser")
print(dom.prettify())

### Finding the list of  years

If we return to the web page and explore the DOM we find that the form has an `id`.  

![goYear](goyear.png)

**Div `id`s are (should be) unique so we can look for this form in the DOM tree by searching for the `id`.**

In [None]:
forms = dom.find_all(id="goYear")
forms

In [None]:
print(forms[0].prettify())

Notice that this form contains several option tags.   We can find all the option tags in this form:


In [None]:
form = forms[0]
opt_tags = form.find_all("option")
opt_tags

This again returns a list of elements.  Notice that each option has a value attribute that is posted in the form.  Let's extract these attributes:

In [None]:
[o.attrs for o in opt_tags]

The attributes are python dictionary objects so we can get the year by using the key `"value"`

In [None]:
years = set([o.attrs["value"] for o in opt_tags]) 
years

## Download All the Data

In the following block of code we use the years that we just collected to **repeatedly** query the website and download all the data for each year.

In [None]:
dfs = []

for y in years:
    print("Downloading Year:", y)

    # Get the data
    r = requests.post("http://www.energy.ca.gov/almanac/renewables_data/solar/index.php", 
              data = {'newYear': y})
    
    # Get all the tables
    tables = pd.read_html(r.content)

    # Get the two tables
    df = extract_and_combine_pv_and_thermal(tables)

    # Save the dataframe
    dfs.append(df)
    
    


## Combine All the Data into a single Dataframe

In [None]:
data = pd.concat(dfs).reset_index(drop=True)

In [None]:
data.head()

### Save a Backup

In [None]:
data.to_csv("cal_energy_data_all_years.csv")

# Finishing the Analysis

Let's examine the growth in the Thermal and PV energy production over the past decade.

1. Construct a Pivot Table of **year** by **kind** with the entries containing the total **Net MWh**.
1. Plot it!

In [None]:
data.pivot_table(values="Net MWh", 
                 index="Year", columns="Kind", aggfunc="sum")

In [None]:
fig = (
    data.pivot_table(values="Net MWh", 
                     index="Year", columns="Kind", aggfunc="sum")
        .plot(kind='line')
)
ax = fig.axes
plt.setp(ax.lines, marker="x")
ax.set_ylabel("Net MWh")

## Interactive Plotly Visualization

In [None]:
import plotly.offline as py
py.init_notebook_mode(connected=False)

import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf

cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

In [None]:
(
    data.pivot_table(values="Net MWh", 
                     index="Year", columns="Kind", aggfunc="sum")
        .iplot(kind='line',  yTitle = "Net MWh")
)

In [None]:
(
    data.pivot_table(values="Capacity (MW)", 
                     index="Year", columns="Kind", aggfunc="sum")
        .iplot(kind='line',  yTitle = "Total Capacity (MW)")
)

Who are the big providers:

In [None]:
(
    data.groupby("Plant Name")[["Capacity (MW)"]].sum()
        .sort_values("Capacity (MW)")
        .tail(30)
        .iplot(kind='bar',  yTitle = "Total Capacity (MW)")
)

## The big Producers:

1. [Topaz Solar](https://en.wikipedia.org/wiki/Topaz_Solar_Farm) (PV)
1. [Solar Energy Generating Systems](https://en.wikipedia.org/wiki/Solar_Energy_Generating_Systems) (Thermal)
1. [Desert Sunlight](https://en.wikipedia.org/wiki/Desert_Sunlight_Solar_Farm) (PV)