In [1]:
import zipfile

import numpy as np
import pandas as pd

#with zipfile.ZipFile('bakery.csv.zip', 'r') as zip_ref:
#    zip_ref.extractall('.')
with zipfile.ZipFile('melb_clean.csv.zip', 'r') as zip_ref:
    zip_ref.extractall('.')
#with zipfile.ZipFile('nba.csv.zip', 'r') as zip_ref:
#    zip_ref.extractall('.')
#with zipfile.ZipFile('stocks_cleaned.csv.zip', 'r') as zip_ref:
#    zip_ref.extractall('.')

melb_clean = pd.read_csv("melb_clean.csv")
north = melb_clean[melb_clean["region"] == "Northern"]
south = melb_clean[melb_clean["region"] == "Southern"]

## Switch bokeh to the notebook mode
from bokeh.io import output_notebook
output_notebook()

## Import the libraries we need
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.io import output_file, show

## Colors, legend, and theme

For your first assignment, the estate agents would like a visualization to represent the relationship between the year a property was built and its total land area, factoring in how this varies between the Northern and Southern regions of Melbourne. You decide to use one of Bokeh's custom themes for the plot.

Two subsets of melb have been created based on which region a property is located in, north and south, as shown below:

```python
    north = melb.loc[melb["region"] == "Northern"]
    south = melb.loc[melb["region"] == "Southern"]
```

A figure, fig, has been preloaded for you. You will update the theme, and add circle glyphs using different colors for each region. You will then add a legend_label so they can be easily distinguished.

### Instructions
    - Import curdoc.
    - Update the theme to "contrast".
    - Add circle glyphs for north, setting x and y to represent year_built and land_area, respectively, setting color to "yellow" and legend_label to "North".
    - Repeat for south, setting color to "red" and legend_label to "South".

In [2]:
# Import curdoc
from bokeh.io import curdoc

# Change theme to contrast
curdoc().theme = "contrast"
fig = figure(x_axis_label="Year Built", y_axis_label="Land Area (Meters Squared)")

# Add north circle glyphs
fig.circle(x=north["year_built"], y=north["land_area"], color="yellow", legend_label="North")

# Add south circle glyphs
fig.circle(x=south["year_built"], y=south["land_area"], color="red", legend_label="South")

output_file(filename="north_vs_south.html")
show(fig)

In [3]:
houses = melb_clean.loc[melb_clean["type"] == "h"]
units = melb_clean.loc[melb_clean["type"] == "u"]
townhouses = melb_clean.loc[melb_clean["type"] == "t"]

## Customizing glyphs

The estate agents have requested a plot displaying the relationship between the year a property was built and its distance to the Central Business District (CBD), distinguishing between houses, units, and townhouses. You decide to use different colors and glyphs for each of the three property types.

Three subsets of melb have been created and preloaded for you:

```python
    houses = melb.loc[melb["type"] == "h"]
    units = melb.loc[melb["type"] == "u"]
    townhouses = melb.loc[melb["type"] == "t"]
```

## Instructions
    - Create a figure, fig, setting x and y-axes labels to "Year Built" and "Distance from CBD (km)", respectively.
    - Add purple circle glyphs to represent houses, plotting "year_built" on the x-axis and "distance" on the y-axis, and labeling the legend as "House".
    - Repeat the above, this time using red square glyphs for units, with the legend labeled as "Unit".
    - Repeat once more using green triangle glyphs to represent townhouses and setting the legend label to "Townhouse".

In [4]:
# Create figure
fig = figure(x_axis_label="Year Built", y_axis_label="Distance from CBD (km)")

# Add circle glyphs for houses
fig.circle(x=houses["year_built"], y=houses["distance"], legend_label="House", color="purple")

# Add square glyphs for units
fig.square(x=units["year_built"], y=units["distance"], legend_label="Unit", color="red")

# Add triangle glyphs for townhouses
fig.triangle(x=townhouses["year_built"], y=townhouses["distance"], legend_label="Townhouse", color="green")
output_file(filename="year_built_vs_distance_by_property_type.html")
show(fig)

## Average building size

The estate agents are interested in understanding if there have been changes in building size over time.

You will group the melb dataset by date and calculate the average building size. You will then convert the grouped DataFrame into a Bokeh source object and build a line plot.

### Instructions
    - Group melb by date and calculate the mean of "building_area".
    - Create fig, labeling the x and y-axes as "Date" and "Building Size (Meters Squared)", respectively, and setting the x-axis ticks to datetime format.
    - Add line glyphs to fig to visualize building_area versus date, using source.
    - Generate an HTML file called "property_size_by_date.html".

In [5]:
melb = melb_clean.copy()

In [None]:
# Group by date and calculate average building size
prop_size = melb.groupby("date", as_index=False)["building_area"].mean()
source = ColumnDataSource(data=prop_size)

# Create the figure
fig = figure(x_axis_label="Date", y_axis_label="Building Size (Meters Squared)", x_axis_type="datetime")

# Add line glyphs
fig.line(x="date", y="building_area", source=source)

# Generate the HTML file
output_file(filename="property_size_by_date.html")
show(fig)

## Sales over time

The estate agents have now asked you to examine house market activity to visualize changes in total sales over time. The melb DataFrame has been grouped on date, this time calculating total sales using the sum of the price column, and stored as melb_sales:

```python
    melb_sales = melb.groupby("date", as_index=False)["price"].sum()
```

source has been created from melb_sales, and preloaded for you. Your task is to format a plot to display the visualization with meaningful axes allowing for insights to be drawn.

### Instructions
    - Import the classes required to change the axis labels to datetime and numeric format.
    - Add line glyphs to the figure, assigning y as"price" versus x as"date" from source.
    - Update the format of the x-axis to months as three characters, and years as 4 digits.
    - Set the y-axis format as "$0a" to display in millions of dollars.

In [7]:
melb_sales = melb.groupby("date", as_index=False)["price"].sum()
source = ColumnDataSource(data=melb_sales)

In [None]:
# Import the second formatter
from bokeh.models import NumeralTickFormatter, DatetimeTickFormatter
fig = figure(x_axis_label="Date", y_axis_label="Sales")

# Add line glyphs
fig.line(x="date", y="price", source=source)

# Format the x-axis format
fig.xaxis[0].formatter = DatetimeTickFormatter(months="%b %Y")

# Format the y-axis format
fig.yaxis[0].formatter = NumeralTickFormatter(format="$0a")

output_file(filename="melbourne_sales.html")
show(fig)

## Categorical column subplots

The estate agents would like you to analyze how property size and amount of land vary by region in Melbourne. With the column layout, you can create two subplots displaying these relationships, using region as the x-axis.

The melb DataFrame has been grouped by region, and the average values for land_area and building_area have been calculated. This has been set up as a Bokeh data object called source, preloaded for you.

### Instructions
    - Import column from the associated Bokeh module.
    - Add bar glyphs to building_size, plotting the "building_area" for each "region".
    - Add bar glyphs to land_size, representing the "land_area" for each "region".
    - Generate an HTML file called "my_first_column.html", and complete the call of show() to display both subplots.

In [None]:
# Import column
from bokeh.layouts import column
regions = ["Eastern", "Southern", "Western", "Northern"]
building_size = figure(x_axis_label="Region", y_axis_label="Building Size (Meters Squared)", 
                       x_range=regions)
land_size = figure(x_axis_label="Region", y_axis_label="Land Size (Meters Squared)", 
                   x_range=regions)

# Add bar glyphs
building_size.vbar(x="region", top="building_area", source=source)
land_size.vbar(x="region", top="land_area", source=source)

# Generate HTML file and display the subplots
output_file(filename="my_first_column.html")
show(column(building_size, land_size))

## Size, location, and price

Next, the estate agents would like to understand how price is related to the size of the property and its distance from the Central Business District (CBD).

In this case, the y-axis of both figures will have the same units, so making a row of subplots is an appropriate choice. source has been set up as a Bokeh object using the melb dataset, and preloaded for you.

### Instructions
    - Import row from the associated Bokeh module.
    - Add circle glyphs to both figures, representing "price" on the y-axis versus "building_area" in building_size, and "price" on the y-axis versus "distance" in distance.
    - Update the y-axis of both figures to display in the format of $0a, for millions of dollars.
    - Complete the call of show() to display both subplots.

In [None]:
# Import row
from bokeh.layouts import row
building_size = figure(x_axis_label="Building Area (Meters Squared)", y_axis_label="Sales")
distance = figure(x_axis_label="Distance from CBD (km)", y_axis_label="Sales")

# Add circle glyphs
building_size.circle(x="building_area", y="price", source=source)
distance.circle(x="distance", y="price", source=source)

# Update the y-axis format for both figures
building_size.yaxis[0].formatter = NumeralTickFormatter(format="$0a")
distance.yaxis[0].formatter = NumeralTickFormatter(format="$0a")

# Display the subplots
output_file(filename="my_first_row.html")
show(row(building_size, distance))

## Using gridplot

The estate agents would like to examine how the relationship between property size and price varies across the four regions of Melbourne:

"Northern", "Western", "Eastern", and "Southern".

This is a great opportunity to use gridplot, displaying one subplot for each region!

### Instructions
    - Import gridplot.
    - Create df by filtering melb for the desired region.
    - Complete the code to add circle glyphs to fig, representing x as the building area column and y as price, from source, and legend_label as region.
    - Display the subplots in a grid using two columns.

In [11]:
# Import gridplot
from bokeh.layouts import gridplot
plots = []

# Complete for loop to create plots
for region in ["Northern", "Western", "Southern", "Eastern"]:
  df = melb.loc[melb["region"] == region]
  source = ColumnDataSource(data=df)
  fig = figure(x_axis_label="Building Area (Meters Squared)", y_axis_label="Price")
  fig.circle(x="building_area", y="price", source=source, legend_label=region)
  fig.yaxis[0].formatter = NumeralTickFormatter(format="$0a")
  plots.append(fig)

# Display plot
output_file(filename="gridplot.html")
show(gridplot(plots, ncols=2))

## Changing size

The estate agents have fed back that the subplots are quite large! They have asked you to make the next round of visualizations, displaying the relationship between the year a property was built with a) distance from the CBD and b) property size, a bit smaller.

You will manually specify the size of two subplots in row format.

### Instructions
    - Set height to 300 pixels and width to 400 pixels for both distance_vs_year and building_size_vs_year.
    - Add circle glyphs to distance_vs_year, with x being the year_built and y representing distance from CBD.
    - Repeat for building_size_vs_year with the same x-axis, but setting y to building_area.
    - Finish the show() call to display the subplots in a row layout.

In [12]:
# Set up figures
distance_vs_year = figure(x_axis_label="Year Built", y_axis_label="Distance from CBD (km)", height=300, width=400)
building_size_vs_year = figure(x_axis_label="Year Built", y_axis_label="Building Size (Meters Squared)", height=300, width=400)

# Add circle glyphs to distance_vs_year
distance_vs_year.circle(x="year_built", y="distance", source=source)

# Add circle glyphs to building_size_vs_year
building_size_vs_year.circle(x="year_built", y="building_area", source=source)

# Generate HTML file and display plot
output_file(filename="custom_size_plot")
show(row(distance_vs_year, building_size_vs_year))

## High to low prices by region

Now you know how to sort a DataFrame, the estate agents have asked you to create a bar plot visualizing the average property price by region from largest to smallest.

regions has been created by grouping melb by region and calculating the average price, and preloaded for you:

```python
regions = melb.groupby("region", as_index=False)["price"].mean()
```

### Instructions
    - Sort regions by price in descending order.
    - Create the figure, setting x_range equal to the "region" column of regions and labeling the x- and y-axes as "Region" and "Sales", respectively.
    - Add bar glyphs from regions, showing the price on the y-axis against each region on the x-axis, and setting the width to 0.9
    - Update the y-axis format to display in millions of dollars with 1 decimal place.

In [20]:
regions = melb[["region", "price"]].groupby("region").mean().reset_index()

In [21]:
# Sort df by price in descending order
regions = regions.sort_values("price", ascending=False)

# Create figure
fig = figure(x_range=regions["region"], x_axis_label="Region", y_axis_label="Sales")

# Add bar glyphs
fig.vbar(x=regions["region"], top=regions["price"], width=0.9)

# Format the y-axis to numeric format
fig.yaxis[0].formatter = NumeralTickFormatter(format="$0.0a")

output_file(filename="sorted_barplot.html")
show(fig)

## Creating nested categories

For your final plot, the estate agents would like you to present property sales across the year, displaying months and quarters on the x-axis.

Some of the code to add months and quarters into the Melbourne dataset has been preloaded for you. The factors variable, which will represent months and their corresponding quarters, needs to be created. The data must be also grouped by these two newly created columns to calculate total sales by taking the sum of the "price" column.

### Instructions
    - Complete factors, entering the relevant quarters and associated months.
    - Create grouped_melb by grouping melb by "month" and "quarter", calculating the total of the "price" column.

In [27]:
melb["date"] = pd.to_datetime(melb["date"], format='mixed')

In [28]:
melb["month"] = melb["date"].dt.month
quarters = {1: "Q1", 2:"Q1", 3:"Q1", 4:"Q2", 5:"Q2", 6:"Q2", 7:"Q3", 8:"Q3", 9:"Q3", 10:"Q4", 11:"Q4", 12:"Q4"}
melb["quarter"] = melb["month"].replace(quarters)
melb["month"] = melb["month"].replace({1:"January", 2:"February", 3:"March", 4:"April", 5:"May", 6:"June", 7:"July", 8:"August", 9:"September", 10:"October", 11:"November", 12:"December"})

# Create factors
factors = [("Q1", "January"), ("Q1", "February"), ("Q1", "March"), 
           ("Q2", "April"), ("Q2", "May"), ("Q2", "June"), 
           ("Q3", "July"), ("Q3", "August"), ("Q3", "September"), 
           ("Q4", "October"), ("Q4", "November"), ("Q4", "December")]

# Calculate total sales by month and quarter
grouped_melb = melb.groupby(["month", "quarter"], as_index=False)["price"].sum()
grouped_melb.sort_values("quarter", inplace=True)
print(grouped_melb.head())

      month quarter         price
3  February      Q1  3.892941e+08
4   January      Q1  2.612622e+08
7     March      Q1  1.614641e+09
0     April      Q2  1.308897e+09
6      June      Q2  1.570334e+09


## Visualizing sales by period

Now you have created your factors, it is time to build a bar plot visualizing sales per month, grouped into quarters!

grouped_melb, a pandas DataFrame containing one row for each month, its respective quarter, and total sales for that month, has been preloaded for you. Additionally, factors, which is a list of tuples containing each quarter and month pair, has also been preloaded.

### Instructions
    - Import NumeralTickFormatter and FactorRange.
    - Create the figure, using FactorRange() and factors for the x-axis, and labeling the y-axis as "Sales".
    - Add bar glyphs, setting x as factors, top as the "price" column of grouped_melb, and width as 0.9.
    - Rotate the x-axis labels to 45 degrees.

In [29]:
# Import NumeralTickFormatter and FactorRange
from bokeh.models import NumeralTickFormatter, FactorRange

# Create figure
fig = figure(x_range=FactorRange(*factors), y_axis_label="Sales")

# Create bar glyphs
fig.vbar(x=factors, top=grouped_melb["price"], width=0.9)
fig.yaxis[0].formatter = NumeralTickFormatter(format="$0.0a")

# Rotate the x-axis labels
fig.xaxis.major_label_orientation = 45

output_file(filename="sales_by_period.html")
show(fig)