# Lab 2 - Data Warehouse / Interactive Pattern - Interactive Querying with Spark, LLAP and Power BI

AdventureWorks would like to create some visualizations of their data to better understand their customers. They are interested in using the powerful visualization capabilities of Power BI and its ability to allow them to share those visualizations, but aren't sure how they can pull in the data to create the dashboards.

They have provided all the weblogs and user tables that you can use to quickly explore the data, and have the product information available in flat files. You will prepare the data to be used in Power BI, explore the data using Spark SQL and Jupyter's built-in visualizations, as well as Matplotlib for more advanced control. Finally, you will import the data into Power BI to create interactive dashboards and reports.

## Explore the weblog and user data

Let's take a look at the data and come up with some interesting visualizations based on what we find.

First, let's import the Python modules and functions we will use in this notebook.

In [None]:
from pyspark import SparkContext
from pyspark.sql import *
from pyspark.sql.types import *

User actions are captured in the weblogs as they navigate through the site. Let's find out which actions are captured, and how many of each action users performed. We'll sore from the highest count to the lowest.

Use the `%%sql` magic parameters to save the query results to a [Pandas](http://pandas.pydata.org/) DataFrame in the `%%local` context. This way, we can use the output in our Matplotlib charts.

> To learn more about the `%%sql` magic, and other magics available with the PySpark kernel, see [Kernels available on Jupyter notebooks with Spark HDInsight clusters](https://docs.microsoft.com/azure/hdinsight/hdinsight-apache-spark-jupyter-notebook-kernels#parameters-supported-with-the-sql-magic).

In [None]:
# TODO: use the %%sql magic output parameter to save the query result to local DataFrame named query1
%%sql #Complete this line#
select Action, Count(weblogs.*) as Ct from weblogs
inner join users on weblogs.userid = users.id
group by #Complete this line#
order by Ct desc

After executing the above query, use the built-in tabs to change the display Type from a Table to other visualizations, such as Pie and Bar charts.

Now, let's alter the query to show the same actions and their counts by gender. This will help us spot differences in how men and women use the site, and which group is ultimately most likely to make a purchase.

As we did with the previous query, use the `%%sql` magic parameters to save the query results to a new Pandas DataFrame.

In [None]:
# TODO: use the %%sql magic output parameter to save the query result to local DataFrame named query2
%%sql #Complete this line#
select Action, Gender, Count(weblogs.*) as Ct from weblogs
inner join users on weblogs.userid = users.id
group by #Complete this line#
order by Ct desc

As you did previously, use the built-in Pie chart visualization to view the breakdown of Action percentages. Use the Bar chart to display the total count of each action. Now change the chart's settings to display the total count of all actions by gender. One limitation of this built-in chart that you may notice, is that you cannot use it to compare the total count of each action by gender. Let's configure and use Matplotlib to help us with this visualization.

First, we need to switch to the local context and import the required libraries.

In [None]:
%%local
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

We need to make sure the Matplotlib commands are run in the local context. To test things out, let's recreate the pie chart showing the Action percentages, using Matplotlib.

In [None]:
# TODO: switch to the local context

labels = query1['Action']
counts = query1#Complete this line#
colors = ['turquoise', 'seagreen', 'mediumslateblue']
plt.pie(counts, labels=labels, autopct='%1.1f%%', colors=#Complete this line#
plt.axis('equal')
plt.show()

To compare actions made by gender, we need to split up the local Pandas [DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) into smaller DataFrames that we can use for our charts.

> A quick reference guide you can use to get up and running quickly with Pandas is the [10 minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) page on their site.

Looking at the pie chart above, we can see that the vast majority of the site's logged data is from people browsing the site. We need to break down the DataFrames further into two groups of actions so we can more easily drill down into the details of these groups: Purchasing, and Browsing.

Create the following DataFrames from the second Pandas DataFrame you created using the `%%sql` magic above:

1. men: Gender = Male, sorted by Action in descending order
2. women: Gender = Female, sorted by Action in descending order
3. men_purchasing: Gender = Male, includes the following actions: Add to Cart and Purchased
4. women_purchasing: Gender = Female, includes the following actions: Add to Cart and Purchased
5. men_browsing: Gender = Male, with the Browse action
6. women_browsing: Gender = Female, with the Browse action
7. labels: All of the actions in either the men_purchasing DataFrame, or the women_purchasing DataFrame
8. men_purchasing_counts: The Ct values from the men_purchasing DataFrame
9. women_purchasing_counts: The Ct values from the women_purchasing DataFrame
10. men_browsing_counts: The Ct values from the men_browsing DataFrame
11. women_browsing_counts: The Ct values from the women_browsing DataFrame

In [None]:
%%local

# TODO: Complete the Pandas DataFrames below, referencing the list above

men = query2[query2['Gender'] == 'Male'].sort_values(by=#Complete this line#
women = query2[query2['Gender'] == 'Female'].sort_values(by=#Complete this line#
men_purchasing = men[men['Action'].isin([#Complete this line#
women_purchasing = women[women['Action'].isin([#Complete this line#
men_browsing = men[men['Action'] == 'Browsed']
women_browsing = women[women['Action'] == 'Browsed']
labels = men_purchasing['Action']

men_purchasing_counts = men_purchasing[#Complete this line#
women_purchasing_counts = women_purchasing[#Complete this line#
men_browsing_counts = men_browsing[#Complete this line#
women_browsing_counts = women_browsing[#Complete this line#

Print the `men`, `women`, and `labels` DataFrames to take a quick look at the data we're working with.

In [None]:
%%local

# TODO: Print the following DataFrames: men, women, labels

print(#Complete this line#
print(#Complete this line#
print(#Complete this line#

Now that we have our DataFrames defined, let's create and configure a Matplotlib stacked Bar chart with women's values on top and men's values underneath. The values should come from the men's and women's purchasing counts DataFrames so we can compare how many women ultimately purchase items from the AdventureWorks store vs. men.

Color code the bars as you see fit, and make certain to add a legend indicating either Men or Women.

In [None]:
%%local

N = 2
ind = np.arange(N)    # the x locations for the groups
width = 0.35       # the width of the bars: can also be len(x) sequence

p1 = plt.bar(ind, men_purchasing_counts, width, color='#d62728')
# TODO: Set the bottom of the bar to men_purchasing_counts
p2 = plt.bar(ind, women_purchasing_counts, width, #Complete this line#

# TODO: Set the ylabel to "Counts"
plt.#Complete this line#
plt.title('Purchasing actions by gender')
plt.xticks(ind, labels)
plt.legend((p1[0], p2[0]), ('Men', 'Women'))
#plt.xkcd(scale=1, length=100, randomness=2)
plt.show()

The chart shows us that women, by a vast majority, conduct purchasing actions on the website, compared to men. Many more in both groups add items to the cart than actually complete the purchase.

Let us compare browsing statistics between the genders. This should also be a bar chart, but display men's values next to the women's.

In [None]:
%%local

fig, ax = plt.subplots()
labels = men_browsing['Action']
ind = np.arange(1)
p1 = ax.bar(ind, men_browsing_counts, width, color='#d62728')
p2 = ax.bar(ind + width, women_browsing_counts, width)

ax.set_ylabel('Counts')
ax.set_title('Browsing counts by gender')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(labels)
ax.legend((p1[0], p2[0]), ('Men', 'Women'))

plt.show()

As you can see from this chart, though the women greatly outperformed the men in purchases, men tend to browse through AdventureWork's product category by more than double. They just appear to not be converting to buyers for some reason. Perhaps this is something the marketing team needs to look into? Or maybe the website's content team.

One thing that makes this chart more difficult to comprehend, is that the counts include an exponential label (1e7), since the values are in the tens of millions, instead showing 0 - 7 as the count values on the y axis.

Let's make this easier to read by displaying the actual count values above each bar. To do this, modify the bar chart code to include a function that accepts a rectangle collection, and sets the text based on the height value of each. Pass the men and women bars to this function.

In [None]:
%%local

fig, ax = plt.subplots()
labels = men_browsing['Action']
ind = np.arange(1)
p1 = ax.bar(ind, men_browsing_counts, width, color='#d62728')
p2 = ax.bar(ind + width, women_browsing_counts, width)

ax.set_ylabel('Counts')
ax.set_title('Browsing counts by gender')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(labels)
ax.legend((p1[0], p2[0]), ('Men', 'Women'))

def autolabel(rects):
    """
    Attach a text label above each bar displaying its height
    """
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

# TODO: Pass both bars to the autolabel function
autolabel(#Complete this line#
autolabel(#Complete this line#
plt.show()

## Explore the product data

Now we need to load and parse the product data from Azure Storage so we can work with it.

In [None]:
products_schema = StructType([
        StructField('ProductId',IntegerType(),False), 
        StructField('ProductName', StringType()), 
        StructField('Price', FloatType()), 
        StructField('CategoryId', StringType()), 
        StructField('Ignore1', StringType()), 
        StructField('Ignore2', StringType()), 
        StructField('Ignore3', StringType()), 
        StructField('Category', StringType()), 
        StructField('Department', StringType())
    ])

products_DF = spark.read.csv("/retaildata/rawdata/ProductFile/part{*}", 
                    schema=products_schema,
                    header=False)

products_DF_with_price = products_DF.select("ProductId", "ProductName", "Price", "CategoryId", "Category", "Department")

Save the product data to a Hive table named Products.

In [None]:
products_DF_with_price.write.mode("overwrite").saveAsTable("Products")

Verify that the table was successfully created.

In [None]:
%%sql
show tables

Only the tables that have false under the isTemporary column are Hive tables that are stored in the metastore and can be accessed from Power BI. We will be using the products, users, and weblogs tables.

Let's take a look at some of the data in the new Products table.

In [None]:
# TODO: select the top 10 rows from the new products table 
%%sql
select * #Complete this line#

It's important to know the top products sold, and which categories they are part of. But an interesting data point for marketing may be what the average age is of the purchasers of those products.

In [None]:
%%sql
select ProductName, Category, Avg(Age), Count(weblogs.*) as Ct from products inner join weblogs
on weblogs.productid = products.productid inner join users on weblogs.userid = users.id
where weblogs.Action = 'Purchased'
group by --Complete this line
order by ct desc
limit 5

Now that we have a good sense of the data, we can start building the charts in Power BI. We'll work on that next.

## Reference tables using DirectQuery in Power BI

Power BI will allow us to quickly create these charts, now that we've explored the data for a bit.

You will need to download the [Power BI Desktop](https://powerbi.microsoft.com/desktop/) software to complete these steps.

1. From Power BI Desktop, select dropdown under **Get Data**, then **More...**. In the Get Data window, select Azure, then Azure HDInsight Spark (Beta).

![Get Data](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab02/images/get-data.png)

2. Enter your Spark cluster's server name in the Server field. This will be [YOUR_CLUSTER_NAME].azurehdinsight.net.
3. Select DirectQuery as the data connectivity mode, then click OK.

![Get Data](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab02/images/get-data-directquery.png)

4. Enter your credentials you defined when you provisioned the cluster.
5. After authenticating, continue to the next step and check the boxes next to the `weblogs`, `users`, and `products` tables, then click Load.

![Get Data](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab02/images/directquery.png)


## Configure table relationships

Before we can start combining columns from related tables for our charts, we first need to configure the table relationships.

Create the following relationships before continuing:

| From: Table (Column) | To: Table (Column) |
| -- | -- |
| weblogs (ProductId) | products (ProductId) |
| weblogs (UserId) | users (id) |

## Create the charts

Now that the tables are loaded using DirectQuery, you will see them listed on the right-hand side of the Power BI page, under the Fields heading. When you select fields from this list, a table visualization is automatically created. Use the visualization icons to use the selected data for new types of charts and other visualizations.

> Power BI displays details of each chart item when you mouse over them. For instance, when you hover your mouse over the slices of the pie chart, Power BI will helpfully display the action name, count, and percentage of that action. This feature is available on all visualization types.

### Create a Pie Chart visualization

Create a new Pie Chart visualization to display the three actions from the weblogs table, and their percentages. This will be very similar to the pie chart we created in Jupyter.

The result should look similar to this:

![Pie Chart](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab02/images/piechart.png)

### Change the title of the Pie Chart

Power BI makes a best guess on what the title of your visualizations should be. However, we can change the title to make the context of the visualizations make more sense.

Change the title of the pie chart to "User Actions".

### Add a Stacked Column Chart

Similar to the stacked bar charts we created with Matplotlib, we'll create a stacked column chart in Power BI to compare purchasing actions (Add to Cart, and Purchased) by gender.

The chart we create should look like the following, and be titled "Purchasing actions by gender":

![Stacked Column Chart](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab02/images/stacked-columnchart.png)

### Add a Clustered Column Chart

Because the number of browsed actions are so high, we'll display those counts in their own chart. We can display the browsed count for males and females side-by-side using the clustered column chart visualization.

The chart we create should look like the following, and be titled "Browsing counts by gender":

![Clustered Column Chart](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab02/images/clustered-columnchart.png)

### Add a Waterfall Chart

The sales department is interested in seeing the top 6 products sold at any time, including the how many of each product was sold, and the total amount of those six products combined.

The chart we create should look like the following, and be titled "Top 6 products sold":

![Waterfall Chart](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab02/images/waterfallchart.png)

After you are done, your page should look similar to the following:

![Completed page](https://raw.githubusercontent.com/ZoinerTejada/hdi-labs/master/Labs/Lab02/images/completed-report.png)

## Conclusion

In the lab, you have learned how to use Spark SQL (and PySpark) to quickly explore gigabytes of data, simplify understanding the data through visualization within Jupyter, and then create nice, interactive versions of those visuals within Power BI.

Specifically you:
* Queried data stored in Hive tables.
* Copied product data from a Spark SQL DataFrame into a new Hive table that can be accessed from Power BI.
* Explored Jupyter's built-in visualizations, and used the more powerful Matplotlib charts to effectively explore the data.
* Learned how to work with the Pandas DataFrames within the local context.
* Set up a DirectQuery connection in Power BI to Spark.
* Created visualizations in Power BI directly from the Spark data.