# Analysis

Remember to run the previous notebooks first.

## Utilities

In [0]:
database_name = "northwind"
spark.sql(f"USE SCHEMA {database_name};")

Out[9]: DataFrame[]

## Analysis

#### What are the 3 least sold products?
Using **PySpark** and **gl_product_sales** table

In [0]:
from operator import attrgetter
display(spark.table("gl_products_sales").rdd.takeOrdered(5, attrgetter("sales")))

product_name,sales
Mishi Kobe Niku,5
Genen Shouyu,6
Chocolade,6
Gravad lax,6
Louisiana Hot Spiced Okra,8


#### What are the top 5 Customers with the highest number of purchases?
Using **Koalas** and **gl_customer_number_of_purchases**

In [0]:
!pip install koalas  # Needed in the community edition (as the cluster has to be recreated)

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-c35ac7eb-8de6-40ed-a579-81d477b3a8db/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
import databricks.koalas as ks

display(ks.read_table("gl_customer_number_of_purchases").nlargest(5, "number_of_purchases"))

number_of_purchases,company_name,contact_name,contact_title
31,Save-a-lot Markets,Jose Pavarotti,Sales Representative
30,Ernst Handel,Roland Mendel,Sales Manager
28,QUICK-Stop,Horst Kloss,Accounting Manager
19,Folk och fä HB,Maria Larsson,Owner
19,Hungry Owl All-Night Grocers,Patricia McKenna,Sales Associate


#### What are the top 5 Customers with the highest purchases value?

Using **Spark Pandas** and **gl_customer_value_of_purchases**

In [0]:
import pyspark.pandas as ps

display(ps.DataFrame(spark.table("gl_customer_value_of_purchases")).nlargest(5, "value_of_purchases"))

value_of_purchases,company_name,contact_name,contact_title
101205.5025,QUICK-Stop,Horst Kloss,Accounting Manager
100529.81,Save-a-lot Markets,Jose Pavarotti,Sales Representative
96662.34,Ernst Handel,Roland Mendel,Sales Manager
80290.07400000001,Rattlesnake Canyon Grocery,Paula Wilson,Assistant Sales Representative
58042.04750000001,Hungry Owl All-Night Grocers,Patricia McKenna,Sales Associate


#### Who was the employee who made more sales last year?

Using **HiveQL** and **gl_employees_sales_per_year**

In [0]:
spark.sql(f"""
    WITH last_year AS (
    SELECT 
        *
    FROM gl_employees_sales_per_year 
    WHERE year = (SELECT MAX(year) FROM gl_employees_sales_per_year)
    )
    
    SELECT 
        * 
    FROM last_year
    WHERE number_of_sales = (SELECT MAX(number_of_sales) FROM last_year)
""").display()

employee_id,first_name,last_name,year,number_of_sales
4,Margaret,Peacock,1998,44
