# Intro To DataFrames, Lab #4
## What-The-Monday?

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions

The datasource for this lab is located on the DBFS at **/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet**.

As we saw in the previous notebook...
* There are a lot more requests for sites on Monday than on any other day of the week.
* The variance is **NOT** unique to the mobile or desktop site.

Your mission, should you choose to accept it, is to demonstrate conclusively why there are more requests on Monday than on any other day of the week.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "../Includes/Classroom Setup"

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Data Source
parquetFile = "/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet"

# Reading data and cache to boost performance
pageviewDF = (spark.read
  .option("inferSchema", "true")                
  .parquet(parquetFile)                         
  .withColumnRenamed("timestamp", "capturedAt")
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
  .orderBy( col("capturedAt"), col("site") )
  .cache()                                 
)

In [0]:
display(pageviewDF)

capturedAt,site,requests
2015-03-16T00:00:00.000+0000,desktop,2343
2015-03-16T00:00:00.000+0000,mobile,1628
2015-03-16T00:00:01.000+0000,desktop,2382
2015-03-16T00:00:01.000+0000,mobile,1636
2015-03-16T00:00:02.000+0000,desktop,2546
2015-03-16T00:00:02.000+0000,mobile,1619
2015-03-16T00:00:03.000+0000,desktop,2402
2015-03-16T00:00:03.000+0000,mobile,1776
2015-03-16T00:00:04.000+0000,desktop,2370
2015-03-16T00:00:04.000+0000,mobile,1716


In [0]:
# Count how many duplicate records in data
recordCount = pageviewDF.count()
distinctRecordCount = pageviewDF.distinct().count()

print("There are {} records with {} duplicate records".format(recordCount, recordCount - distinctRecordCount))

In [0]:
from pyspark.sql.functions import col

# Get all duplicate records
dupDF = (pageviewDF
         .groupBy(col("capturedAt"), col("site"), col("requests"))
         .count()
         .where("count > 1")
         .orderBy(col("count").desc(), col("capturedAt"))
         .drop(col("count"))
        )
display(dupDF)

capturedAt,site,requests
2015-04-20T00:00:00.000+0000,mobile,1577
2015-04-20T00:00:00.000+0000,desktop,2273
2015-04-20T00:00:01.000+0000,desktop,2369
2015-04-20T00:00:01.000+0000,mobile,1604
2015-04-20T00:00:02.000+0000,desktop,2270
2015-04-20T00:00:02.000+0000,mobile,1699
2015-04-20T00:00:03.000+0000,mobile,1663
2015-04-20T00:00:03.000+0000,desktop,2651
2015-04-20T00:00:04.000+0000,desktop,2260
2015-04-20T00:00:04.000+0000,mobile,1611


In [0]:
# Add column day_of_week
interDF = (dupDF
  .withColumn("day_of_week", date_format(col("capturedAt"), "E"))
  .orderBy("capturedAt", "site")
)
display(interDF)
                  

capturedAt,site,requests,day_of_week
2015-04-20T00:00:00.000+0000,desktop,2273,Mon
2015-04-20T00:00:00.000+0000,mobile,1577,Mon
2015-04-20T00:00:01.000+0000,desktop,2369,Mon
2015-04-20T00:00:01.000+0000,mobile,1604,Mon
2015-04-20T00:00:02.000+0000,desktop,2270,Mon
2015-04-20T00:00:02.000+0000,mobile,1699,Mon
2015-04-20T00:00:03.000+0000,desktop,2651,Mon
2015-04-20T00:00:03.000+0000,mobile,1663,Mon
2015-04-20T00:00:04.000+0000,desktop,2260,Mon
2015-04-20T00:00:04.000+0000,mobile,1611,Mon


In [0]:
# Confirm all duplicate records are on Monday
notMonDupRecords = (interDF
              .filter(col("day_of_week") != "Mon")
              .count()
      )
print("There are {} duplicate records not on Monday".format(notMonDupRecords))

In [0]:
# Count all dupplicate requests on Monday from desktop and mobile 
dupRequests = (interDF
              .filter(col("day_of_week") == "Mon")
              .groupBy(col("site"))
              .sum("requests")
              .withColumnRenamed("sum(requests)", "totalDupRequests")
              .collect()
              )

desktopDupRequests = dupRequests[0]["totalDupRequests"]
mobileDupRequests = dupRequests[1]["totalDupRequests"]

print("The number of requests for sites on Monday are larger than on any other day of the week is because there are:") 
print("\t + {} duplicate requests on desktop".format(desktopDupRequests, mobileDupRequests))
print("\t + {} duplicate requests on mobile on Monday.".format(mobileDupRequests))