<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Databricks Learning" style="width: 600px; height: 240px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Touching Spark - Part 2 - Lab

## Getting Started

Let's start Creating SparkSession and useful variables

In [None]:
%load_ext autotime

In [None]:
import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io
from pyspark.sql.types import *
from pyspark.sql.functions import *

baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

In [None]:
qcutils.print_s3_bucket_object(key='training/wikipedia_pagecount.tsv')

## Data Ingestion

For this lab we will use the file `wikipedia_pagecount.tsv` 

This file contains recent web traffic data to Wikipedia, that is less than 1 hour old. It captures 1 hour of page counts to all of Wikipedia languages and projects.

In each line, the first column (like `en`) is the Wikimedia project name. The following abbreviations are used for the first column:
```
wikipedia mobile: ".mw"
wiktionary: ".d"
wikibooks: ".b"
wikimedia: ".m"
wikinews: ".n"
wikiquote: ".q"
wikisource: ".s"
wikiversity: ".v"
mediawiki: ".w"
```

Projects without a period and a following character are Wikipedia projects. So, any line starting with the column `en` refers to the English language Wikipedia (and can be requests from either a mobile or desktop client).

There will only be one line starting with the column `en.mw`, which will have a total count of the number of requests to English language Wikipedia's mobile edition. 

`en.d` refers to English language Wiktionary. 

`fr` is French Wikipedia. There are over 290 language possibilities.

Read it as DataFrame:
* Let the system infer the schema
* Create the schema yourself

In [None]:
wpc = (spark.read
       ...
      )

wpc

In [None]:
schema = StructType(
  [
    ...
  ]
)

wpc = (spark.read
       ...
      )

wpc.printSchema()
wpc

## Business Questions

Let's now face the same business question and try to solve it using DataFrame API

* Question # 1) How many articles in English Wikipedia were requested in the past hour?
* Question # 2) How many requests total did English Wikipedia get in the past hour?
* Question # 3) How many requests total did each Wikipedia project get total during this hour?
* Question # 4) How many different English Wikimedia projects saw traffic in the past hour?
* Question # 5) How much traffic did each English Wikimedia project get in the past hour?
* Question # 6) What were the 25 most popular English articles in the past hour?
* Question # 7) How many requests did the "Apache Spark" article recieve during this hour?
* Question # 8) Which Apache project received the most requests during this hour?
* Question # 9) What percentage of the 5.1 million English articles were requested in the past hour?
* Question # 10) How many total requests were there to English Wikipedia Desktop edition in the past hour?
* Question # 11) How many total requests were there to English Wikipedia Mobile edition in the past hour?

[Pyspark API reference](https://spark.apache.org/docs/2.4.5/api/python/pyspark.html)

[Pyspark DataFrame API reference](https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html#pyspark.sql.DataFrame)

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.