<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Databricks Learning" style="width: 600px; height: 240px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Touching Spark - Part 1 - Lab

## Analyzing the Wikipedia PageCounts with RDDs
### Technical Accomplishments:

* Learn how to use the following RDD actions: `count`, `take`, `takeSample`, `collect`
* Learn the following RDD transformations: `filter`, `map`, `groupByKey`, `reduceByKey`, `sortBy`
* Learn how to convert your RDD code to Datasets
* Learn how to cache an RDD and view its number of partitions and total size in memory

## Getting Started

Let's start Creating SparkSession and useful variables

In [None]:
%load_ext autotime

In [None]:
import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

sc = spark.sparkContext

spark

In [None]:
qcutils.print_s3_bucket_object(key='training/wikipedia_pagecount.tsv')

## Data Ingestion

For this lab we will use the file `wikipedia_pagecount.tsv` 

This file contains recent web traffic data to Wikipedia, that is less than 1 hour old. It captures 1 hour of page counts to all of Wikipedia languages and projects.

In each line, the first column (like `en`) is the Wikimedia project name. The following abbreviations are used for the first column:
```
wikipedia mobile: ".mw"
wiktionary: ".d"
wikibooks: ".b"
wikimedia: ".m"
wikinews: ".n"
wikiquote: ".q"
wikisource: ".s"
wikiversity: ".v"
mediawiki: ".w"
```

Projects without a period and a following character are Wikipedia projects. So, any line starting with the column `en` refers to the English language Wikipedia (and can be requests from either a mobile or desktop client).

There will only be one line starting with the column `en.mw`, which will have a total count of the number of requests to English language Wikipedia's mobile edition. 

`en.d` refers to English language Wiktionary. 

`fr` is French Wikipedia. There are over 290 language possibilities.

Read it as a text file in RDD

In [None]:
wpc = sc.textFile(baseUri+"wikipedia_pagecount.tsv")

wpc.cache().count()

## Business Questions

Try to answer the following question:

* Question # 1) How many unique articles in English Wikipedia were requested in the past hour?
* Question # 2) How many requests total did English Wikipedia get in the past hour?
* Question # 3) How many requests total did each Wikipedia project get total during this hour?

[Pyspark API reference](https://spark.apache.org/docs/2.4.5/api/python/pyspark.html)

[RDD API reference](https://spark.apache.org/docs/2.4.5/api/python/pyspark.html#pyspark.RDD)

## Preliminary Work

Before working on the question, some preliminary works need to be done.

We ingested a text file, we need to create a data structure taht can be querable:

0. Look into the RDD
    * Can you see a common separator?
0. Change content of each lines in order to ease the following work
    * Can you split the content of the lines in a smart way?
0. Remove lines that not fit the "schema"
    * Is there and header in the content?
    * Can you remove it?

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.