# Exercises on Spark Core API
<font color='violet'>Any changes to the notebook text made for the homework assignment are indicated in violet.</font>

This notebook contains exercises on three different datasets. The goal is to solve these exercises using the **Spark Core API**.

We start by installing pyspark (only execute if this is needed, e.g., if you are running this on Google Colab), and downloading the datasets. The exercises follow.

### Useful documentation to do these exercises.

The PySpark Documentation is available at https://spark.apache.org/docs/latest/api/python/index.html. 

Instructions on how to install PySpark on your local PC may be found at https://spark.apache.org/docs/latest/api/python/getting_started/install.html. Note that by installing PySpark in this way, you automatically have a local copy of Spark.

The Spark Core  api that we use below has the following  documentation which is a useful reference to have: https://spark.apache.org/docs/latest/api/python/reference/pyspark.html

#### Installing PySpark

In [1]:
# This installs pyspark in the current python environment.
# By installing pyspark, we automatically also install spark.
# You **need** to run this cell when running this notebook in google colab
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824028 sha256=cacedca3a145da50c67f00b399f6cdc343e216dd24317b31286543e05d2aa25d
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

#### General imports and starting Spark

In [2]:
#This is needed to start a Spark session from the notebook
#You may adjust the memory used by the driver program based on your machine's settings
import os 
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=3g  pyspark-shell"

from pyspark.sql import SparkSession

In [3]:
# -------------------------------
# Start Spark in LOCAL mode
# -------------------------------

#The following lines are just there to allow this cell to be re-executed multiple times:
#if a spark session was already started, we stop it before starting a new one
#(there can be only one spark context per jupyter notebook)
try: 
    spark
    print("Spark application already started. Terminating existing application and starting new one")
    spark.stop()
except: 
    pass

# Create a new spark session (note, the * indicates to use all available CPU cores)
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("demoRDD") \
    .getOrCreate()
    
#When dealing with RDDs, we work the sparkContext object. See https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
sc=spark.sparkContext

# We print the sparkcontext. This prints general information about the spark instance we have connected to. 
# In particular, the hyperlink allows us to open the spark UI (useful for seeing what is going on)
# Note: this hyperlink won't work when running this notebook in Google Colab.
sc

### Downloading data

The next cell downloads the data required to do the exercises

In [4]:
!mkdir downloads
!wget 'https://drive.google.com/u/0/uc?id=1Xlmwiku1RKLyACAPMFZMg1-RsWY-8tYS&export=download' -O downloads/data-spark-exercises.zip
!unzip -q downloads/data-spark-exercises.zip
!ls data

--2023-03-31 19:32:19--  https://drive.google.com/u/0/uc?id=1Xlmwiku1RKLyACAPMFZMg1-RsWY-8tYS&export=download
Resolving drive.google.com (drive.google.com)... 172.253.114.101, 172.253.114.139, 172.253.114.102, ...
Connecting to drive.google.com (drive.google.com)|172.253.114.101|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://drive.google.com/uc?id=1Xlmwiku1RKLyACAPMFZMg1-RsWY-8tYS&export=download [following]
--2023-03-31 19:32:19--  https://drive.google.com/uc?id=1Xlmwiku1RKLyACAPMFZMg1-RsWY-8tYS&export=download
Reusing existing connection to drive.google.com:443.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0o-50-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/i301kdqge3ojlpgpjtot6uqccfg4ef01/1680291075000/12785547293638390956/*/1Xlmwiku1RKLyACAPMFZMg1-RsWY-8tYS?e=download&uuid=753cb20f-0633-4a5e-a604-2ccdc5bee5b5 [following]
--2023-03-31 19:32:26--  https://doc-0o-50-docs.googleuserconten

## 1. Sensor data exercises
In the file “data/sensors/sensor-sample.txt” you will find on each line, multiple fields of information, let’s call them : Date(Date), Time(Time), RoomId(Integer)-SensorId(Integer), Value1(float), Value2(float)
Using this file, use spark to compute the following queries :

1. Count the number of entries for each day.
2. Count the number of measures for each pair of RoomId-SensorId.
3. Compute the average of Value1.

<font color='violet'>Reading in the data:</font>

In [5]:
fileName = 'data/sensors/sensor-sample.txt'
sensorRDD = sc.textFile(fileName)
sensorRDD = sensorRDD.map(lambda x: x.split())
sensorRDD.take(3)

[['2017-03-31', '03:38:16.508', '1-0', '122.153', '2.03397'],
 ['2017-03-31', '03:38:15.967', '1-1', '-3.91901', '2.09397'],
 ['2017-03-31', '03:38:16.577', '1-2', '11.04', '2.07397']]

<font color='violet'>1. Count the number of entries for each day:</font>


In [6]:
dayRDD = sensorRDD.map(lambda x: x[0]).countByValue()
dayRDD

defaultdict(int,
            {'2017-03-31': 3393,
             '2017-02-28': 62103,
             '2017-03-01': 33423,
             '2017-03-02': 32403,
             '2017-03-03': 29727,
             '2017-03-04': 30225,
             '2017-03-05': 26019,
             '2017-03-06': 24315,
             '2017-03-07': 26625,
             '2017-03-08': 29343,
             '2017-03-09': 27288,
             '2017-03-21': 19410,
             '2017-03-22': 10989,
             '2017-03-10': 12483,
             '2017-03-23': 24213,
             '2017-03-24': 13467,
             '2017-03-11': 19059,
             '2017-03-12': 25089,
             '2017-03-25': 12225,
             '2017-03-13': 24783,
             '2017-03-26': 13587,
             '2017-03-14': 23418,
             '2017-03-27': 14544,
             '2017-03-15': 11901,
             '2017-03-28': 22338,
             '2017-03-29': 12120,
             '2017-03-16': 13869,
             '2017-03-17': 26922,
             '2017-03-30': 5814,

<font color='violet'>
The sorting is not strictly necessary, but helps verify the correctness of the result, and comparison with notebook 4:</font>

In [7]:
myKeys = list(dayRDD.keys())
myKeys.sort()
{i: dayRDD[i] for i in myKeys}

{'2017-02-28': 62103,
 '2017-03-01': 33423,
 '2017-03-02': 32403,
 '2017-03-03': 29727,
 '2017-03-04': 30225,
 '2017-03-05': 26019,
 '2017-03-06': 24315,
 '2017-03-07': 26625,
 '2017-03-08': 29343,
 '2017-03-09': 27288,
 '2017-03-10': 12483,
 '2017-03-11': 19059,
 '2017-03-12': 25089,
 '2017-03-13': 24783,
 '2017-03-14': 23418,
 '2017-03-15': 11901,
 '2017-03-16': 13869,
 '2017-03-17': 26922,
 '2017-03-18': 17427,
 '2017-03-19': 21999,
 '2017-03-20': 21942,
 '2017-03-21': 19410,
 '2017-03-22': 10989,
 '2017-03-23': 24213,
 '2017-03-24': 13467,
 '2017-03-25': 12225,
 '2017-03-26': 13587,
 '2017-03-27': 14544,
 '2017-03-28': 22338,
 '2017-03-29': 12120,
 '2017-03-30': 5814,
 '2017-03-31': 3393,
 '2017-04-01': 537}

<font color='violet'>2. Count the number of measures for each pair of RoomId-SensorId:</font>

In [8]:
roomsensorRDD = sensorRDD.map(lambda x: x[2]).countByValue()

<font color='violet'>Again, sorting is not stirctly needed:</font>

In [9]:
myKeys = list(roomsensorRDD.keys())
myKeys.sort()
{i: roomsensorRDD[i] for i in myKeys}

{'1-0': 43047,
 '1-1': 43047,
 '1-2': 43047,
 '2-0': 46915,
 '2-1': 46915,
 '2-2': 46915,
 '3-0': 46634,
 '3-1': 46634,
 '3-2': 46634,
 '4-0': 43793,
 '4-1': 43793,
 '4-2': 43793,
 '5-0': 35,
 '5-1': 35,
 '5-2': 35,
 '6-0': 35666,
 '6-1': 35666,
 '6-2': 35666,
 '7-0': 14910,
 '7-1': 14910,
 '7-2': 14910}

<font color='violet'>3. Compute the average of Value1:
</font>

In [10]:
sensorRDD.map(lambda x: float(x[3])).mean()

92.8069927576456

## 2. Movielens movie data exercises

Movielens (https://movielens.org/) is a website that provides non-commercial, personalised movie recommendations. GroupLens Research has collected and made available rating data sets from the MovieLens web site for the purpose of research into making recommendation services. In this exercise, we will use one of these datasets (the movielens latest dataset, http://files.grouplens.org/datasets/movielens/ml-latest-small.zip) and compute some basic queries on it.
The dataset has already been downloaded and is available at data/movielens/movies.csv, data/movielens/ratings.csv, data/movielens/tags.csv, data/movielens/links.csv

1. Inspect the dataset's [README file](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html), in particular the section titled "Content and Use of Files" to learn the structure of these three files.
2. Compute all pairs (`movieid`, `rat`) where `movieid` is a movie id (as found in ratings.csv) and `rat` is the average rating of that movie id. (Hint: use aggregateByKey to compute first the sum of all ratings as well as the number of ratings per key).
2. Compute all pairs (`title`, `rat`) where `title` is a full movie title (as found in the movies.csv file), and `rat` is the average rating of that movie (computed over all possible ratings for that movie, as found in the ratings.csv file)
3. [_Extra_] Compute all pairs (`title`, `tag`) where `title` is a full movie title that has an average rating of at least 3.5, and `tag` is a tag for that movie (as found in the tags.csv file)

Extra: if you want to experiment with larger datasets, download the 10m dataset (http://files.grouplens.org/datasets/movielens/ml-10m.zip, 250 Mb uncompressed) and re-do the exercises above

<font color='violet'> 1. Inspect the dataset's README file, in particular the section titled "Content and Use of Files" to learn the structure of these three files:</font>

In [11]:
readme = 'data/movielens/README.txt'
file_obj = open(readme, "r")
for line in file_obj:
  print(line)
file_obj.close()

Summary




This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.



Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.



The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.



This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.



This and other GroupLens data sets are publicly availab

<font color='violet'>Reading in data:</font>

In [12]:
fileName = 'data/movielens/ratings.csv'
ratingsRDD = sc.textFile(fileName)
ratingsRDD = ratingsRDD.map(lambda x:x.split(","))
ratingsRDD.take(3)

[['1', '1', '4.0', '964982703'],
 ['1', '3', '4.0', '964981247'],
 ['1', '6', '4.0', '964982224']]

<font color='violet'>Notice the commas present in the movie titles, and the additional quotes for the 11th movie:</font>

In [13]:
moviesRDD = sc.textFile('data/movielens/movies.csv')
moviesRDD.take(11)

['1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy',
 '2,Jumanji (1995),Adventure|Children|Fantasy',
 '3,Grumpier Old Men (1995),Comedy|Romance',
 '4,Waiting to Exhale (1995),Comedy|Drama|Romance',
 '5,Father of the Bride Part II (1995),Comedy',
 '6,Heat (1995),Action|Crime|Thriller',
 '7,Sabrina (1995),Comedy|Romance',
 '8,Tom and Huck (1995),Adventure|Children',
 '9,Sudden Death (1995),Action',
 '10,GoldenEye (1995),Action|Adventure|Thriller',
 '11,"American President, The (1995)",Comedy|Drama|Romance']

<font color='violet'>Since using dataframes is not allowed in this notebook, we will have to do some manual parsing of the data. The first comma in each line always separates the movieID from the rest of the line, and the last comma always separates the genre from the rest of the line. We also have the strip extra quotes present in the titles, in particular look at the 11th movie:</font>

In [14]:
moviesRDD = moviesRDD.map(lambda x:x.split(",",1)).map(lambda x:[x[0],x[1].rsplit(",",1)[0].strip('\"')])
moviesRDD.take(11)

[['1', 'Toy Story (1995)'],
 ['2', 'Jumanji (1995)'],
 ['3', 'Grumpier Old Men (1995)'],
 ['4', 'Waiting to Exhale (1995)'],
 ['5', 'Father of the Bride Part II (1995)'],
 ['6', 'Heat (1995)'],
 ['7', 'Sabrina (1995)'],
 ['8', 'Tom and Huck (1995)'],
 ['9', 'Sudden Death (1995)'],
 ['10', 'GoldenEye (1995)'],
 ['11', 'American President, The (1995)']]

<font color='violet'>Similarly, for the tags dataset:</font>

In [15]:
tagsRDD = sc.textFile('data/movielens/tags.csv')
tagsRDD = tagsRDD.map(lambda x:x.split(","))
tagsRDD.take(3)

[['2', '60756', 'funny', '1445714994'],
 ['2', '60756', 'Highly quotable', '1445714996'],
 ['2', '60756', 'will ferrell', '1445714992']]

<font color='violet'>2. Compute all pairs (movieid, rat) where movieid is a movie id (as found in ratings.csv) and rat is the average rating of that movie id. (Hint: use aggregateByKey to compute first the sum of all ratings as well as the number of ratings per key).

<font color='violet'>We make a pair RDD, where the keys are the movieids and the values is a list containing the ratings and the constant number 1. The two elements of this list are then used to calculate the sum of scores and number of reviews by key. The average rating by key can be calculated by dividing these two numbers. Again, sorting is not strictly needed.</font>
</font>

In [16]:
pairRDD = ratingsRDD.map(lambda x:(int(x[1]),[float(x[2]),1]))
pairRDD.reduceByKey(lambda x,y: [x[0]+y[0],x[1]+y[1]]).mapValues(lambda x:x[0]/x[1]).sortByKey().take(5)

[(1, 3.9209302325581397),
 (2, 3.4318181818181817),
 (3, 3.2596153846153846),
 (4, 2.357142857142857),
 (5, 3.0714285714285716)]

<font color='violet'>3. Compute all pairs (title, rat) where title is a full movie title (as found in the movies.csv file), and rat is the average rating of that movie (computed over all possible ratings for that movie, as found in the ratings.csv file)

The averaging is achieved by a similar trick as in the previous exercise, though indexing is a bit more involved here. Again, sorting is not strictly necessary.
</font>

In [17]:
pairRDD1 = ratingsRDD.map(lambda x:(x[1],[float(x[2]),1]))
pairRDD1.take(3)

[('1', [4.0, 1]), ('3', [4.0, 1]), ('6', [4.0, 1])]

In [18]:
pairRDD2 = moviesRDD.map(lambda x:(x[0],x[1]))
pairRDD2.take(11)

[('1', 'Toy Story (1995)'),
 ('2', 'Jumanji (1995)'),
 ('3', 'Grumpier Old Men (1995)'),
 ('4', 'Waiting to Exhale (1995)'),
 ('5', 'Father of the Bride Part II (1995)'),
 ('6', 'Heat (1995)'),
 ('7', 'Sabrina (1995)'),
 ('8', 'Tom and Huck (1995)'),
 ('9', 'Sudden Death (1995)'),
 ('10', 'GoldenEye (1995)'),
 ('11', 'American President, The (1995)')]

In [19]:
pairRDD1.join(pairRDD2).reduceByKey(lambda x,y: ([x[0][0]+y[0][0],x[0][1]+y[0][1]],y[1])).map(lambda x:(x[1][1],x[1][0][0]/x[1][0][1])).sortByKey().take(32)

[("'71 (2014)", 4.0),
 ("'Hellboy': The Seeds of Creation (2004)", 4.0),
 ("'Round Midnight (1986)", 3.5),
 ("'Salem's Lot (2004)", 5.0),
 ("'Til There Was You (1997)", 4.0),
 ("'Tis the Season for Love (2015)", 1.5),
 ("'burbs, The (1989)", 3.176470588235294),
 ("'night Mother (1986)", 3.0),
 ('(500) Days of Summer (2009)', 3.6666666666666665),
 ('*batteries not included (1987)', 3.2857142857142856),
 ('...All the Marbles (1981)', 2.0),
 ('...And Justice for All (1979)', 3.1666666666666665),
 ('00 Schneider - Jagd auf Nihil Baxter (1994)', 4.5),
 ('1-900 (06) (1994)', 4.0),
 ('10 (1979)', 3.375),
 ('10 Cent Pistol (2015)', 1.25),
 ('10 Cloverfield Lane (2016)', 3.6785714285714284),
 ('10 Items or Less (2006)', 2.6666666666666665),
 ('10 Things I Hate About You (1999)', 3.5277777777777777),
 ('10 Years (2011)', 3.5),
 ('10,000 BC (2008)', 2.7058823529411766),
 ('100 Girls (2000)', 3.25),
 ('100 Streets (2016)', 2.5),
 ('101 Dalmatians (1996)', 3.074468085106383),
 ('101 Dalmatians (One

<font color='violet'>4. [_Extra_] Compute all pairs (title, tag) where title is a full movie title that has an average rating of at least 3.5, and tag is a tag for that movie (as found in the tags.csv file)

Again, sorting is not strictly needed, and the average rating is only included to verify the that no ratings less than 3.5 are listed.
</font>


In [20]:
ratingsRDD.map(lambda x:(x[1],[float(x[2]),1]))\
          .reduceByKey(lambda x,y: [x[0]+y[0],x[1]+y[1]])\
          .mapValues(lambda x:x[0]/x[1])\
          .filter(lambda x: x[1] >= 3.5)\
          .join(moviesRDD.map(lambda x:(x[0],x[1])))\
          .join(tagsRDD.map(lambda x:(x[1],x[2])))\
          .map(lambda x: (x[1][0][1],[x[1][1],x[1][0][0]]))\
          .sortBy(lambda x:x[1][1])\
          .take(10)

[('Ghost World (2001)', ['adolescence', 3.5]),
 ('Prince of Egypt, The (1998)', ['Bible', 3.5]),
 ('Prince of Egypt, The (1998)', ['Moses', 3.5]),
 ('Babel (2006)', ['Brad Pitt', 3.5]),
 ('Babel (2006)', ['cate blanchett', 3.5]),
 ('Babel (2006)', ['multiple storylines', 3.5]),
 ('Babel (2006)', ['social commentary', 3.5]),
 ('Father of the Bride (1991)', ['remake', 3.5]),
 ('Father of the Bride (1991)', ['wedding', 3.5]),
 ('Owning Mahowny (2003)', ['gambling', 3.5])]

## 3. Github log data exercises
Github makes activity logs publicly available at https://www.githubarchive.org/. One such log file, which contains activity data for 2015-03-01 between 0h-1h at night, has been downloaded and is available at `data/github/2015-03-01-0.json.gz`. This (compressed) file contains multiple JSON objects, one per line. Here is a sample line of this file, neatly formatted:

`{ "id": "2614896652",
    "type": "CreateEvent",
    "actor": {
        "id": 739622,
        "login": "treydock",
        "gravatar_id": "",
        "url": "https://api.githb.com/users/treydock",
        "avatar_url": "https://avatars.githubusercontent.com/u/739622?"
    },
    "repo": {
        "id": 23934080,
        "name": "Early-Modern-OCR/emop-dashboard",
    "url": "https://api.github.com/repos/Early-Modern-OCR/emop-dashboard"
    },
    "payload": {
        "ref": "development",
        "ref_type": "branch",
        "master-branch": "master",
        "description": "",
        "pusher_type": "user",
    },
    "public": true,
    "created_at": "2015-03-01T00:00:00Z",
    "org": {
        "id": 10965476,
        "login": "Early-Modern-OCR",
        "gravatar_id": "",
        "url": "https://api.github.com/orgs/Early-Modern-OCR",
        "avatar_url": "https://avatars.githubusercontent.com/u/10965476?"
    }
}`

This log entry has `CreateEvent` type and its `payload.ref_type` is `branch` . So someone named "treydock" (`actor.login`) created a repository branch called "development" (`payload.ref`) in the first second of March 1, 2015 (`created_at`) .

1. Load the textfile into an RDD (note: spark can read gzipped files directly!). Convert this RDD (which consists of string elements) to an RDD where each element is a JSON object (hint: use the `json.loads` function from the `json` module to convert a string into a JSON object).

2. Filter this RDD of JSON objects to retain only those objects that represent push activities (where `type` equals `PushEvent`)

3. Count the number of push events.

4. Compute the number of push events, grouped per `actor.login`. 

5. Retrieve the results of (4) in sorted order, where logins with higher number of pushes come first. Retrieve the 10 first such results (which contain the highest number of pushes)

6. You are representing a company and need to retrieving the number of pushes for every employee in the company. The file `data/github/employees.txt` contains a list of all employee login names at your company.

Extra: if you want to experiment with larger datasets, download more log data from the github archive website and re-do the exercises above

<font color='violet'>1. Load the textfile into an RDD (note: spark can read gzipped files directly!). Convert this RDD (which consists of string elements) to an RDD where each element is a JSON object (hint: use the json.loads function from the json module to convert a string into a JSON object):</font>

In [21]:
githubRDD = sc.textFile('data/github/2015-03-01-0.json.gz')
import json
githubRDD = githubRDD.map(lambda x:json.loads(x))
githubRDD.take(2)

[{'id': '2614896652',
  'type': 'CreateEvent',
  'actor': {'id': 739622,
   'login': 'treydock',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/treydock',
   'avatar_url': 'https://avatars.githubusercontent.com/u/739622?'},
  'repo': {'id': 23934080,
   'name': 'Early-Modern-OCR/emop-dashboard',
   'url': 'https://api.github.com/repos/Early-Modern-OCR/emop-dashboard'},
  'payload': {'ref': 'development',
   'ref_type': 'branch',
   'master_branch': 'master',
   'description': '',
   'pusher_type': 'user'},
  'public': True,
  'created_at': '2015-03-01T00:00:00Z',
  'org': {'id': 10965476,
   'login': 'Early-Modern-OCR',
   'gravatar_id': '',
   'url': 'https://api.github.com/orgs/Early-Modern-OCR',
   'avatar_url': 'https://avatars.githubusercontent.com/u/10965476?'}},
 {'id': '2614896653',
  'type': 'PushEvent',
  'actor': {'id': 9063348,
   'login': 'bezerrathm',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/bezerrathm',
   'avatar_url': 'https://avatar

<font color='violet'> 2. Filter this dataframe to retain only those rows that represent push activities (where type equals PushEvent).

Again, sorting is not strictly needed:</font>

In [22]:
pushRDD = githubRDD.filter(lambda x:x['type']=='PushEvent').sortBy(lambda x:x["actor"]["login"])
pushRDD.take(3)

[{'id': '2614921736',
  'type': 'PushEvent',
  'actor': {'id': 4584144,
   'login': '0000marcell',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/0000marcell',
   'avatar_url': 'https://avatars.githubusercontent.com/u/4584144?'},
  'repo': {'id': 31145226,
   'name': '0000marcell/BLANK_SITE',
   'url': 'https://api.github.com/repos/0000marcell/BLANK_SITE'},
  'payload': {'push_id': 588078540,
   'size': 1,
   'distinct_size': 1,
   'ref': 'refs/heads/master',
   'head': '1bc3f18a8ccc5bd51f4568533af3530a14511252',
   'before': '8a3b9fbbb7445ae809a6441414d6989112fd20e6',
   'commits': [{'sha': '1bc3f18a8ccc5bd51f4568533af3530a14511252',
     'author': {'email': '3b41845c55bd0cbc15aa56425f563af645db6204@gmail.com',
      'name': 'Marcell Monteiro Cruz'},
     'message': 'coffee-script implemented',
     'distinct': True,
     'url': 'https://api.github.com/repos/0000marcell/BLANK_SITE/commits/1bc3f18a8ccc5bd51f4568533af3530a14511252'}]},
  'public': True,
  'created_at': '2

<font color='violet'>3. Count the number of push events:</font>

In [23]:
pushRDD.count()

8793


<font color='violet'> 4. Compute the number of push events, grouped per `actor.login`.

Again, sorting is not strictly needed:</font>

In [24]:
pushbyloginRDD = pushRDD.groupBy(lambda x: x['actor']['login']).map(lambda x: (x[0], len(x[1]))).sortByKey()
pushbyloginRDD.take(3)

[('0000marcell', 1), ('01000101', 1), ('05K4R1N', 2)]

<font color='violet'>5. Retrieve the results of (4) in sorted order, where logins with higher number of pushes come first. Retrieve the 10 first such results (which contain the highest number of pushes):</font>


In [25]:
pushbyloginRDD.sortBy(lambda x:x[1],False).take(10)

[('greatfirebot', 192),
 ('diversify-exp-user', 146),
 ('KenanSulayman', 72),
 ('manuelrp07', 45),
 ('mirror-updates', 42),
 ('tryton-mirror', 37),
 ('Somasis', 26),
 ('direwolf-github', 24),
 ('EmanueleMinotto', 22),
 ('hansliu', 21)]

<font color='violet'> 6. You are representing a company and need to retrieve the number of pushes for every employee in the company. The file `data/github/employees.txt` contains a list of all employee login names at your company.

IMPORTANT: All the employees in the file data/github/employees.txt seem to have at least one push. I have added additional employees with zero pushes to make sure my code is robust:
</font>

In [26]:
employeesRDD = sc.textFile('data/github/employees.txt')
employeesRDD = employeesRDD.union(sc.parallelize(['An employee with zero pushes','Another employee with zero pushes']))
employeesRDD.sortBy(lambda x: x[0]).take(5)

['AiMadobe',
 'Akkyie',
 'An employee with zero pushes',
 'Another employee with zero pushes',
 'BatMiles']

<font color='violet'>The left outer join method is applied to employeesRDD is used to make sure all employees, even those who do not push, are present in the join. Those who do not push have value `None`, which must be replaced by `0`. Sorting is done for convenience, but is not strictly necessary.
</font>

In [27]:
pushersRDD = employeesRDD.groupBy(lambda x: x).leftOuterJoin(pushbyloginRDD).mapValues(lambda x: x[1] if x[1] != None else 0).sortBy(lambda x: x[1])
pushersRDD.take(10)

[('Another employee with zero pushes', 0),
 ('An employee with zero pushes', 0),
 ('barnardn', 1),
 ('Ramzawulf', 1),
 ('summersd', 1),
 ('eckardjf', 1),
 ('elhaddad1', 1),
 ('mikebronner', 1),
 ('serranoarevalo', 1),
 ('alexanderdidenko', 1)]

In [28]:
pushersRDD.sortBy(lambda x: x[1], ascending = False).take(3)


[('KenanSulayman', 72), ('manuelrp07', 45), ('Somasis', 26)]

<font color='violet'>Stopping Spark:</font>

In [29]:
sc.stop()