You can use this notebook to develop your answers. Make sure to look at intermediate results using `take()` for debugging.

In [1]:
import json 

## Load data into RDDs
usersRDD = sc.textFile("datafiles/se_users.json").map(json.loads)
postsRDD = sc.textFile("datafiles/se_posts.json").map(json.loads)
playRDD = sc.textFile("datafiles/play.txt")
logsRDD = sc.textFile("datafiles/NASA_logs_sample.txt")
amazonInputRDD = sc.textFile("datafiles/amazon-ratings.txt")
nobelRDD = sc.textFile("datafiles/prize.json").map(json.loads)
amazonBipartiteRDD = amazonInputRDD.map(lambda x: x.split(" ")).map(lambda x: (x[0], x[1])).distinct()

In [77]:
for t in postsRDD.take(3): print(t)

{'id': 2, 'posttypeid': 1, 'title': 'How can a group track database schema changes?', 'acceptedanswerid': 4, 'parentid': None, 'creationdate': '2011-01-03', 'score': 68, 'viewcount': 11533, 'owneruserid': 7, 'lasteditoruserid': 97, 'tags': '<mysql><version-control><schema>'}
{'id': 3, 'posttypeid': 1, 'title': 'What is an effective way of labeling columns in a database?', 'acceptedanswerid': None, 'parentid': None, 'creationdate': '2011-01-03', 'score': 30, 'viewcount': 1302, 'owneruserid': 17, 'lasteditoruserid': 97, 'tags': '<database-design><erd>'}
{'id': 4, 'posttypeid': 2, 'title': None, 'acceptedanswerid': None, 'parentid': 2, 'creationdate': '2011-01-03', 'score': 46, 'viewcount': None, 'owneruserid': 18, 'lasteditoruserid': 1396, 'tags': None}


Task 1 (0.25): Use filter to find all posts where tags are not null (None in python) and that are tagged 'postgresql-9.4', and then a map so that the output RDD has tuples of the form: (ID, Title, Tags). Note that postsRDD contains dictionaries -- see the contents by running postsRDD.take(10).

In [35]:
def task1(postsRDD):
    res = postsRDD.filter(
        lambda x: x.get("tags")!= None).filter(
        lambda x: "postgresql-9.4" in x.get("tags")).map(
        lambda x: (x.get("id"), x.get("title"), x.get("tags"))
    )
    return res 
a = task1(postsRDD)

In [39]:
for t in a.take(10): print(t)

(89480, 'PostgreSQL timezone setting', '<postgresql><postgresql-9.4>')
(89555, 'Retrieving latest record using DISTINCT ON is slow', '<postgresql><index><performance><postgresql-9.4><query-performance>')
(89746, 'Use result of aggregate in same select?', '<postgresql><postgresql-9.4>')
(89971, 'PostgreSql JSONB SELECT against multiple values', '<postgresql><json><postgresql-9.4>')
(90002, 'PostgreSQL operator uses index but underlying function does not', '<postgresql><index-tuning><json><postgresql-9.4><operator>')
(90360, 'Rely on .pgpass in CREATE USER MAPPING', '<postgresql><postgresql-9.4><foreign-data>')
(95214, 'Working with Materialized View', '<postgresql><materialized-view><postgresql-9.4><pgbouncer>')
(95758, 'PostgreSQL update and delete property from JSONB column', '<postgresql><postgresql-9.4>')
(95778, 'Clarification on UNION ALL of JSONB_EACH result', '<postgresql><postgresql-9.4>')
(45870, 'How to do incremental backup every hour in Postgres?', '<postgresql><backup><win

Task 2 (0.25): Use flatMap on the postsRDD to create an RDD (ID, Tag), listing all the tags for each post as a separate tuple. If a post has no tags, it should not appear in the output RDD.

In [74]:
def task2FlatMapper(dic):
    res = []
    tags = dic.get("tags").replace("<", "").replace(">", " ").split(" ")
    tags.pop()
    for i in tags:
        res.append( (dic.get("id"), i)
        )
    return res
    

def task2(postsRDD):
    return postsRDD.filter(
        lambda x: x.get("tags")!= None).flatMap(
        task2FlatMapper
    )
a = task2(postsRDD)

Task 3 (0.25): The goal here is to find the 5 lexicographically smallest tags for each year, for the posts from that year. So the outputRDD should be contain tuples of the form: ('2001', ['tag1', 'tag2', ..., 'tag5']), with 'tag1' < 'tag2' and 'tag5' being smaller (lexicographically) than any other tag for a post from that year. All the five (or fewer for some of the years) tags should be distinct. Use a map followed by reduceByKey for doing this.

In [86]:
def task3MapA(dic):
    year = dic["creationdate"][:4]
    tags = dic.get("tags").replace("<", "").replace(">", " ").split(" ")
    tags.pop()
    res = (year, set(tags))
    return res
def task3MapB(tu):
    x= tu[1]
    x = list(x)
    x.sort()

    return (tu[0], x[:5])

def task3(postsRDD):
    return postsRDD.filter(
        lambda x: x.get("tags")!= None).map(
        task3MapA
    ).reduceByKey(lambda v1, v2: v1 | v2).map(task3MapB)
a = task3(postsRDD)

In [87]:
for t in a.take(10): print(t)
# print(len(a.collect()))

('2011', ['access-control', 'active-directory', 'activity-monitor', 'ado.net', 'aggregate'])
('2014', ['access-control', 'acid', 'active-directory', 'activity-monitor', 'address'])
('2015', ['access-control', 'active-directory', 'ado.net', 'aggregate', 'alter-database'])
('2010', ['data-warehouse', 'database-design', 'dbcc', 'export', 'import'])
('2012', ['access-control', 'active-directory', 'activity-monitor', 'address', 'ado.net'])
('2013', ['access-control', 'acid', 'active-directory', 'activity-monitor', 'address'])
('2009', ['career', 'ssas'])


In [6]:
# one way is to find all the unique years, then for each year filter by year and get all the tags, then sort them lexicographically
# 2nd way is to do some unnecessary operations, for each row sort lexicographically first, then combine?
def task3FlatMapper(dic):
    res = []
    tags = dic.get("tags").replace("<", "").replace(">", " ").split(" ")
    tags.pop()
    
    return res
def task3(postsRDD):
    res = []
    years = postsRDD.map(lambda x: x["creationdate"][:4]).distinct().collect()
    return years
#     for year in years:
#         postsRDD.filter(lambda x: x["creationdate"][:4] == year).map
#         tags = 
#         res.append( (year, tags))
#     return sc.parallelize(res)
a = task3(postsRDD)

In [9]:
for t in a: print(t)

TypeError: 'PipelinedRDD' object is not iterable

In [94]:
dd = {"name" :"joshua", 
      "date": "2011-01-03"}

In [97]:
dd["date"][:4]

'2011'

In [5]:
"2011" >"2012"

False

In [56]:
{1,2}.add({1,2})

TypeError: unhashable type: 'set'

In [72]:
a = {5,5,5,3,2}

In [73]:
list(a).sort()
print(a)

{2, 3, 5}


In [79]:
tu = ("2001", {"mysql", "joshua", "daniel"})

In [81]:
x= tu[1]
x = list(x)
x.sort()
tu[1] = x[:5]

TypeError: 'tuple' object does not support item assignment