# CMU 10-405/605 Machine Learning with Large Datasets
## **Welcome to Recitation 2**
This notebook will walk you through a basic Spark transformations and actions. We will also process large Wikipedia dump to find interesting insights about your favorite celebrities.

#### ** Part 1: Basic Spark Transformations and Actions **
Refer to the Spark guide: https://spark.apache.org/docs/2.4.4/rdd-programming-guide.html

In [3]:
# create a sample list
my_list = [i for i in range(1,10000000)]

# parallelize the data
rdd_0 = sc.parallelize(my_list)

# add value 4 to each number
rdd_1 = rdd_0.map(lambda x : x + 4)

# print the value on the console
rdd_1.take(10)

# What will the above line print?
# What happens to rdd_0?

In [4]:
# Make tuples
rdd_2 = rdd_0.map(lambda x: (x, x+1))

rdd_2.take(10)

In [5]:
# Include decimal numbers 
rdd_3 = rdd_0.flatMap(lambda x: (x-0.5, x, x+0.5))

rdd_3.take(10)

In [6]:
# Remove the duplicates
rdd_4 = rdd_3.distinct().cache()

rdd_4.take(10)

In [7]:
# Sort the values
rdd_5 = rdd_4.map(lambda x: (x, 1)).sortByKey().map(lambda x: x[0])

rdd_5.take(10)

In [8]:
# Get all the duplicates using groupByKey
rdd_6 = rdd_3.map(lambda x: (x, 1)).groupByKey().filter(lambda x: len(x[1]) > 1)
rdd_6.take(10)

In [9]:
# Get all the duplicates using reducebyKey
rdd_7 = rdd_3.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).filter(lambda x: x[1] > 1)
rdd_7.take(10)

#### ** Part 2: Exploratory Data Analysis on Wikipedia dump **

In [11]:
# Load the Wikipedia dataset
wikiRDD = sc.wholeTextFiles("/FileStore/tables/text/*/*")

wikiRDD.take(10)

# What's the difference between textFile and wholeTextFiles?

In [12]:
# Get all the pages
pages = wikiRDD.flatMap(lambda x: x[1].split('</doc>')).cache()

pages.count()

Extract the title and content of the article and return a RDD containing tuples of (title, actual content)

In [14]:
def get_title_and_content(content):
    """
      This method generates a tuple of (title, actual content) from a Wiki dump of each page
    """
    # Remove any leading or lagging space if present 
    content = content.strip()
    title, actual_content = '', ''
    try:
        if(content != ''):
            # Split the content on the basis of new line
            arr = content.split("\n",2)
            # Second line is the title
            title = arr[1]
            # Rest is the actual content
            actual_content = arr[2]
    except:
        title, actual_content = '', ''
    return (title, actual_content)

In [15]:
# Extract the title and actual content
pages = pages.map(get_title_and_content)

pages.count()

Filter out all the articles of celebrities from the dataset

In [17]:
import re
def check_if_person(content):
    """ 
      This method checks if a page is of a celebrity or not.
    """
    # Checking in first 150 characters
    content = content[:150]
    # Checking for format like 12 August 1993
    list1 = re.findall(r"[\d]{1,2} [ADFJMNOS]\w* [\d]{4}", content)
    # Checking for format like August 12, 1993
    list2 = re.findall(r"[ADFJMNOS]\w* [\d]{1,2}[,] [\d]{4}", content)
    if(len(list1)>0 or len(list2)>0):
        return True
    return False



In [18]:
# Get all the celebrities pages
celebrities = pages.filter(lambda x : check_if_person(x[1]))
celebrities.count()

Get the names of the celebrities who are linked with Carnegie Mellon

In [22]:
# Get the names of the celebrities who are linked with Carnegie Mellon 
celebrities.filter(lambda x: "Carnegie Mellon" in x[1]).map(lambda x: x[0]).count()

Get the names of the celebrities who are linked with Carnegie Mellon and have won a Turing Award

In [27]:
# Get the names of the celebrities who are linked with Carnegie Mellon and have won a Turing Award
celebrities.filter(lambda x: "Carnegie Mellon" in x[1] and "Turing" in x[1]).map(lambda x: x[0]).collect()

In [28]:
# Get the names of the celebrities who are linked with Stanford and have won a Turing Award
celebrities.filter(lambda x: "Stanford" in x[1] and "Turing" in x[1]).map(lambda x: x[0]).collect()