#### Names of people in the group

Please write the names of the people in your group in the next cell.

Anne Torgersen

Aaryan Neupane

In [0]:
# We need to install 'ipython_unittest' to run unittests in a Jupyter notebook
!pip install -q ipython_unittest

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-0406c303-65b4-419a-a14d-26284c5751a8/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
# Loading modules that we need
from pyspark.sql.dataframe import DataFrame
from collections import Counter
from pyspark.sql.functions import desc
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql import Row
import numpy as np

In [0]:
# A helper function to load a table (stored in Parquet format) from DBFS as a Spark DataFrame 
def load_df(table_name: "name of the table to load") -> DataFrame:
    return spark.read.parquet(table_name)

users_df = load_df("/user/hive/warehouse/users")
posts_df = load_df("/user/hive/warehouse/posts")

#### Subtask 1: implementing two functions
Implement these two functions:
1. 'compute_pearsons_r' that receives a DataFrame and two column names and returns the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between values of two columns;
2. 'make_tag_graph' that in the input receives the DataFrame containing the records related to 'questions' and returns a DataFrame with two columns 'u' and 'v'; the record for row i from the resulting DataFrame is a tuple (u_i, v_i). u_i and v_j are distinct tags and have appeared together for a question.

Please note that you should implement the 'compute_pearsons_r' yourself, so you should not use the 'DataFrame.stat.corr' method. Nevertheless, you can use 'DataFrame.stat.corr' to verify the correctness of your implementation.

In [0]:
def compute_pearsons_r(df: "a DataFrame", col1: "name of column A", col2: "name of column B") -> float:
    df.createOrReplaceTempView("df")
    correlation_coefficient = spark.sql(
        f"SELECT corr({col1}, {col2}) AS correlation FROM df"
    ).first()["correlation"]
    return correlation_coefficient

    
    

def make_tag_graph(df: "DataFrame containing question data") -> DataFrame:
    df.createOrReplaceTempView("df")
    posts = df.select("Tags").filter(col("Tags").contains("><")).rdd.collect()

    records = set()

    for post in posts: 
        tag_list = post["Tags"].replace("<", "").rstrip(">").split(">")
        tag_records = set()

        if len(tag_list) == 1:
            records.add((tag_list[0], tag_list[0]))
        else:
            for i in tag_list:
                for j in tag_list:
                    if i != j and (j, i) not in records and i != "" and j != "":
                        records.add((i, j))
                        records.add((j, i))
    print(records)

    # Convert the list of records to a DataFrame
    result_df = spark.createDataFrame(records, ["u", "v"])

    return result_df





    
  
          


    
    




#### Subtask 2: implementing three functions
Impelment these three functions:
1. 'get_nodes' that, given the result from execution of 'make_tag_graph', returns a DataFrame with one column named 'id' that includes the tags that have appeared in the tag graph;
2. 'get_edges' that, given the result from execution of 'make_tag_graph', returns a DataFrame with two columns 'src' and 'dst' where 'src' is the source node and 'dst' is the destination node.


Note that the term 'tag graph' in this context refers to the DataFrame reuturned by executing 'make_tag_graph'. Furthermore, 'src' and 'dst' are distinct, so 'src' != 'dst'.

In [0]:
def get_nodes(df: "DataFrame of the tag graph") -> DataFrame:
   data = df.select("u").distinct()
   return data

#def get_edges(df: "DataFrame of the tag graph") -> DataFrame:
  ## To-do!


In [0]:
# Loading 'ipython_unittest' so we can use '%%unittest_main' magic command
%load_ext ipython_unittest

The ipython_unittest extension is already loaded. To reload it, use:
  %reload_ext ipython_unittest


In [0]:
posts_df[posts_df["PostTypeId"] == 1].collect()

Out[75]: [Row(Id=5, ParentId=None, PostTypeId=1, CreationDate=datetime.datetime(2014, 5, 13, 23, 58, 30), Score=9, ViewCount=789, Body=None, OwnerUserId=5, LastActivityDate=datetime.datetime(2014, 5, 14, 0, 36, 31), Title='SG93IGNhbiBJIGRvIHNpbXBsZSBtYWNoaW5lIGxlYXJuaW5nIHdpdGhvdXQgaGFyZC1jb2RpbmcgYmVoYXZpb3I/', Tags='<machine-learning>', AnswerCount=1, CommentCount=1, FavoriteCount=1, CloseDate=datetime.datetime(2014, 5, 14, 14, 40, 25)),
 Row(Id=7, ParentId=None, PostTypeId=1, CreationDate=datetime.datetime(2014, 5, 14, 0, 11, 6), Score=4, ViewCount=459, Body=None, OwnerUserId=36, LastActivityDate=datetime.datetime(2014, 5, 16, 13, 45), Title='V2hhdCBvcGVuLXNvdXJjZSBib29rcyAob3Igb3RoZXIgbWF0ZXJpYWxzKSBwcm92aWRlIGEgcmVsYXRpdmVseSB0aG9yb3VnaCBvdmVydmlldyBvZiBkYXRhIHNjaWVuY2U/', Tags='<education><open-source>', AnswerCount=3, CommentCount=4, FavoriteCount=1, CloseDate=datetime.datetime(2014, 5, 14, 8, 40, 54)),
 Row(Id=14, ParentId=None, PostTypeId=1, CreationDate=datetime.datetime(2014

#### Subtask 3: validating the implementation by running the tests

Run the cell below and make sure that all the tests run successfully.

In [0]:
%%unittest_main
class TestTask3(unittest.TestCase):
  
  error_threshold = 0.03
  
  def test_corr1(self):
    # Pearson correlation coefficient between 'user reputation' and 'upvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "UpVotes")
    self.assertLessEqual(abs(result-0.5218138310114108), self.error_threshold)
    print(result)
  
  def test_corr2(self):
    # Pearson correlation coefficient between 'user reputation' and 'downvotes' received by users
    result = compute_pearsons_r(users_df, "Reputation", "DownVotes")
    self.assertLessEqual(abs(result-0.1473558141546844), self.error_threshold)
    print(result)

  def test_corr3(self):
    # Pearson correlation coefficient between 'question score' and the 'number of answers' it received
    result = compute_pearsons_r(posts_df[posts_df["PostTypeId"] == 1], "Score", "AnswerCount")
    self.assertLessEqual(abs(result-0.47855272641249674), self.error_threshold)
    print(result)
    
  def test_make_tag_graph(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    print("Actual Column Names:", result.columns)
    self.assertIsInstance(result, DataFrame)
    
    coulmn_names = Counter(map(str.lower, ['u', 'v']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)), "Missing column(s) or column name mismatch")
    
    display(result)
    
    self.assertEqual(result.count(), 228830)
    
  def test_get_nodes(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    n = get_nodes(result)
    self.assertEqual(n.count(), 638)
    n.show()

  def test_get_edges(self):
    result = make_tag_graph(df=posts_df[posts_df["PostTypeId"] == 1])
    e = get_edges(result)
    
    coulmn_names = Counter(map(str.lower, ['src', 'dst']))
    self.assertCountEqual(coulmn_names, Counter(map(str.lower, e.columns)), "Missing column(s) or column name mismatch")
    
    self.assertEqual(e.count(), 225290)
    e.show()
    



0.5218138821621892
0.14735583045028042
0.4753098634046609
{('simulation', 'online-learning'), ('clustering', 'parallel'), ('text-generation', 'probability'), ('loss-function', 'sparse'), ('linear-algebra', 'chi-square-test'), ('mse', 'mlp'), ('supervised-learning', 'lstm'), ('machine-learning', 'terminology'), ('overfitting', 'recommender-system'), ('cnn', 'text'), ('pytorch', 'embeddings'), ('statistics', 'sas'), ('apache-spark', 'databases'), ('data-stream-mining', 'r'), ('machine-learning', 'online-learning'), ('bayesian-networks', 'deep-learning'), ('transfer-learning', 'numpy'), ('transformer', 'dimensionality-reduction'), ('distance', 'machine-learning-model'), ('mse', 'time-series'), ('anaconda', 'statsmodels'), ('torch', 'accuracy'), ('statistics', 'graphs'), ('preprocessing', 'time-series'), ('auc', 'r'), ('numpy', 'representation'), ('machine-learning-model', 'object-detection'), ('scikit-learn', 'feature-construction'), ('gpu', 'predict'), ('sequence', 'time-series'), ('data

u,v
simulation,online-learning
clustering,parallel
text-generation,probability
loss-function,sparse
linear-algebra,chi-square-test
mse,mlp
supervised-learning,lstm
machine-learning,terminology
overfitting,recommender-system
cnn,text


Fail

...E.F
ERROR: test_get_edges (__main__.TestTask3)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Cell Tests", line 43, in test_get_edges
NameError: name 'get_edges' is not defined

FAIL: test_make_tag_graph (__main__.TestTask3)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Cell Tests", line 33, in test_make_tag_graph
AssertionError: 32598 != 228830

----------------------------------------------------------------------
Ran 6 tests in 13.805s

FAILED (failures=1, errors=1)
Out[76]: <unittest.runner.TextTestResult run=6 errors=1 failures=1>

#### Subtask 4: answering to questions about Spark related concepts

Please write a short description for the terms below---one to two short paragraphs for each term. Don't copy-paste; instead, write your own understanding.

1. What do the terms 'User-Defined Functions (UDFs)', 'Data Locality', 'Bucketing', 'Distributed Filesystem' mean in the context of Spark?

Write your descriptions in the next cell.

Your answers...