# task: create a user-user network from the raw `ratings` dataset

* load ratings file (code provided)
* generate a set of edges of the form `[(u,v,w)]` where each edge represents the fact that users `u`,`v` have both rated the same `w` movies (ie if they have both rated movie1, movie2, the edge weigth will be 2)
 * you should have 164054 edges
* save the set of edges to a file as indicated below. **Make sure you name the file using your `<id>` as indicated below, to avoid conflicts with other students within the file workspace**

In [2]:
import pandas as pd
import pyspark.sql.functions as f
from pyspark.sql.types import *
from pyspark.sql import Row
from operator import add
from collections import Counter

In [3]:
# 1. Load ratings file (code provided)
# IN
RATINGS_SMALL_PARQUET = "/FileStore/tables/ratings-small.parquet"

# OUT
EDGES_SMALL_PARQUET = "/FileStore/tables/180571930/edges-small.parquet"

In [4]:
ratings = spark.read.parquet(RATINGS_SMALL_PARQUET)
display(ratings)

In [5]:
ratings.count()

md ### count number of distinct users  -- this is the number of unique userId that you will have in your adjacency list:

In [7]:
ratings.agg(f.countDistinct('userId')).show()

In [8]:
ratings.agg(f.countDistinct('movieId')).show()

#add your code here. 

you need to construct a RDD consisting of edges of the form:

`    [(source_node, target_node, weight)]`
    
    for example: `[(1,6,1), (1,8,1),(2,3,1), (2,4,1)]`
    
please name the RDD with edges `weightedEdges`

In [10]:
# 2. Generate a set of edges of the form [(u,v,w)]

In [11]:
# 2.1 Drop two attributes - rating, timestamp
ratings = ratings.drop('rating','timestamp')
ratings.take(10)

In [12]:
# 2.2 Convert dataframe to rdd
r = ratings.rdd
# type(r)
r.take(10)

In [13]:

# 2.3 Swap - (movieId, userId)
movie_user = r.map(lambda x: (x[1], x[0]))
movie_user.collect()

In [14]:
# 2.4 Group by key: movie, {user1, user5, ...}
output = movie_user.groupByKey().mapValues(list)
# output.count() # 9724
output.collect()

In [15]:
# 2.5 Combinate two users with the same movieId
import itertools

def combinations(row):
  k = row[0]
  return [(k, v) for v in itertools.combinations(row[1], 2)]

b = output.map(combinations).flatMap(lambda x: x)
# [(1, (1, 2)), (1, (1, 3)), (1, (1, 4)), ...]

# 2.6 Caluculate the count
c = b.map(lambda x: x[1]).countByValue()
# [(1, 2), (1, 4), (2, 4), (3, 4), (3, 5), (4, 5), (1, 2), (1, 3), (2, 3)]
# defaultdict(<class 'int'>, {(1, 2): 2, (1, 3): 1, (2, 3): 1, (4, 5): 1, (3, 4): 1, (2, 4): 1, (1, 4): 1, (3, 5): 1})

# 2.7 Split the item to [(1,2,3), ...]
d = sc.parallelize([c]).flatMap(lambda x: x.items())
# [((1, 2), 2), ((1, 3), 1),((2, 3), 1),...]

d.count() # 164054

In [16]:
d.take(10) # display the result

In [17]:
# 2.8 Flat the list and sort by source_node
e = d.map(lambda x: (x[0][0],x[0][1],x[1])).sortBy(lambda x: (x[0], x[1]))# [(266, 351, 6),...]

In [18]:
e.collect()

In [19]:
# 2.9 Rdd to dataframe
weightedEdges = sqlContext.createDataFrame(e, ['source_node', 'target_node', 'weight'])
display(weightedEdges)

In [20]:
# 3. Save the set of edges to a file as indicated below. 

In [21]:
weightedEdges.withColumnRenamed('count','weight').write.parquet(EDGES_SMALL_PARQUET, mode="overwrite")

In [22]:
r = spark.read.parquet(EDGES_SMALL_PARQUET)
display(r)

In [23]:
r.count()