# Counting number of the mutual friends

For each user having ID in the column `userId` count the amount of his / her common friends with each other user having ID in the column `userId`.

Print 49 pairs of the users having the largest amount of common friends, ordered in descending order first by the common friends count , then by id of user1 and finally by id of user 2. 

The format is following: 

```
count user1 user2
```

**Example:**

```
234	54719 767867

120	54719 767866

97  50787 327676
```

The overall plan could look like this:

* Create a new column `friend` by exploding of column `friends` (like in the demo iPython notebook) 
* Group the resulting dataframe by the column `friend` (like in the demo iPython notebook)
* Create a column `users` by collecting all users with the same id in the column `friend` together (like in the demo iPython notebook)
* Sort the elements in the column `users` by the function _sort_array_
* Filter only the rows which have more than 1 element in the column `users`
* For each row emit all possible ordered pairs of users from the column `users` (tip: write a user defined function for this)
* Count the number of times each pair has appeared 
* With the help of the window function (like in the demo python notebook) select 49 pairs of users who have the biggest amount of common friends

The sample dataset is located at /data/graphDFSample.

The part of the result on the sample dataset:

```
...
3044 21864412 51640390
3021 17139850 51640390
3010 14985079 51640390
2970 17139850 21864412
2913 20158643 27967558
...
```

### Solution description

1. The original data has the following schema: 
   * `user` - an ID of user
   * `friends` - a list with ID-s of users which are friends of the user
2. Reverse the original data frame to the following data schema:
   * `friend` - an ID of user
   * `users` - a list with ID-s of user which has relation with the friend
3. Add a column `user_size=len(users)` and leave only columns with `user_size > 1`
4. Sort array `users` in each row
5. Use a UDF to create all possible pairs of elements from `users` array in each row. 
   Add the result as a new column 'user_pairs' with type Array(Struct(user1, user2)). 
   As result, we will get the data schema:
   ```
   <friend> [(<user1_1>, <user2_1>), (<user1_k>, <user2_j>),]
   ```
6. Explode the `user_pairs` to `mutual_friends` field.
7. Group by `user_pairs` column and sum up the column.

As result, we get counted pairs of users which has the same friend. 

### Step 1. Connect and read data

In [None]:
import pyspark.sql.types as t

from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.functions import explode, collect_list, size, col, desc, sort_array, udf, count


GRAPH_PATH = "/data/graphDFSample"


spark_session = SparkSession.builder.enableHiveSupport().master("local").getOrCreate()
users_relations_graph = spark_session.read.parquet(GRAPH_PATH)

### Step 2. Write a UDF to combine all pairs in array

In [None]:
def make_pairs(arr):
    """
    Returns an array with all possible pairs (as tuple)
    from the original array.
    """
    arr_len = len(arr)
    return [(arr[i], arr[j]) for i in range(arr_len) for j in range(i + 1, arr_len)]


# Simple test
assert all(x in [(1, 2), (1, 3), (2, 3)] for x in list(make_pairs([1, 2, 3])))

# Create custom type and register UDF
pair_type = t.StructType([
    t.StructField("p", t.IntegerType(), False),
    t.StructField("q", t.IntegerType(), False),
])
make_pairs = udf(make_pairs, t.ArrayType(pair_type))

### Step 3. Reverse graph and prepare for calculation

In [None]:
reversed_graph = users_relations_graph.withColumn("friend", explode("friends")) \
                                      .groupBy("friend").agg(collect_list("user").alias("users")) \
                                      .withColumn("users", sort_array('users')) \
                                      .filter(size(col("users_size")) > 1)

### Step 4. Find mutual friends

In [None]:
mutual_friends_df = reversed_graph.withColumn("pairs", make_pairs("users")) \
                                  .withColumn("mutual_friends", explode("pairs")) \
                                  .groupBy("mutual_friends").agg(count("mutual_friends").alias("friends_count"))

### Step 5. Collect and print result

In [None]:
top_50_by_friends = mutual_friends_df.select(col("friends_count"), 
                                             col("mutual_friends.p").alias("user1"), 
                                             col("mutual_friends.q").alias("user2")) \
                                     .orderBy(desc("friends_count"), desc("user1"), desc("user2")) \
                                     .limit(49)

In [None]:
for row in top_50_by_friends.collect():
    print(row.friends_count, row.user1, row.user2, sep="\t")