#### Common warnings:

1. __Backup your solution into the 'work' directory inside the home directory ('/home/jovyan'). It is the only one that state will be saved over sessions.__

1. Please, ensure that you call the right interpreter (python2 or python3). Do not write just "python" without the major version. There is no guarantee that any particular version of Python is set as the default one in the Grading system.

1. One cell must contain only one programming language.
E.g. if a cell contains Python code and you also want to call a bash-command (using “!”) in it, you should move the bash to another cell.

1. Our IPython converter is an improved version of the standard converter Nbconvert and it can process most of Jupyter's magic commands correctly (e.g. it understands "%%bash" and executes the cell as a "bash"-script). However, we highly recommend to avoid magics wherever possible.

#### Spark specific warnings:

1. It is a good practice to run Spark with master "yarn". However, containered system's performance is limited. If you see repeating Py4JavaErrors or Py4JNetworkErrors exceptions which you assume are not relevant to your code, feel free to change master to “local”.

1. You should eliminate extra symbols in output (such as quotes, brackets etc.). When you finally get the resulting dataframe it is easier to print wiki.take(1) instead of traverse RDD using for cycle. But in this case a lot of junk symbols will be printed like: `[['Anarchism', 'is', .. ]]`. See the right output example in the task.

#### Task hint
Each subsequent of these tasks is a continuation of the previous one. So, you may use the same IPython notebook for all the programming assignments in this week.

In [43]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql import Window

spark_session = SparkSession.builder.enableHiveSupport().master("yarn").getOrCreate()

In [44]:
data = spark_session.read.parquet("/data/sample264")
meta = spark_session.read.parquet("/data/meta")

In [45]:
data.show(10)

+------+-------+--------+----------+
|userId|trackId|artistId| timestamp|
+------+-------+--------+----------+
| 13065| 944906|  978428|1501588527|
|101897| 799685|  989262|1501555608|
|215049| 871513|  988199|1501604269|
|309769| 857670|  987809|1501540265|
|397833| 903510|  994595|1501597615|
|501769| 818149|  994975|1501577955|
|601353| 958990|  973098|1501602467|
|710921| 916226|  972031|1501611582|
|  6743| 801006|  994339|1501584964|
|152407| 913509|  994334|1501571055|
+------+-------+--------+----------+
only showing top 10 rows



In [49]:
user_artist = data.groupBy('userId', 'artistId').count()

In [50]:
user_artist.show()

+------+--------+-----+
|userId|artistId|count|
+------+--------+-----+
|484714| 1000564|    2|
|685378|  974357|    8|
|531701|  969480|    5|
|341232|  977291|    1|
|554281|  985827|    1|
|395708|  975337|    1|
|646244| 1001300|    2|
|108592|  991179|    1|
|245658|  997265|    3|
|485786|  993060|   20|
|277044| 1000564|    7|
|  1533|  995917|    3|
| 47200| 1000564|    1|
|612770|  986570|    1|
|  1750|  985681|    1|
|731637|  979654|    1|
|481421|  975605|    4|
|490000|  969480|    1|
|718583|  982782|    1|
|790358|  997721|    1|
+------+--------+-----+
only showing top 20 rows



#### Normalization could be done by next function

In [51]:
from pyspark.sql import Window
from pyspark.sql.functions import row_number, sum,col

# normalize weights of its edges (divide the weight of each edge on a sum of weights of all edges).
def norm(df, key1, key2, field, n): 
    
    window = Window.partitionBy(key1).orderBy(col(field).desc())
        
    topsDF = df.withColumn("row_number", row_number().over(window)) \
        .filter(col("row_number") <= n) \
        .drop(col("row_number")) 
        
    tmpDF = topsDF.groupBy(col(key1)).agg(col(key1), sum(col(field)).alias("sum_" + field))
   
    normalizedDF = topsDF.join(tmpDF, key1, "inner") \
        .withColumn("norm_" + field, col(field) / col("sum_" + field)) \
        .cache()

    return normalizedDF

In [52]:
user_artist_norm = norm(user_artist, 'userId', 'artistId', 'count', 100) \
        .select('userId', 'artistId', 'norm_count')

In [53]:
user_artist_norm.take(50)

[Row(userId=3175, artistId=981306, norm_count=0.2222222222222222),
 Row(userId=3175, artistId=995274, norm_count=0.1111111111111111),
 Row(userId=3175, artistId=986492, norm_count=0.1111111111111111),
 Row(userId=3175, artistId=976051, norm_count=0.1111111111111111),
 Row(userId=3175, artistId=1000709, norm_count=0.1111111111111111),
 Row(userId=3175, artistId=984798, norm_count=0.1111111111111111),
 Row(userId=3175, artistId=969751, norm_count=0.1111111111111111),
 Row(userId=3175, artistId=1000564, norm_count=0.1111111111111111),
 Row(userId=5518, artistId=978963, norm_count=0.5),
 Row(userId=5518, artistId=984128, norm_count=0.3333333333333333),
 Row(userId=5518, artistId=969429, norm_count=0.16666666666666666),
 Row(userId=5803, artistId=982335, norm_count=1.0),
 Row(userId=6654, artistId=1002715, norm_count=0.2),
 Row(userId=6654, artistId=985758, norm_count=0.2),
 Row(userId=6654, artistId=987351, norm_count=0.2),
 Row(userId=6654, artistId=987809, norm_count=0.2),
 Row(userId=66

In [58]:
window = Window.orderBy(f.col('norm_count').desc())
    
user_ArtistList = user_artist_norm.withColumn('position', f.rank().over(window)) \
    .filter(f.col('position') < 40) \
    .orderBy('userId', 'artistId') \
    .select('userId', 'artistId') \
    .take(40)

In [59]:
for val in user_ArtistList:
    print("%s %s" % val)

66 993426
116 974937
128 1003021
131 983068
195 997265
215 991696
235 990642
288 1000564
300 1003362
321 986172
328 967986
333 1000416
346 982037
356 974846
374 1003167
428 993161
431 969340
445 970387
488 970525
542 969751
612 987351
617 970240
649 973851
658 973232
662 975279
698 995788
708 968848
746 972032
747 972032
776 997265
784 969853
806 995126
811 996436
837 989262
901 988199
923 977066
934 990860
957 991171
989 975339
999 968823
