#### Common warnings:

1. __Backup your solution into the 'work' directory inside the home directory ('/home/jovyan'). It is the only one that state will be saved over sessions.__

1. Please, ensure that you call the right interpreter (python2 or python3). Do not write just "python" without the major version. There is no guarantee that any particular version of Python is set as the default one in the Grading system.

1. One cell must contain only one programming language.
E.g. if a cell contains Python code and you also want to call a bash-command (using “!”) in it, you should move the bash to another cell.

1. Our IPython converter is an improved version of the standard converter Nbconvert and it can process most of Jupyter's magic commands correctly (e.g. it understands "%%bash" and executes the cell as a "bash"-script). However, we highly recommend to avoid magics wherever possible.

#### Spark specific warnings:

1. It is a good practice to run Spark with master "yarn". However, containered system's performance is limited. If you see repeating Py4JavaErrors or Py4JNetworkErrors exceptions which you assume are not relevant to your code, feel free to change master to “local”.

1. You should eliminate extra symbols in output (such as quotes, brackets etc.). When you finally get the resulting dataframe it is easier to print wiki.take(1) instead of traverse RDD using for cycle. But in this case a lot of junk symbols will be printed like: `[['Anarchism', 'is', .. ]]`. See the right output example in the task.

#### Task hint
Each subsequent of these tasks is a continuation of the previous one. So, you may use the same IPython notebook for all the programming assignments in this week.

In [61]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql import Window

spark_session = SparkSession.builder.enableHiveSupport().master("yarn").getOrCreate()

In [62]:
data = spark_session.read.parquet("/data/sample264")
meta = spark_session.read.parquet("/data/meta")

In [63]:
data.show(10)

+------+-------+--------+----------+
|userId|trackId|artistId| timestamp|
+------+-------+--------+----------+
| 13065| 944906|  978428|1501588527|
|101897| 799685|  989262|1501555608|
|215049| 871513|  988199|1501604269|
|309769| 857670|  987809|1501540265|
|397833| 903510|  994595|1501597615|
|501769| 818149|  994975|1501577955|
|601353| 958990|  973098|1501602467|
|710921| 916226|  972031|1501611582|
|  6743| 801006|  994339|1501584964|
|152407| 913509|  994334|1501571055|
+------+-------+--------+----------+
only showing top 10 rows



In [64]:
artist_track = data.groupBy('artistId', 'trackId').count()

In [65]:
artist_track.show()

+--------+-------+-----+
|artistId|trackId|count|
+--------+-------+-----+
|  986534| 829140|    5|
|  995135| 967720|   25|
|  983387| 829641|  135|
|  969750| 955248|   29|
|  970395| 929329|   23|
|  988199| 870619|   82|
|  995788| 885715|   16|
|  987932| 958532|   36|
| 1000709| 852389|    1|
|  991186| 824970|    2|
|  977073| 864053|   12|
|  994213| 844903|   23|
|  978874| 851005|    2|
|  983741| 948079|    1|
|  969750| 842192|    1|
|  997782| 860339|   37|
|  997189| 944578|   15|
|  993554| 823329|   14|
|  997983| 851182|    2|
|  983132| 847276|   12|
+--------+-------+-----+
only showing top 20 rows



#### Normalization could be done by next function

In [51]:
from pyspark.sql import Window
from pyspark.sql.functions import row_number, sum,col

# normalize weights of its edges (divide the weight of each edge on a sum of weights of all edges).
def norm(df, key1, key2, field, n): 
    
    window = Window.partitionBy(key1).orderBy(col(field).desc())
        
    topsDF = df.withColumn("row_number", row_number().over(window)) \
        .filter(col("row_number") <= n) \
        .drop(col("row_number")) 
        
    tmpDF = topsDF.groupBy(col(key1)).agg(col(key1), sum(col(field)).alias("sum_" + field))
   
    normalizedDF = topsDF.join(tmpDF, key1, "inner") \
        .withColumn("norm_" + field, col(field) / col("sum_" + field)) \
        .cache()

    return normalizedDF

In [67]:
artist_track_norm = norm(artist_track, 'artistId', 'trackId', 'count', 100) \
        .select('artistId', 'trackId', 'norm_count')

In [68]:
artist_track_norm.take(50)

[Row(artistId=968694, trackId=827354, norm_count=0.25),
 Row(artistId=968694, trackId=820606, norm_count=0.25),
 Row(artistId=968694, trackId=897139, norm_count=0.25),
 Row(artistId=968694, trackId=925696, norm_count=0.25),
 Row(artistId=969344, trackId=933592, norm_count=1.0),
 Row(artistId=969479, trackId=959227, norm_count=0.44166666666666665),
 Row(artistId=969479, trackId=819606, norm_count=0.2),
 Row(artistId=969479, trackId=929291, norm_count=0.10833333333333334),
 Row(artistId=969479, trackId=798826, norm_count=0.075),
 Row(artistId=969479, trackId=890444, norm_count=0.05),
 Row(artistId=969479, trackId=826621, norm_count=0.041666666666666664),
 Row(artistId=969479, trackId=860239, norm_count=0.025),
 Row(artistId=969479, trackId=882651, norm_count=0.016666666666666666),
 Row(artistId=969479, trackId=886945, norm_count=0.008333333333333333),
 Row(artistId=969479, trackId=944749, norm_count=0.008333333333333333),
 Row(artistId=969479, trackId=927174, norm_count=0.008333333333333

In [71]:
window = Window.orderBy(f.col('norm_count').desc())
    
artist_Track_List = artist_track_norm.withColumn('position', f.rank().over(window)) \
    .filter(f.col('position') < 40) \
    .orderBy('artistId', 'trackId') \
    .select('artistId', 'trackId') \
    .take(40)

In [72]:
for val in artist_Track_List:
    print("%s %s" % val)

967993 869415
967998 947428
968004 927380
968017 859321
968022 852786
968034 807671
968038 964150
968042 835935
968043 913568
968046 935077
968047 806127
968065 907906
968073 964586
968086 813446
968092 837129
968118 914441
968125 821410
968140 953008
968148 877445
968161 809793
968163 803065
968168 876119
968189 858639
968221 896937
968224 892880
968232 825536
968237 932845
968238 939177
968241 879045
968242 911250
968248 953554
968255 808494
968259 880230
968265 950148
968266 824437
968269 913243
968272 816049
968278 946743
968285 847460
968286 940006
