## 0: Section Overview

In this section, we will describe how to access the data, complete some Exploratory Data Analysis in order to prepare our data for use, and split our data into Test and Train datasets.

## 1: Necessary Imports

In [43]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

## 2: Data Access

We used the Last.FM data which records 92,800 artist listening records from 1892 users. Our data can be accessed [here](https://grouplens.org/datasets/hetrec-2011/) by downloading the .zip file under the header Last.FM. These daatsets are also available in the [data folder](https://github.com/elimiller7/dst-assessment-2/tree/main/data) in our GitHub repository.

## 3: Test and Train Split


In [3]:
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('last_fm_data/user_artists.dat', header=True, sep='\t')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/11 17:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Since we have 2100 users, we are going to separate 20% of users to use as our test data, and the remaining 80% will be our training data. Assuming that the users are ordered randomly, we can select the first 1664 entries of the dataframe to be our training data. This represents the preferences of the first 33 users.

In [49]:
df_train = df.limit(1664)
df_test = df.orderBy('userID', ascending=False).limit(436)

Now, for each user in the test dataset, we will randomly split their artist preferences into a set for testing our model performance, and a seperate validation set for checking how accurate the model's recommendations were. Again, we will choose to retain 20% of artists for our validation set.

In [67]:
users = df_test.select('userID').distinct().collect()

for user in users:
    user_id = user['userID']
    
    user_data = df_test.filter(df_test['userID'] == user_id)

    x, y = user_data.randomSplit([0.8, 0.2], seed=42)
    
    test_df = df_test.union(x)
    validation_df = df_test.union(y)

test_df.show()

+------+--------+------+
|userID|artistID|weight|
+------+--------+------+
|   999|      65|   433|
|   999|     152|   401|
|   999|     154|   218|
|   999|     188|   128|
|   999|     190|   344|
|   999|     227|   100|
|   999|     233|    72|
|   999|     234|   128|
|   999|     321|   422|
|   999|     344|   129|
|   999|     355|    80|
|   999|     377|   327|
|   999|     418|   167|
|   999|     461|   206|
|   999|     486|   140|
|   999|     498|   125|
|   999|     506|   565|
|   999|     511|   307|
|   999|     533|   131|
|   999|     538|   232|
+------+--------+------+
only showing top 20 rows



In [66]:
validation_df.show()

+------+--------+------+
|userID|artistID|weight|
+------+--------+------+
|   999|      65|   433|
|   999|     152|   401|
|   999|     154|   218|
|   999|     188|   128|
|   999|     190|   344|
|   999|     227|   100|
|   999|     233|    72|
|   999|     234|   128|
|   999|     321|   422|
|   999|     344|   129|
|   999|     355|    80|
|   999|     377|   327|
|   999|     418|   167|
|   999|     461|   206|
|   999|     486|   140|
|   999|     498|   125|
|   999|     506|   565|
|   999|     511|   307|
|   999|     533|   131|
|   999|     538|   232|
+------+--------+------+
only showing top 20 rows

