# Spotify Behaviors Project
### Brian Huang, Victor Thai, Annie Fan, Aishani Mohapatra

## Generating Sampled Data
With our dataset being over 500gbs, it was important that we sampled our data rather than loading it in as a whole.

```Python
from pyspark.sql import functions as f
from pyspark.sql import SparkSession

import os
import pandas as pd
import random

spark = SparkSession.builder.getOrCreate()
#start a spark session
spark_fp = os.path.join("/", "Volumes", "Marceline Jr.", "Spotify Dataset", "training_set")
#replace with your filepath

df = spark.read.option("header", "true").csv(spark_fp) #load in the data 

ids = df.select('session_id').distinct() #get unique user/session ids for sampling. we want to sample by user

sampled_users = ids.orderBy(f.rand()).limit(50000) #sample the N users

sampled_users_list = list(sampled_users.toPandas()['session_id'])
samp_fracs = {key:1 for key in sampled_users_list}
#generate the fractions we need to sample from pyspark

samp_df = df.sampleBy("session_id", fractions = samp_fracs)
samp_df.write.csv("./sampled_users_100000.csv", header = True)
#write the file out
```