# Task one
## Stage 2: 

Taking the output from stage 1, create a folder called genres, and
inside this folder produce parquet files for each subgenre from the dataset. So
for example, Action.parquet will contain all the die hard/mission impossible
films, etc.

Example valid names:
- `Adventure`
- genre=Action
- `Faith_and_Spirituality`
- genre=Faith_and_Spirituality

The schema of each genre parquet must match the schema from stage1.

A genre name like `Action, adventure` should be broken down into separate genres `[“Action”, “Adventure”]`

In [1]:
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

In [2]:
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('imdb-munging') \
    .getOrCreate()

sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/19 17:56:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [None]:
input_path = "../output/films"
df = spark.read.parquet(input_path)
df.show(20, truncate=False)

In [None]:
df = df.withColumn('genre', f.col('genres')[0]).repartition('genre').sortWithinPartitions('persons')
#df.show(20, truncate=False)
df.printSchema()

In [37]:
# a quick count of the main genres
df.groupBy('genre').count().sort(f.desc('count')).show()

+-----------+-----+
|      genre|count|
+-----------+-----+
|      Drama|85765|
|Documentary|84270|
|     Comedy|56628|
|     Action|30127|
|     Horror|13616|
|  Biography|12341|
|      Crime|11951|
|  Adventure|10559|
|   Thriller| 5631|
|      Adult| 5520|
|  Animation| 3844|
|    Romance| 3365|
|     Family| 2407|
|    Fantasy| 2020|
|      Music| 1775|
|    Mystery| 1771|
|     Sci-Fi| 1480|
|    Musical|  932|
|    History|  788|
|      Sport|  678|
+-----------+-----+
only showing top 20 rows



In [43]:
# A better approach

output_path = '../output/genres'

df.write.parquet(output_path, mode='overwrite', partitionBy='genre')

# write to the same output path as df read path
#df.write.saveAsTable('genres', mode='overwrite', partitionBy='genre')

                                                                                

In [34]:
rows = df.groupBy('genre').agg(f.count('genre')).collect()
genre_list = [row['genre'] for row in rows]
print(genre_list)

['Crime', 'Romance', 'Thriller', 'Adventure', 'Drama', 'War', 'Documentary', 'Reality-TV', 'Family', 'Fantasy', 'Adult', 'History', 'Mystery', 'Musical', 'Animation', 'Music', 'Horror', 'Western', 'Biography', 'Comedy', 'Action', 'Sport', 'Talk-Show', 'Sci-Fi', 'News', 'Game-Show']


In [33]:
# A different solution: for each genre write a `genre.parquet` file

'''
for g in genre_list:
    output_file = r'../output/genres/%s.parquet' % g
    df.filter(df['genre'] == g).write.save(output_file)
'''

In [54]:
# Read a partitioned parquet

file = "../output/genres/genre=Adventure"
#file = "../output/genres/Adventure.parquet"
gdf = spark.read.parquet(file)

gdf.printSchema()

root
 |-- film_id: string (nullable = true)
 |-- title: string (nullable = true)
 |-- year: date (nullable = true)
 |-- duration: integer (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- persons: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- genre: string (nullable = true)



In [14]:
sc.stop()