# Movies withour a Genre

In this notebook we will find all the movies that do not have a genre.
We will be using PySpark to solve the problem!

### Get the data in !

The movie genre data is stored in the **data/movies.csv** file.
The first thing we need to do is to read the data into Spark.

In [33]:
# Dependencies
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

# Setting a Spark Session
spark = SparkSession.builder.appName("Movies_With_No_Genre").getOrCreate()

# Reading the data
movies = spark.read.csv(
    path = "./data/movies.csv",
    sep = ",",
    header = True,
    quote = '"',
    schema = "movieID INT, title STRING, genres STRING"
)

# Let's check our data
movies.printSchema()
movies.show(n=10, truncate=False, vertical=False)

root
 |-- movieID: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)

+-------+----------------------------------+-------------------------------------------+
|movieID|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
|6      |Heat (1995)                       |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                    |Comedy|Romance                    

### Exploring the data

Ok. Cool, we've got our data in. However before we will be able to solve our problem we need to apply a couple of transformations. For example, genres are listed using a pipeline separator, and in order to be able to proceed with the analysis we need to transform our dataset to the dataset of the form (movieID, title, genre).

In [34]:
# Transforming the data
movie_genre = (
    (
        movies
            .withColumn("genres_array", f.split("genres", "\|"))
            .withColumn("genre", f.explode("genres_array"))
    )
    .select("movieId", "title", "genre")
)

# Let's check what is in our dataset
movie_genre.show(5)

+-------+----------------+---------+
|movieId|           title|    genre|
+-------+----------------+---------+
|      1|Toy Story (1995)|Adventure|
|      1|Toy Story (1995)|Animation|
|      1|Toy Story (1995)| Children|
|      1|Toy Story (1995)|   Comedy|
|      1|Toy Story (1995)|  Fantasy|
+-------+----------------+---------+
only showing top 5 rows



In [56]:
# Next we want to get all the genres available
available_genres = movie_genre.select("genre").distinct()
# Number of available genres
nGenres = available_genres.count()

# Display available genres
print("We have {0} genres available in the database.".format(nGenres))
available_genres.sort("genre").show(n=nGenres, truncate=False)

We have 20 genres available in the database.
+------------------+
|genre             |
+------------------+
|(no genres listed)|
|Action            |
|Adventure         |
|Animation         |
|Children          |
|Comedy            |
|Crime             |
|Documentary       |
|Drama             |
|Fantasy           |
|Film-Noir         |
|Horror            |
|IMAX              |
|Musical           |
|Mystery           |
|Romance           |
|Sci-Fi            |
|Thriller          |
|War               |
|Western           |
+------------------+



### Getting a result

Cool! We have a genre named "(no genres listed)". That is what we are after! To finalize our solution we need to extract a allthe movies with "(no genres listed)", and if our resulting dataset is not too huge, we will print it.

In [63]:
# Movies without a genre
movies_without_genre = movie_genre.where("genre = '(no genres listed)'")
# If the number of movies with no genre is less than 100, then we will print it
nMoviesWithoutGenre = movies_without_genre.count()
print("We have got {} movies without a genre in our dataset.".format(nMoviesWithoutGenre))
if nMoviesWithoutGenre < 100:
    movies_without_genre.sort("title").show(n=nMoviesWithoutGenre, truncate=False)

We have got 34 movies without a genre in our dataset.
+-------+----------------------------------------------------------------------------------+------------------+
|movieId|title                                                                             |genre             |
+-------+----------------------------------------------------------------------------------+------------------+
|182727 |A Christmas Story Live! (2017)                                                    |(no genres listed)|
|149330 |A Cosmic Christmas (1977)                                                         |(no genres listed)|
|159779 |A Midsummer Night's Dream (2016)                                                  |(no genres listed)|
|159161 |Ali Wong: Baby Cobra (2016)                                                       |(no genres listed)|
|122888 |Ben-hur (2016)                                                                    |(no genres listed)|
|176601 |Black Mirror                             