# Top 10 Movies

In this notebook we are going to find top 10 movies that were rated by the majority of the users.
To accomplish this task we are going to need 2 data files:

- **movies.csv**
- **ratings.csv**

Our output dataframe will contain:

- **movieID**
- **title**
- **number_of_ratings**
- **ratings_sum**
- **average_rating**

... and it is going to contain 10 records in total, i.e. top 10 most rated movies in the dataset.

In [24]:
# Project Dependencies
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

# Setting a Spark Session
spark = SparkSession.builder.appName("Movies_With_No_Genre").getOrCreate()

In [25]:
# Reading movies.csv data

movies = spark.read.csv(
    path = "./data/movies.csv",
    sep = ",",
    header = True,
    quote = '"',
    schema = "movieID INT, title STRING, genres STRING"
)

# Reading ratings.csv data
ratings = (
    spark.read.csv(
        path="./data/ratings.csv",
        sep=",",
        header=True,
        quote='"',
        schema="userId INT, movieId INT, rating DOUBLE, timestamp INT",
    )
    .withColumn("timestamp", f.to_timestamp(f.from_unixtime("timestamp")))
)

# Top 10 Most Rated Movies
top_10_rated_movies = (
    (
      ratings
        .groupBy("movieID")
        .agg(
            f.count("rating").alias("number_of_rates"),
            f.sum("rating").alias("ratings_sum"),
            f.avg("rating").alias("average_rating")
        )   
        .orderBy("number_of_rates", ascending = False)
    )
    .join(movies, ["movieID"], "left")
    .select("movieID", "title", "number_of_rates", "ratings_sum", "average_rating")
).limit(10)

print("\n*** Top 10 Most Rated Movies ***\n")
top_10_rated_movies.show(truncate = False)


*** Top 10 Most Rated Movies ***

+-------+-----------------------------------------+---------------+-----------+-----------------+
|movieID|title                                    |number_of_rates|ratings_sum|average_rating   |
+-------+-----------------------------------------+---------------+-----------+-----------------+
|356    |Forrest Gump (1994)                      |329            |1370.0     |4.164133738601824|
|318    |Shawshank Redemption, The (1994)         |317            |1404.0     |4.429022082018927|
|296    |Pulp Fiction (1994)                      |307            |1288.5     |4.197068403908795|
|593    |Silence of the Lambs, The (1991)         |279            |1161.0     |4.161290322580645|
|2571   |Matrix, The (1999)                       |278            |1165.5     |4.192446043165468|
|260    |Star Wars: Episode IV - A New Hope (1977)|251            |1062.0     |4.231075697211155|
|480    |Jurassic Park (1993)                     |238            |892.5      |3.75

We are all done!

Reference notebook: Initial_Data_Analysis.ipynb