# Movie Ratings: Computing a Mean Rating

1) First, we are going to calculate the average rating for each movie in the dataset. We are also going to compute the average movie ratings by gender;

2) We will display top three movies by average rating: for all users, for male users only, and for female users only.

3) We will find top three movies with the greatest difference in average ratings comparing the results of male and female users.

4) Finally we will calculate how men and women rate movies on average.

Before proceeding with the analysis I would like to provide a brief description of the rating system.
The lowest rating value that the user can assign is 1 star and the highest is 5 stars.

**Interpretation of the rating values:**
- 1 star: very bad
- 2 stars: bad
- 3 stars: average
- 4 stars: good
- 5 stars: excellent



In [1]:
# Settings 
import os
import numpy as np
import pandas as pd
import sqlite3
from sqlite3 import Error as SQLiteError

# Pandas
pd.set_option('precision', 4)

# SQLite
dbfile = "sqlitedb/movielens.db"
if not os.path.isfile(dbfile):
    print("Failed to detect the database file.")
    
# Establish DB Connection
conn = sqlite3.connect(dbfile)
if not conn:
    print("Failed to establish DB connection.")

In [2]:
# Data Query
query = """
    SELECT r.movie_id AS MovieID, 
        m.movie_name AS MovieName,
        r.user_id AS UserID,
        CASE u.user_gender
            WHEN 0 THEN 'M'
            ELSE 'F'
        END Gender,
        r.rating as Rating       
        
    FROM ratings AS r
        LEFT JOIN movies as m ON m.movie_id = r.movie_id
        LEFT JOIN users as u ON u.user_id = r.user_id
    ORDER BY r.movie_id, r.user_id
"""

summary = pd.read_sql_query(query, conn)

print("\nSummary DataFrame:\n")
summary.head(n=5)


Summary DataFrame:



Unnamed: 0,MovieID,MovieName,UserID,Gender,Rating
0,1,Toy Story,139,M,2
1,1,Toy Story,755,M,2
2,1,Toy Story,1577,F,4
3,1,Toy Story,1940,M,4
4,1,Toy Story,2765,M,4


## 1. Ranking Movies by Users Average Ratings

### 1.1 Movie Rating Using Mean Value (Average)

In [3]:
# Movie Mean Ratings
movie_ratings = summary.groupby(['MovieName'])
movie_mean_ratings = movie_ratings[['Rating']].mean().sort_values('Rating', ascending=False)
# Rating
movie_mean_ratings

Unnamed: 0_level_0,Rating
MovieName,Unnamed: 1_level_1
"Shawshank Redemption, The",3.6
Star Wars: Episode IV - A New Hope,3.2667
Blade Runner,3.2222
Groundhog Day,3.1667
"Silence of the Lambs, The",3.0625
Babe,3.0
Saving Private Ryan,3.0
Star Wars: Episode VI - Return of the Jedi,3.0
Schindler's List,3.0
Pulp Fiction,3.0


### 1.2 Average Rating by User Gender

In this section we are going to compute **average movie rating by user gender**, and we will use pandas **groupby** option to accomplish the task. 

In [26]:
movie_mean_rating_by_gender = summary.groupby(['MovieName', 'Gender'])[['Rating']].mean()
movie_mean_rating_by_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,Rating
MovieName,Gender,Unnamed: 2_level_1
Babe,F,3.4286
Babe,M,2.0
Blade Runner,F,3.5
Blade Runner,M,3.0
Forrest Gump,F,3.0
Forrest Gump,M,2.25
Gladiator,F,3.0
Gladiator,M,2.8333
Groundhog Day,F,2.8333
Groundhog Day,M,3.5


## 2. Top 3 Movies by Average Rating

### 2.1. Top 3 Movies Selected by Users

In [27]:
# Top 3 movies
movie_mean_ratings.head(n=3) 

Unnamed: 0_level_0,Rating
MovieName,Unnamed: 1_level_1
"Shawshank Redemption, The",3.6
Star Wars: Episode IV - A New Hope,3.2667
Blade Runner,3.2222


### 2.2. Top 3 Movies by Female Users

In [28]:
female_summary = summary[summary['Gender']=='F']
female_mean_ratings = female_summary.groupby(['MovieName'])[['Rating']].mean()
female_mean_ratings.sort_values('Rating', ascending=False).head(n=3)

Unnamed: 0_level_0,Rating
MovieName,Unnamed: 1_level_1
Shakespeare in Love,4.25
Pulp Fiction,4.0
"Shawshank Redemption, The",3.8


### 2.3. Top 3 Movies by Male Users

In [29]:
male_summary = summary[summary['Gender']=='M']
male_mean_ratings = male_summary.groupby(['MovieName'])[['Rating']].mean()
male_mean_ratings.sort_values('Rating', ascending=False).head(n=3)

Unnamed: 0_level_0,Rating
MovieName,Unnamed: 1_level_1
Raiders of the Lost Ark,3.6667
Schindler's List,3.5
Groundhog Day,3.5


... and here we got the results, that show us that average tastes of male and female audiences are quite different. Top female movie **"Shakepeare in Love"** got only **2.1429** within the male audience. Top male movie **"Raiders of the Lost Ark"** got only **2.0000** within the female audience. Next we are going to find three movies that have the biggest difference by comparing male and female ratings.

## 3. Movies with the Highest Rating Difference in Male and Female Audiences

Before moving forwards with the assignment we will construct a summary dataset for the mean ratings.
The dataset is going to contain the following data:

1. MovieName - movie name
2. Rating - movie average rating
3. FRating - movie average rating by female audience
4. MRating - movie average rating by male audience
5. DELTA - DELTA = FRating - MRating


In [30]:
# Average rating data
mean_rating_summary = pd.merge(movie_mean_ratings, female_mean_ratings, 
                              on='MovieName', how='inner')
mean_rating_summary = pd.merge(mean_rating_summary, male_mean_ratings, on='MovieName', how='inner')
mean_rating_summary.columns = ['Rating', 'FRating', 'MRating']

# Let's add a DELTA column: DELTA = FRating - MRating

mean_rating_summary['DELTA'] = mean_rating_summary['FRating'] - mean_rating_summary['MRating']

mean_rating_summary

Unnamed: 0_level_0,Rating,FRating,MRating,DELTA
MovieName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Shawshank Redemption, The",3.6,3.8,3.4,0.4
Star Wars: Episode IV - A New Hope,3.2667,3.4286,3.125,0.3036
Blade Runner,3.2222,3.5,3.0,0.5
Groundhog Day,3.1667,2.8333,3.5,-0.6667
"Silence of the Lambs, The",3.0625,2.7143,3.3333,-0.619
Babe,3.0,3.4286,2.0,1.4286
Saving Private Ryan,3.0,2.75,3.1429,-0.3929
Star Wars: Episode VI - Return of the Jedi,3.0,3.0,3.0,0.0
Schindler's List,3.0,2.5,3.5,-1.0
Pulp Fiction,3.0,4.0,2.625,1.375


Now it is going to be easy to answer the question about taste difference comparing male and female audiences.

### 3.1. Movies Loved by Women and Hated by Men

In [31]:
mean_rating_summary.sort_values('DELTA', ascending=False).head(n=3)

Unnamed: 0_level_0,Rating,FRating,MRating,DELTA
MovieName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Shakespeare in Love,2.9091,4.25,2.1429,2.1071
Babe,3.0,3.4286,2.0,1.4286
Pulp Fiction,3.0,4.0,2.625,1.375


### 3.2. Movies Loved by Men and Hated by Women

In [32]:
mean_rating_summary.sort_values('DELTA', ascending=True).head(n=3)

Unnamed: 0_level_0,Rating,FRating,MRating,DELTA
MovieName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Raiders of the Lost Ark,2.9091,2.0,3.6667,-1.6667
Schindler's List,3.0,2.5,3.5,-1.0
"Matrix, The",2.8333,2.4,3.1429,-0.7429


## 4. Overall Movies Average Rating

In this section we will calculate the average rating for all rated movies.

### 4.1. Overall Average Rating

In [33]:
# Overall Average Rating
overall_average_rating = summary['Rating'].mean()
print("Average Rating: ", round(overall_average_rating, 4))

Average Rating:  2.9253


In [34]:
# Total Ratings
nRatings = summary['Rating'].count()
print("Total Number of Ratings: {0}".format(nRatings))

Total Number of Ratings: 241


### 4.2. Female Overall Average Rating

In [35]:
# Female Overall Average Rating
female_overall_average_rating = female_summary['Rating'].mean()
print("Average Rating by Females' Audience: ", round(female_overall_average_rating, 4))

Average Rating by Females' Audience:  2.9474


In [36]:
# Total Female Ratings
nFemaleRatings = female_summary['Rating'].count()
print("Total Number of Ratings Provided by Females: {0}".format(nFemaleRatings))

Total Number of Ratings Provided by Females: 114


### 4.3. Male Overall Average Rating

In [37]:
# Male Overall Average Rating
male_overall_average_rating = male_summary['Rating'].mean()
print("Average Rating by Males' Audience: ", round(male_overall_average_rating, 4))

Average Rating by Males' Audience:  2.9055


In [38]:
# Total Male Ratings
nMaleRatings = male_summary['Rating'].count()
print("Total Number of Ratings Provided by Males: {0}".format(nMaleRatings))

Total Number of Ratings Provided by Males: 127


## 5. Summary

Ok! Lets summarize the results!

The top three movies selected by the users are

1. The Shawshank Redemption, with the rating of **3.6000** (Females: 3.8000, Males: 3.4000) 
2. Star Wars: Episode IV - A New Hope, with the rating of **3.2667** (Females: 3.4286, Males: 3.1250) 
3. Blade Runner, with the rating of **3.2222** (Females: 3.5000, Males: 3.0000) 

The movies in Top 3 have good ratings in both male and female audiences. Males' ratings though are slightly lower than females' ratings.

If we have a look at Top 3 movies selected by male and female audiences, we will notice that the preferences of males and females differ significantly. For example, the most loved female movie "Shakespeare in Love" got only 2.14 from the male voters, and this is according to our rating system can be interpreted as a "bad movie". On the opposite side is the movie "Raiders of the Lost Ark", this movie got the highest ranking from males (**3.6667**), and women didn't like this movie at all, and gave it only **2.0000**.

This is something that needs to be taken into consideration when planning a movie date!