# Movie Ratings: Popularity

In this notebook we are going to compute movie popularity scores. We will keep the things simple and say that the most popular movie is the movie that has the largest number of user ratings. 

**MOVIE_POPULARITY = NUMBER_OF_USER_RATINGS_RECEIVED**

One of the analysis goals is to find three most popular movies. Suppose that we have several movies with the same **Popularity** score. How can we identify the most popular one? Well, from the previous analysis results we have a **Mean Rating** estimate for each movie, and we will say that the movie among the movies with the same **Popularity** score is more popular if it has the higher **Mean Rating**.

In [1]:
# Settings 
import os
import numpy as np
import pandas as pd
import sqlite3
from sqlite3 import Error as SQLiteError

# Pandas
pd.set_option('precision', 4)

# SQLite
dbfile = "sqlitedb/movielens.db"
if not os.path.isfile(dbfile):
    print("Failed to detect the database file.")
    
# Establish DB Connection
conn = sqlite3.connect(dbfile)
if not conn:
    print("Failed to establish DB connection.")

In [2]:
# Data Query
query = """
    SELECT r.movie_id AS MovieID, 
        m.movie_name AS MovieName,
        r.user_id AS UserID,
        CASE u.user_gender
            WHEN 0 THEN 'M'
            ELSE 'F'
        END Gender,
        r.rating as Rating       
        
    FROM ratings AS r
        LEFT JOIN movies as m ON m.movie_id = r.movie_id
        LEFT JOIN users as u ON u.user_id = r.user_id
    ORDER BY r.movie_id, r.user_id
"""

summary = pd.read_sql_query(query, conn)

print("\nSummary DataFrame:\n")
print(summary.head(n=5))


Summary DataFrame:

   MovieID  MovieName  UserID Gender  Rating
0        1  Toy Story     139      M       2
1        1  Toy Story     755      M       2
2        1  Toy Story    1577      F       4
3        1  Toy Story    1940      M       4
4        1  Toy Story    2765      M       4


## 1. Movie Popularity Rating

### 1.1. Popularity Rating

In [41]:
# Popularity Rating
popularity_summary = summary[['MovieName', 'Rating']].groupby('MovieName').agg(
    popularity = ('Rating', lambda x: x.count()),
    mean_rating = ('Rating', lambda x: x.mean())
)
popularity_summary.columns = ['Popularity', 'Mean Rating']
popularity_rating = popularity_summary.sort_values(['Popularity', 'Mean Rating'], ascending=False)
print(popularity_rating)

                                            Popularity  Mean Rating
MovieName                                                          
Toy Story                                           17       2.8235
Silence of the Lambs, The                           16       3.0625
Star Wars: Episode IV - A New Hope                  15       3.2667
Star Wars: Episode VI - Return of the Jedi          14       3.0000
Independence Day (ID4)                              13       2.7692
Groundhog Day                                       12       3.1667
Schindler's List                                    12       3.0000
Gladiator                                           12       2.9167
Matrix, The                                         12       2.8333
Sixth Sense, The                                    12       2.8333
Total Recall                                        12       1.9167
Pulp Fiction                                        11       3.0000
Saving Private Ryan                             

### 1.2. Top 3 Movies by Popularity

In [36]:
# Top 3 Movies
print(popularity_rating.head(n=3))

                                    Popularity  Mean Rating
MovieName                                                  
Toy Story                                   17       2.8235
Silence of the Lambs, The                   16       3.0625
Star Wars: Episode IV - A New Hope          15       3.2667


### 2. Popularity Rating by Female Audience

### 2.1. Popularity Rating by Females

In [40]:
# Movie Popularity Rating by Female Audience

female_summary = summary[summary['Gender']=='F']
female_popularity_summary = female_summary[['MovieName', 'Rating']].groupby('MovieName').agg(
    popularity = ('Rating', lambda x: x.count()),
    mean_rating = ('Rating', lambda x: x.mean())
)
female_popularity_summary.columns = ['Popularity', 'Mean Rating']
female_popularity_rating = female_popularity_summary.sort_values(['Popularity', 'Mean Rating'], 
    ascending=False)
print(female_popularity_rating)

                                            Popularity  Mean Rating
MovieName                                                          
Toy Story                                            7       3.5714
Babe                                                 7       3.4286
Star Wars: Episode IV - A New Hope                   7       3.4286
Silence of the Lambs, The                            7       2.7143
Stand by Me                                          7       2.4286
Total Recall                                         7       1.7143
Forrest Gump                                         6       3.0000
Gladiator                                            6       3.0000
Sixth Sense, The                                     6       3.0000
Star Wars: Episode VI - Return of the Jedi           6       3.0000
Groundhog Day                                        6       2.8333
Independence Day (ID4)                               6       2.6667
Schindler's List                                

### 2.2. Top 3 Most Popular Movies Selected by Females

Please note that within the female audience we have 7 movies with the **Popularity** score equal to 7,
hence in order to rank these movies we will be using movie **Mean Rating** value.

In [45]:
# Top 3 Movies Selected by Females
print(female_popularity_rating.head(n=3))

                                    Popularity  Mean Rating
MovieName                                                  
Toy Story                                    7       3.5714
Babe                                         7       3.4286
Star Wars: Episode IV - A New Hope           7       3.4286


...and oops! We've got two movies with the same **Popularity Score** and the **Mean Rating** value!
What we are going to do about that? Well, I would say lets do nothing. Why?

1) This is not that is going to happen too often. I would say that it is rather an exemption than the rule.

2) Our analysis dataset is small, so quirks and inconsitencies should be expect.

3) Finally, our goal is to choose the top 3 most popular movies, and we did that. We do not care too match which movie is more popular or less popular within the top 3.


## 3. Popularity Rating by Male Audience

### 3.1. Popularity Rating by Males

In [47]:
# Movie Popularity Rating by Male Audience

male_summary = summary[summary['Gender']=='M']
male_popularity_summary = male_summary[['MovieName', 'Rating']].groupby('MovieName').agg(
    popularity = ('Rating', lambda x: x.count()),
    mean_rating = ('Rating', lambda x: x.mean())
)
male_popularity_summary.columns = ['Popularity', 'Mean Rating']
male_popularity_rating = male_popularity_summary.sort_values(['Popularity', 'Mean Rating'], 
    ascending=False)
print(male_popularity_rating)

                                            Popularity  Mean Rating
MovieName                                                          
Toy Story                                           10       2.3000
Silence of the Lambs, The                            9       3.3333
Star Wars: Episode IV - A New Hope                   8       3.1250
Star Wars: Episode VI - Return of the Jedi           8       3.0000
Pulp Fiction                                         8       2.6250
Matrix, The                                          7       3.1429
Saving Private Ryan                                  7       3.1429
Independence Day (ID4)                               7       2.8571
Shakespeare in Love                                  7       2.1429
Raiders of the Lost Ark                              6       3.6667
Groundhog Day                                        6       3.5000
Schindler's List                                     6       3.5000
Gladiator                                       

### 3.2. Top 3 Movies Selected by Males

In [48]:
# Top 3 Movies Selected by Male
print(male_popularity_rating.head(n=3))

                                    Popularity  Mean Rating
MovieName                                                  
Toy Story                                   10       2.3000
Silence of the Lambs, The                    9       3.3333
Star Wars: Episode IV - A New Hope           8       3.1250


## 4. Summary

... and here we go! We got movie popularity results! They are quite interesting but not as surprising as **Mean Rating** results. This is because if you follow what is going on in your community, you are tuned and up to date, and it is pretty much the same for everyone in your community, and it is not much affected by gender. Our results show exactly this.

The most popular movie is the **"Toy Story"**, it is a family movie produced by Pixar,  and it got the largest number of ratings within the user community. The interesting thing is, that women quite liked it (3.5714), and men didn't (2.3000). From the results we can conclude that women enjoy and value the time spent with their family and kids, while men would prefer being elsewhere. Let's have a look at the **"Toy Story"** ratings: 

In [57]:
# Toy Story Ratings

idx = summary['MovieName']=='Toy Story'
summary[idx][['MovieName', 'Gender', 'Rating']].sort_values(['Gender', 'Rating'], ascending=False)

Unnamed: 0,MovieName,Gender,Rating
3,Toy Story,M,4
4,Toy Story,M,4
0,Toy Story,M,2
1,Toy Story,M,2
9,Toy Story,M,2
10,Toy Story,M,2
12,Toy Story,M,2
14,Toy Story,M,2
16,Toy Story,M,2
13,Toy Story,M,1


The second popular movie on the list is **"The Silence of the Lambs"**, it was popular in both male and female audiences. In male audience it got the rating of **3.3333** and **2.7143** in female audience. From the results we can see that even though the movie is super popular and got a plenty of worldwide recognition, our audience didn't particularly enjoyed it. That is how our ratings look like: 

In [58]:
# The Silence of Lambs Ratings

idx = summary['MovieName']=='Silence of the Lambs, The'
summary[idx][['MovieName', 'Gender', 'Rating']].sort_values(['Gender', 'Rating'], ascending=False)

Unnamed: 0,MovieName,Gender,Rating
103,"Silence of the Lambs, The",M,5
95,"Silence of the Lambs, The",M,4
98,"Silence of the Lambs, The",M,4
106,"Silence of the Lambs, The",M,4
109,"Silence of the Lambs, The",M,4
99,"Silence of the Lambs, The",M,3
107,"Silence of the Lambs, The",M,3
94,"Silence of the Lambs, The",M,2
108,"Silence of the Lambs, The",M,1
104,"Silence of the Lambs, The",F,5


If you have a look at the women Top 3 movie list, you may notice that at the second position is the movie **"Babe""**. As I mentioned when ranking the movies by popularity I was using **Popularity** score and the **Mean Rating** value. **"The Silence of Lambs"** and **"Babe"** have the popularity score equal to 7, however the movie **"Babe"** was liked by females more. 

The interesting thing here is that **"Babe"** is a movie about _"a pig raised by sheepdogs, learns to herd sheep with a little help from Farmer Hoggett"_. In other words we again deal with the family movie, and again it was highly ranked by the female users, and it was almost ignored by the male users (only 3 male users have seen the movie, and gave it the rating of 2.0000). 

In [59]:
# Babe Ratings

idx = summary['MovieName']=='Babe'
summary[idx][['MovieName', 'Gender', 'Rating']].sort_values(['Gender', 'Rating'], ascending=False)

Unnamed: 0,MovieName,Gender,Rating
17,Babe,M,2
22,Babe,M,2
25,Babe,M,2
19,Babe,F,5
21,Babe,F,5
24,Babe,F,4
26,Babe,F,4
18,Babe,F,3
20,Babe,F,2
23,Babe,F,1


The number three on our popularity list is the **"Star Wars: Episode IV - A New Hope"**. I do not think that anyone will question the popularity of the "Star Wars". I will also say that this is a typical **"OK"** movie where one can safely go with friends, a girlfriend, or a boyfriend, and most likely spend a good time together. The ratings results for the movie are as follows:

In [62]:
# Star Wars: Episode IV - A New Hope Ratings

idx = summary['MovieName']=='Star Wars: Episode IV - A New Hope'
summary[idx][['MovieName', 'Gender', 'Rating']].sort_values(['Gender', 'Rating'], ascending=False)

Unnamed: 0,MovieName,Gender,Rating
38,Star Wars: Episode IV - A New Hope,M,5
39,Star Wars: Episode IV - A New Hope,M,5
31,Star Wars: Episode IV - A New Hope,M,4
40,Star Wars: Episode IV - A New Hope,M,4
27,Star Wars: Episode IV - A New Hope,M,3
30,Star Wars: Episode IV - A New Hope,M,2
28,Star Wars: Episode IV - A New Hope,M,1
36,Star Wars: Episode IV - A New Hope,M,1
35,Star Wars: Episode IV - A New Hope,F,5
29,Star Wars: Episode IV - A New Hope,F,4
