# Movie Rating Data Analysis 
### Part 1: Hypothesis Testing
In this first half, 11 questions are invested by using the appropriate hypothesis tests. To cut down on false positives, the per-test significance level 𝛼 is set to 0.005 (as per Benjamin et al., 2018).

In [1]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
# Load data
movies = pd.read_csv('movieReplicationSet.csv')
movies.head()

Unnamed: 0,The Life of David Gale (2003),Wing Commander (1999),Django Unchained (2012),Alien (1979),Indiana Jones and the Last Crusade (1989),Snatch (2000),Rambo: First Blood Part II (1985),Fargo (1996),Let the Right One In (2008),Black Swan (2010),...,When watching a movie I cheer or shout or talk or curse at the screen,When watching a movie I feel like the things on the screen are happening to me,As a movie unfolds I start to have problems keeping track of events that happened earlier,"The emotions on the screen ""rub off"" on me - for instance if something sad is happening I get sad or if something frightening is happening I get scared",When watching a movie I get completely immersed in the alternative reality of the film,Movies change my position on social economic or political issues,When watching movies things get so intense that I have to stop watching,Gender identity (1 = female; 2 = male; 3 = self-described),Are you an only child? (1: Yes; 0: No; -1: Did not respond),Movies are best enjoyed alone (1: Yes; 0: No; -1: Did not respond)
0,,,4.0,,3.0,,,,,,...,1.0,6.0,2.0,5.0,5.0,5.0,1.0,1.0,0,1
1,,,1.5,,,,,,,,...,3.0,1.0,1.0,6.0,5.0,3.0,2.0,1.0,0,0
2,,,,,,,,,,,...,5.0,4.0,3.0,5.0,5.0,4.0,4.0,1.0,1,0
3,,,2.0,,3.0,,,,,4.0,...,3.0,1.0,1.0,4.0,5.0,3.0,1.0,1.0,0,1
4,,,3.5,,0.5,,0.5,1.0,,0.0,...,2.0,3.0,2.0,5.0,6.0,4.0,4.0,1.0,1,1


In [18]:
# convert all data to numeric
movies = movies.apply(pd.to_numeric, errors='coerce')

#### 1)  Are movies that are more popular (operationalized as having more ratings) rated higher than movies that are less popular? 

In [32]:
# subset only movie ratings
movies_ratings = movies.iloc[:,:400]
# median-split of popularity to determine high vs. low popularity movies
median_1 = movies_ratings.count().median()
print(median_1)
low_pop = movies_ratings.loc[:, movies_ratings.count() < median_1]
high_pop = movies_ratings.loc[:, movies_ratings.count() >= median_1]

197.5


In [44]:
# vectorize lower half
low_pop_arr = low_pop.values.flatten()
# remove missing values
low_pop_arr = low_pop_arr[np.isfinite(low_pop_arr)]
print(low_pop.shape)
np.median(low_pop_arr)

(1097, 200)


2.5

In [45]:
# vectorize higher half
high_pop_arr = high_pop.values.flatten()
# remove missing values
high_pop_arr = high_pop_arr[np.isfinite(high_pop_arr)]
print(high_pop.shape)
np.median(high_pop_arr)

(1097, 200)


3.0

In [46]:
# U test
u1, pu1 = stats.mannwhitneyu(low_pop_arr, high_pop_arr, alternative = 'less')
print(u1, pu1)
pu1<0.005

741899855.5 0.0


True

Since p-value is less than our alpha, 0.005, we reject the null hypothesis to conclude
that more popular movies are rated higher than unpopular movies. 

#### 2) Are movies that are newer rated differently than movies that are older?

In [57]:
# median-split of year of release to contrast movies in terms of whether they are old or new
movies_year = movies_ratings.columns.str[-5:-1].values.astype(int)
median_2 = np.median(movies_year)

old = movies_ratings.loc[:,movies_ratings.columns.str[-5:-1].values.astype(int) < median_2]
new = movies_ratings.loc[:,movies_ratings.columns.str[-5:-1].values.astype(int) >= median_2]

In [58]:
# vectorize lower half
old_arr = old.values.flatten()
# remove missing values
old_arr = old_arr[np.isfinite(old_arr)]
print(old.shape)
np.median(old_arr)

(1097, 197)


3.0

In [59]:
# vectorize higher half
new_arr = new.values.flatten()
# remove missing values
new_arr = new_arr[np.isfinite(new_arr)]
print(new.shape)
np.median(new_arr)

(1097, 203)


3.0

In [62]:
# U test
u2, pu2 = stats.mannwhitneyu(old_arr, new_arr, alternative = 'two-sided')
print(u2, pu2)
pu2<0.005

1502583861.0 1.2849216001533932e-06


True

Since p-value is less than our alpha, 0.005, we reject the null hypothesis to conclude
that newer movies are rated differently than older movies. 

#### 3) Is enjoyment of ‘Shrek (2001)’ gendered, i.e. do male and female viewers rate it differently?

In [64]:
# Subset the movie and create a male group and a female group
Shrek = movies[['Shrek (2001)', 'Gender identity (1 = female; 2 = male; 3 = self-described)']]
female = Shrek[Shrek['Gender identity (1 = female; 2 = male; 3 = self-described)'] == 1]['Shrek (2001)']
male = Shrek[Shrek['Gender identity (1 = female; 2 = male; 3 = self-described)'] == 2]['Shrek (2001)']

In [65]:
# female half
# remove missing values
female_arr = female.values[np.isfinite(female)]
np.median(female_arr)

3.5

In [66]:
# male half
# remove missing values
male_arr = male.values[np.isfinite(male)]
np.median(male_arr)

3.0

In [67]:
# U test
u3, pu3 = stats.mannwhitneyu(female_arr, male_arr, alternative = 'two-sided')
print(u3, pu3)
pu3<0.005

96830.5 0.050536625925559006


False

Since p-value is not less than our alpha, 0.005, we fail to reject the null hypothesis and conclude
that male and female viewers do not rate ‘Shrek (2001)’ differently. 

#### 4) What proportion of movies are rated differently by male and female viewers?

In [71]:
# holdder list
u_stats4 = []

# iterated through all 400 movies
# in each iteration, do a male female split and record if the U test is significant
for i in range(400):
    movie = movies.iloc[:,[i, 474]]
    movie_title = movie.columns[0]
    gender_title = movie.columns[1]
    
    female = movie[movie[gender_title] == 1][movie_title]
    female_arr = female.values[np.isfinite(female)]
    male = movie[movie[gender_title] == 2][movie_title]
    male_arr = male.values[np.isfinite(male)]
    
    u4, pu4 = stats.mannwhitneyu(female_arr, male_arr, alternative = 'two-sided')
    if pu4<0.005:
        u_stats4.append(1)
    else:
        u_stats4.append(0)

In [72]:
# calculate proportion of movies showing a differnce
np.sum(u_stats4)/len(u_stats4)

0.125

12.5% movies rated differently by male and female viewers.

#### 5) Do people who are only children enjoy ‘The Lion King (1994)’ more than people with siblings?

In [122]:
# Subset the movie and create a only-child group and a with-siblings group
Lion_King = movies[['The Lion King (1994)', 'Are you an only child? (1: Yes; 0: No; -1: Did not respond)']]
only_child = Lion_King[Lion_King['Are you an only child? (1: Yes; 0: No; -1: Did not respond)'] == 1]['The Lion King (1994)']
with_siblings = Lion_King[Lion_King['Are you an only child? (1: Yes; 0: No; -1: Did not respond)'] == 0]['The Lion King (1994)']

In [123]:
# only-child half
# remove missing values
only_child_arr = only_child.values[np.isfinite(only_child)]
np.median(only_child_arr)

3.5

In [124]:
# with-siblings half
# remove missing values
with_siblings_arr = with_siblings.values[np.isfinite(with_siblings)]
np.median(with_siblings_arr)

4.0

In [125]:
# U test
u5, pu5 = stats.mannwhitneyu(only_child_arr, with_siblings_arr, alternative = 'greater')
print(u5, pu5)
pu5<0.005

52929.0 0.978419092554931


False

Since p-value is not less than our alpha, 0.005, we fail to reject the null hypothesis and conclude
that people who are only children do not rate ‘The Lion King (1994)’ more than people with siblings. 

#### 6) What proportion of movies exhibit an “only child effect”, i.e. are rated different by viewers with siblings vs. those without?

In [130]:
# holdder list
u_stats6 = []

# iterated through all 400 movies
# in each iteration, do a whether only-child or not split and record if the U test is significant
for i in range(400):
    movie = movies.iloc[:,[i, 475]]
    movie_title = movie.columns[0]
    status_title = movie.columns[1]
    
    only_child = movie[movie[status_title] == 1][movie_title]
    only_child_arr = only_child.values[np.isfinite(only_child)]
    with_siblings = movie[movie[status_title] == 0][movie_title]
    with_siblings_arr = with_siblings.values[np.isfinite(with_siblings)]
    
    u6, pu6 = stats.mannwhitneyu(only_child_arr, with_siblings_arr, alternative = 'two-sided')
    if pu6<0.005:
        u_stats6.append(1)
    else:
        u_stats6.append(0)

In [131]:
# calculate proportion of movies showing a differnce
np.sum(u_stats6)/len(u_stats6)

0.0175

Only 1.75% movies exhibit an “only child effect”.

#### 7) Do people who like to watch movies socially enjoy ‘The Wolf of Wall Street (2013)’ more than those who prefer to watch them alone?

In [132]:
# Subset the movie and create a alone group and a social group
WoWS = movies[['The Wolf of Wall Street (2013)', 'Movies are best enjoyed alone (1: Yes; 0: No; -1: Did not respond)']]
alone = WoWS[WoWS['Movies are best enjoyed alone (1: Yes; 0: No; -1: Did not respond)'] == 1]['The Wolf of Wall Street (2013)']
socially = WoWS[WoWS['Movies are best enjoyed alone (1: Yes; 0: No; -1: Did not respond)'] == 0]['The Wolf of Wall Street (2013)']

In [133]:
# perfer alone half
# remove missing values
alone_arr = alone.values[np.isfinite(alone)]
np.median(alone_arr)

3.5

In [134]:
# perfer socially half
# remove missing values
socially_arr = socially.values[np.isfinite(socially)]
np.median(socially_arr)

3.0

In [135]:
# U test
u7, pu7 = stats.mannwhitneyu(alone_arr, socially_arr, alternative = 'less')
print(u7, pu7)
pu7<0.005

56806.5 0.9436657996253056


False

Since p-value is not less than our alpha, 0.005, we fail to reject the null hypothesis and conclude
that people who like to watch movies socially do not enjoy ‘The Wolf of Wall Street (2013)’ more than those who prefer to watch them alone. 

#### 8) What proportion of movies exhibit such a “social watching” effect?


In [138]:
# holdder list
u_stats8 = []

# iterated through all 400 movies
# in each iteration, do a whether prefer watching alone split and record if the U test is significant
for i in range(400):
    movie = movies.iloc[:,[i, 476]]
    movie_title = movie.columns[0]
    status_title = movie.columns[1]
    
    alone = movie[movie[status_title] == 1][movie_title]
    alone_arr = alone.values[np.isfinite(alone)]
    socially = movie[movie[status_title] == 0][movie_title]
    socially_arr = socially.values[np.isfinite(socially)]
    
    u8, pu8 = stats.mannwhitneyu(alone_arr, socially_arr, alternative = 'two-sided')
    if pu8<0.005:
        u_stats8.append(1)
    else:
        u_stats8.append(0)

In [139]:
# calculate proportion of movies showing a differnce
np.sum(u_stats8)/len(u_stats8)

0.025

Only 2.5% movies exhibit a “social watching”.

#### 9) Is the ratings distribution of ‘Home Alone (1990)’ different than that of ‘Finding Nemo (2003)’? 

In [140]:
# Subset the movie
home_alone = movies['Home Alone (1990)']
# remove missing values
home_alone_arr = home_alone.values[np.isfinite(home_alone)]

# Subset the movie
finding_nemo = movies['Finding Nemo (2003)']
# remove missing values
finding_nemo_arr = finding_nemo.values[np.isfinite(finding_nemo)]

In [141]:
# Kolmogorov-Smirnov (KS) test
ks, p9 = stats.kstest(home_alone_arr, finding_nemo_arr, alternative = 'two-sided')
print(ks, p9)
p9<0.005

0.15269080020897632 6.379397182836346e-10


True

Since p-value is less than our alpha, 0.005, we reject the null hypothesis to conclude
that the ratings distribution of the two movies are different? 

#### 10) There are ratings on movies from several franchises (‘Star Wars’, ‘Harry Potter’, ‘The Matrix’, ‘Indiana Jones’, ‘Jurassic Park’, ‘Pirates of the Caribbean’, ‘Toy Story’, ‘Batman’) in this dataset. How many of these are of consistent quality, as experienced by viewers? 

In [248]:
# find movies under the ‘Star Wars’ franchise
sw = movies.loc[:,movies.columns.str.contains('Star Wars')].values
print(sw.shape)
# find movies under the ‘Harry Potter’ franchise
hp = movies.loc[:,movies.columns.str.contains('Harry Potter')].values
print(hp.shape)
# find movies under the ‘The Matrix’ franchise
tm = movies.loc[:,movies.columns.str.contains('The Matrix')].values
print(tm.shape)
# find movies under the ‘Indiana Jones’ franchise
ij = movies.loc[:,movies.columns.str.contains('Indiana Jones')].values
print(ij.shape)
# find movies under the ‘Jurassic Park’ franchise
jp = movies.loc[:,movies.columns.str.contains('Jurassic Park')].values
print(jp.shape)
# find movies under the ‘Pirates of the Caribbean’ franchise
potc = movies.loc[:,movies.columns.str.contains('Pirates of the Caribbean')].values
print(potc.shape)
# find movies under the ‘Toy Story’ franchise
ts = movies.loc[:,movies.columns.str.contains('Toy Story')].values
print(ts.shape)
# find movies under the ‘Batman’ franchise
b = movies.loc[:,movies.columns.str.contains('Batman')].values
print(b.shape)

(1097, 6)
(1097, 4)
(1097, 3)
(1097, 4)
(1097, 3)
(1097, 3)
(1097, 3)
(1097, 3)


In [249]:
# Kruskal Wallis Test pf each franchise
h1, p_1 = stats.kruskal(sw.T[0], sw.T[1], sw.T[2], sw.T[3], sw.T[4], sw.T[5], nan_policy='omit')
h2, p_2 = stats.kruskal(hp.T[0], hp.T[1], hp.T[2], hp.T[3], nan_policy='omit')
h3, p_3 = stats.kruskal(tm.T[0], tm.T[1], tm.T[2], nan_policy='omit')
h4, p_4 = stats.kruskal(ij.T[0], ij.T[1], ij.T[2], ij.T[3], nan_policy='omit')
h5, p_5 = stats.kruskal(jp.T[0], jp.T[1], jp.T[2], nan_policy='omit')
h6, p_6 = stats.kruskal(potc.T[0], potc.T[1], potc.T[2], nan_policy='omit')
h7, p_7 = stats.kruskal(ts.T[0], ts.T[1], ts.T[2], nan_policy='omit')
h8, p_8 = stats.kruskal(b.T[0], b.T[1], b.T[2], nan_policy='omit')

In [250]:
print(h1, p_1)
print(p_1<0.005)
print(h2, p_2)
print(p_2<0.005)
print(h3, p_3)
print(p_3<0.005)
print(h4, p_4)
print(p_4<0.005)
print(h5, p_5)
print(p_5<0.005)
print(h6, p_6)
print(p_6<0.005)
print(h7, p_7)
print(p_7<0.005)
print(h8, p_8)
print(p_8<0.005)

230.5841753686405 8.01647736660335e-48
True
3.331230732890868 0.34331950837289205
False
48.378866521305774 3.1236517880781424e-11
True
45.79416340261569 6.27277563979608e-10
True
46.59088064385298 7.636930084362221e-11
True
20.64399756002606 3.2901287079094474e-05
True
24.38599493626327 5.065805156537524e-06
True
190.53496872634642 4.2252969509030006e-42
True


Only the 'Harry Potter' franchise is of consistent quality.

#### 11) Are movies with longer titles rated higher than movies with shorter titles?

In [152]:
# median-split of title length to determine long title and short title movies
title_len = np.vectorize(len)(movies_ratings.columns.str[:-7])
median_10 = np.median(title_len)

short = movies_ratings.loc[:,title_len < median_10]
long = movies_ratings.loc[:,title_len >= median_10]
print(short.shape)
print(long.shape)

(1097, 182)
(1097, 218)


In [153]:
# vectorize shorter half
short_arr = short.values.flatten()
# remove missing values
short_arr = short_arr[np.isfinite(short_arr)]
np.median(short_arr)

3.0

In [154]:
# vectorize longer half
long_arr = long.values.flatten()
# remove missing values
long_arr = long_arr[np.isfinite(long_arr)]
np.median(long_arr)

3.0

In [155]:
# U test
u10, pu10 = stats.mannwhitneyu(short_arr, long_arr, alternative = 'less')
print(u10, pu10)
pu10<0.005

1545200237.0 0.0012739213298039243


True

1545200237.0 0.0025478426596078486
True

Since p-value is less than our alpha, 0.005, we reject the null hypothesis to conclude
that movies that have longer titles rated higher than movies that have shorter
titles.