<a href="https://colab.research.google.com/github/gKorada/PythonNotebooks/blob/main/Data_Exploration_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Movie Analysis: Does a Movie gain a higher Rating based on it's rating?

First, let us get our imports out of the way.

In [None]:
import pandas as pd
import io

Now that we have our imports out of the way, let us upload all of our data frames.

In [None]:
links_df = pd.read_csv('/content/links.csv')
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [None]:
movie_df = pd.read_csv('/content/movies.csv')
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
ratings_df = pd.read_csv('/content/ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


So the goal of our analysis is to see if depending on the genre a movie can gain a higher rating. To do that, we must first merge the movie data set and the ratings data set.

In [None]:
movie_rating = pd.merge(movie_df,ratings_df, on = 'movieId')

movie_rating.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


As you can see, the two dataframes are merged into one data frame. However, for effective utilization, we have to seperate the genres so we can use a groupby statement to conduct our EDA.

To do this, I will use the '.get_dummies' function in pandas to easily see what values, in this case genres, are satisfied by each movie.


In [None]:
genre_seperated = movie_rating['genres'].str.get_dummies(sep = '|')

genre_seperated.head()

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0


Now the genres are seperated, but we lost a few key columns. So let's add those back, drop the genres column and then re-merge the ratings dataset do we can perfom our EDA


In [None]:
genre_seperated['movieId'] = movie_df['movieId']

movie_rating_genre = pd.merge(movie_rating , genre_seperated, on = 'movieId')

movie_rating_genre.drop(columns = ['genres'], inplace = True)
movie_rating_genre.head()


Unnamed: 0,movieId,title,userId,rating,timestamp,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),1,4.0,964982703,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,1,Toy Story (1995),5,4.0,847434962,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,1,Toy Story (1995),7,4.5,1106635946,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,1,Toy Story (1995),15,2.5,1510577970,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,1,Toy Story (1995),17,4.5,1305696483,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0


We have successfully made our working data frame! Now, we can start our EDA.

Wait, before we proceed further, let's clean the data. At this point, duplicate rows or missing values haven't affected our process, but we should still check. (if 'movieId' had a missing value we would have had hiccups already so we sould be good there)

In [None]:
movie_rating_genre.duplicated().sum()

0

No duplicates

In [None]:
movie_rating_genre.isnull().sum()

movieId               0
title                 0
userId                0
rating                0
timestamp             0
(no genres listed)    0
Action                0
Adventure             0
Animation             0
Children              0
Comedy                0
Crime                 0
Documentary           0
Drama                 0
Fantasy               0
Film-Noir             0
Horror                0
IMAX                  0
Musical               0
Mystery               0
Romance               0
Sci-Fi                0
Thriller              0
War                   0
Western               0
dtype: int64

No columns have null/missing values, so our data frame is good to go.

Let's first try grouping by rating to see which genres pop up more.

In [None]:
rating_grouped = movie_rating_genre.groupby('rating')

rating_grouped.head()

Unnamed: 0,movieId,title,userId,rating,timestamp,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),1,4.0,964982703,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,1,Toy Story (1995),5,4.0,847434962,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,1,Toy Story (1995),7,4.5,1106635946,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,1,Toy Story (1995),15,2.5,1510577970,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,1,Toy Story (1995),17,4.5,1305696483,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
5,1,Toy Story (1995),18,3.5,1455209816,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
6,1,Toy Story (1995),19,4.0,965705637,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
7,1,Toy Story (1995),21,3.5,1407618878,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
8,1,Toy Story (1995),27,3.0,962685262,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
9,1,Toy Story (1995),31,5.0,850466616,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0


Let's look at the extremes (rating = 0.5 /rating = 5.0)

In [None]:
movie_rating_genre.loc[movie_rating_genre['rating'] == 0.5 ].head()

Unnamed: 0,movieId,title,userId,rating,timestamp,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
26,1,Toy Story (1995),76,0.5,1439165548,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
261,2,Jumanji (1995),298,0.5,1450452897,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
355,3,Grumpier Old Men (1995),308,0.5,1421374465,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
424,5,Father of the Bride Part II (1995),490,0.5,1324370305,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
722,10,GoldenEye (1995),517,0.5,1487957717,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
movie_rating_genre.loc[movie_rating_genre['rating'] == 5 ].head()

Unnamed: 0,movieId,title,userId,rating,timestamp,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
9,1,Toy Story (1995),31,5.0,850466616,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
12,1,Toy Story (1995),40,5.0,832058959,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
13,1,Toy Story (1995),43,5.0,848993983,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
16,1,Toy Story (1995),46,5.0,834787906,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
19,1,Toy Story (1995),57,5.0,965796031,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0


Ok, so grouping by rating and finding the pole's didn't do much. Let's go at another angle, let's average out the rating of each movie to get a good average for each movie.

In [None]:
movie_avg = movie_rating_genre.groupby('movieId').apply(lambda x:x['rating'].mean())
movie_avg.head()

movieId
1    3.920930
2    3.431818
3    3.259615
4    2.357143
5    3.071429
dtype: float64

Now lets sort the data and look the first 10 and last 10 to see if we see any commonalities without conducting a statistical analysis.

In [None]:
movie_avg_sorted = movie_avg.sort_values(ascending = False)

In [None]:
movie_avg_sorted.head(10)

movieId
88448     5.0
100556    5.0
143031    5.0
143511    5.0
143559    5.0
6201      5.0
102217    5.0
102084    5.0
6192      5.0
145994    5.0
dtype: float64

In [None]:
movie_avg_sorted.tail(10)

movieId
44243     0.5
72424     0.5
6371      0.5
82684     0.5
137517    0.5
157172    0.5
85334     0.5
53453     0.5
8494      0.5
71810     0.5
dtype: float64

Ok, we have the sorted movies, however, this might be a lot more work to see commanilites movie by movie like this. Maybe it's best we just the built in Pandas Statistical tools.

In [None]:
movie_rating_genre.describe()

Unnamed: 0,movieId,userId,rating,timestamp,(no genres listed),Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
count,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,...,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0
mean,19435.295718,326.127564,3.501557,1205946000.0,0.0,0.251924,0.269051,0.080378,0.16268,0.353713,...,0.004026,0.034154,0.016145,0.01072,0.102592,0.165943,0.101382,0.30382,0.039073,0.007884
std,35530.987199,182.618491,1.042529,216261000.0,0.0,0.43412,0.443469,0.271879,0.369075,0.478124,...,0.063326,0.181627,0.126034,0.102983,0.303427,0.372031,0.301836,0.459908,0.19377,0.088442
min,1.0,1.0,0.5,828124600.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1199.0,177.0,3.0,1019124000.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2991.0,325.0,3.5,1186087000.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,8122.0,477.0,4.0,1435994000.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,193609.0,610.0,5.0,1537799000.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Ok, so this doesn't really tell us much, so let's do it a different way where we can see the stats in terms of ratings

In [None]:
movie_rating_genre.iloc[:, 5:].multiply(movie_rating_genre['rating'], axis = 'index').describe()


Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
count,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0,100836.0
mean,0.0,0.882844,0.949829,0.280683,0.573406,1.208358,0.874152,0.012629,1.397641,0.474493,0.015649,0.120081,0.057346,0.038548,0.358845,0.579416,0.353966,1.063702,0.133097,0.027842
std,0.0,1.608305,1.657202,0.99523,1.368937,1.745926,1.615711,0.218445,1.847636,1.263571,0.252391,0.667385,0.465779,0.38458,1.110765,1.365095,1.102128,1.709434,0.691788,0.325212
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.5,2.0,0.0,0.0,3.0,0.0,0.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0
max,0.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


Now that we have a proper distribution, we can draw a proper conclusion.

As we can see, there are very few genres which have an average higher than 1, and these genres are Comedy, Drama, and Thriller, with average ratings of 1.21, 1.40 ,1.06 respectively. These 3 genres also have higher standard deviations, meaing they also have a more diverse contents. This may be the reason why these 3 genres have the highest average ratings as well!