# MovieLens 1M Dataset

GroupLens Research provides a number of collections of movie ratings data collected from users of MovieLens in the late 1990s and early 2000s. The data provide movie ratings, movie metadata (genres and year), and demographic data about the users (age, zip code, gender identification, and occupation). Such data is often of interest in the development of recommendation systems based on machine learning algorithms.

While we do not explore machine learning techniques in detail in this notebook, I will show you how to slice and dice datasets like these into the exact form you need. The MovieLens 1M dataset contains <b>1 million ratings</b> collected from 6,000 users on 4,000 movies. It’s spread across three tables: 
<ul><li>ratings</li><li>user information</li><li>movie information</li></ul>
    
***

You can find this dataset by googling "MovieLens 1M Datast" or <a href="https://grouplens.org/datasets/movielens/1m/" target="_blank">here</a>, <br> These three files are zipped, so extract into your working directory.    

***

In [18]:
import pandas as pd

<h2>Let's first read users data</h2>

In [22]:
users = pd.read_table('Pandas/users.dat')

In [21]:
users

Unnamed: 0,1::F::1::10::48067
0,2::M::56::16::70072
1,3::M::25::15::55117
2,4::M::45::7::02460
3,5::M::25::20::55455
4,6::F::50::9::55117
...,...
6034,6036::F::25::15::32603
6035,6037::F::45::1::76006
6036,6038::F::56::1::14706
6037,6039::F::45::0::01060


***
<h2>We have a few observations to make here </h2><br>
<h3>
<ol>
    <li>data is all squezed into one column</li>
    <li>data elements are seperated by <b><code>"::"</code></b> separator</li>
    <li>There are no column headers</li>
    <li>First row of data is read as headers</li>
</ol> 
</h3>
<h4>You can read the documentation of the dataset where you downloaded it from, and find the column headers for each of the table & relavent details. Often you will have to read documentation of the dataset before you start using the dataset</h4><br>
<h4>So now that we understand the data better - lets work on the dataset to make it usable - this step is called <code>data cleansing</code></h4>
<br><h4>Lets read all three tables</h4>

***

In [24]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('Pandas/users.dat', sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('Pandas/ratings.dat', sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('Pandas/movies.dat', sep='::', header=None, names=mnames)

  This is separate from the ipykernel package so we can avoid doing imports until
  
  if __name__ == '__main__':


***

<h4>You will see we have addressed all our observations by using these :<br>
<ul>
<li>OBSERVATION 1 & 2 : using <code>sep = '::' </code></li>
<li>OBSERVATION 3: by creating our own list with header names </li>
    <li>OBSERVATION 4: by using <code>names = 'unames'</code></li></ul></h4>
Lets see if these worked

***

In [27]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [28]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [29]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


***

Note that ages and occupations are coded as integers indicating groups described in
the dataset’s README file. Analyzing the data spread across three tables is not a simple
task; for example, suppose you wanted to compute mean ratings for a particular
movie by sex and age. As you will see, this is much easier to do with all of the data
merged together into a single table. Using pandas’s <b><code>merge</code></b> function, we first merge
ratings with users and then merge that result with the movies data. pandas infers
which columns to use as the merge (or join) keys based on overlapping names:

***

In [32]:
data = pd.merge(pd.merge(ratings, users), movies)
data

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...,...,...,...,...
1000204,5949,2198,5,958846401,M,18,17,47901,Modulations (1998),Documentary
1000205,5675,2703,3,976029116,M,35,14,30030,Broken Vessels (1998),Drama
1000206,5780,2845,1,958153068,M,18,17,92886,White Boys (1999),Drama
1000207,5851,3607,5,957756608,F,18,20,55410,One Little Indian (1973),Comedy|Drama|Western


***

<h4>We see from ratings table that each user has multiple ratings, where did all these go? Lets check for user with user_id = 1</h4>

***

In [53]:
data[data['user_id']==1][10:]

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
12759,1,595,5,978824268,F,1,10,48067,Beauty and the Beast (1991),Animation|Children's|Musical
13819,1,938,4,978301752,F,1,10,48067,Gigi (1958),Musical
14006,1,2398,4,978302281,F,1,10,48067,Miracle on 34th Street (1947),Drama
14386,1,2918,4,978302124,F,1,10,48067,Ferris Bueller's Day Off (1986),Comedy
15859,1,1035,5,978301753,F,1,10,48067,"Sound of Music, The (1965)",Musical
16741,1,2791,4,978302188,F,1,10,48067,Airplane! (1980),Comedy
18472,1,2687,3,978824268,F,1,10,48067,Tarzan (1999),Animation|Children's
18914,1,2018,4,978301777,F,1,10,48067,Bambi (1942),Animation|Children's
19503,1,3105,5,978301713,F,1,10,48067,Awakenings (1990),Drama
20183,1,2797,4,978302039,F,1,10,48067,Big (1988),Comedy|Fantasy


***
<h4>Lesson learnt : Never trust what is displayed, since data is large - python often find ways to display it in most consise way, so allways interogate the data properly and do not rely on what is displayed</h4>


Let's get the mean movie ratings for each film grouped by gender, we can use the <code>pivot_table</code> method:
***

In [36]:
mean_ratings = data.pivot_table('rating', index='title',columns='gender', aggfunc='mean')
mean_ratings[:5]   # i have shown only top 5 rows to quickly check if the output was as it was intended to be

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024


***
<h4>This produced another DataFrame containing mean ratings with movie titles as row
labels (the “index”) and gender as column labels. I first filter down to movies that
received at least 250 ratings (a completely arbitrary number); to do this, I then group
the data by title and use <code><b>size()</b></code> to get a Series of group sizes for each title:</h4>

***

In [37]:
ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]

title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64

In [38]:
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

***
<h4>The index of titles receiving at least 250 ratings can then be used to select rows from mean_ratings:</h4>

***

In [40]:
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings     # notice the size of table changing from previous mean_rating table 

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.500000
101 Dalmatians (1996),3.240000,2.911215
12 Angry Men (1957),4.184397,4.328421
...,...,...
Young Guns (1988),3.371795,3.425620
Young Guns II (1990),2.934783,2.904025
Young Sherlock Holmes (1985),3.514706,3.363344
Zero Effect (1998),3.864407,3.723140


***
<h4>To see the top films among female viewers, we can sort by the F column in descending order:</h4>

***

In [42]:
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Close Shave, A (1995)",4.644444,4.473795
"Wrong Trousers, The (1993)",4.588235,4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.572650,4.464589
Wallace & Gromit: The Best of Aardman Animation (1996),4.563107,4.385075
Schindler's List (1993),4.562602,4.491415
...,...,...
"Avengers, The (1998)",1.915254,2.017467
Speed 2: Cruise Control (1997),1.906667,1.863014
Rocky V (1990),1.878788,2.132780
Barb Wire (1996),1.585366,2.100386


***
<h2>Measuring Rating Disagreement</h2>
Suppose you wanted to find the movies that are most divisive between male and  female viewers. One way is to add a column to mean_ratings containing the difference in means, then sort by that:

***

In [45]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
mean_ratings   # new column added.

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"'burbs, The (1989)",2.793478,2.962085,0.168607
10 Things I Hate About You (1999),3.646552,3.311966,-0.334586
101 Dalmatians (1961),3.791444,3.500000,-0.291444
101 Dalmatians (1996),3.240000,2.911215,-0.328785
12 Angry Men (1957),4.184397,4.328421,0.144024
...,...,...,...
Young Guns (1988),3.371795,3.425620,0.053825
Young Guns II (1990),2.934783,2.904025,-0.030758
Young Sherlock Holmes (1985),3.514706,3.363344,-0.151362
Zero Effect (1998),3.864407,3.723140,-0.141266


***
<h4>Sorting by 'diff' yields the movies with the greatest rating difference so that we can see which ones were preferred by women:</h4>

***

In [46]:
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dirty Dancing (1987),3.790378,2.959596,-0.830782
Jumpin' Jack Flash (1986),3.254717,2.578358,-0.676359
Grease (1978),3.975265,3.367041,-0.608224
Little Women (1994),3.870588,3.321739,-0.548849
Steel Magnolias (1989),3.901734,3.365957,-0.535777
Anastasia (1997),3.8,3.281609,-0.518391
"Rocky Horror Picture Show, The (1975)",3.673016,3.160131,-0.512885
"Color Purple, The (1985)",4.158192,3.659341,-0.498851
"Age of Innocence, The (1993)",3.827068,3.339506,-0.487561
Free Willy (1993),2.921348,2.438776,-0.482573


***
<h4>Reversing the order of the rows and again slicing off the top 10 rows, we get the movies preferred by men that women didn’t rate as highly: </h4>

***

In [48]:
sorted_by_diff[::-1][:10]  # note the slice [::-1] reverses the order and [:10] slices top 10 records

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Good, The Bad and The Ugly, The (1966)",3.494949,4.2213,0.726351
"Kentucky Fried Movie, The (1977)",2.878788,3.555147,0.676359
Dumb & Dumber (1994),2.697987,3.336595,0.638608
"Longest Day, The (1962)",3.411765,4.031447,0.619682
"Cable Guy, The (1996)",2.25,2.863787,0.613787
Evil Dead II (Dead By Dawn) (1987),3.297297,3.909283,0.611985
"Hidden, The (1987)",3.137931,3.745098,0.607167
Rocky III (1982),2.361702,2.943503,0.581801
Caddyshack (1980),3.396135,3.969737,0.573602
For a Few Dollars More (1965),3.409091,3.953795,0.544704


***
<h4>Suppose instead you wanted the movies that elicited the most disagreement among
viewers, independent of gender identification. Disagreement can be measured by the
variance or standard deviation of the ratings: </h4>

***

In [51]:
rating_std_by_title = data.groupby('title')['rating'].std()  # create group by title and calculate sd for ratings
rating_std_by_title = rating_std_by_title.loc[active_titles] # ilter the resulting set for active titles
rating_std_by_title.sort_values(ascending=False)[:10]    #Order Series by value in descending order

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

<h1> END OF SECTION </h1>