# Conjunto de dados do MovieLens 1M

O conjunto de dados MoviesLens 1M contém um milhão de avaliações coletadas de 6 mil usuários sobre 4 mil filmes. Os dados estão espalhados em três tabelas: avaliações, informações de usuário e informações sobre o filme.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [5]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('datasets/movielens/users.dat', sep='::',
                      header=None, names=unames, engine='python')
users

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


In [6]:
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
rating = pd.read_table('datasets/movielens/ratings.dat', sep='::',
                       header=None, names=rnames, engine='python')
rating

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [7]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                       header=None, names=mnames, engine='python')
movies

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


- Unindo os dados das  tabelas em uma só, a parti de merge de rating e users, e posteriomente com movies

In [23]:
data = pd.merge(pd.merge(rating, users), movies)
data

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...,...,...,...,...
1000204,5949,2198,5,958846401,M,18,17,47901,Modulations (1998),Documentary
1000205,5675,2703,3,976029116,M,35,14,30030,Broken Vessels (1998),Drama
1000206,5780,2845,1,958153068,M,18,17,92886,White Boys (1999),Drama
1000207,5851,3607,5,957756608,F,18,20,55410,One Little Indian (1973),Comedy|Drama|Western


In [31]:
#Quatidade de filmes por gênero
results = pd.Series([x.split()[0] for x in data.genres.dropna()])
r = results.value_counts()
r[:15]

Comedy                     116883
Drama                      111423
Comedy|Romance              42712
Comedy|Drama                42245
Drama|Romance               29170
Action|Thriller             26759
Horror                      22563
Drama|Thriller              18248
Thriller                    17851
Action|Adventure|Sci-Fi     17783
Drama|War                   14656
Action|Sci-Fi               14309
Action|Sci-Fi|Thriller      13970
Action                      12311
Action|Drama|War            12224
Name: count, dtype: int64

In [42]:
# Obtendo as avaliações médias de cada filme agrupadas pela identificação de gênero da pessoa
mean_rating = data.pivot_table('rating', index='title',
                               columns='gender', aggfunc='mean')
mean_rating

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375000,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024
...,...,...
"Zed & Two Noughts, A (1985)",3.500000,3.380952
Zero Effect (1998),3.864407,3.723140
Zero Kelvin (Kjærlighetens kjøtere) (1995),,3.500000
Zeus and Roxanne (1997),2.777778,2.357143


- Filtrando os filmes que receberam ao menos 250 avaliações e agrupando os dados por título para obter o tamanho dos grupos por títulos

In [53]:
rating_by_title = data.groupby('title').size()
rating_by_title[500:515]

title
Boys on the Side (1995)                            195
Boys, The (1997)                                    11
Braddock: Missing in Action III (1988)              81
Brady Bunch Movie, The (1995)                      418
Brain That Wouldn't Die, The (1962)                 56
Braindead (1992)                                    70
Bram Stoker's Dracula (1992)                       672
Brandon Teena Story, The (1998)                     49
Brassed Off (1996)                                 238
Braveheart (1995)                                 2443
Brazil (1985)                                      913
Bread and Chocolate (Pane e cioccolata) (1973)      67
Breakdown (1997)                                   354
Breaker Morant (1980)                              360
Breakfast Club, The (1985)                        1539
dtype: int64

In [51]:
active_titles = rating_by_title.index[rating_by_title >= 250]
active_titles

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

In [54]:
mean_rating = mean_rating.loc[active_titles]
mean_rating

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.500000
101 Dalmatians (1996),3.240000,2.911215
12 Angry Men (1957),4.184397,4.328421
...,...,...
Young Guns (1988),3.371795,3.425620
Young Guns II (1990),2.934783,2.904025
Young Sherlock Holmes (1985),3.514706,3.363344
Zero Effect (1998),3.864407,3.723140


- Vendo os principais filmes entre as telespectadoras femininas

In [56]:
top_female_ratings = mean_rating.sort_values(by='F', ascending=False)
top_female_ratings.drop(columns='M')

gender,F
title,Unnamed: 1_level_1
"Close Shave, A (1995)",4.644444
"Wrong Trousers, The (1993)",4.588235
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.572650
Wallace & Gromit: The Best of Aardman Animation (1996),4.563107
Schindler's List (1993),4.562602
...,...
"Avengers, The (1998)",1.915254
Speed 2: Cruise Control (1997),1.906667
Rocky V (1990),1.878788
Barb Wire (1996),1.585366


- Avaliando a discrepância nas avaliações

In [61]:
# adicionando uma coluna contendo a diferenã nas médias
mean_rating['diff'] = mean_rating['M'] - mean_rating['F']
sorted_by_diff = mean_rating.sort_values(by='diff')
# filmes preferidos pelas mulheres
sorted_by_diff[:15]

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dirty Dancing (1987),3.790378,2.959596,-0.830782
Jumpin' Jack Flash (1986),3.254717,2.578358,-0.676359
Grease (1978),3.975265,3.367041,-0.608224
Little Women (1994),3.870588,3.321739,-0.548849
Steel Magnolias (1989),3.901734,3.365957,-0.535777
Anastasia (1997),3.8,3.281609,-0.518391
"Rocky Horror Picture Show, The (1975)",3.673016,3.160131,-0.512885
"Color Purple, The (1985)",4.158192,3.659341,-0.498851
"Age of Innocence, The (1993)",3.827068,3.339506,-0.487561
Free Willy (1993),2.921348,2.438776,-0.482573


In [62]:
#filmes preferidos pelos homens
sorted_by_diff[::-1][:15]

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Good, The Bad and The Ugly, The (1966)",3.494949,4.2213,0.726351
"Kentucky Fried Movie, The (1977)",2.878788,3.555147,0.676359
Dumb & Dumber (1994),2.697987,3.336595,0.638608
"Longest Day, The (1962)",3.411765,4.031447,0.619682
"Cable Guy, The (1996)",2.25,2.863787,0.613787
Evil Dead II (Dead By Dawn) (1987),3.297297,3.909283,0.611985
"Hidden, The (1987)",3.137931,3.745098,0.607167
Rocky III (1982),2.361702,2.943503,0.581801
Caddyshack (1980),3.396135,3.969737,0.573602
For a Few Dollars More (1965),3.409091,3.953795,0.544704


- Filmes com mais discrepância entre os telepectadores independetemente do gênero

In [64]:
# Desvio padrão das avaliações com dados agrupados por títulos
rating_std_by_title = data.groupby('title')['rating'].std()
#filtrando para active_titles
rating_std_by_title = rating_std_by_title.loc[active_titles]

# ordenação da serie em ordem decrescente
rating_std_by_title.sort_values(ascending=False)

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
                                           ...   
Wrong Trousers, The (1993)               0.708666
Shawshank Redemption, The (1994)         0.700443
Great Escape, The (1963)                 0.692585
Rear Window (1954)                       0.688946
Close Shave, A (1995)                    0.667143
Name: rating, Length: 1216, dtype: float64