# Downloading data

## The MovieLens dataset

[Movielens](http://grouplens.org/datasets/movielens/) has a number of datasets:

This notebook is an example of downloading [a small dataset](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip)

> Small: 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. Last updated 10/2016.

### Step 1: instead of downloading the zipfile into the current directory, i want to save it in a /data directory:

In [43]:
import os

cwd = os.getcwd()
data_dir = cwd + "/data"
print(cwd, data_dir)

# if no data_dir make one:
if not os.path.isdir(data_dir):
    os.mkdir(data_dir, 755)

/Users/ko/git/cs109-2015 /Users/ko/git/cs109-2015/data


In [18]:
import urllib.request

url = "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
file_path = os.path.join(data_dir, "ml-latest-small.zip")

# only download if file doesn't exist:
if not os.path.isfile(file_path):
    print("Starting download...")
    urllib.request.urlretrieve(url, file_path)
    print("Downloaded", file_path)
else:
    print(f"File {file_path} already downloaded")

File /Users/ko/git/cs109-2015/data/ml-latest-small.zip already downloaded


Now to inspect and extract the downloaded zipile:

In [104]:
import zipfile

z = zipfile.ZipFile(file_path)

# this unzips the zipfile inside the data_dir
z.extractall(data_dir)
z.close()
os.listdir(data_dir)

['.DS_Store', 'ml-latest-small', 'ml-latest-small.zip']

The unzip process creates a directory with the same name as the zipfile inside the given path, so the actual files are:

In [56]:
csv_path = data_dir + "/ml-latest-small" 
os.listdir(csv_path)

['links.csv', 'movies.csv', 'ratings.csv', 'README.txt', 'tags.csv']

to list only the csv files where the data is:

In [98]:
[f for f in os.listdir(csv_path) if f[-4:]==".csv"]

['links.csv', 'movies.csv', 'ratings.csv', 'tags.csv']

### Step 2: take a look at the csv files

In [348]:
import pandas as pd

links = pd.read_csv(csv_path+'/links.csv', index_col="movieId")
movies = pd.read_csv(csv_path + "/movies.csv", index_col="movieId")
ratings = pd.read_csv(csv_path + "/ratings.csv")
tags = pd.read_csv(csv_path + "/tags.csv", index_col="movieId")

In [338]:
links.head()

Unnamed: 0_level_0,imdbId,tmdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,114709,862.0
2,113497,8844.0
3,113228,15602.0
4,114885,31357.0
5,113041,11862.0


In [349]:
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [351]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [340]:
tags.head()

Unnamed: 0_level_0,userId,tag,timestamp
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
339,15,sandra 'boring' bullock,1138537770
1955,15,dentist,1193435061
7478,15,Cambodia,1170560997
32892,15,Russian,1170626366
34162,15,forgettable,1141391765


To select only the first two columbs of a df:

In [127]:
tags[tags.columns[:2]].head()

Unnamed: 0,userId,movieId
0,15,339
1,15,1955
2,15,7478
3,15,32892
4,15,34162


To filter data:

In [132]:
ratings[ratings['rating']> 3.0].head()

Unnamed: 0,userId,movieId,rating,timestamp
4,1,1172,4.0,1260759205
8,1,1339,3.5,1260759125
12,1,1953,4.0,1260759191
13,1,2105,4.0,1260759139
20,2,10,4.0,835355493


### Show comedy movies ranked >=4



To select only movies with comedy:

In [385]:
c_movies = movies[movies["genres"].str.match("Comedy")]
c_movies.head(3)

Unnamed: 0_level_0,title,genres,ratings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,Grumpier Old Men (1995),Comedy|Romance,3.161017
4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.384615
5,Father of the Bride Part II (1995),Comedy,3.267857


The ratings table has multiple ratings for each movie, so to make that useful here we get the mean rating for each movie:

In [353]:
ratings_avg = ratings.groupby("movieId").mean()
ratings_avg.head()

Unnamed: 0_level_0,userId,rating,timestamp
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,338.558704,3.87247,1103116000.0
2,318.906542,3.401869,1069321000.0
3,374.423729,3.161017,966242900.0
4,355.538462,2.384615,927779700.0
5,320.785714,3.267857,996720100.0


So now we have a dataframe of comedy movies and another of ratings, with both having the movieId column in common. So now to add ratings to all the movies.

In [383]:
movies["ratings"] = ratings_avg["rating"]