Summary
This dataset (ml-20m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.


THis is where I start to explore the data.

I will be using the MovieLens Dataset.

CITATION:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

Also see the MovieLens 20M YouTube Trailers Dataset for links between MovieLens movies and movie trailers hosted on YouTube.




<h4>MovieLens 20M Dataset</h4>
<!-- see https://developers.google.com/search/docs/data-types/dataset -->
<article typeof="dcat:Dataset">
  <p>
    <span property="dc:title">MovieLens 20M</span> 
    <span rel="dc:subject">movie ratings</span>.
    <span property="dc:description">Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.</span>
  </p>
  
  <ul>
    <li>
      <a href="http://files.grouplens.org/datasets/movielens/ml-20m-README.html">README.txt</a>
    </li>
    <li>
      <a rel="dcat:distribution" href="http://files.grouplens.org/datasets/movielens/ml-20m.zip"><span property="dcat:mediaType" content="application/zip">ml-20m.zip</span></a> (size: 190 MB, <a href="http://files.grouplens.org/datasets/movielens/ml-20m.zip.md5">checksum</a>)
    </li>
  </ul>
  <p>
  Also see the <a href="https://grouplens.org/datasets/movielens/20m-youtube/">MovieLens 20M YouTube Trailers Dataset</a> for links between MovieLens movies and movie trailers hosted on YouTube.
  </p>
  <p>
    Permalink:
    <a href="https://grouplens.org/datasets/movielens/20m/">https://grouplens.org/datasets/movielens/20m/</a>
  </p>
</article>


In [None]:
import os.path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

DATAPATH = '/Users/ergonyc/Projects/Insight/Data'
DATA_DIR = DATAPATH + '/ml-20m'
MOVIE_CSV_FILE = os.path.join(DATA_DIR, 'movies.csv')
RATINGS_CSV_FILE = os.path.join(DATA_DIR, 'ratings.csv')
TAGS_CSV_FILE = os.path.join(DATA_DIR, 'tags.csv')
LINKS_CSV_FILE = os.path.join(DATA_DIR, 'links.csv')

pd.set_option("display.max_rows", 5)




In [None]:
# Load movies data.
movies = pd.read_csv(MOVIE_CSV_FILE, sep=',')

# Clear the field genres
movies['genres'] = np.where(movies['genres'] == '(no genres listed)', '', movies['genres'])
movies.head()



In [None]:
# Get genres list
def get_genres(s):
    if len(s) == 0:
        return np.NaN
    return s.split('|')

genres_list = movies['genres'].apply(get_genres).dropna()
genres = list(set().union(*list(genres_list)))
print(genres)

In [None]:
import re
def get_year(s):
    pattern = re.compile('^(.*) \\(([0-9\-]*)\\)$')
    result = pattern.match(s)
    if result:
        return int(result.group(2).split('-')[0])
    else:
        return np.NaN

movies['year'] = movies['title'].apply(get_year)
movies.head()

In [None]:
tags = pd.read_csv(TAGS_CSV_FILE, sep=',')
tags['datetime'] = tags['timestamp'].apply(pd.to_datetime, unit='s')
tags.head()

In [6]:
links = pd.read_csv(LINKS_CSV_FILE, sep=',')
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [None]:
ratings = pd.read_csv(RATINGS_CSV_FILE, sep=',')
ratings['datetime'] = ratings['timestamp'].apply(pd.to_datetime, unit='s')
ratings.head()

# stub for world happiness report... (SUICIDED DATA SET)

DATAPATH = '/Users/ergonyc/Projects/Insight/Data'
DATA_DIR = DATAPATH + '/world-happiness-report'

YEAR015_FILE = os.path.join(DATA_DIR, '2015.csv')
YEAR016_FILE = os.path.join(DATA_DIR, '2016.csv')
YEAR017_FILE = os.path.join(DATA_DIR, '2017.csv')


In [None]:
# Load suicide data.
year15 = pd.read_csv(YEAR015_FILE, sep=',')
year16 = pd.read_csv(YEAR016_FILE, sep=',')
year17 = pd.read_csv(YEAR017_FILE, sep=',')
