# Used data description

I'm using the data collected by group GroupLens. Data can be obtained from their site [here](https://grouplens.org/datasets/movielens/)
I'm using dataset updated in Aug 2017 (note that data may change over time). For basic dectribtion I'll use words directly from MovieLens site:
_26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags._

Database consists of 6 files:
- movies.csv
- ratings.csv
- links.csv
- tags.csv
- genome-scores.csv
- genome-tags.csv

For now I'll concentrate on the first 2.

### movies.csv
Contains information about films. Their Id, title (usually with production year) and if possible - it's genre

In [1]:
import numpy as np
import pandas as pd

In [2]:
movies = pd.read_csv('movies.csv')

In [6]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [7]:
movies.dtypes

movieId     int64
title      object
genres     object
dtype: object

In [8]:
movies.shape

(45843, 3)

Stating the obvious: movies.csv has 3 columns: numeric movieId and string title and genre and 45843 rows representing films. Title contains premiere data written in brackets. One film can have one than one genre. Genres are separated by '|' sign. 

One thing that seems to be logical at the moment is to set up movieId as an index for this dataframe

In [9]:
movies.set_index('movieId',inplace=True)
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


At the end just having some fun with it :)

In [18]:
searchText=input()
movies[movies['title'].str.contains(searchText)]

Godfather


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
858,"Godfather, The (1972)",Crime|Drama
1221,"Godfather: Part II, The (1974)",Crime|Drama
2023,"Godfather: Part III, The (1990)",Crime|Drama|Mystery|Thriller
8607,Tokyo Godfathers (2003),Adventure|Animation|Drama
25934,3 Godfathers (1948),Drama|Western
100180,"Last Godfather, The (2010)",Comedy
106704,Disco Godfather (1979),Action|Crime|Drama
121519,The New Godfathers (1979),Crime
124791,Three Godfathers (1936),Drama|Western
156972,Onimasa: A Japanese Godfather (1982),Action


Based on this we can also see that sometimes there are also some series (like The Godfather Trilogy) while it's also possible to rate separate films. 
Also genres maybe sometimes misleading (For example I wouldn't tell there is sth different as  it comes to type of film between Godfather II and Godfather III)

### ratings.csv 
This file contains ratings given by users to the films. Note that there are more than 26 M records so it's sometimes difficult to work with such huge files.

In [10]:
ratings = pd.read_csv('ratings.csv')

In [11]:
ratings.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556
5,1,1968,4.0,1425942148
6,1,2762,4.5,1425941300
7,1,2918,5.0,1425941593
8,1,2959,4.0,1425941601
9,1,4226,4.0,1425942228


In [12]:
ratings.shape

(26024289, 4)

In [13]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [15]:
ratings.rating.describe()

count    2.602429e+07
mean     3.528090e+00
std      1.065443e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

In [22]:
ratings.movieId.nunique()

45115

In [23]:
ratings.userId.nunique()

270896

In [26]:
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

Data in ratings.csv has 4 numeric columns and 26024289 rows representing rates on movie-user pair. Neither userId nor movieId is good candidate for index (although technically possible here) as both may have not unique entries. On the other hand userId-movieId pair would be good candidate (but I'll leave it as is for now). User rate movies using scale 0,5-5 and rates may change by 0,5 with mean around 3,5 which is consistent with median. There are no missing data. There are 270 tho unique users rating 45115 films (not all movies are rated, but almost). 

[next](part2.ipynb) [index](index.ipynb)