<h1 style="color: #2A8B66;">Movie Data Analysis</h1>
<h3>
Dataset from the MovieLens website to explore it using Pandas.
</h3>
<ul>
    <li><b>Data source: </b><span>filename ml-20m.zip</span></li>
    <li><b>Website: </b><a>https://grouplens.org/datasets/movielens/</a></li>
</ul>

#### imports

In [7]:
import pandas as pd

#### Show the content of the directory that it stores csv files

In [2]:
!ls ./movielens/

genome-scores.csv  links.csv   ratings.csv  tags.csv
genome-tags.csv    movies.csv  README.txt


#### Number of movies into "movies.csv" file

In [5]:
!echo "Number of movies:"
!cat ./movielens/movies.csv | wc -l

Number of movies:
27279


<h2 style="color: #2A738B;">Use Pandas to read the dataset</h2>
<ul>
    <li><b>ratings.csv: </b><span>userId,movieId,rating,timestamp</span></li>
    <li><b>tags.csv: </b><span>userId,movieId,tag,timestamp</span></li>
    <li><b>movies.csv: </b><span>movieId,title,genres</span></li>
</ul>

In [24]:
movies = pd.read_csv("./movielens/movies.csv", sep=",")
print(type(movies))
movies.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [25]:
tags = pd.read_csv("./movielens/tags.csv", sep=",")
tags.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,1240597180
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078


In [27]:
ratings = pd.read_csv("./movielens/ratings.csv", sep=",")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


### Series
#### Each row of the DataFrame is a Series object

In [30]:
tags_row_0 = tags.iloc[0]
print(type(tags_row_0))
tags_row_0

<class 'pandas.core.series.Series'>


userId                18
movieId             4141
tag          Mark Waters
timestamp     1240597180
Name: 0, dtype: object

In [31]:
tags_row_0.index

Index(['userId', 'movieId', 'tag', 'timestamp'], dtype='object')

In [32]:
tags_row_0["movieId"]

4141

In [35]:
print("'tag' in tags_row_0? {0}".format("tag" in tags_row_0))
print("'rating' in tags_row_0? {0}".format("rating" in tags_row_0))

'tag' in tags_row_0? True
'rating' in tags_row_0? False


### DataFrame

In [37]:
tags.index

RangeIndex(start=0, stop=465564, step=1)

In [39]:
tags.columns

Index(['userId', 'movieId', 'tag', 'timestamp'], dtype='object')

#### Descriptive Statistics

In [60]:
ratings["rating"].describe()

count    2.000026e+07
mean     3.525529e+00
std      1.051989e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

In [61]:
ratings["rating"].mean()

3.5255285642993797

In [62]:
ratings["rating"].min()

0.5

In [64]:
ratings["rating"].max()

5.0

In [65]:
ratings["rating"].std()

1.0519889192942424

In [66]:
ratings["rating"].mode()

0    4.0
dtype: float64

In [67]:
ratings.corr()

Unnamed: 0,userId,movieId,rating,timestamp
userId,1.0,-0.00085,0.001175,-0.003101
movieId,-0.00085,1.0,0.002606,0.459096
rating,0.001175,0.002606,1.0,-0.000512
timestamp,-0.003101,0.459096,-0.000512,1.0


In [76]:
# filter to check each element of the Pandas Series is greater than five
filter_1 = ratings["rating"] > 4
# check if any element into Pandas Series is True
print(filter_1.any())
# check if all elements into Pandas Series is True
print(filter_1.all())

True
False
