# Questions for the MovieLens Parquet Dataset

## Setup Spark-SQL

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Speed Test") \
    .enableHiveSupport() \
    .getOrCreate()

In [2]:
%load_ext sparksql_magic

In [3]:
%%sparksql

SHOW DATABASES

0
namespace
default
movielens
movielens_parquet
movielens_parquet_compressed
post
taxi
test_db
text


In [4]:
%%sparksql

USE movielens_parquet_compressed

## Playground

### How many movies do we have?

```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 6.9 ms, sys: 153 µs, total: 7.05 ms
Wall time: 125 ms
```

In [5]:
%%time 
%%sparksql

SELECT count(*) FROM movies

[Stage 2:>                                                          (0 + 1) / 1]

CPU times: user 13.4 ms, sys: 1.37 ms, total: 14.8 ms
Wall time: 7.79 s


                                                                                

0
count(1)
62423


### How many ratings do we have?
```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 1.17 ms, sys: 5.38 ms, total: 6.55 ms
Wall time: 116 ms
```

In [6]:
%%time 
%%sparksql

SELECT count(*) FROM ratings



CPU times: user 9.76 ms, sys: 5.8 ms, total: 15.6 ms
Wall time: 5.2 s


                                                                                

0
count(1)
25000095


### How many users do we have?
```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 8.24 ms, sys: 1.12 ms, total: 9.36 ms
Wall time: 30.6 s
```

In [7]:
%%time 
%%sparksql

SELECT COUNT(DISTINCT(userid)) FROM ratings



CPU times: user 11.3 ms, sys: 3.24 ms, total: 14.6 ms
Wall time: 3.79 s


                                                                                

0
count(DISTINCT userid)
162541


### Which movie(s) has (have) the most number of genres?
```
* hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 9.34 ms, sys: 1.38 ms, total: 10.7 ms
Wall time: 27.4 s
```

In [8]:
%%time 
%%sparksql

select title, year, genres, size(genres) as num_gen from movies order by num_gen desc limit 2

[Stage 12:>                                                         (0 + 1) / 1]

CPU times: user 5.12 ms, sys: 5.11 ms, total: 10.2 ms
Wall time: 1.51 s


                                                                                

0,1,2,3
title,year,genres,num_gen
Rubber,2010,"['Action', 'Adventure', 'Comedy', 'Crime', 'Drama', 'Film-Noir', 'Horror', 'Mystery', 'Thriller', 'Western']",10
Motorama,1991,"['Adventure', 'Comedy', 'Crime', 'Drama', 'Fantasy', 'Mystery', 'Sci-Fi', 'Thriller']",8


### Show all movies with terminator in the title

```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 8.31 ms, sys: 0 ns, total: 8.31 ms
Wall time: 232 ms
```

In [9]:
%%time 
%%sparksql

select movieid, title, year from movies where lower(title) like '%terminator%'

CPU times: user 6.61 ms, sys: 2.59 ms, total: 9.2 ms
Wall time: 791 ms


                                                                                

0,1,2
movieid,title,year
589,Terminator 2: Judgment Day,1991
1240,"Terminator, The",1984
4934,"Exterminator, The",1980
6537,Terminator 3: Rise of the Machines,2003
68791,Terminator Salvation,2009
102425,Lady Terminator (Pembalasan ratu pantai selatan),1989
120799,Terminator Genisys,2015
126713,Exterminator 2,1984
136200,The Terminators,2009


### How many movies do we have from 1984?
```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 10.5 ms, sys: 0 ns, total: 10.5 ms
Wall time: 26.1 s
```

In [10]:
%%time 
%%sparksql

select count(*) from movies where year = 1984

CPU times: user 7.81 ms, sys: 0 ns, total: 7.81 ms
Wall time: 1.09 s


0
count(1)
470


### Show the distribution of movies per year (where year >= 2000), sorted by year
```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 12.9 ms, sys: 3.02 ms, total: 15.9 ms
Wall time: 54.5 s
```

In [11]:
%%time 
%%sparksql

select year, count(title) from movies where year >= 2000 group by year order by year asc

CPU times: user 6.81 ms, sys: 1.06 ms, total: 7.87 ms
Wall time: 1.3 s


0,1
year,count(title)
2000,929
2001,969
2002,1023
2003,1028
2004,1169
2005,1250
2006,1440
2007,1494
2008,1625


### Movies with the most number of ratings
```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 11.7 ms, sys: 2.6 ms, total: 14.3 ms
Wall time: 28.5 s
```

In [12]:
%%time 
%%sparksql

select title, year, num_rating, median_rating from movie_rating order by num_rating DESC limit 10

CPU times: user 3.43 ms, sys: 3.28 ms, total: 6.72 ms
Wall time: 571 ms


0,1,2,3
title,year,num_rating,median_rating
Forrest Gump,1994,81491,4.0
"Shawshank Redemption, The",1994,81482,4.0
Pulp Fiction,1994,79672,4.0
"Silence of the Lambs, The",1991,74127,4.0
"Matrix, The",1999,72674,4.0
Star Wars: Episode IV - A New Hope,1977,68717,4.0
Jurassic Park,1993,64144,4.0
Schindler's List,1993,60411,4.0
Braveheart,1995,59184,4.0


### Top ten best rated movies (by median) where we have at least 100 ratings for a movie
```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 27.9 ms, sys: 3.95 ms, total: 31.8 ms
Wall time: 27.4 s
```

In [13]:
%%time 
%%sparksql

select title, year, num_rating, median_rating 
from movie_rating
where num_rating > 100
order by median_rating DESC, num_rating DESC
limit 10

CPU times: user 2.93 ms, sys: 4.29 ms, total: 7.21 ms
Wall time: 716 ms


0,1,2,3
title,year,num_rating,median_rating
Forrest Gump,1994,81491,4.0
"Shawshank Redemption, The",1994,81482,4.0
Pulp Fiction,1994,79672,4.0
"Silence of the Lambs, The",1991,74127,4.0
"Matrix, The",1999,72674,4.0
Star Wars: Episode IV - A New Hope,1977,68717,4.0
Jurassic Park,1993,64144,4.0
Schindler's List,1993,60411,4.0
Braveheart,1995,59184,4.0


### Top ten worst rated movies (by median) where we have at least 100 ratings for a movie

```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 12.8 ms, sys: 858 µs, total: 13.6 ms
Wall time: 28.2 s
```

In [14]:
%%time 
%%sparksql

select title, year, num_rating, median_rating 
from movie_rating
where
    num_rating is not null
    and num_rating > 100
order by median_rating ASC, num_rating DESC
limit 10

CPU times: user 6.43 ms, sys: 0 ns, total: 6.43 ms
Wall time: 390 ms


0,1,2,3
title,year,num_rating,median_rating
Disaster Movie,2008,557,0.0
Pokemon 4 Ever (a.k.a. Pokémon 4: The Movie),2002,494,0.0
From Justin to Kelly,2003,417,0.0
Pokémon Heroes,2003,355,0.0
Bratz: The Movie,2007,209,0.0
SuperBabies: Baby Geniuses 2,2004,208,0.0
Faces of Death 4,1990,169,0.0
Faces of Death 6,1996,162,0.0
Faces of Death 5,1996,147,0.0


### Which genres were used how often?

```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 14.4 ms, sys: 2.46 ms, total: 16.9 ms
Wall time: 54.5 s
```

In [15]:
%%time 
%%sparksql

SELECT genre, COUNT(genre) AS cnt FROM (
    SELECT EXPLODE(genres) genre FROM movies
)t
GROUP BY genre
ORDER BY cnt DESC

CPU times: user 5.74 ms, sys: 1.72 ms, total: 7.46 ms
Wall time: 1.06 s


0,1
genre,cnt
Drama,25606
Comedy,16870
Thriller,8654
Romance,7719
Action,7348
Horror,5989
Documentary,5605
Crime,5319
,5062


## Naïve Movie Recommender

### Step 1 - find two movies (the `movieid` you like a lot)
 
 --> 4011 == Snatch
 
 --> 1270 == Back to the Future
 
 ```
  * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 9.52 ms, sys: 1.28 ms, total: 10.8 ms
Wall time: 247 ms
```

In [16]:
%%time 
%%sparksql

select movieid, title, year from movies where lower(title) like '%snatch%'

CPU times: user 3.81 ms, sys: 4.08 ms, total: 7.89 ms
Wall time: 397 ms


0,1,2
movieid,title,year
426,Body Snatchers,1993
1337,"Body Snatcher, The",1945
2664,Invasion of the Body Snatchers,1956
4011,Snatch,2000
7001,Invasion of the Body Snatchers,1978
97031,"Goke, Body Snatcher from Hell (Kyuketsuki Gokemidoro)",1968
102388,"Candy Snatchers, The",1973
122037,The Bone Snatcher,2003
130656,The 21 Carat Snatch,1971


```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 9.57 ms, sys: 0 ns, total: 9.57 ms
Wall time: 231 ms
```

In [17]:
%%time 
%%sparksql

select movieid, title, year from movies where lower(title) like '%back to the%'


CPU times: user 6.89 ms, sys: 0 ns, total: 6.89 ms
Wall time: 452 ms


0,1,2
movieid,title,year
1270,Back to the Future,1985
1863,Major League: Back to the Minors,1998
2011,Back to the Future Part II,1989
2012,Back to the Future Part III,1990
4081,Back to the Beach,1987
4445,T-Rex: Back to the Cretaceous,1998
5936,"Come Back to the Five and Dime, Jimmy Dean, Jimmy Dean",1982
102666,Ivan Vasilievich: Back to the Future (Ivan Vasilievich menyaet professiyu),1973
106549,Back to the USSR - takaisin Ryssiin,1992


### Find people who liked these movies as well and save it into temp table
```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 7.76 ms, sys: 3.01 ms, total: 10.8 ms
Wall time: 30.8 s
```

In [22]:
%%time 
%%sparksql

CREATE TEMPORARY VIEW similar_people as 
select distinct(userid) userid
from ratings 
where (movieid = 4011 or movieid = 1270) and rating = 5

CPU times: user 7.3 ms, sys: 0 ns, total: 7.3 ms
Wall time: 143 ms


### Basic checks for `similar_people`

```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 4.24 ms, sys: 3.32 ms, total: 7.57 ms
Wall time: 126 ms
```

In [23]:
%%time 
%%sparksql

select * from similar_people limit 2



CPU times: user 10.5 ms, sys: 1.52 ms, total: 12 ms
Wall time: 1.78 s


                                                                                

0
userid
1238
1342


```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 8.16 ms, sys: 110 µs, total: 8.27 ms
Wall time: 102 ms
```

In [24]:
%%time 
%%sparksql

select count(*) from similar_people



CPU times: user 7.4 ms, sys: 6.23 ms, total: 13.6 ms
Wall time: 1.28 s


                                                                                

0
count(1)
16455


### Join `similar_people` with `movies` and `ratings` and get movie recommendations

```
 * hive://hadoop@localhost:10000/movielens_parquet_compressed
Done.
CPU times: user 11.3 ms, sys: 5.8 ms, total: 17.1 ms
Wall time: 1min 2s
```

In [25]:
%%time 
%%sparksql

SELECT m.title, count(*) as five_star_count from ratings r
INNER JOIN similar_people sp ON r.userid = sp.userid
INNER JOIN movies m ON r.movieid = m.movieid
WHERE rating = 5
GROUP BY m.title
ORDER BY five_star_count DESC
LIMIT 20



CPU times: user 13.6 ms, sys: 6.19 ms, total: 19.7 ms
Wall time: 3.34 s


                                                                                

0,1
title,five_star_count
Back to the Future,11602
"Shawshank Redemption, The",6662
"Matrix, The",6545
Star Wars: Episode IV - A New Hope,6263
Pulp Fiction,6228
Star Wars: Episode V - The Empire Strikes Back,6062
Snatch,5836
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark),5554
Fight Club,5535
