# Class 2: Overview

[MovieLens](https://movielens.org/) is a project run by the [GroupLens](https://grouplens.org/) research lab at the University of Minnesota. It has has collected millions of movie ratings over many years to promote research into recommendation systems.

They provide the collected review data, of over 30 million ratings for non-commercial use.

We will be using the "small" version of the data for this exercise, which contains only 100,000 reviews.


Citation: F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872




In [2]:
import duckdb
import os
import pandas as pd
import urllib.request
from zipfile import ZipFile
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances


from matplotlib import pyplot as plt

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [3]:
filename = "ml-latest-small.zip"
path = "/Users/yashwanth/Documents/GWU/Sem 3/Data Mining/Class 2/Class Material/"
url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
if os.path.isfile(path+filename):
    print(f'file already downloaded: {filename}')
else:
    print(f'downloading file: {filename}')
    headers = urllib.request.urlretrieve(url, filename=path+filename)

with ZipFile(path+filename, 'r') as zip_file:
    files = zip_file.namelist()
    for f in files: print(f)
    zip_file.extractall(path=path)
    print("Extracted")



downloading file: ml-latest-small.zip
ml-latest-small/
ml-latest-small/links.csv
ml-latest-small/tags.csv
ml-latest-small/ratings.csv
ml-latest-small/README.txt
ml-latest-small/movies.csv
Extracted


In [4]:
extracted_path = path+'ml-latest-small/'

duckdb.sql(f'CREATE TEMPORARY VIEW links AS (SELECT * FROM "{extracted_path}links.csv")')
duckdb.sql(f'CREATE TEMPORARY VIEW tags AS (SELECT * FROM "{extracted_path}tags.csv")')
duckdb.sql(f'CREATE TEMPORARY VIEW ratings AS (SELECT * FROM "{extracted_path}ratings.csv")')
duckdb.sql(f'CREATE TEMPORARY VIEW movies AS (SELECT * FROM "{extracted_path}movies.csv")')

In [5]:
duckdb.sql("SELECT * FROM ratings LIMIT 5").df()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
duckdb.sql("SELECT * FROM movies LIMIT 5").df()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
duckdb.sql("SELECT * FROM tags LIMIT 5").df()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [8]:
# The "links" data allows you to link a movie to its IMDB page or themoviedb.org page
# This is useful if you are building an application and wish to create links or pull in content
# from those sources, but we will just ignore this here.
duckdb.sql("SELECT * FROM links LIMIT 5").df()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862
1,2,113497,8844
2,3,113228,15602
3,4,114885,31357
4,5,113041,11862


# Sorting out the facts and the dimensions

In looking at this data, I believe you can separate out a few "dimensions" in the data:

* **Users**: the only reference to users in the data is the `userId`, but presumably MoveLens itself keeps track of other information about the user. The `userId` field is the primary key here.
* **Movies**: movies are dimensions, as well, with primary key `movieId`
* **Genres**: each genre would be it's own dimension, with a relationship to Movies.
    * Note: this is sometimes called an "outrigger dimension" because it is a dimension table that only joins to another dimension table. It can also be an exmaple of "Snowflaking", which is an alternative method of constructing a data warehouse that is more normalized, but also harder for users to query. In this 'dimensional modeling' approach, outriggers and snowflaking should be kept to a minimum.
* **Tags**: each individual tag is a dimension. However, we again have very little data here.

We also have one that is debatable:
* **Time** or **Date**: we have a "timestamp" field, which captures the time of the rating. Whether we include this would depend on the business and typical queries. Let's say that we are interested in dates, but not specific timestamps.

As for facts, it seems we have 2 distinct sets:

* **Ratings**: This is our primary "fact" table and would presumably be the one we find the most interesting/useful.
* **Tag Events**: tags are somewhat like ratings, in that they are discrete events that occurs. There are not a huge number of tagging events or tags themselves in this data. (A `COUNT(*)` only returns about 3600 tags)

Let's go ahead and create each of these tables and then look at what querying them looks like

In [10]:
## First, we will create a User table.
## Because there is no other information about users, we will derive a field, that we will call "date joined", based on their earliest review

query = '''

CREATE OR REPLACE TABLE user_dim AS (
    WITH ratings_w_date AS

    (
        SELECT
            UserId
            , DATE_TRUNC('day', MAKE_TIMESTAMP(1000000*timestamp) ) AS rating_date
        FROM
            ratings
    )

    , user_data AS
    (
        SELECT
            UserId
            , MIN(rating_date) AS date_joined
        FROM
            ratings_w_date
        GROUP BY UserId
    )

    SELECT * FROM user_data
)'''

duckdb.sql(query)
duckdb.sql('SELECT * FROM user_dim LIMIT 10').df()

Unnamed: 0,userId,date_joined
0,42,2001-07-27
1,50,2017-07-20
2,70,2012-12-11
3,74,2008-04-06
4,107,1996-04-12
5,108,2003-01-17
6,124,2012-05-07
7,125,2016-09-17
8,128,1998-06-28
9,130,1996-05-20


In [11]:
## Now, let's look at the movies and genres dimensions.
## We'll create the movie dimension table, dropping the genres, then we will create a separte genre table.

query = '''

CREATE OR REPLACE TABLE movie_dim AS
(
    SELECT
        movieId
        , title
        , REGEXP_REPLACE(title, '\s\(\d+\)', '') AS extracted_title
        , REGEXP_EXTRACT(title, '\((\d+)\)', 1) AS year_released
        , imdbId
        , tmdbId

    FROM
        movies m
    LEFT JOIN
        links l
    USING (movieId)
)

'''
duckdb.sql(query)
duckdb.sql('SELECT * FROM movie_dim LIMIT 10').df()

  query = '''


Unnamed: 0,movieId,title,extracted_title,year_released,imdbId,tmdbId
0,1,Toy Story (1995),Toy Story,1995,114709,862
1,3,Grumpier Old Men (1995),Grumpier Old Men,1995,113228,15602
2,4,Waiting to Exhale (1995),Waiting to Exhale,1995,114885,31357
3,5,Father of the Bride Part II (1995),Father of the Bride Part II,1995,113041,11862
4,6,Heat (1995),Heat,1995,113277,949
5,7,Sabrina (1995),Sabrina,1995,114319,11860
6,8,Tom and Huck (1995),Tom and Huck,1995,112302,45325
7,9,Sudden Death (1995),Sudden Death,1995,114576,9091
8,11,"American President, The (1995)","American President, The",1995,112346,9087
9,12,Dracula: Dead and Loving It (1995),Dracula: Dead and Loving It,1995,112896,12110


In [12]:
## Now, onto genres
## Note:

query = '''

CREATE OR REPLACE TABLE genre_dim AS
(
    SELECT
        movieId
        , UNNEST(SPLIT(genres, '|')) AS genre
    FROM
        movies
)
'''

duckdb.sql(query)
duckdb.sql('SELECT * FROM genre_dim LIMIT 10').df()

Unnamed: 0,movieId,genre
0,1,Adventure
1,1,Animation
2,1,Children
3,1,Comedy
4,1,Fantasy
5,2,Adventure
6,2,Children
7,2,Fantasy
8,3,Comedy
9,3,Romance


### Dates

We are going to skip creating a dimension table for tags, because it only would include the tag itself (and possibly a tagId that we create). But, let's address dates. One thing about date dimension tables is that we should _not_ derive these from the data itself: we may not have data that covers every day. Since we know in advance what all possible dates are, we should create this type of table independently

Because this is a fairly standard table used in many different situations, I found an example [here, as a gist]( https://gist.github.com/adityawarmanfw/0612333605d351f2f1fe5c87e1af20d2) and made a quick modification to expand the date range.

This table is kind of overkill for this use! But, there are places (imagine sales analysis for a product/company) where all of the detailed date information is useful for grouping by

In [14]:

query = '''
CREATE OR REPLACE TABLE date_dim AS (
     WITH generate_date AS (
        SELECT CAST(RANGE AS DATE) AS date_key
          FROM RANGE(DATE '1900-01-01', DATE '2100-12-31', INTERVAL 1 DAY)
          )
   SELECT date_key AS date_key,
          DAYOFYEAR(date_key) AS day_of_year,
          YEARWEEK(date_key) AS week_key,
          WEEKOFYEAR(date_key) AS week_of_year,
          DAYOFWEEK(date_key) AS day_of_week,
          ISODOW(date_key) AS iso_day_of_week,
          DAYNAME(date_key) AS day_name,
          DATE_TRUNC('week', date_key) AS first_day_of_week,
          DATE_TRUNC('week', date_key) + 6 AS last_day_of_week,
          YEAR(date_key) || RIGHT('0' || MONTH(date_key), 2) AS month_key,
          MONTH(date_key) AS month_of_year,
          DAYOFMONTH(date_key) AS day_of_month,
          LEFT(MONTHNAME(date_key), 3) AS month_name_short,
          MONTHNAME(date_key) AS month_name,
          DATE_TRUNC('month', date_key) AS first_day_of_month,
          LAST_DAY(date_key) AS last_day_of_month,
          CAST(YEAR(date_key) || QUARTER(date_key) AS INT) AS quarter_key,
          QUARTER(date_key) AS quarter_of_year,
          CAST(date_key - DATE_TRUNC('Quarter', date_key) + 1 AS INT) AS day_of_quarter,
          ('Q' || QUARTER(date_key)) AS quarter_desc_short,
          ('Quarter ' || QUARTER(date_key)) AS quarter_desc,
          DATE_TRUNC('quarter', date_key) AS first_day_of_quarter,
          LAST_DAY(DATE_TRUNC('quarter', date_key) + INTERVAL 2 MONTH) as last_day_of_quarter,
          CAST(YEAR(date_key) AS INT) AS year_key,
          DATE_TRUNC('Year', date_key) AS first_day_of_year,
          DATE_TRUNC('Year', date_key) - 1 + INTERVAL 1 YEAR AS last_day_of_year,
          ROW_NUMBER() OVER (PARTITION BY YEAR(date_key), MONTH(date_key), DAYOFWEEK(date_key) ORDER BY date_key) AS ordinal_weekday_of_month
     FROM generate_date
)
'''

duckdb.sql(query)
duckdb.sql('SELECT * FROM date_dim LIMIT 10').df()


Unnamed: 0,date_key,day_of_year,week_key,week_of_year,day_of_week,iso_day_of_week,day_name,first_day_of_week,last_day_of_week,month_key,month_of_year,day_of_month,month_name_short,month_name,first_day_of_month,last_day_of_month,quarter_key,quarter_of_year,day_of_quarter,quarter_desc_short,quarter_desc,first_day_of_quarter,last_day_of_quarter,year_key,first_day_of_year,last_day_of_year,ordinal_weekday_of_month
0,1900-01-05,5,190001,1,5,5,Friday,1900-01-01,1900-01-07,190001,1,5,Jan,January,1900-01-01,1900-01-31,19001,1,5,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,1
1,1900-01-12,12,190002,2,5,5,Friday,1900-01-08,1900-01-14,190001,1,12,Jan,January,1900-01-01,1900-01-31,19001,1,12,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,2
2,1900-01-19,19,190003,3,5,5,Friday,1900-01-15,1900-01-21,190001,1,19,Jan,January,1900-01-01,1900-01-31,19001,1,19,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,3
3,1900-01-26,26,190004,4,5,5,Friday,1900-01-22,1900-01-28,190001,1,26,Jan,January,1900-01-01,1900-01-31,19001,1,26,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,4
4,1900-02-04,35,190005,5,0,7,Sunday,1900-01-29,1900-02-04,190002,2,4,Feb,February,1900-02-01,1900-02-28,19001,1,35,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,1
5,1900-02-11,42,190006,6,0,7,Sunday,1900-02-05,1900-02-11,190002,2,11,Feb,February,1900-02-01,1900-02-28,19001,1,42,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,2
6,1900-02-18,49,190007,7,0,7,Sunday,1900-02-12,1900-02-18,190002,2,18,Feb,February,1900-02-01,1900-02-28,19001,1,49,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,3
7,1900-02-25,56,190008,8,0,7,Sunday,1900-02-19,1900-02-25,190002,2,25,Feb,February,1900-02-01,1900-02-28,19001,1,56,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,4
8,1900-03-07,66,190010,10,3,3,Wednesday,1900-03-05,1900-03-11,190003,3,7,Mar,March,1900-03-01,1900-03-31,19001,1,66,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,1
9,1900-03-14,73,190011,11,3,3,Wednesday,1900-03-12,1900-03-18,190003,3,14,Mar,March,1900-03-01,1900-03-31,19001,1,73,Q1,Quarter 1,1900-01-01,1900-03-31,1900,1900-01-01,1900-12-31,2


In [15]:
# Fact tables
# We only need a few minor updates to the ratings table and will leave the tags table alone.

query = '''
CREATE OR REPLACE TABLE ratings_fact AS
(
    SELECT
          userId
        , movieId
        , rating
        , timestamp
        , DATE_TRUNC('day', MAKE_TIMESTAMP(1000000*timestamp) ) AS rating_date
    FROM
        ratings
)
'''
duckdb.sql(query)
duckdb.sql('SELECT * FROM ratings_fact LIMIT 10').df()

Unnamed: 0,userId,movieId,rating,timestamp,rating_date
0,1,1,4.0,964982703,2000-07-30
1,1,3,4.0,964981247,2000-07-30
2,1,6,4.0,964982224,2000-07-30
3,1,47,5.0,964983815,2000-07-30
4,1,50,5.0,964982931,2000-07-30
5,1,70,3.0,964982400,2000-07-30
6,1,101,5.0,964980868,2000-07-30
7,1,110,4.0,964982176,2000-07-30
8,1,151,5.0,964984041,2000-07-30
9,1,157,5.0,964984100,2000-07-30


![schema](./movieLens_dimensional_schame_example.png)

In [17]:
## Most queries related to reviews are now straightforward queries starting from the "fact" table
## and then joining against various dimension tables.
## Query: Which movies have the most reviews?

duckdb.sql("""
SELECT
    m.title
    , AVG(r.rating) AS avg_rating
    , COUNT(r.rating) AS num_ratings
FROM
    ratings_fact r
LEFT JOIN
    movie_dim m
USING (movieId)
GROUP BY
    m.title
ORDER BY num_ratings DESC
LIMIT 10
""").df()

Unnamed: 0,title,avg_rating,num_ratings
0,Forrest Gump (1994),4.164134,329
1,"Shawshank Redemption, The (1994)",4.429022,317
2,Pulp Fiction (1994),4.197068,307
3,"Silence of the Lambs, The (1991)",4.16129,279
4,"Matrix, The (1999)",4.192446,278
5,Star Wars: Episode IV - A New Hope (1977),4.231076,251
6,Jurassic Park (1993),3.75,238
7,Braveheart (1995),4.031646,237
8,Terminator 2: Judgment Day (1991),3.970982,224
9,Schindler's List (1993),4.225,220


## Query: Do people write more reviews on weekends?

duckdb.sql("""
SELECT
    d.day_name
     , AVG(r.rating) AS avg_rating
    , COUNT(r.rating) AS num_ratings
FROM
    ratings_fact r
LEFT JOIN
    date_dim d
ON r.rating_date = d.date_key
GROUP BY
    d.day_name
ORDER BY num_ratings DESC
LIMIT 10
""").df()

# Blank spaces for extra queries in class

# Creating a design matrix

Our design matrix is going to have _users_ for rows.

Conveniently, we have a user dimension table that we can use as our starting point. Let's create our design matrix table, along with the first few features.

To start, we can make the following features:
* date_joined
* Number of Movie Reviews
* Average rating

I am going to name this table the "driver" table. Because the process of creating features can be many steps, it is a good pattern to first create a basic shell of the table you want in the end, where every thing (user) you want to analyze/model has it's own row.

In [21]:
query = '''

CREATE OR REPLACE TABLE driver AS

(
    SELECT
          u.userId
        , COUNT(*) AS number_reviews
        , AVG(rating) AS average_rating
    FROM
        user_dim u
    LEFT JOIN
        ratings_fact r
    USING (userId)
    GROUP BY u.userID
)

'''

duckdb.sql(query)
duckdb.sql("SELECT * FROM driver LIMIT 10").df()

Unnamed: 0,userId,number_reviews,average_rating
0,1,232,4.366379
1,2,29,3.948276
2,3,39,2.435897
3,4,216,3.555556
4,5,44,3.636364
5,6,314,3.493631
6,7,152,3.230263
7,8,47,3.574468
8,9,46,3.26087
9,10,140,3.278571


### Adding some genre features

Perhaps we want to create features based on the average rating of movies by the user for each genre.

That will require a column per genre.

Lets look at how many genres there are

In [23]:
duckdb.sql("SELECT DISTINCT genre FROM genre_dim").df()

Unnamed: 0,genre
0,Comedy
1,Romance
2,Crime
3,Animation
4,IMAX
5,Children
6,Thriller
7,Western
8,Fantasy
9,Action


In [24]:
query = '''
CREATE OR REPLACE TABLE driver2 AS (
    WITH genre_rating AS (

        SELECT
              userId
            , rating
            , g.genre
        FROM
            ratings_fact r
        LEFT JOIN
            genre_dim g
        USING (movieId)

    )
    , genre_features AS (
        PIVOT genre_rating
        ON  genre
        USING
            AVG(rating) AS avg_rating_genre_feature
            , COUNT(rating) AS num_rating_genre_feature
    )

    SELECT
        *
        FROM
            driver d
        LEFT JOIN
            genre_features gf
        USING (userId)
)
'''
duckdb.sql(query)
duckdb.sql("SELECT * FROM driver2 LIMIT 100").df()

Unnamed: 0,userId,number_reviews,average_rating,(no genres listed)_avg_rating_genre_feature,(no genres listed)_num_rating_genre_feature,Action_avg_rating_genre_feature,Action_num_rating_genre_feature,Adventure_avg_rating_genre_feature,Adventure_num_rating_genre_feature,Animation_avg_rating_genre_feature,Animation_num_rating_genre_feature,Children_avg_rating_genre_feature,Children_num_rating_genre_feature,Comedy_avg_rating_genre_feature,Comedy_num_rating_genre_feature,Crime_avg_rating_genre_feature,Crime_num_rating_genre_feature,Documentary_avg_rating_genre_feature,Documentary_num_rating_genre_feature,Drama_avg_rating_genre_feature,Drama_num_rating_genre_feature,Fantasy_avg_rating_genre_feature,Fantasy_num_rating_genre_feature,Film-Noir_avg_rating_genre_feature,Film-Noir_num_rating_genre_feature,Horror_avg_rating_genre_feature,Horror_num_rating_genre_feature,IMAX_avg_rating_genre_feature,IMAX_num_rating_genre_feature,Musical_avg_rating_genre_feature,Musical_num_rating_genre_feature,Mystery_avg_rating_genre_feature,Mystery_num_rating_genre_feature,Romance_avg_rating_genre_feature,Romance_num_rating_genre_feature,Sci-Fi_avg_rating_genre_feature,Sci-Fi_num_rating_genre_feature,Thriller_avg_rating_genre_feature,Thriller_num_rating_genre_feature,War_avg_rating_genre_feature,War_num_rating_genre_feature,Western_avg_rating_genre_feature,Western_num_rating_genre_feature
0,1,232,4.366379,,0,4.322222,90,4.388235,85,4.689655,29,4.547619,42,4.277108,83,4.355556,45,,0,4.529412,68,4.297872,47,5.0,1,3.470588,17,,0,4.681818,22,4.166667,18,4.307692,26,4.225,40,4.145455,55,4.5,22,4.285714,7
1,2,29,3.948276,,0,3.954545,11,4.166667,3,,0,,0,4.0,7,3.8,10,4.333333,3,3.882353,17,,0,,0,3.0,1,3.75,4,,0,4.0,2,4.5,1,3.875,4,3.7,10,4.5,1,3.5,1
2,3,39,2.435897,,0,3.571429,14,2.727273,11,0.5,4,0.5,5,1.0,9,0.5,2,,0,0.75,16,3.375,4,,0,4.6875,8,,0,0.5,1,5.0,1,0.5,5,4.2,15,4.142857,7,0.5,5,,0
3,4,216,3.555556,,0,3.32,25,3.655172,29,4.0,6,3.8,10,3.509615,104,3.814815,27,4.0,2,3.483333,120,3.684211,19,4.0,4,4.25,4,3.0,1,4.0,16,3.478261,23,3.37931,58,2.833333,12,3.552632,38,3.571429,7,3.8,10
4,5,44,3.636364,,0,3.111111,9,3.25,8,4.333333,6,4.111111,9,3.466667,15,3.833333,12,,0,3.8,25,4.142857,7,,0,3.0,1,3.666667,3,4.4,5,4.0,1,3.090909,11,2.5,2,3.555556,9,3.333333,3,3.0,2
5,6,314,3.493631,,0,3.609375,64,3.893617,47,4.071429,14,3.617021,47,3.370079,127,3.285714,35,,0,3.614286,140,3.538462,26,2.5,2,3.263158,19,4.666667,3,4.166667,12,3.733333,15,3.614286,70,3.47619,21,3.544118,68,3.583333,12,3.818182,11
6,7,152,3.230263,,0,3.257812,64,3.314815,54,3.392857,14,3.2,15,3.163265,49,3.307692,26,,0,3.131579,57,3.065217,23,3.25,2,4.0,5,2.454545,11,3.666667,9,3.178571,14,2.65,30,3.154762,42,3.430233,43,3.291667,12,1.5,1
7,8,47,3.574468,,0,3.333333,12,3.545455,11,5.0,1,4.25,4,3.208333,24,3.888889,9,,0,3.789474,19,3.25,4,,0,4.5,2,4.5,2,5.0,1,4.0,3,3.5,14,3.25,4,3.75,16,3.666667,3,3.0,2
8,9,46,3.26087,,0,3.125,8,3.8,10,4.0,1,4.0,1,3.666667,15,3.142857,7,,0,3.428571,21,5.0,2,4.0,1,1.8,5,3.0,1,3.0,1,4.0,3,3.166667,6,3.0,8,2.545455,11,3.5,2,4.0,1
9,10,140,3.278571,,0,3.5,26,3.580645,31,3.866667,15,3.607143,14,3.265823,79,3.115385,13,,0,3.152778,72,3.441176,17,,0,1.75,2,3.361111,18,3.333333,9,2.166667,3,3.333333,78,2.0,5,3.076923,13,3.75,4,,0


## Null values

OK, so one thing that you might notice here are some null values

There appear to be very few people who rate movies with no genre listed (even looking at more rows, you will see this)

But, even for relatively common genres, like `Children`, not everyone rates those

Dealing with null values is a common problem in data mining. While some tools may handle these values gracefully, it is nearly always better to deal with null values yourself.

There are a lot of different ways we can deal with nulls:

* For a column like `(no genres listed)_avg_rating_genre_feature` because there are so many null values, it may be best to just drop that column entirely
* Dropping rows (users) is a trickier process, because those are what we are studying. If there's a row that has very poor data, perhaps we can drop it from our study, but we then need to understand what we do when similar rows
* We can replace null values with some other number. This is called **_imputation_**

Some common imputation strategies are:

* Replace all nulls with zeros:
    * This is simple and would work well if the 0 is a reasonable value to take. In this case, saying a user who has not rated any children's movies would rate them all as 0 (out of 5) is stating that that user _hates_ children's movies.
* Replace all nulls with some sort of average (also called _mean imputation_)
    * We could replace the null children's average rating with either the average rating of that user or, possibly, the average rating of all children's movies.
* You could try to do something more sophisticated, by inferring what value it should have.

We will just use a mean imputation, assigning the user's average rating to all null genres.



In [26]:
duckdb.sql("DESCRIBE TABLE driver2").df()['column_name'].to_list()

['userId',
 'number_reviews',
 'average_rating',
 '(no genres listed)_avg_rating_genre_feature',
 '(no genres listed)_num_rating_genre_feature',
 'Action_avg_rating_genre_feature',
 'Action_num_rating_genre_feature',
 'Adventure_avg_rating_genre_feature',
 'Adventure_num_rating_genre_feature',
 'Animation_avg_rating_genre_feature',
 'Animation_num_rating_genre_feature',
 'Children_avg_rating_genre_feature',
 'Children_num_rating_genre_feature',
 'Comedy_avg_rating_genre_feature',
 'Comedy_num_rating_genre_feature',
 'Crime_avg_rating_genre_feature',
 'Crime_num_rating_genre_feature',
 'Documentary_avg_rating_genre_feature',
 'Documentary_num_rating_genre_feature',
 'Drama_avg_rating_genre_feature',
 'Drama_num_rating_genre_feature',
 'Fantasy_avg_rating_genre_feature',
 'Fantasy_num_rating_genre_feature',
 'Film-Noir_avg_rating_genre_feature',
 'Film-Noir_num_rating_genre_feature',
 'Horror_avg_rating_genre_feature',
 'Horror_num_rating_genre_feature',
 'IMAX_avg_rating_genre_feature',

In [27]:
query_snippet = '''

CREATE OR REPLACE TABLE user_matrix AS (

    SELECT
        userId'''


for column in duckdb.sql("DESCRIBE TABLE driver2").df()['column_name'].to_list():
    if 'no genres' not in column and 'userId' not in column:
        if 'genre_feature' not in column or 'num_rating' in column:
            query_snippet += f'''\n\t\t, "{column}"'''
        else:
            query_snippet += f'''\n\t\t, COALESCE("{column}", "average_rating") AS "{column}" '''

query_snippet += """
    FROM
        driver2
)"""
print(query_snippet)




CREATE OR REPLACE TABLE user_matrix AS (

    SELECT
        userId
		, "number_reviews"
		, "average_rating"
		, COALESCE("Action_avg_rating_genre_feature", "average_rating") AS "Action_avg_rating_genre_feature" 
		, "Action_num_rating_genre_feature"
		, COALESCE("Adventure_avg_rating_genre_feature", "average_rating") AS "Adventure_avg_rating_genre_feature" 
		, "Adventure_num_rating_genre_feature"
		, COALESCE("Animation_avg_rating_genre_feature", "average_rating") AS "Animation_avg_rating_genre_feature" 
		, "Animation_num_rating_genre_feature"
		, COALESCE("Children_avg_rating_genre_feature", "average_rating") AS "Children_avg_rating_genre_feature" 
		, "Children_num_rating_genre_feature"
		, COALESCE("Comedy_avg_rating_genre_feature", "average_rating") AS "Comedy_avg_rating_genre_feature" 
		, "Comedy_num_rating_genre_feature"
		, COALESCE("Crime_avg_rating_genre_feature", "average_rating") AS "Crime_avg_rating_genre_feature" 
		, "Crime_num_rating_genre_feature"
		, COALESCE("D

In [28]:
duckdb.sql(query_snippet)
duckdb.sql('SELECT * FROM user_matrix LIMIT 100').df()

Unnamed: 0,userId,number_reviews,average_rating,Action_avg_rating_genre_feature,Action_num_rating_genre_feature,Adventure_avg_rating_genre_feature,Adventure_num_rating_genre_feature,Animation_avg_rating_genre_feature,Animation_num_rating_genre_feature,Children_avg_rating_genre_feature,Children_num_rating_genre_feature,Comedy_avg_rating_genre_feature,Comedy_num_rating_genre_feature,Crime_avg_rating_genre_feature,Crime_num_rating_genre_feature,Documentary_avg_rating_genre_feature,Documentary_num_rating_genre_feature,Drama_avg_rating_genre_feature,Drama_num_rating_genre_feature,Fantasy_avg_rating_genre_feature,Fantasy_num_rating_genre_feature,Film-Noir_avg_rating_genre_feature,Film-Noir_num_rating_genre_feature,Horror_avg_rating_genre_feature,Horror_num_rating_genre_feature,IMAX_avg_rating_genre_feature,IMAX_num_rating_genre_feature,Musical_avg_rating_genre_feature,Musical_num_rating_genre_feature,Mystery_avg_rating_genre_feature,Mystery_num_rating_genre_feature,Romance_avg_rating_genre_feature,Romance_num_rating_genre_feature,Sci-Fi_avg_rating_genre_feature,Sci-Fi_num_rating_genre_feature,Thriller_avg_rating_genre_feature,Thriller_num_rating_genre_feature,War_avg_rating_genre_feature,War_num_rating_genre_feature,Western_avg_rating_genre_feature,Western_num_rating_genre_feature
0,1,232,4.366379,4.322222,90,4.388235,85,4.689655,29,4.547619,42,4.277108,83,4.355556,45,4.366379,0,4.529412,68,4.297872,47,5.0,1,3.470588,17,4.366379,0,4.681818,22,4.166667,18,4.307692,26,4.225,40,4.145455,55,4.5,22,4.285714,7
1,2,29,3.948276,3.954545,11,4.166667,3,3.948276,0,3.948276,0,4.0,7,3.8,10,4.333333,3,3.882353,17,3.948276,0,3.948276,0,3.0,1,3.75,4,3.948276,0,4.0,2,4.5,1,3.875,4,3.7,10,4.5,1,3.5,1
2,3,39,2.435897,3.571429,14,2.727273,11,0.5,4,0.5,5,1.0,9,0.5,2,2.435897,0,0.75,16,3.375,4,2.435897,0,4.6875,8,2.435897,0,0.5,1,5.0,1,0.5,5,4.2,15,4.142857,7,0.5,5,2.435897,0
3,4,216,3.555556,3.32,25,3.655172,29,4.0,6,3.8,10,3.509615,104,3.814815,27,4.0,2,3.483333,120,3.684211,19,4.0,4,4.25,4,3.0,1,4.0,16,3.478261,23,3.37931,58,2.833333,12,3.552632,38,3.571429,7,3.8,10
4,5,44,3.636364,3.111111,9,3.25,8,4.333333,6,4.111111,9,3.466667,15,3.833333,12,3.636364,0,3.8,25,4.142857,7,3.636364,0,3.0,1,3.666667,3,4.4,5,4.0,1,3.090909,11,2.5,2,3.555556,9,3.333333,3,3.0,2
5,6,314,3.493631,3.609375,64,3.893617,47,4.071429,14,3.617021,47,3.370079,127,3.285714,35,3.493631,0,3.614286,140,3.538462,26,2.5,2,3.263158,19,4.666667,3,4.166667,12,3.733333,15,3.614286,70,3.47619,21,3.544118,68,3.583333,12,3.818182,11
6,7,152,3.230263,3.257812,64,3.314815,54,3.392857,14,3.2,15,3.163265,49,3.307692,26,3.230263,0,3.131579,57,3.065217,23,3.25,2,4.0,5,2.454545,11,3.666667,9,3.178571,14,2.65,30,3.154762,42,3.430233,43,3.291667,12,1.5,1
7,8,47,3.574468,3.333333,12,3.545455,11,5.0,1,4.25,4,3.208333,24,3.888889,9,3.574468,0,3.789474,19,3.25,4,3.574468,0,4.5,2,4.5,2,5.0,1,4.0,3,3.5,14,3.25,4,3.75,16,3.666667,3,3.0,2
8,9,46,3.26087,3.125,8,3.8,10,4.0,1,4.0,1,3.666667,15,3.142857,7,3.26087,0,3.428571,21,5.0,2,4.0,1,1.8,5,3.0,1,3.0,1,4.0,3,3.166667,6,3.0,8,2.545455,11,3.5,2,4.0,1
9,10,140,3.278571,3.5,26,3.580645,31,3.866667,15,3.607143,14,3.265823,79,3.115385,13,3.278571,0,3.152778,72,3.441176,17,3.278571,0,1.75,2,3.361111,18,3.333333,9,2.166667,3,3.333333,78,2.0,5,3.076923,13,3.75,4,3.278571,0


## Normalizing values

We are going to load the current design matrix up as a pandas dataframe and use some tools from scikit-learn for the next few steps. The data is pretty tiny

In [30]:
user_matrix = duckdb.sql('SELECT * FROM user_matrix ORDER BY userId').df()
print(user_matrix.shape) #
user_matrix

(610, 41)


Unnamed: 0,userId,number_reviews,average_rating,Action_avg_rating_genre_feature,Action_num_rating_genre_feature,Adventure_avg_rating_genre_feature,Adventure_num_rating_genre_feature,Animation_avg_rating_genre_feature,Animation_num_rating_genre_feature,Children_avg_rating_genre_feature,Children_num_rating_genre_feature,Comedy_avg_rating_genre_feature,Comedy_num_rating_genre_feature,Crime_avg_rating_genre_feature,Crime_num_rating_genre_feature,Documentary_avg_rating_genre_feature,Documentary_num_rating_genre_feature,Drama_avg_rating_genre_feature,Drama_num_rating_genre_feature,Fantasy_avg_rating_genre_feature,Fantasy_num_rating_genre_feature,Film-Noir_avg_rating_genre_feature,Film-Noir_num_rating_genre_feature,Horror_avg_rating_genre_feature,Horror_num_rating_genre_feature,IMAX_avg_rating_genre_feature,IMAX_num_rating_genre_feature,Musical_avg_rating_genre_feature,Musical_num_rating_genre_feature,Mystery_avg_rating_genre_feature,Mystery_num_rating_genre_feature,Romance_avg_rating_genre_feature,Romance_num_rating_genre_feature,Sci-Fi_avg_rating_genre_feature,Sci-Fi_num_rating_genre_feature,Thriller_avg_rating_genre_feature,Thriller_num_rating_genre_feature,War_avg_rating_genre_feature,War_num_rating_genre_feature,Western_avg_rating_genre_feature,Western_num_rating_genre_feature
0,1,232,4.366379,4.322222,90,4.388235,85,4.689655,29,4.547619,42,4.277108,83,4.355556,45,4.366379,0,4.529412,68,4.297872,47,5.000000,1,3.470588,17,4.366379,0,4.681818,22,4.166667,18,4.307692,26,4.225000,40,4.145455,55,4.500000,22,4.285714,7
1,2,29,3.948276,3.954545,11,4.166667,3,3.948276,0,3.948276,0,4.000000,7,3.800000,10,4.333333,3,3.882353,17,3.948276,0,3.948276,0,3.000000,1,3.750000,4,3.948276,0,4.000000,2,4.500000,1,3.875000,4,3.700000,10,4.500000,1,3.500000,1
2,3,39,2.435897,3.571429,14,2.727273,11,0.500000,4,0.500000,5,1.000000,9,0.500000,2,2.435897,0,0.750000,16,3.375000,4,2.435897,0,4.687500,8,2.435897,0,0.500000,1,5.000000,1,0.500000,5,4.200000,15,4.142857,7,0.500000,5,2.435897,0
3,4,216,3.555556,3.320000,25,3.655172,29,4.000000,6,3.800000,10,3.509615,104,3.814815,27,4.000000,2,3.483333,120,3.684211,19,4.000000,4,4.250000,4,3.000000,1,4.000000,16,3.478261,23,3.379310,58,2.833333,12,3.552632,38,3.571429,7,3.800000,10
4,5,44,3.636364,3.111111,9,3.250000,8,4.333333,6,4.111111,9,3.466667,15,3.833333,12,3.636364,0,3.800000,25,4.142857,7,3.636364,0,3.000000,1,3.666667,3,4.400000,5,4.000000,1,3.090909,11,2.500000,2,3.555556,9,3.333333,3,3.000000,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,606,1115,3.657399,3.178808,151,3.503401,147,3.714286,42,3.448980,49,3.565321,421,3.654135,133,3.800000,5,3.787966,698,3.597938,97,3.812500,8,3.346154,52,3.062500,16,3.727273,44,3.791209,91,3.740845,355,3.556962,79,3.525126,199,3.792308,65,3.411765,17
606,607,187,3.786096,3.722222,72,3.466667,45,3.333333,6,3.421053,19,3.327273,55,3.814815,27,3.786096,0,4.012195,82,3.571429,21,3.786096,0,4.114286,35,5.000000,1,3.600000,5,4.647059,17,3.517241,29,3.250000,36,4.114754,61,4.166667,6,4.000000,2
607,608,831,3.134176,3.330325,277,3.220994,181,3.118182,55,2.460227,88,2.736620,355,3.613014,146,3.000000,6,3.437500,280,3.000000,111,3.750000,4,3.319588,97,4.000000,12,2.757576,33,3.550725,69,2.886792,106,3.296407,167,3.536680,259,3.578947,19,2.636364,11
608,609,37,3.270270,3.090909,11,3.200000,10,3.000000,1,3.000000,2,3.285714,7,3.500000,6,3.000000,2,3.368421,19,3.000000,1,3.270270,0,3.500000,2,3.000000,1,3.270270,0,3.270270,0,3.200000,5,3.000000,5,3.285714,14,3.500000,4,4.000000,1


In [31]:
# scikit learn has a built in min-max scaler. We need to remove the userId column, then add it back in after scaling

scaler =  MinMaxScaler()

without_userid = user_matrix.drop('userId', axis=1)

normalized = scaler.fit_transform(without_userid)

normalized_df = pd.DataFrame(normalized, columns = without_userid.columns)

normalized_df['userId'] = user_matrix['userId']
normalized_df

Unnamed: 0,number_reviews,average_rating,Action_avg_rating_genre_feature,Action_num_rating_genre_feature,Adventure_avg_rating_genre_feature,Adventure_num_rating_genre_feature,Animation_avg_rating_genre_feature,Animation_num_rating_genre_feature,Children_avg_rating_genre_feature,Children_num_rating_genre_feature,Comedy_avg_rating_genre_feature,Comedy_num_rating_genre_feature,Crime_avg_rating_genre_feature,Crime_num_rating_genre_feature,Documentary_avg_rating_genre_feature,Documentary_num_rating_genre_feature,Drama_avg_rating_genre_feature,Drama_num_rating_genre_feature,Fantasy_avg_rating_genre_feature,Fantasy_num_rating_genre_feature,Film-Noir_avg_rating_genre_feature,Film-Noir_num_rating_genre_feature,Horror_avg_rating_genre_feature,Horror_num_rating_genre_feature,IMAX_avg_rating_genre_feature,IMAX_num_rating_genre_feature,Musical_avg_rating_genre_feature,Musical_num_rating_genre_feature,Mystery_avg_rating_genre_feature,Mystery_num_rating_genre_feature,Romance_avg_rating_genre_feature,Romance_num_rating_genre_feature,Sci-Fi_avg_rating_genre_feature,Sci-Fi_num_rating_genre_feature,Thriller_avg_rating_genre_feature,Thriller_num_rating_genre_feature,War_avg_rating_genre_feature,War_num_rating_genre_feature,Western_avg_rating_genre_feature,Western_num_rating_genre_feature,userId
0,0.079164,0.829900,0.843590,0.118734,0.864052,0.185996,0.931034,0.177914,0.899471,0.237288,0.819277,0.076923,0.856790,0.107399,0.850913,0.000000,0.889273,0.051223,0.824468,0.188,1.000000,0.020833,0.660131,0.056106,0.859195,0.00,0.929293,0.164179,0.814815,0.110429,0.846154,0.050980,0.827778,0.095694,0.810101,0.0880,0.888889,0.177419,0.841270,0.134615,1
1,0.003361,0.717658,0.758741,0.014512,0.814815,0.006565,0.766284,0.000000,0.766284,0.000000,0.750000,0.006487,0.733333,0.023866,0.843137,0.032258,0.737024,0.012232,0.737069,0.000,0.737069,0.000000,0.555556,0.003300,0.722222,0.04,0.766284,0.000000,0.777778,0.012270,0.888889,0.001961,0.750000,0.009569,0.711111,0.0160,0.888889,0.008065,0.666667,0.019231,2
2,0.007095,0.311650,0.670330,0.018470,0.494949,0.024070,0.000000,0.024540,0.000000,0.028249,0.000000,0.008341,0.000000,0.004773,0.396682,0.000000,0.000000,0.011468,0.593750,0.016,0.358974,0.000000,0.930556,0.026403,0.430199,0.00,0.000000,0.007463,1.000000,0.006135,0.000000,0.009804,0.822222,0.035885,0.809524,0.0112,0.000000,0.040323,0.430199,0.000000,3
3,0.073189,0.612230,0.612308,0.032982,0.701149,0.063457,0.777778,0.036810,0.733333,0.056497,0.627404,0.096386,0.736626,0.064439,0.764706,0.021505,0.643137,0.090979,0.671053,0.076,0.750000,0.083333,0.833333,0.013201,0.555556,0.01,0.777778,0.119403,0.661836,0.141104,0.639847,0.113725,0.518519,0.028708,0.678363,0.0608,0.682540,0.056452,0.733333,0.192308,4
4,0.008962,0.633923,0.564103,0.011873,0.611111,0.017505,0.851852,0.036810,0.802469,0.050847,0.616667,0.013902,0.740741,0.028640,0.679144,0.000000,0.717647,0.018349,0.785714,0.028,0.659091,0.000000,0.555556,0.003300,0.703704,0.03,0.866667,0.037313,0.777778,0.006135,0.575758,0.021569,0.444444,0.004785,0.679012,0.0144,0.629630,0.024194,0.555556,0.038462,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,0.408887,0.639570,0.579725,0.199208,0.667423,0.321663,0.714286,0.257669,0.655329,0.276836,0.641330,0.390176,0.700919,0.317422,0.717647,0.053763,0.714815,0.532875,0.649485,0.388,0.703125,0.166667,0.632479,0.171617,0.569444,0.16,0.717172,0.328358,0.731380,0.558282,0.720188,0.696078,0.679325,0.188995,0.672250,0.3184,0.731624,0.524194,0.647059,0.326923,606
606,0.062360,0.674120,0.705128,0.094987,0.659259,0.098468,0.629630,0.036810,0.649123,0.107345,0.581818,0.050973,0.736626,0.064439,0.714376,0.000000,0.767575,0.061927,0.642857,0.084,0.696524,0.000000,0.803175,0.115512,1.000000,0.01,0.688889,0.037313,0.921569,0.104294,0.670498,0.056863,0.611111,0.086124,0.803279,0.0976,0.814815,0.048387,0.777778,0.038462,607
607,0.302838,0.499108,0.614690,0.365435,0.604665,0.396061,0.581818,0.337423,0.435606,0.497175,0.434155,0.329008,0.691781,0.348449,0.529412,0.064516,0.632353,0.213303,0.500000,0.444,0.687500,0.083333,0.626575,0.320132,0.777778,0.12,0.501684,0.246269,0.677939,0.423313,0.530398,0.207843,0.621424,0.399522,0.674818,0.4144,0.684211,0.153226,0.474747,0.211538,608
608,0.006348,0.535643,0.559441,0.014512,0.600000,0.021882,0.555556,0.006135,0.555556,0.011299,0.571429,0.006487,0.666667,0.014320,0.529412,0.021505,0.616099,0.013761,0.500000,0.004,0.567568,0.000000,0.666667,0.006601,0.555556,0.01,0.615616,0.000000,0.615616,0.000000,0.600000,0.009804,0.555556,0.011962,0.619048,0.0224,0.666667,0.032258,0.777778,0.019231,609


In [32]:
distances = euclidean_distances(normalized, normalized)
distances

array([[0.        , 0.76024771, 2.83679029, ..., 1.5963914 , 1.33477816,
        2.20936878],
       [0.76024771, 0.        , 2.44479253, ..., 1.66339849, 0.85152723,
        2.48191034],
       [2.83679029, 2.44479253, 0.        , ..., 2.23642112, 1.88654805,
        3.33192543],
       ...,
       [1.5963914 , 1.66339849, 2.23642112, ..., 0.        , 1.4579445 ,
        1.54417867],
       [1.33477816, 0.85152723, 1.88654805, ..., 1.4579445 , 0.        ,
        2.53727685],
       [2.20936878, 2.48191034, 3.33192543, ..., 1.54417867, 2.53727685,
        0.        ]])

In [33]:
def find_most_similar(distances, normalized_df, index):
    min_index = -1
    min_value = distances[index,:].max()
    max_value = -1
    max_index = -1
    for ix, val in enumerate(distances[index,:]):
        if ix != index and val < min_value:
            min_index = ix
            min_value = val
        elif ix != index and val > max_value:
            max_index = ix
            max_value = val
    print(min_index, min_value, max_index, max_value)
    return (min_index, min_value, max_index, max_value)


index = 0
min_index, min_value, max_index, max_value = find_most_similar(distances, normalized_df, index)

user_matrix.loc[[index, min_index, max_index],:] # it is easier to interpret the non-normalized values

165 0.3944590775291274 598 3.723797336952958


Unnamed: 0,userId,number_reviews,average_rating,Action_avg_rating_genre_feature,Action_num_rating_genre_feature,Adventure_avg_rating_genre_feature,Adventure_num_rating_genre_feature,Animation_avg_rating_genre_feature,Animation_num_rating_genre_feature,Children_avg_rating_genre_feature,Children_num_rating_genre_feature,Comedy_avg_rating_genre_feature,Comedy_num_rating_genre_feature,Crime_avg_rating_genre_feature,Crime_num_rating_genre_feature,Documentary_avg_rating_genre_feature,Documentary_num_rating_genre_feature,Drama_avg_rating_genre_feature,Drama_num_rating_genre_feature,Fantasy_avg_rating_genre_feature,Fantasy_num_rating_genre_feature,Film-Noir_avg_rating_genre_feature,Film-Noir_num_rating_genre_feature,Horror_avg_rating_genre_feature,Horror_num_rating_genre_feature,IMAX_avg_rating_genre_feature,IMAX_num_rating_genre_feature,Musical_avg_rating_genre_feature,Musical_num_rating_genre_feature,Mystery_avg_rating_genre_feature,Mystery_num_rating_genre_feature,Romance_avg_rating_genre_feature,Romance_num_rating_genre_feature,Sci-Fi_avg_rating_genre_feature,Sci-Fi_num_rating_genre_feature,Thriller_avg_rating_genre_feature,Thriller_num_rating_genre_feature,War_avg_rating_genre_feature,War_num_rating_genre_feature,Western_avg_rating_genre_feature,Western_num_rating_genre_feature
0,1,232,4.366379,4.322222,90,4.388235,85,4.689655,29,4.547619,42,4.277108,83,4.355556,45,4.366379,0,4.529412,68,4.297872,47,5.0,1,3.470588,17,4.366379,0,4.681818,22,4.166667,18,4.307692,26,4.225,40,4.145455,55,4.5,22,4.285714,7
165,166,190,4.073684,3.994048,84,4.084906,53,4.5,15,4.2,15,4.067308,52,4.201923,52,4.0,2,4.166667,84,4.034483,29,4.666667,3,3.75,14,4.428571,7,4.5,3,3.975,20,4.214286,28,4.029412,34,4.014706,68,4.136364,11,4.333333,3
598,599,2478,2.64205,2.736148,758,2.766816,446,2.91411,163,2.39759,166,2.422051,975,2.812649,419,3.067308,52,2.823267,1010,2.652,250,3.303571,28,2.397959,196,2.932692,52,2.615942,69,2.828221,163,2.666213,367,2.760766,418,2.703518,597,2.89881,84,2.653061,49


$Cosine(x,y) = \frac{x \cdot y}{|x||y|}$

In [35]:
# Measuring Similarity: Cosine Similarity

similarities = cosine_similarity(normalized, normalized)
similarities

index = 0
min_index, min_value, max_index, max_value = find_most_similar(similarities, normalized_df, index)

user_matrix.loc[[index, min_index, max_index],:] # it is easier to interpret the non-normalized values

598 0.5986467590459792 128 0.9965737142598751


Unnamed: 0,userId,number_reviews,average_rating,Action_avg_rating_genre_feature,Action_num_rating_genre_feature,Adventure_avg_rating_genre_feature,Adventure_num_rating_genre_feature,Animation_avg_rating_genre_feature,Animation_num_rating_genre_feature,Children_avg_rating_genre_feature,Children_num_rating_genre_feature,Comedy_avg_rating_genre_feature,Comedy_num_rating_genre_feature,Crime_avg_rating_genre_feature,Crime_num_rating_genre_feature,Documentary_avg_rating_genre_feature,Documentary_num_rating_genre_feature,Drama_avg_rating_genre_feature,Drama_num_rating_genre_feature,Fantasy_avg_rating_genre_feature,Fantasy_num_rating_genre_feature,Film-Noir_avg_rating_genre_feature,Film-Noir_num_rating_genre_feature,Horror_avg_rating_genre_feature,Horror_num_rating_genre_feature,IMAX_avg_rating_genre_feature,IMAX_num_rating_genre_feature,Musical_avg_rating_genre_feature,Musical_num_rating_genre_feature,Mystery_avg_rating_genre_feature,Mystery_num_rating_genre_feature,Romance_avg_rating_genre_feature,Romance_num_rating_genre_feature,Sci-Fi_avg_rating_genre_feature,Sci-Fi_num_rating_genre_feature,Thriller_avg_rating_genre_feature,Thriller_num_rating_genre_feature,War_avg_rating_genre_feature,War_num_rating_genre_feature,Western_avg_rating_genre_feature,Western_num_rating_genre_feature
0,1,232,4.366379,4.322222,90,4.388235,85,4.689655,29,4.547619,42,4.277108,83,4.355556,45,4.366379,0,4.529412,68,4.297872,47,5.0,1,3.470588,17,4.366379,0,4.681818,22,4.166667,18,4.307692,26,4.225,40,4.145455,55,4.5,22,4.285714,7
598,599,2478,2.64205,2.736148,758,2.766816,446,2.91411,163,2.39759,166,2.422051,975,2.812649,419,3.067308,52,2.823267,1010,2.652,250,3.303571,28,2.397959,196,2.932692,52,2.615942,69,2.828221,163,2.666213,367,2.760766,418,2.703518,597,2.89881,84,2.653061,49
128,129,140,3.921429,4.013158,76,3.92029,69,4.113636,22,3.944444,18,3.858974,39,4.071429,14,3.921429,0,3.986842,38,4.083333,30,4.5,1,3.5,7,3.9375,8,4.055556,9,3.545455,11,4.125,24,3.895349,43,3.824324,37,4.25,8,3.921429,0


# Summary

We are only scratching the surface with these similarity measurements!

Clearly, there exist some 'critics' in the data set that have a very large number of reviews. Since half the features are "counts" (though normalized) then critics will all be more similar to each other than they are to a regular user. Perhaps that is the goal you have in mind, but maybe you would only care about whether the _ratings_ given by two customers in a genre are similar or not and so you can drop all of the count based features.

Also, different people have different ratings scales. Maybe some (like our critic, userId =599) are very critical and have low average ratings, while some are enthusiastic and rate most movies highly. In this case, perhaps our business problem is better served by creating features that take the difference in the user's genre rating average and the user's total rating average. So, it's not that they _like_ Sci-Fi, it is whether they like it _more or less_ than other genres.

But, rest assured, we will have lots of practice with feature engineering throughout the rest of this course!

Finally, we are going to export our entire database so that we can reuse the tables we created for homework!

In [38]:
path_to_export = "/Users/yashwanth/Documents/GWU/Sem 3/Data Mining/Class 2/Class Material/MovieLensExport/"
duckdb.sql(f'''EXPORT DATABASE '{path_to_export}' (FORMAT PARQUET)''') 