# Data science over small movie dataset -- Part 2

<p style="font-size: 20px; font-weight: bold;">Sparse matrix recommender</p>

Anton Antonov   
October 2025  
November 2025  

---

## Introduction

This notebook shows transformation of movie dataset into a form more suitable for making a movie recommender system. 

The movie data was downloaded from here: ["IMDB Movie Ratings Dataset"](https://www.kaggle.com/datasets/thedevastator/imdb-movie-ratings-dataset). That dataset was chosen because:

- It has the right size for demonstration of data wrangling techniques
    - ≈5000 rows and 15 columns (each row corresponding to a movie)
- It is "real life" data with expected skewness of variable distributions
- It is diverse enough over movie years and genres
- There are no missing values

---

## Setup

Load packages used in the notebook:

In [None]:
use Math::SparseMatrix;
use ML::SparseMatrixRecommender;
use ML::SparseMatrixRecommender::Utilities;
use Statistics::OutlierIdentifiers;

In [None]:
#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

In [None]:
#% js
js-d3-list-line-plot(10.rand xx 40, background => 'none', stroke-width => 2)

In [None]:
my $title-color = 'Silver';
my $stroke-color = 'SlateGray';
my $tooltip-color = 'LightBlue';
my $tooltip-background-color = 'none';
my $tick-labels-font-size = 10;
my $tick-labels-color = 'Silver';
my $tick-labels-font-family = 'Helvetica';
my $background = '#1F1F1F';
my $color-scheme = 'schemeTableau10';
my $color-palette = 'Inferno';
my $edge-thickness = 3;
my $vertex-size = 6;
my $mmd-theme = q:to/END/;
%%{
  init: {
    'theme': 'forest',
    'themeVariables': {
      'lineColor': 'Ivory'
    }
  }
}%%
END
my %force = collision => {iterations => 0, radius => 10},link => {distance => 180};
my %force2 = charge => {strength => -30, iterations => 4}, collision => {radius => 50, iterations => 4}, link => {distance => 30};

my %opts = :$background, :$title-color, :$edge-thickness, :$vertex-size;

---

## Ingest transformed data

Ingest the movie data already ***transformed*** in the first notebook, [AAn1]:

----

## Recommender system

One way to investigate (browse) the data is to make a recommender system and explore with it different aspects of the movie dataset like movie profiles and nearest neighbors similarities distribution.

### Make the recommender

In order to make a more meaningful recommender we put the values of the different numerical variables into "buckets" -- i.e. intervals derived corresponding to the values distribution for each variable. The boundaries of the intervals can form a regular grid, correspond to quanitile values, or be specially made. Here we use quantiles:

In [None]:
my @bucketVars = <score votes_count reviews_count>;
my @dsMovieDataLongForm2;
sink for @dsMovieDataLongForm.map(*<TagType>).unique -> $var {
    if $var ∈ @bucketVars {
        my %bucketizer = ML::SparseMatrixRecommender::Utilities::categorize-to-intervals(@dsMovieDataLongForm.grep(*<TagType> eq $var).map(*<Tag>)».Numeric, probs => (0..6) >>/>> 6, :interval-names):pairs;
        @dsMovieDataLongForm2.append(@dsMovieDataLongForm.grep(*<TagType> eq $var).map(*.clone).map({ $_<Tag> = %bucketizer{$_<Tag>}; $_ }))
    } else {
        @dsMovieDataLongForm2.append(@dsMovieDataLongForm.grep(*<TagType> eq $var))
    }
}

In [None]:
sink records-summary(@dsMovieDataLongForm2)

Here we make a Sparse Matrix Recommender (SMR):

In [None]:
my $smrObj = 
    ML::SparseMatrixRecommender.new
    .create-from-long-form(
        @dsMovieDataLongForm2.append(@dsMovieGenreLongForm), 
        item-column-name => 'Item', 
        tag-type-column-name => 'TagType',
        tag-column-name => 'Tag',
        :add-tag-types-to-column-names)        
    .apply-term-weight-functions('IDF', 'None', 'Cosine')

Here are the recommender sub-matrices dimensions (rows and columns):

In [None]:
.say for $smrObj.take-matrices.deepmap(*.dimensions).sort(*.key)

Note that the sub-matrices of "reviews_count", "score", and "votes_count" have small number of columns, corresponding to the number probabilities specified when categorizing to intervals.

### Enhance one-hot embedding

In [None]:
my $mat = $smrObj.take-matrices<year>;

my $matUp = Math::SparseMatrix.new(
    diagonal => 1/2 xx ($mat.columns-count - 1), k => 1, 
    row-names => $mat.column-names,
    column-names => $mat.column-names
);

my $matDown = $matUp.transpose;

# mat = mat + mat . matDown + mat . matDown
$mat = $mat.add($mat.dot($matUp)).add($mat.dot($matDown));

In [None]:
#%js
 my %opts = margins => {top => 30, left => 16, right => 16, bottom => 16}, :$tick-labels-font-size, :$tick-labels-color, :$title-color, :tooltip, :$tooltip-color, :$color-palette, :$tooltip-background-color, :$background;
$mat[(^$mat.rows-count).pick(50).sort; 'year:' X~ (1970..2000)].Array
==> js-d3-matrix-plot(:600width, :400height, |%opts)

In [None]:
#% js
js-d3-list-plot($mat.tuples, :$background, :600width, :500height, point-size => 1, :!axes)

In [None]:
my %matrices = $smrObj.take-matrices;
%matrices<year> = $mat;
my $smrObj2 = ML::SparseMatrixRecommender.new(%matrices)

### Recommendations

Example recommendation by profile:

In [None]:
sink $smrObj
.apply-tag-type-weights({genre => 2})
.recommend-by-profile(<genre:History year:1999>, 12, :!normalize)
.join-across(select-columns(@dsMovieData, @field-names), 'index')
.echo-value(as => {to-pretty-table($_, align => 'l', field-names => ['score', |@field-names])})

Recommendation by history:

In [None]:
sink $smrObj
.recommend(<2125 2308>, 12, :!normalize, :!remove-history)
.join-across(select-columns(@dsMovieData, @field-names), 'index')
.echo-value(as => {to-pretty-table($_, align => 'l', field-names => ['score', |@field-names])})

### Profiles

Find movie IDs for a certain criteria (e.g. action movies):

In [None]:
my @movieIDs = $smrObj.recommend-by-profile('actor:Orlando Bloom', Inf).take-value».key;
deduce-type(@movieIDs)

In [None]:
my @profile = |$smrObj.profile(@movieIDs).take-value;
deduce-type(@profile)

In [None]:
outlier-identifier(@profile».value, identifier => &top-outliers o &quartile-identifier-parameters)
==> {@profile[$_]}()
==> my @profile2;

deduce-type(@profile2)

In [None]:
#%js
js-d3-list-plot(
    [|@profile».value.kv.map(-> $x, $y { %(:$x, :$y, group => 'full profile' ) }), 
     |@profile2».value.kv.map(-> $x, $y { %(:$x, :$y, group => 'outliers' ) })], 
    :$background,
    :300height,
    :600width
    )

---

## References

### Articles, blog posts

[AA1] Anton Antonov, ["Introduction to data wrangling with Raku"](https://rakuforprediction.wordpress.com/2021/12/31/introduction-to-data-wrangling-with-raku/), (2021), [RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

[AA2] Anton Antonov, ["Implementing Machine Learning algorithms in Raku (TRC-2022 talk)"](https://rakuforprediction.wordpress.com/2022/08/15/implementing-machine-learning-algorithms-in-raku-trc-2022-talk/), (2021), [RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Notebooks 

[AAn1] Anton Antonov, 
["Small movie dataset analysis"](), 
(2025),
[RakuForPrediction-blog at GitHub]().

[AAn2] Anton Antonov, 
["Small movie dataset recommender"](), 
(2025),
[RakuForPrediction-blog at GitHub]().

[AAn3] Anton Antonov, 
["Small movie dataset graph"](), 
(2025),
[RakuForPrediction-blog at GitHub]().




### Packages

[AAp1] Anton Antonov, [Data::Importers, Raku package](https://github.com/antononcube/Raku-Data-Importers), (2024-2025), [GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov, [Data::Reshapers, Raku package](https://github.com/antononcube/Raku-Data-Reshapers), (2021-2025), [GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov, [Data::Summarizers, Raku package](https://github.com/antononcube/Raku-Data-Summarizers), (2021-2024), [GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov, [Graph, Raku package](https://github.com/antononcube/Raku-Graph), (2024-2025), [GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov, [JavaScript::D3, Raku package](https://github.com/antononcube/Raku-JavaScript-D3), (2022-2025), [GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov, [Jupyter::Chatbook, Raku package](https://github.com/antononcube/Raku-Jupyter-Chatbook), (2023-2025), [GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov, [Math::SparseMatrix, Raku package](https://github.com/antononcube/Raku-Math-SparseMatrix), (2024-2025), [GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov, [ML::AssociationRuleLearning, Raku package](https://github.com/antononcube/Raku-ML-AssociationRuleLearning), (2022-2024), [GitHub/antononcube](https://github.com/antononcube).

[AAp9] Anton Antonov, [ML::SparseMatrixRecommender, Raku package](https://github.com/antononcube/Raku-ML-SparseMatrixRecommender), (2025), [GitHub/antononcube](https://github.com/antononcube).

[AAp10] Anton Antonov, [Statistics::OutlierIdentifiers, Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers), (2022), [GitHub/antononcube](https://github.com/antononcube).


### Videos

[AAv1] Anton Antonov, ["Simplified Machine Learning Workflows Overview (Raku-centric)"](https://www.youtube.com/watch?v=p3iwPsc6e74), (2022), [YouTube/@AAA4prediction](https://www.youtube.com/@AAA4prediction).

[AAv2] Anton Antonov, ["TRC 2022 Implementation of ML algorithms in Raku"](https://www.youtube.com/watch?v=efRHfjYebs4), (2022), [YouTube/@AAA4prediction](https://www.youtube.com/@AAA4prediction).

[AAv3] Anton Antonov, ["Exploratory Data Analysis with Raku"](https://www.youtube.com/watch?v=YCnjMVSfT8w), (2024), [YouTube/@AAA4prediction](https://www.youtube.com/@AAA4prediction).

[AAv4] Anton Antonov, ["Raku RAG demo"](https://www.youtube.com/watch?v=JHO2Wk1b-Og), (2024), [YouTube/@AAA4prediction](https://www.youtube.com/@AAA4prediction).
