# Small movie dataset -- Part 3

<p style="font-size: 20px; font-weight: bold;">Relationships graphs</p>

Anton Antonov   
November 2025  

---

## Introduction

This notebook shows transformation of movie dataset into a form more suitable for making a movie recommender system. 

The movie data was downloaded from here: ["IMDB Movie Ratings Dataset"](https://www.kaggle.com/datasets/thedevastator/imdb-movie-ratings-dataset). That dataset was chosen because:

- It has the right size for demonstration of data wrangling techniques
    - ≈5000 rows and 15 columns (each row corresponding to a movie)
- It is "real life" data with expected skewness of variable distributions
- It is diverse enough over movie years and genres
- There are no missing values

### Outline 

Here are the transformation and data analysis steps taken in this notebook:

1. Ingest the data 
    - Shape size and summaries
    - Numerical columns transformation
    - Renaming columns to have more convenient names  
    - Separating the non-uniform genres column into movie-genre associations
        - Into long format
2. Basic data analysis
    - Number of movies per year distribution
    - Movie-genre distribution
    - Pareto principle adherence for movie directors
    - Correlation between number of votes and rating
3. Association Rules Learning (ARL)
    - Converting long format dataset into "baskets" of genres
    - Most frequent combinations of genres
    - Implications between genres
        - I.e. a biography-movie is also a drama-movie 94% of the time
    - LLM-derived dictionary of most commonly used ARL measures
4. Recommender system creation
    - Conversion of numerical data into categorical data
    - Application of one hot embedding
    - Experimenting / observing recommendation results
    - Getting familiar with the movie data by computing profiles for sets of movies

### Comments & observations

- In most "real life" data processing most of the data transformation listed steps above are taken.
- ARL can be also used for deriving recommendations if the data is large enough.
- The Sparse Matrix Recommender (SMR) object is based on Nearest Neighbors finding over "bags of tags."
    - Latent Semantic Indexing (LSI) tag-weighting functions are applied.
- One hot embedding is common technique, which in this notebook is done via cross-tabulation.
- The categorization of numerical data means putting number into suitable bins or "buckets."
    - The bin or bucket boundaries can be on a regular grid or a quantile grid.
- For categorized numerical data one-hot embedding matrices can be processed to increase similarity between numeric buckets that are close to each to other.
- If the movies had reviews or summaries associated with them, then Latent Semantic Analysis (LSA) could be applied.
    - SMR can use both LSA-terms-based and LSA-topics-based representations of the movies.
    - LLMs can be used to derive the LSA representation.
    - Again, *not done in this notebook*.
        - But there are plans to demonstrate such LSA application with Raku soon.

---

## Setup

In [1]:
use Math::SparseMatrix;
use ML::SparseMatrixRecommender;
use ML::SparseMatrixRecommender::Utilities;
use Statistics::OutlierIdentifiers;

In [2]:
#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

In [3]:
#% js
js-d3-list-line-plot(10.rand xx 40, background => 'none', stroke-width => 2)

In [4]:
my $title-color = 'Silver';
my $stroke-color = 'SlateGray';
my $tooltip-color = 'LightBlue';
my $tooltip-background-color = 'none';
my $tick-labels-font-size = 10;
my $tick-labels-color = 'Silver';
my $tick-labels-font-family = 'Helvetica';
my $background = '#1F1F1F';
my $color-scheme = 'schemeTableau10';
my $color-palette = 'Inferno';
my $edge-thickness = 3;
my $vertex-size = 6;
my $mmd-theme = q:to/END/;
%%{
  init: {
    'theme': 'forest',
    'themeVariables': {
      'lineColor': 'Ivory'
    }
  }
}%%
END
my %force = collision => {iterations => 0, radius => 10},link => {distance => 180};
my %force2 = charge => {strength => -30, iterations => 4}, collision => {radius => 50, iterations => 4}, link => {distance => 30};

my %opts = :$background, :$title-color, :$edge-thickness, :$vertex-size;

{background => #1F1F1F, edge-thickness => 3, title-color => Silver, vertex-size => 6}

---

## Ingest data

Ingest the movie data:

In [5]:
my $fileName=$*HOME~'/Datasets/Kaggle The Movies Ratings Dataset/movie_data.csv';
my @dsMovieData=data-import($fileName, headers=>'auto');

deduce-type(@dsMovieData)

Vector(Assoc(Atom((Str)), Atom((Str)), 15), 5043)

Show a sample of the movie data:

In [6]:
#% html
#my @field-names = @dsMovieData.head.keys.sort;
my @field-names = <index movie_title title_year country duration language actor_1_name actor_2_name actor_3_name director_name imdb_score num_user_for_reviews num_voted_users movie_imdb_link>;
@dsMovieData.pick(8)
==> to-html(:@field-names)

index,movie_title,title_year,country,duration,language,actor_1_name,actor_2_name,actor_3_name,director_name,imdb_score,num_user_for_reviews,num_voted_users,movie_imdb_link
1505,The Horseman on the Roof,1995.0,France,135.0,French,Olivier Martinez,François Cluzet,Isabelle Carré,Jean-Paul Rappeneau,7.1,25.0,4885,http://www.imdb.com/title/tt0113362/?ref_=fn_tt_tt_1
2213,Risen,2016.0,USA,107.0,English,Peter Firth,Jan Cornet,María Botto,Kevin Reynolds,6.3,117.0,12276,http://www.imdb.com/title/tt3231054/?ref_=fn_tt_tt_1
2373,Spirited Away,2001.0,Japan,125.0,Japanese,Bunta Sugawara,Ryûnosuke Kamiki,Miyu Irino,Hayao Miyazaki,8.6,902.0,417971,http://www.imdb.com/title/tt0245429/?ref_=fn_tt_tt_1
1443,Something Borrowed,2011.0,USA,112.0,English,Ashley Williams,Steve Howey,Kirsten Day,Luke Greenfield,5.9,153.0,48019,http://www.imdb.com/title/tt0491152/?ref_=fn_tt_tt_1
3126,Employee of the Month,2006.0,USA,103.0,English,Dane Cook,Danny Woodburn,Jessica Simpson,Greg Coolidge,5.5,151.0,37681,http://www.imdb.com/title/tt0424993/?ref_=fn_tt_tt_1
4878,Jesus People,2007.0,USA,35.0,English,Victoria Jackson,Kate Flannery,Tim Bagley,Jason Naumann,6.9,,31,http://www.imdb.com/title/tt1003002/?ref_=fn_tt_tt_1
2317,Passchendaele,2008.0,Canada,114.0,English,Landon Liboiron,Michael Greyeyes,Caroline Dhavernas,Paul Gross,6.5,102.0,6904,http://www.imdb.com/title/tt1092082/?ref_=fn_tt_tt_1
4929,Perfect Cowboy,2014.0,USA,109.0,English,Charla Cochran,Joe Lev,Sienna Beckman,Ken Roht,7.0,,8,http://www.imdb.com/title/tt3581098/?ref_=fn_tt_tt_1


Convert string values of the numerical columns into numbers:

In [7]:
@dsMovieData .= map({ 
    $_<title_year> = $_<title_year>.trim.Int; 
    $_<imdb_score> = $_<imdb_score>.Numeric; 
    $_<num_user_for_reviews> = $_<num_user_for_reviews>.Int; 
    $_<num_voted_users> = $_<num_voted_users>.Int; 
    $_});
deduce-type(@dsMovieData)

Vector(Struct([actor_1_name, actor_2_name, actor_3_name, country, director_name, duration, genres, imdb_score, index, language, movie_imdb_link, movie_title, num_user_for_reviews, num_voted_users, title_year], [Str, Str, Str, Str, Str, Str, Str, Rat, Str, Str, Str, Str, Int, Int, Int]), 5043)

Summary of the data (over selected columns):

In [8]:
#% html
my @field-names = <index title_year imdb_score genres num_voted_users num_user_for_reviews>;
sink records-summary(select-columns(@dsMovieData, @field-names), :@field-names);

+-----------------+-----------------------+--------------------+------------------------------+------------------------+----------------------+
| index           | title_year            | imdb_score         | genres                       | num_voted_users        | num_user_for_reviews |
+-----------------+-----------------------+--------------------+------------------------------+------------------------+----------------------+
| 4436    => 1    | Min    => 0           | Min    => 1.6      | Drama                => 236  | Min    => 5            | Min    => 0          |
| 2658    => 1    | 1st-Qu => 1998        | 1st-Qu => 5.8      | Comedy               => 209  | 1st-Qu => 8589         | 1st-Qu => 64         |
| 327     => 1    | Mean   => 1959.585961 | Mean   => 6.442138 | Comedy|Drama         => 191  | Mean   => 83668.160817 | Mean   => 271.63494  |
| 495     => 1    | Median => 2005        | Median => 6.6      | Comedy|Drama|Romance => 187  | Median => 34359        | Median => 155  

Convert to long form by skipping special columns:

In [9]:
my @varnames = <movie_title title_year country actor_1_name actor_2_name actor_3_name num_voted_users num_user_for_reviews imdb_score director_name language>;
my @dsMovieDataLongForm = to-long-format(@dsMovieData, 'index', @varnames, variables-to => 'TagType', values-to => 'Tag');

deduce-type(@dsMovieDataLongForm)

Vector((Any), 55473)

Show a sample of the converted data:

In [10]:
#% html
@dsMovieDataLongForm.pick(8)
==> to-html(field-names => <index TagType Tag>)

index,TagType,Tag
2663,actor_1_name,Christine Taylor
2876,actor_1_name,Steve Buscemi
2594,title_year,0
172,actor_3_name,Desmond Llewelyn
4921,imdb_score,8.5
1146,actor_1_name,Daniel Radcliffe
749,actor_3_name,Jake Lloyd
3805,director_name,Bob Saget


Give some tag types more convenient names:

In [11]:
my %toBetterTagTypes = 
    movie_title => 'title', 
    title_year => 'year', 
    director_name => 'director',
    actor_1_name => 'actor', actor_2_name => 'actor', actor_3_name => 'actor', 
    num_voted_users => 'votes_count', num_user_for_reviews => 'reviews_count',
    imdb_score => 'score', 
    ;

@dsMovieDataLongForm = @dsMovieDataLongForm.map({ $_<TagType> = %toBetterTagTypes{$_<TagType>} // $_<TagType>; $_ });
@dsMovieDataLongForm = |rename-columns(@dsMovieDataLongForm, {index=>'Item'});

deduce-type(@dsMovieDataLongForm)

Vector((Any), 55473)

Summarize the long form data:

In [12]:
sink records-summary(@dsMovieDataLongForm, :12max-tallies)

+------------------+------------------+------------------------+
| Tag              | Item             | TagType                |
+------------------+------------------+------------------------+
| English => 4704  | 1526    => 11    | actor         => 15129 |
| USA     => 3807  | 3741    => 11    | language      => 5043  |
| UK      => 448   | 2554    => 11    | reviews_count => 5043  |
| 2009    => 260   | 3011    => 11    | year          => 5043  |
| 2014    => 252   | 2       => 11    | title         => 5043  |
| 2006    => 239   | 4465    => 11    | country       => 5043  |
| 2013    => 237   | 4526    => 11    | votes_count   => 5043  |
| 2010    => 230   | 1424    => 11    | director      => 5043  |
| 2015    => 226   | 2744    => 11    | score         => 5043  |
| 2011    => 226   | 3710    => 11    |                        |
| 2008    => 225   | 4799    => 11    |                        |
| 6.7     => 223   | 4326    => 11    |                        |
| (Other) => 44396 | (Oth

Make a separate dataset with movie-genre associations:

In [13]:
my @dsMovieGenreLongForm = @dsMovieData.map({ $_<index> X $_<genres>.split('|', :skip-empty)}).flat(1).map({ <index genre> Z=> $_ })».Hash;
deduce-type(@dsMovieGenreLongForm)

Vector(Assoc(Atom((Str)), Atom((Str)), 2), 14504)

Make the genres long form similar to that with the rest of the movie metadata:

In [14]:
@dsMovieGenreLongForm = rename-columns(@dsMovieGenreLongForm, {index => 'Item', genre => 'Tag'}).map({ $_.push('TagType' => 'genre') });

deduce-type(@dsMovieGenreLongForm)

Vector(Assoc(Atom((Str)), Atom((Str)), 3), 14504)

In [15]:
#% html
@dsMovieGenreLongForm.head(8)
==> to-html(field-names => <Item TagType Tag>)

Item,TagType,Tag
0,genre,Action
0,genre,Adventure
0,genre,Fantasy
0,genre,Sci-Fi
1,genre,Action
1,genre,Adventure
1,genre,Fantasy
2,genre,Action


----

## Statistics

In this section we compute different statistics that should give us better idea what the data is.

Show movie years distribution:

In [16]:
#% js
js-d3-bar-chart(@dsMovieData.map(*<title_year>.Str).&tally.sort(*.head), title => 'Movie years distribution', :$background, :$title-color, :1000width)
~
js-d3-box-whisker-chart(@dsMovieData.map(*<title_year>)».Int.grep(*>1916), :horizontal, :$background)

Show movie genre distribution:

In [17]:
#% js
my %genreCounts = cross-tabulate(@dsMovieGenreLongForm, 'Item', 'Tag', :sparse).column-sums(:p);
js-d3-bar-chart(%genreCounts.sort, :$background)


Check Pareto principle adherence for director names:

In [18]:
#% js
pareto-principle-statistic(@dsMovieData.map(*<director_name>))
==> js-d3-list-line-plot(
        :$background,
        title => 'Pareto principle adherence for movie directors',
        y-label => 'probability', x-label => 'index',
        :grid-lines, :5stroke-width, :$title-color)

Plot the number of IMDB votes vs IMBDB scores:

In [19]:
#% js
@dsMovieData.map({ %( x => $_<num_voted_users>».Num».log(10), y => $_<imdb_score>».Num ) })
==> js-d3-list-plot(
        :$background,
        title => 'Number of IMBD votes vs IMDB scores',
        x-label => 'Number of votes, lg', y-label => 'score',
        :grid-lines, point-size => 4, :$title-color)

---

## Association rules learning

It is interesting to see which genres associated closely with each other. One way to find to those associations is to use Association Rule Learning (ARL).

For each movie make a "basket" of genres:

In [20]:
my @baskets = cross-tabulate(@dsMovieGenreLongForm, 'Item', 'Tag').values».keys».List;
@baskets».elems.&tally

{1 => 633, 2 => 1355, 3 => 1628, 4 => 981, 5 => 349, 6 => 75, 7 => 18, 8 => 4}

Find frequent sets that are seen in at least 300 movies:

In [21]:
my @freqSets = frequent-sets(@baskets, min-support => 300, min-number-of-items => 2, max-number-of-items => Inf);
deduce-type(@freqSets):tally

Tuple([Pair(Vector(Atom((Str)), 2), Atom((Rat))) => 14, Pair(Vector(Atom((Str)), 3), Atom((Rat))) => 1], 15)

In [22]:
to-pretty-table(@freqSets.map({ %( FrequentSet => $_.key.join(' '), Frequency => $_.value) }).sort(-*<Frequency>), field-names => <FrequentSet Frequency>, align => 'l');

+----------------------+-----------+
| FrequentSet          | Frequency |
+----------------------+-----------+
| Drama Romance        | 0.146143  |
| Drama Thriller       | 0.138211  |
| Comedy Drama         | 0.131469  |
| Action Thriller      | 0.116796  |
| Comedy Romance       | 0.116796  |
| Crime Thriller       | 0.108665  |
| Crime Drama          | 0.104303  |
| Action Adventure     | 0.093198  |
| Comedy Family        | 0.070989  |
| Mystery Thriller     | 0.070196  |
| Action Drama         | 0.068412  |
| Action Sci-Fi        | 0.066627  |
| Crime Drama Thriller | 0.066032  |
| Action Crime         | 0.065041  |
| Adventure Comedy     | 0.061670  |
+----------------------+-----------+

Here are the corresponding association rules:

In [23]:
association-rules(@baskets, min-support => 0.025, min-confidence => 0.70)
==> { .sort(-*<confidence>) }()
==> { to-pretty-table($_, field-names => <antecedent consequent count support confidence lift leverage conviction>) }()

+---------------------+------------+-------+----------+------------+----------+----------+------------+
|      antecedent     | consequent | count | support  | confidence |   lift   | leverage | conviction |
+---------------------+------------+-------+----------+------------+----------+----------+------------+
|      Biography      |   Drama    |  275  | 0.054531 |  0.938567  | 1.824669 | 0.024646 |  7.904874  |
|       History       |   Drama    |  189  | 0.037478 |  0.913043  | 1.775049 | 0.016364 |  5.584672  |
|   Animation Comedy  |   Family   |  154  | 0.030537 |  0.895349  | 8.269678 | 0.026845 |  8.520986  |
| Adventure Animation |   Family   |  151  | 0.029942 |  0.893491  | 8.252520 | 0.026314 |  8.372364  |
|         War         |   Drama    |  190  | 0.037676 |  0.892019  | 1.734175 | 0.015950 |  4.497297  |
|      Animation      |   Family   |  205  | 0.040650 |  0.847107  | 7.824108 | 0.035455 |  5.832403  |
|    Crime Mystery    |  Thriller  |  129  | 0.025580 |  0.82165

### Measure cheat-sheet

Here is an HTML table showing the formulas for the Association Rules Learning measures (confidence, lift, leverage, conviction), along with their minimum value, maximum value, and value of indifference:


<table border="1" cellpadding="5" cellspacing="0">
  <thead>
    <tr>
      <th>Measure</th>
      <th>Formula</th>
      <th>Min Value</th>
      <th>Max Value</th>
      <th>Value of Indifference</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Confidence</td>
      <td>
        Confidence(A → B) = P(B | A) = <br>
        <code>support(A ∪ B) / support(A)</code>
      </td>
      <td>0</td>
      <td>1</td>
      <td>support(B)</td>
    </tr>
    <tr>
      <td>Lift</td>
      <td>
        Lift(A → B) = <br>
        <code>confidence(A → B) / support(B) = support(A ∪ B) / (support(A) × support(B))</code>
      </td>
      <td>0</td>
      <td>∞ (unbounded above)</td>
      <td>1</td>
    </tr>
    <tr>
      <td>Leverage</td>
      <td>
        Leverage(A → B) = <br>
        <code>support(A ∪ B) - (support(A) × support(B))</code>
      </td>
      <td>-min(support(A)×support(¬B), support(¬A)×support(B))</td>
      <td>min(support(A), support(B))</td>
      <td>0</td>
    </tr>
    <tr>
      <td>Conviction</td>
      <td>
        Conviction(A → B) = <br>
        <code>(1 - support(B)) / (1 - confidence(A → B))</code>
      </td>
      <td>0</td>
      <td>∞ (unbounded above)</td>
      <td>1</td>
    </tr>
  </tbody>
</table>


### Explanation of terms:
- **support(X)** = P(X), the proportion of transactions containing itemset X.
- **¬A** = complement of A (transactions not containing A).
- Value of indifference generally means the value where the measure indicates independence or no association.  
- For Confidence, the baseline is support(B) (probability of B alone).
- For Lift and Conviction, 1 indicates no association.
- Leverage's minimum and maximum depend on the supports of A and B.


#### LLM prompt

In [24]:
# % chat 
#Give the formulas for the Association Rules Learning measures: confidence, lift, leverage, and conviction.
#In an HTML table for each measure give the min value, max value, value of indifference. 

()

----

## Recommender system

One way to investigate (browse) the data is to make a recommender system and explore with it different aspects of the movie dataset like movie profiles and nearest neighbors similarities distribution.

### Make the recommender

In order to make a more meaningful recommender we put the values of the different numerical variables into "buckets" -- i.e. intervals derived corresponding to the values distribution for each variable. The boundaries of the intervals can form a regular grid, correspond to quanitile values, or be specially made. Here we use quantiles:

In [25]:
my @bucketVars = <score votes_count reviews_count>;
my @dsMovieDataLongForm2;
sink for @dsMovieDataLongForm.map(*<TagType>).unique -> $var {
    if $var ∈ @bucketVars {
        my %bucketizer = ML::SparseMatrixRecommender::Utilities::categorize-to-intervals(@dsMovieDataLongForm.grep(*<TagType> eq $var).map(*<Tag>)».Numeric, probs => (0..6) >>/>> 6, :interval-names):pairs;
        @dsMovieDataLongForm2.append(@dsMovieDataLongForm.grep(*<TagType> eq $var).map(*.clone).map({ $_<Tag> = %bucketizer{$_<Tag>}; $_ }))
    } else {
        @dsMovieDataLongForm2.append(@dsMovieDataLongForm.grep(*<TagType> eq $var))
    }
}

In [26]:
sink records-summary(@dsMovieDataLongForm2)

+------------------+--------------------+------------------------+
| Item             | Tag                | TagType                |
+------------------+--------------------+------------------------+
| 2716    => 11    | English   => 4704  | actor         => 15129 |
| 3355    => 11    | USA       => 3807  | director      => 5043  |
| 2670    => 11    | 6.1≤v<6.6 => 901   | score         => 5043  |
| 37      => 11    | 7≤v<7.5   => 891   | title         => 5043  |
| 3917    => 11    | 7.5≤v<9.5 => 886   | reviews_count => 5043  |
| 4900    => 11    | 37≤v<91   => 846   | language      => 5043  |
| 4190    => 11    | 155≤v<253 => 845   | year          => 5043  |
| (Other) => 55396 | (Other)   => 42593 | (Other)       => 10086 |
+------------------+--------------------+------------------------+


Here we make a Sparse Matrix Recommender (SMR):

In [27]:
my $smrObj = 
    ML::SparseMatrixRecommender.new
    .create-from-long-form(
        @dsMovieDataLongForm2.append(@dsMovieGenreLongForm), 
        item-column-name => 'Item', 
        tag-type-column-name => 'TagType',
        tag-column-name => 'Tag',
        :add-tag-types-to-column-names)        
    .apply-term-weight-functions('IDF', 'None', 'Cosine')

ML::SparseMatrixRecommender(:matrix-dimensions((5043, 13825)), :density(<23319/23239825>), :tag-types(("votes_count", "year", "country", "title", "reviews_count", "score", "actor", "genre", "language", "director")))

Here are the recommender sub-matrices dimensions (rows and columns):

In [28]:
.say for $smrObj.take-matrices.deepmap(*.dimensions).sort(*.key)

actor => (5043 6256)
country => (5043 66)
director => (5043 2399)
genre => (5043 26)
language => (5043 48)
reviews_count => (5043 7)
score => (5043 7)
title => (5043 4917)
votes_count => (5043 7)
year => (5043 92)


Note that the sub-matrices of "reviews_count", "score", and "votes_count" have small number of columns, corresponding to the number probabilities specified when categorizing to intervals.

### Enhance one-hot embedding

In [29]:
my $mat = $smrObj.take-matrices<year>;

my $matUp = Math::SparseMatrix.new(
    diagonal => 1/2 xx ($mat.columns-count - 1), k => 1, 
    row-names => $mat.column-names,
    column-names => $mat.column-names
);

my $matDown = $matUp.transpose;

# mat = mat + mat . matDown + mat . matDown
$mat = $mat.add($mat.dot($matUp)).add($mat.dot($matDown));

Math::SparseMatrix(:specified-elements(14915), :dimensions((5043, 92)), :density(<14915/463956>))

In [30]:
#%js
 my %opts = margins => {top => 30, left => 16, right => 16, bottom => 16}, :$tick-labels-font-size, :$tick-labels-color, :$title-color, :tooltip, :$tooltip-color, :$color-palette, :$tooltip-background-color, :$background;
$mat[(^$mat.rows-count).pick(50).sort; 'year:' X~ (1970..2000)].Array
==> js-d3-matrix-plot(:600width, :400height, |%opts)

In [31]:
#% js
js-d3-list-plot($mat.tuples, :$background, :600width, :500height, point-size => 1, :!axes)

In [32]:
my %matrices = $smrObj.take-matrices;
%matrices<year> = $mat;
my $smrObj2 = ML::SparseMatrixRecommender.new(%matrices)

ML::SparseMatrixRecommender(:matrix-dimensions((5043, 13825)), :density(<79829/69719475>), :tag-types(("year", "country", "votes_count", "actor", "language", "reviews_count", "director", "genre", "score", "title")))

### Recommendations

Example recommendation by profile:

In [33]:
sink $smrObj
.apply-tag-type-weights({genre => 2})
.recommend-by-profile(<genre:History year:1999>, 12, :!normalize)
.join-across(select-columns(@dsMovieData, @field-names), 'index')
.echo-value(as => {to-pretty-table($_, align => 'l', field-names => ['score', |@field-names])})

+----------+-------+------------+------------+----------------------------------------------+-----------------+----------------------+
| score    | index | title_year | imdb_score | genres                                       | num_voted_users | num_user_for_reviews |
+----------+-------+------------+------------+----------------------------------------------+-----------------+----------------------+
| 2.775503 | 553   | 1999       | 6.700000   | Drama|History|Romance                        | 31080           | 217                  |
| 2.634951 | 215   | 1999       | 6.600000   | Action|Adventure|History                     | 101411          | 546                  |
| 2.135452 | 1016  | 1999       | 6.400000   | Adventure|Biography|Drama|History|War        | 55889           | 390                  |
| 2.000529 | 2468  | 1999       | 6.200000   | Action|Drama|History|Romance|War|Western     | 899             | 42                   |
| 2.000000 | 4767  | 2015       | 7.500000   | History 

Recommendation by history:

In [34]:
sink $smrObj
.recommend(<2125 2308>, 12, :!normalize, :!remove-history)
.join-across(select-columns(@dsMovieData, @field-names), 'index')
.echo-value(as => {to-pretty-table($_, align => 'l', field-names => ['score', |@field-names])})

+-----------+-------+------------+------------+----------------------------------------------+-----------------+----------------------+
| score     | index | title_year | imdb_score | genres                                       | num_voted_users | num_user_for_reviews |
+-----------+-------+------------+------------+----------------------------------------------+-----------------+----------------------+
| 17.040045 | 2125  | 2007       | 7.300000   | Comedy|History                               | 5166            | 27                   |
| 17.040045 | 2308  | 1999       | 7.400000   | Biography|Comedy|Drama|History|Music|Musical | 10037           | 202                  |
| 12.459325 | 1728  | 2007       | 7.100000   | Biography|Drama|History                      | 10175           | 23                   |
| 11.721195 | 3404  | 2010       | 7.200000   | Biography|Comedy|Drama|History               | 11158           | 73                   |
| 11.082548 | 1799  | 1984       | 7.400000   | 

### Profiles

Find movie IDs for a certain criteria (e.g. action movies):

In [35]:
my @movieIDs = $smrObj.recommend-by-profile('actor:Orlando Bloom', Inf).take-value».key;
deduce-type(@movieIDs)

Vector(Atom((Str)), 11)

In [36]:
my @profile = |$smrObj.profile(@movieIDs).take-value;
deduce-type(@profile)

Vector(Pair(Atom((Str)), Atom((Numeric))), 71)

In [37]:
outlier-identifier(@profile».value, identifier => &top-outliers o &quartile-identifier-parameters)
==> {@profile[$_]}()
==> my @profile2;

deduce-type(@profile2)

Vector(Pair(Atom((Str)), Atom((Numeric))), 17)

In [38]:
#%js
js-d3-list-plot(
    [|@profile».value.kv.map(-> $x, $y { %(:$x, :$y, group => 'full profile' ) }), 
     |@profile2».value.kv.map(-> $x, $y { %(:$x, :$y, group => 'outliers' ) })], 
    :$background,
    :300height,
    :600width
    )

----

## Graphs

Using the recommender we make the nearest neighbors graph for the movies from year 2014 to 2016.

In [87]:
my @focusMovieIDs = (2014...2016).map({ $smrObj.recommend-by-profile('year:' ~ $_, Inf).take-value».key }).flat;
@focusMovieIDs.elems

584

Change the tag type weights to reflect the view that:
- Common actors or directors means movies a similar
    - Or are seen by the "same" viewers
- Common genres are important
    - But not as much as directors or actors
- Release years are not important    


In [88]:
$smrObj.apply-tag-type-weights({ director => 1, actor => 1, genre => 0.5}, default => 0.2)

ML::SparseMatrixRecommender(:matrix-dimensions((5043, 13825)), :density(<23319/23239825>), :tag-types(("votes_count", "year", "country", "title", "reviews_count", "score", "actor", "genre", "language", "director")))

For each movie find two nearest neighbors and make corresponding graph edges:

In [None]:
my @edges = @focusMovieIDs.map({ $_ X=> $smrObj.recommend($_, 6).take-value».key }).flat;
@edges.elems

3504

In [89]:
my @edges2 = @edges.grep({ $_.value ∈ @focusMovieIDs });
@edges2.elems

1419

In [80]:
sink my %indexToID = @dsMovieData.map({ $_<index> => "{$_<index>} {$_<movie_title>.trim} ($_<title_year>)" });

In [90]:
my $g = Graph.new(@edges2.map({ %indexToID{$_.key} => %indexToID{$_.value} }));

Graph(vertexes => 544, edges => 985, directed => False)

In [91]:
my @comps = $g.connected-components.sort(-*.elems);
deduce-type(@comps)

Vector((Any), 20)

In [101]:
my @focus-component = |@comps.grep({ $_.join(' ') ~~ /:i star / }).head;
@focus-component.elems

451

In [104]:
#%js
$g.edges(:dataset) 
==> js-d3-graph-plot(
    vertex-label-color => 'none',
    :$background, 
    title-color => 'gray',
    width => 1200, 
    edge-thickness => 1,
    vertex-size => 2,
    force => {charge => {strength => -20, iterations => 2}, collision => {radius => 1, iterations => 1}, link => {distance => 0}}
)

---

## References

### Articles, blog posts

### Packages