# Data pre-processing

Para tener los datos necesario para probar los enfoques de filtros colaborativos(CF) y basados en contenido(CB) necesito:

* Calificaciones de los ítems(movies) de los usuarios (CF)
* Features propies de los ítems (CB)

Dado esto, a continuacion se combinaran los siguientes datasets:

* MovieLens 25M Dataset: Prácticamente no tiene información de las películas pero si tiene las calificaciones de los usuarios.
* TMDB Movie Dataset: No tiene calificaciones personalizadas como el dataset anterior pero tiene varios features para las películas que es lo que necesito.

## Setup

In [61]:
%load_ext autoreload
%autoreload 2

BASE_PATH       = '..'
DATASETS_PATH   = f'{BASE_PATH}/datasets'
MOVIE_LENS_PATH = f'{DATASETS_PATH}/ml-25m'
TMDB_PATH       = f'{DATASETS_PATH}/tmdb'
DATABASE        = 'movies'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [62]:
import sys
sys.path.append(f'{BASE_PATH}/lib')

from database.mongo import Mongo
from pytorch_common.util import LoggerBuilder

In [63]:
LoggerBuilder().on_console().build()

<RootLogger root (INFO)>

# Pre-processing steps

#### 1. Import five collecitons to **movies** mongodb database:

    * From movie lens dataset:
        * rattings
        * movies
        * links
        * tags
    * From the movie database dataset:
        * movies_metadata

In [64]:
!mkdir -p {DATASETS_PATH}

!cd {DATASETS_PATH}; curl -LO http://files.grouplens.org/datasets/movielens/ml-25m.zip

!cd {DATASETS_PATH}; unzip -o ml-25m.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  249M  100  249M    0     0  9474k      0  0:00:27  0:00:27 --:--:-- 12.1M325k      0  0:13:06  0:00:01  0:13:05  325k.7M    0     0  4806k      0  0:00:53  0:00:04  0:00:49 4808k    0  8070k      0  0:00:31  0:00:15  0:00:16 9947k
Archive:  ml-25m.zip
  inflating: ml-25m/tags.csv         
  inflating: ml-25m/links.csv        
  inflating: ml-25m/README.txt       
  inflating: ml-25m/ratings.csv      
  inflating: ml-25m/genome-tags.csv  
  inflating: ml-25m/genome-scores.csv  
  inflating: ml-25m/movies.csv       


In [65]:
!mkdir -p {TMDB_PATH}

Download TMDB dataset from [Here](https://www.kaggle.com/datasets/hudsonmendes/tmdb-movies-20002020-with-imdb-id) (archive.zip file) into next directory:

In [8]:
TMDB_PATH

'../datasets/tmdb'

In [70]:
!cd {TMDB_PATH}; unzip -o archive.zip
!cd {TMDB_PATH}; cat *.json > movies_metadata.json

Archive:  archive.zip
  inflating: tmdb-movies-2000.json   
  inflating: tmdb-movies-2001.json   
  inflating: tmdb-movies-2002.json   
  inflating: tmdb-movies-2003.json   
  inflating: tmdb-movies-2004.json   
  inflating: tmdb-movies-2005.json   
  inflating: tmdb-movies-2006.json   
  inflating: tmdb-movies-2007.json   
  inflating: tmdb-movies-2008.json   
  inflating: tmdb-movies-2009.json   
  inflating: tmdb-movies-2010.json   
  inflating: tmdb-movies-2011.json   
  inflating: tmdb-movies-2012.json   
  inflating: tmdb-movies-2013.json   
  inflating: tmdb-movies-2014.json   
  inflating: tmdb-movies-2015.json   
  inflating: tmdb-movies-2016.json   
  inflating: tmdb-movies-2017.json   
  inflating: tmdb-movies-2018.json   
  inflating: tmdb-movies-2019.json   
  inflating: tmdb-movies-2020.json   


In [73]:
MOVIE_LENS_FILES = [ f'{MOVIE_LENS_PATH}/{f}' for f in [ 
    'ratings.csv',
    'movies.csv',
    'links.csv',
    'tags.csv'
]]
TMDB_FILES = [ f'{TMDB_PATH}/movies_metadata.json']

In [74]:
MOVIE_LENS_FILES

['../datasets/ml-25m/ratings.csv',
 '../datasets/ml-25m/movies.csv',
 '../datasets/ml-25m/links.csv',
 '../datasets/ml-25m/tags.csv']

In [75]:
TMDB_FILES

['../datasets/tmdb/movies_metadata.json']

In [76]:
Mongo.drop(DATABASE, MOVIE_LENS_FILES)



In [77]:
Mongo.import_csv(DATABASE, MOVIE_LENS_FILES)

2023-12-10 17:16:09,269 - INFO - Success: b'2023-12-10T17:08:58.281-0300  connected to: mongodb://localhost/2023-12-10T17:09:01.281-0300  [........................] movies.ratings  7.27MB/647MB (1.1%)2023-12-10T17:09:04.281-0300  [........................] movies.ratings  15.6MB/647MB (2.4%)2023-12-10T17:09:07.281-0300  [........................] movies.ratings  23.6MB/647MB (3.6%)2023-12-10T17:09:10.281-0300  [#.......................] movies.ratings  29.6MB/647MB (4.6%)2023-12-10T17:09:13.282-0300  [#.......................] movies.ratings  34.0MB/647MB (5.3%)2023-12-10T17:09:16.281-0300  [#.......................] movies.ratings  38.2MB/647MB (5.9%)2023-12-10T17:09:19.281-0300  [#.......................] movies.ratings  42.7MB/647MB (6.6%)2023-12-10T17:09:22.281-0300  [#.......................] movies.ratings  47.0MB/647MB (7.3%)2023-12-10T17:09:25.282-0300  [#.......................] movies.ratings  51.1MB/647MB (7.9%)2023-12-10T17:09:28.282-0300  [##......................] movies.

In [81]:
Mongo.drop(DATABASE, TMDB_FILES)



In [82]:
Mongo.import_json(DATABASE, TMDB_FILES)

2023-12-10 17:18:24,708 - INFO - Success: b'2023-12-10T17:18:19.679-0300  connected to: mongodb://localhost/2023-12-10T17:18:22.680-0300  [##############..........] movies.movies_metadata  79.8MB/135MB (59.1%)2023-12-10T17:18:24.707-0300  [########################] movies.movies_metadata  135MB/135MB (100.0%)2023-12-10T17:18:24.707-0300  218444 document(s) imported successfully. 0 document(s) failed to import.'
2023-12-10 17:18:24,708 - INFO - Success: b'2023-12-10T17:18:19.679-0300  connected to: mongodb://localhost/2023-12-10T17:18:22.680-0300  [##############..........] movies.movies_metadata  79.8MB/135MB (59.1%)2023-12-10T17:18:24.707-0300  [########################] movies.movies_metadata  135MB/135MB (100.0%)2023-12-10T17:18:24.707-0300  218444 document(s) imported successfully. 0 document(s) failed to import.'
2023-12-10 17:18:24,708 - INFO - Success: b'2023-12-10T17:18:19.679-0300  connected to: mongodb://localhost/2023-12-10T17:18:22.680-0300  [##############..........] movie

#### 2. Transform imdb id to number

In [83]:
Mongo.command(DATABASE,  """
db.getCollection('movies_metadata').aggregate([
    {
        $match: { 
            $and: [
                { id_imdb: { $ne: "" } },
                { id_imdb: { $ne: 0 } }
            ]
        }
    },
    {
        $addFields: {
            imdb_id: {$toLong: [ { $arrayElemAt: [ { $split: ["$id_imdb", "tt"]}, 1 ] }] }
        }
    },
    { $out: "movies_metadata_v2" }
]);
""")



#### 3.  Add indexes to both links and movies_metadata_v2 collections.

In [84]:
Mongo.command(DATABASE, """
db.getCollection('links').createIndex(
    { "movieId": 1 }, 
    { 
        unique: true, 
        name: "movieId_unique_index"
    }
);
""")



In [85]:
Mongo.command(DATABASE, """
db.getCollection('movies_metadata_v2').createIndex(
    { "imdb_id": 1 }, 
    { unique: false, name: "imdb_id_multiple_index" }
);
""")



#### 4. Add imdb features to movies collection

In [86]:
Mongo.command(DATABASE, """
db.getCollection('movies').aggregate([
    {
        $lookup:
          {
            from: "links",
            foreignField: "movieId",
            localField: "movieId", 
            as: "links"
          }
     },
     { $match: { links: { $exists: true, $not: {$size: 0} } } },
     { 
        $project: { 
            id: "$movieId",
            tmdb_id:      { "$arrayElemAt": ["$links.tmdbId", 0] },
            imdb_id:      { "$arrayElemAt": ["$links.imdbId", 0] },
            title:        { $arrayElemAt:   [ {$split:["$title","("]} ,  0 ] },
            release_year: { $arrayElemAt:   [ {$split:["$title","("]} ,  1 ] },
            genres:       { $split:         [ "$genres", "|" ] }
        } 
    }, 
    {
        $lookup:
          {
            from: "movies_metadata_v2",
            foreignField: "imdb_id",
            localField: "imdb_id", 
            as: "movies_metadata"
          }
     },
     { $match: {  movies_metadata: { $exists: true, $not: {$size: 0} } } },
     { 
        $project: { 
            id: 1,
            tmdb_id: 1,
            imdb_id: 1,
            title: 1,
            genres: 1,
            for_adults:        { "$arrayElemAt": ["$movies_metadata.adult", 0] },
            budget:            { "$arrayElemAt": ["$movies_metadata.budget", 0] },
            original_language: { "$arrayElemAt": ["$movies_metadata.original_language", 0] },
            overview:          { "$arrayElemAt": ["$movies_metadata.overview", 0] },
            poster:            { "$arrayElemAt": ["$movies_metadata.poster_path", 0] },
            release:           { "$arrayElemAt": ["$movies_metadata.release_date", 0] },
            popularity:        { "$arrayElemAt": ["$movies_metadata.popularity", 0] },
            vote_mean:         { "$arrayElemAt": ["$movies_metadata.vote_average", 0] },
            vote_count:        { "$arrayElemAt": ["$movies_metadata.vote_count", 0] }
        }
    },
    { $out: "movies_v2" }
]);
""")



#### 5. Group tags per used, movie pair.

In [87]:
Mongo.command(DATABASE, """
db.getCollection('tags').aggregate(
    [
        { 
            $group: {
                _id: {
                    user_id: "$userId",
                    movie_id: "$movieId"
                },
                tags: { $push: { $toLower: "$tag" } }
            }
        },
        {
          $project: {
            _id: 0,
            user_id: "$_id.user_id",
            movie_id: "$_id.movie_id",
            user_movie_id: { $concat: [ { $toString: "$_id.user_id" } , "_", { $toString:"$_id.movie_id"} ] },
            tags: 1
          }  
        },
        { $out: "tags_v2" }
    ]
);
""")



#### 6. Create used_movie_id into new tags_v2 collection.

In [88]:
Mongo.command(DATABASE, """
db.getCollection('tags_v2').createIndex(
    { 'user_movie_id': 1 }, 
    { unique: true, name: 'id_unique_index' }
)
""")



#### 7. Add used_movie_id fields into new ratings_v2 collection and also create a unique index.

In [89]:
Mongo.command(DATABASE, """
db.getCollection('ratings').aggregate([
        {
          $project: {
            user_id: "$userId",
            movie_id: "$movieId",
            user_movie_id: { $concat: [ { $toString: "$userId" } , "_", { $toString:"$movieId"} ] },
            rating: 1,
            timestamp: 1
          }  
        },
        { $out: "ratings_v2" }
    ]
);
""")



Remove duplicates documents by ratings_v3.

In [90]:
Mongo.command(DATABASE, """
db.ratings_v2.aggregate([
    { 
        $group: { _id: "$user_movie_id", doc: { $first: "$$ROOT" } }
    },
    { 
        $replaceRoot: { newRoot: "$doc" }
    },
    { $out: "ratings_v3" }
]);
""")



In [91]:
Mongo.command(DATABASE, """
db.getCollection('ratings_v3').createIndex(
    { 'user_movie_id': 1 }, 
    { unique: true, name: 'id_unique_index' }
);
""")



#### 8. Join ratting_v2 and tags_v2 collections by user_movie_id into a new ratings_tags_v1 collection.

In [92]:
Mongo.command(DATABASE, """
db.getCollection('ratings_v3').aggregate([
    {
        $lookup:
          {
            from: "tags_v2",
            foreignField: "user_movie_id",
            localField: "user_movie_id", 
            as: "tags_v2"
          }
     },
     { $match: { tags_v2: { $exists: true, $not: {$size: 0} } } },
     { 
        $project: { 
            user_id: 1,
            movie_id: 1,
            rating: 1,
            timestamp: 1,
            tags: "$tags_v2.tags"
        }
    },
    {
        $addFields: {  
            _id: { $concat: [ { $toString: "$user_id" } , "_", { $toString:"$movie_id"} ] },            
            tags: {
                "$reduce": {
                    "input": "$tags",
                    "initialValue": [],
                    "in": { "$setUnion": [ "$$value", "$$this" ] }
                }
            }
        }
    },
    { $out: "ratings_tags_v1" }
]);
""")



#### 9. Add tags field into movie_v2 collections:

In [93]:
Mongo.command(DATABASE, """
db.getCollection('movies_v2').createIndex(
    { 'id': 1 }, 
    { unique: true, name: 'id_unique_index' }
)
""")



In [94]:
Mongo.command(DATABASE, """
db.tags_v2.aggregate([
    { 
        $group: { _id: "$movie_id", doc: { $first: "$$ROOT" } }
    },
    { 
        $replaceRoot: { newRoot: "$doc" }
    },
    { $out: "tags_v3" }
]);
""")



In [95]:
Mongo.command(DATABASE, """
db.getCollection('tags_v3').createIndex(
    { 'movie_id': 1 }, 
    { unique: true, name: 'id_unique_index' }
)
""")



In [96]:
Mongo.command(DATABASE, """
db.getCollection('movies_v2').aggregate([
    {
        $lookup:
          {
            from: "tags_v3",
            foreignField: "movie_id",
            localField: "id",
            as: "tags_v3"
          }
    },
    { $match: { tags_v3: { $exists: true, $not: {$size: 0} } } },
    { 
        $addFields: { 
            tags: {
                "$reduce": {
                    "input": "$tags_v3.tags",
                    "initialValue": [],
                    "in": { "$setUnion": [ "$$value", "$$this" ] }
                }
            }
        }
    },
    { $unset: ["tags_v3"] },
    { $addFields: {  _id: "$id" } },
    { $unset: ["id"] },
    { $out: "movies_v3" }
]);
""")



#### 10. Export final ccollections to json files:

* movies_v3 to movies_v3.json
* ratings_tags_v1 to ratings_tags_v1.json

In [97]:
Mongo.export_to_json(database='movies', path=DATASETS_PATH, collections=['movies_v3'])

2023-12-10 17:24:28,351 - INFO - Success: b'2023-12-10T17:24:27.994-0300  connected to: mongodb://localhost/2023-12-10T17:24:28.349-0300  exported 18679 records'
2023-12-10 17:24:28,351 - INFO - Success: b'2023-12-10T17:24:27.994-0300  connected to: mongodb://localhost/2023-12-10T17:24:28.349-0300  exported 18679 records'
2023-12-10 17:24:28,351 - INFO - Success: b'2023-12-10T17:24:27.994-0300  connected to: mongodb://localhost/2023-12-10T17:24:28.349-0300  exported 18679 records'


In [98]:
Mongo.export_to_json(database='movies', path=DATASETS_PATH, collections=['ratings_tags_v1'])

2023-12-10 17:24:30,596 - INFO - Success: b'2023-12-10T17:24:28.396-0300  connected to: mongodb://localhost/2023-12-10T17:24:29.397-0300  [##########..............]  movies.ratings_tags_v1  88000/210725  (41.8%)2023-12-10T17:24:30.396-0300  [####################....]  movies.ratings_tags_v1  184000/210725  (87.3%)2023-12-10T17:24:30.594-0300  [########################]  movies.ratings_tags_v1  210725/210725  (100.0%)2023-12-10T17:24:30.594-0300  exported 210725 records'
2023-12-10 17:24:30,596 - INFO - Success: b'2023-12-10T17:24:28.396-0300  connected to: mongodb://localhost/2023-12-10T17:24:29.397-0300  [##########..............]  movies.ratings_tags_v1  88000/210725  (41.8%)2023-12-10T17:24:30.396-0300  [####################....]  movies.ratings_tags_v1  184000/210725  (87.3%)2023-12-10T17:24:30.594-0300  [########################]  movies.ratings_tags_v1  210725/210725  (100.0%)2023-12-10T17:24:30.594-0300  exported 210725 records'
2023-12-10 17:24:30,596 - INFO - Success: b'2023-1