# Data pre-processing

Para tener los datos necesario para probar los enfoques de filtros colaborativos(CF) y basados en contenido(CB) necesito:

* Calificaciones de los ítems(movies) de los usuarios (CF)
* Features propies de los ítems (CB)

Dado esto, a continuacion se combinaran los siguientes datasets:

* MovieLens 25M Dataset: Prácticamente no tiene información de las películas pero si tiene las calificaciones de los usuarios.
* TMDB Movie Dataset: No tiene calificaciones personalizadas como el dataset anterior pero tiene varios features para las películas que es lo que necesito.

## Setup

In [1]:
%load_ext autoreload
%autoreload 2

BASE_PATH       = '..'
DATASETS_PATH   = f'{BASE_PATH}/datasets'
MOVIE_LENS_PATH = f'{DATASETS_PATH}/ml-25m'
TMDB_PATH       = f'{DATASETS_PATH}/tmdb'
DATABASE        = 'movies'

In [2]:
import sys
sys.path.append(f'{BASE_PATH}/lib')

from database.mongo import Mongo
from pytorch_common.util import LoggerBuilder

In [3]:
LoggerBuilder().on_console().build()

<RootLogger root (INFO)>

# Pre-processing steps

#### 1. Import five collecitons to **movies** mongodb database:

    * From movie lens dataset:
        * rattings
        * movies
        * links
        * tags
    * From the movie database dataset:
        * movies_metadata

In [4]:
!mkdir -p {DATASETS_PATH}

!cd {DATASETS_PATH}; curl -LO http://files.grouplens.org/datasets/movielens/ml-25m.zip

!cd {DATASETS_PATH}; unzip -o ml-25m.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  249M  100  249M    0     0  6394k      0  0:00:40  0:00:40 --:--:-- 4789k
Archive:  ml-25m.zip
  inflating: ml-25m/tags.csv         
  inflating: ml-25m/links.csv        
  inflating: ml-25m/README.txt       
  inflating: ml-25m/ratings.csv      
  inflating: ml-25m/genome-tags.csv  
  inflating: ml-25m/genome-scores.csv  
  inflating: ml-25m/movies.csv       


In [5]:
!mkdir -p {TMDB_PATH}

Download TMDB dataset from [Here](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata) (archive.zip file) into next directory:

In [8]:
TMDB_PATH

'../datasets/tmdb'

In [7]:
!cd {TMDB_PATH}; unzip -o archive.zip
!cd {TMDB_PATH}; mv tmdb_5000_movies.csv movies_metadata.csv

unzip:  cannot find or open archive.zip, archive.zip.zip or archive.zip.ZIP.
mv: cannot stat 'tmdb_5000_movies.csv': No such file or directory


In [9]:
MOVIE_LENS_FILES = [ f'{MOVIE_LENS_PATH}/{f}' for f in [ 
    'ratings.csv',
    'movies.csv',
    'links.csv',
    'tags.csv'
]]
TMDB_FILES = [ f'{TMDB_PATH}/movies_metadata.csv']

In [10]:
MOVIE_LENS_FILES

['../datasets/ml-25m/ratings.csv',
 '../datasets/ml-25m/movies.csv',
 '../datasets/ml-25m/links.csv',
 '../datasets/ml-25m/tags.csv']

In [11]:
TMDB_FILES

['../datasets/tmdb/movies_metadata.csv']

In [14]:
Mongo.drop(DATABASE, MOVIE_LENS_FILES)



In [15]:
Mongo.import_csv(DATABASE, MOVIE_LENS_FILES)

2023-11-06 21:46:32,251 - INFO - Success: b'2023-11-06T21:44:05.786-0300  connected to: mongodb://localhost/2023-11-06T21:44:08.787-0300  [........................] movies.ratings  12.4MB/647MB (1.9%)2023-11-06T21:44:11.787-0300  [........................] movies.ratings  24.9MB/647MB (3.9%)2023-11-06T21:44:14.787-0300  [#.......................] movies.ratings  37.5MB/647MB (5.8%)2023-11-06T21:44:17.787-0300  [#.......................] movies.ratings  50.4MB/647MB (7.8%)2023-11-06T21:44:20.787-0300  [##......................] movies.ratings  63.6MB/647MB (9.8%)2023-11-06T21:44:23.787-0300  [##......................] movies.ratings  77.0MB/647MB (11.9%)2023-11-06T21:44:26.787-0300  [###.....................] movies.ratings  90.4MB/647MB (14.0%)2023-11-06T21:44:29.787-0300  [###.....................] movies.ratings  104MB/647MB (16.1%)2023-11-06T21:44:32.787-0300  [####....................] movies.ratings  117MB/647MB (18.1%)2023-11-06T21:44:35.787-0300  [####....................] movie

In [16]:
Mongo.drop(DATABASE, TMDB_FILES)



In [17]:
Mongo.import_csv(DATABASE, TMDB_FILES)

2023-11-06 21:46:33,882 - INFO - Success: b'2023-11-06T21:46:32.699-0300  connected to: mongodb://localhost/2023-11-06T21:46:33.880-0300  45466 document(s) imported successfully. 0 document(s) failed to import.'


#### 2. Transform imdb id to number

In [18]:
Mongo.command(DATABASE,  """
db.getCollection('movies_metadata').aggregate([
    {
        $match: { 
            $and: [
                { imdb_id: { $ne: "" } },
                { imdb_id: { $ne: 0 } }
            ]
        }
    },
    {
        $addFields: {
            imdb_id: {$toLong: [ { $arrayElemAt: [ { $split: ["$imdb_id", "tt"]}, 1 ] }] }
        }
    },
    { $out: "movies_metadata_v2" }
]);
""")



#### 3.  Add indexes to both links and movies_metadata_v2 collections.

In [19]:
Mongo.command(DATABASE, """
db.getCollection('links').createIndex(
    { "movieId": 1 }, 
    { 
        unique: true, 
        name: "movieId_unique_index"
    }
);
""")



In [23]:
Mongo.command(DATABASE, """
db.getCollection('movies_metadata_v2').createIndex(
    { "imdb_id": 1 }, 
    { unique: false, name: "imdb_id_multiple_index" }
);
""")



#### 4. Add imdb features to movies collection

In [24]:
Mongo.command(DATABASE, """
db.getCollection('movies').aggregate([
    {
        $lookup:
          {
            from: "links",
            foreignField: "movieId",
            localField: "movieId", 
            as: "links"
          }
     },
     { $match: { links: { $exists: true, $not: {$size: 0} } } },
     { 
        $project: { 
            id: "$movieId",
            tmdb_id:      { "$arrayElemAt": ["$links.tmdbId", 0] },
            imdb_id:      { "$arrayElemAt": ["$links.imdbId", 0] },
            title:        { $arrayElemAt:   [ {$split:["$title","("]} ,  0 ] },
            release_year: { $arrayElemAt:   [ {$split:["$title","("]} ,  1 ] },
            genres:       { $split:         [ "$genres", "|" ] }
        } 
    }, 
    {
        $lookup:
          {
            from: "movies_metadata_v2",
            foreignField: "imdb_id",
            localField: "imdb_id", 
            as: "movies_metadata"
          }
     },
     { $match: {  movies_metadata: { $exists: true, $not: {$size: 0} } } },
     { 
        $project: { 
            id: 1,
            tmdb_id: 1,
            imdb_id: 1,
            title: 1,
            genres: 1,
            for_adults:        { "$arrayElemAt": ["$movies_metadata.adult", 0] },
            budget:            { "$arrayElemAt": ["$movies_metadata.budget", 0] },
            original_language: { "$arrayElemAt": ["$movies_metadata.original_language", 0] },
            overview:          { "$arrayElemAt": ["$movies_metadata.overview", 0] },
            poster:            { "$arrayElemAt": ["$movies_metadata.poster_path", 0] },
            release:           { "$arrayElemAt": ["$movies_metadata.release_date", 0] },
            popularity:        { "$arrayElemAt": ["$movies_metadata.popularity", 0] },
            vote_mean:         { "$arrayElemAt": ["$movies_metadata.vote_average", 0] },
            vote_count:        { "$arrayElemAt": ["$movies_metadata.vote_count", 0] }
        }
    },
    { $out: "movies_v2" }
]);
""")



#### 5. Group tags per used, movie pair.

In [25]:
Mongo.command(DATABASE, """
db.getCollection('tags').aggregate(
    [
        { 
            $group: {
                _id: {
                    user_id: "$userId",
                    movie_id: "$movieId"
                },
                tags: { $push: { $toLower: "$tag" } }
            }
        },
        {
          $project: {
            _id: 0,
            user_id: "$_id.user_id",
            movie_id: "$_id.movie_id",
            user_movie_id: { $concat: [ { $toString: "$_id.user_id" } , "_", { $toString:"$_id.movie_id"} ] },
            tags: 1
          }  
        },
        { $out: "tags_v2" }
    ]
);
""")



#### 6. Create used_movie_id into new tags_v2 collection.

In [26]:
Mongo.command(DATABASE, """
db.getCollection('tags_v2').createIndex(
    { 'user_movie_id': 1 }, 
    { unique: true, name: 'id_unique_index' }
)
""")



#### 7. Add used_movie_id fields into new ratings_v2 collection and also create a unique index.

In [27]:
Mongo.command(DATABASE, """
db.getCollection('ratings').aggregate([
        {
          $project: {
            user_id: "$userId",
            movie_id: "$movieId",
            user_movie_id: { $concat: [ { $toString: "$userId" } , "_", { $toString:"$movieId"} ] },
            rating: 1,
            timestamp: 1
          }  
        },
        { $out: "ratings_v2" }
    ]
);
""")



Remove duplicates documents by ratings_v3.

In [31]:
Mongo.command(DATABASE, """
db.ratings_v2.aggregate([
    { 
        $group: { _id: "$user_movie_id", doc: { $first: "$$ROOT" } }
    },
    { 
        $replaceRoot: { newRoot: "$doc" }
    },
    { $out: "ratings_v3" }
]);
""")



In [32]:
Mongo.command(DATABASE, """
db.getCollection('ratings_v3').createIndex(
    { 'user_movie_id': 1 }, 
    { unique: true, name: 'id_unique_index' }
);
""")



#### 8. Join ratting_v2 and tags_v2 collections by user_movie_id into a new ratings_tags_v1 collection.

In [33]:
Mongo.command(DATABASE, """
db.getCollection('ratings_v3').aggregate([
    {
        $lookup:
          {
            from: "tags_v2",
            foreignField: "user_movie_id",
            localField: "user_movie_id", 
            as: "tags_v2"
          }
     },
     { $match: { tags_v2: { $exists: true, $not: {$size: 0} } } },
     { 
        $project: { 
            user_id: 1,
            movie_id: 1,
            rating: 1,
            timestamp: 1,
            tags: "$tags_v2.tags"
        }
    },
    {
        $addFields: {  
            _id: { $concat: [ { $toString: "$user_id" } , "_", { $toString:"$movie_id"} ] },            
            tags: {
                "$reduce": {
                    "input": "$tags",
                    "initialValue": [],
                    "in": { "$setUnion": [ "$$value", "$$this" ] }
                }
            }
        }
    },
    { $out: "ratings_tags_v1" }
]);
""")



#### 9. Add tags field into movie_v2 collections:

In [34]:
Mongo.command(DATABASE, """
db.getCollection('movies_v2').createIndex(
    { 'id': 1 }, 
    { unique: true, name: 'id_unique_index' }
)
""")



In [36]:
Mongo.command(DATABASE, """
db.tags_v2.aggregate([
    { 
        $group: { _id: "$movie_id", doc: { $first: "$$ROOT" } }
    },
    { 
        $replaceRoot: { newRoot: "$doc" }
    },
    { $out: "tags_v3" }
]);
""")



In [37]:
Mongo.command(DATABASE, """
db.getCollection('tags_v3').createIndex(
    { 'movie_id': 1 }, 
    { unique: true, name: 'id_unique_index' }
)
""")



In [38]:
Mongo.command(DATABASE, """
db.getCollection('movies_v2').aggregate([
    {
        $lookup:
          {
            from: "tags_v3",
            foreignField: "movie_id",
            localField: "id",
            as: "tags_v3"
          }
    },
    { $match: { tags_v3: { $exists: true, $not: {$size: 0} } } },
    { 
        $addFields: { 
            tags: {
                "$reduce": {
                    "input": "$tags_v3.tags",
                    "initialValue": [],
                    "in": { "$setUnion": [ "$$value", "$$this" ] }
                }
            }
        }
    },
    { $unset: ["tags_v3"] },
    { $addFields: {  _id: "$id" } },
    { $unset: ["id"] },
    { $out: "movies_v3" }
]);
""")



#### 10. Export final ccollections to json files:

* movies_v3 to movies_v3.json
* ratings_tags_v1 to ratings_tags_v1.json

In [39]:
Mongo.export_to_json(database='movies', path=DATASETS_PATH, collections=['movies_v3'])

2023-11-06 23:59:03,293 - INFO - Success: b'2023-11-06T23:59:02.504-0300  connected to: mongodb://localhost/2023-11-06T23:59:03.291-0300  exported 35149 records'


In [40]:
Mongo.export_to_json(database='movies', path=DATASETS_PATH, collections=['ratings_tags_v1'])

2023-11-06 23:59:06,007 - INFO - Success: b'2023-11-06T23:59:03.974-0300  connected to: mongodb://localhost/2023-11-06T23:59:04.975-0300  [##########..............]  movies.ratings_tags_v1  96000/210725  (45.6%)2023-11-06T23:59:05.975-0300  [######################..]  movies.ratings_tags_v1  200000/210725  (94.9%)2023-11-06T23:59:06.002-0300  [########################]  movies.ratings_tags_v1  210725/210725  (100.0%)2023-11-06T23:59:06.002-0300  exported 210725 records'
