# Data Processing: Mock Data

We will use this notebook for modifying the data we are inputing into our model for training. We strongly recommend creating a **virtual environment** before running the following code. Let's start by getting our dependencies.

In [5]:
!pip install tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets





## Imports

Next, let's invoke the necessary packages.

In [1]:
import os
import pprint

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

## Dataset

This is included in the TensorFlow library. We intend to use the MovieLens ratings dataset.

In [5]:
# Ratings data.
ratings = tfds.load("movielens/100k-ratings", split="train")
# Features of all the available movies.
movies = tfds.load("movielens/100k-movies", split="train")

Let's take a look at the data structure:

In [6]:
for x in ratings.take(1).as_numpy_iterator():
    print("Rating: ")
    pprint.pprint(x)

for x in movies.take(1).as_numpy_iterator():
    print("Movie: ")
    pprint.pprint(x)

Rating: 
{'bucketized_user_age': 45.0,
 'movie_genres': array([7], dtype=int64),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}
Movie: 
{'movie_genres': array([4], dtype=int64),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


You can modify the limits of the previous for-loops if you would like to see more examples. The next thing to do is to process the data. We only need `user_id` and `movie_title` for training our model.

In [7]:
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
})
movies = movies.map(lambda x: x["movie_title"])

Next, we will identify unique `user_id`s and `movie_title`s. This is for having the vocabulary necessary for embedding vectors mapping.

In [15]:
movie_titles = movies.batch(1_000)
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))


print("Unique Movie Titles:", len(unique_movie_titles))
print("Unique User IDs:", len(unique_user_ids))

Unique Movie Titles: 1664
Unique User IDs: 943


In [16]:
unique_movie_titles

array([b"'Til There Was You (1997)", b'1-900 (1994)',
       b'101 Dalmatians (1996)', ..., b'Zeus and Roxanne (1997)',
       b'unknown', b'\xc3\x81 k\xc3\xb6ldum klaka (Cold Fever) (1994)'],
      dtype=object)