# Data Processing: Mock Data

We will use this notebook for modifying the data we are inputing into our model for training. We strongly recommend creating a **virtual environment** before running the following code. Don't forget to install TensorFlow dependencies into your environment.

## Imports

Next, let's invoke the necessary packages.

In [1]:
import os
import pprint

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

## Dataset

This is included in the TensorFlow library. We intend to use the MovieLens ratings dataset.

In [24]:
# Ratings data.
ratings = tfds.load("movielens/100k-ratings", split="train")
# Features of all the available movies.
movies = tfds.load("movielens/100k-movies", split="train")

Let's take a look at the data structure:

In [25]:
for x in ratings.take(1).as_numpy_iterator():
    print("Rating: ")
    pprint.pprint(x)

for x in movies.take(1).as_numpy_iterator():
    print("Movie: ")
    pprint.pprint(x)

Rating: 
{'bucketized_user_age': 45.0,
 'movie_genres': array([7], dtype=int64),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}
Movie: 
{'movie_genres': array([4], dtype=int64),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


You can modify the limits of the previous for-loops if you would like to see more examples. The next thing to do is to process the data for saving it into a CSV file.

In [26]:
ratings = ratings.map(lambda x: {
    "media_id": x["movie_id"],
    "media_title": x["movie_title"],
    "user_id": x["user_id"],
    "user_rating": x["user_rating"],
})

media_items = movies.map(lambda x: {
    "media_id": x["movie_id"],
    "media_title": x["movie_title"],
})

Now that we have some raw data, we will put it in separate files for further processing

In [27]:
import csv

with open('../RawData/raw_ratings.csv', mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    for rating in ratings.as_numpy_iterator():
        csv_writer.writerow([rating["media_id"], rating["media_title"], rating["user_id"], rating["user_rating"]])

In [29]:
with open('../RawData/raw_media_items.csv', mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    for item in media_items.as_numpy_iterator():
        csv_writer.writerow([item["media_id"], item["media_title"]])

Now we are going to import some Anime data from this wonderful [GitHub repository](https://github.com/manami-project/anime-offline-database). The objective is to get an array of anime titles and years so we can use for further processing.

In [40]:
import json

In [49]:
with open("../RawData/anime-offline-database.json", encoding='utf-8') as data_file:
    data = json.load(data_file)
    
def reduce(item):
    return {
        "media_title": item["title"].encode('utf-8'),
        "media_year": str(item["animeSeason"]["year"]).encode('utf-8'),
    }

animes = list(map(reduce, data["data"]))

We will shuffle the data a little bit before we put it into a CSV file. We will also perform some adjustments for processing.

In [50]:
np.random.shuffle(animes)

with open('../RawData/raw_animes.csv', mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    for anime in animes:
        csv_writer.writerow([anime["media_title"], anime["media_year"]])

Let's take a look at our data and at the same time do some cleanup.

In [51]:
with open('../RawData/raw_ratings.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    ratings_array = []
    for row in csv_reader:
        media_id = row[0][2:-1]
        media_title = row[1][2:-8]
        user_id = row[2][2:-1]
        user_rating = row[3]
        ratings_array.append([media_id, media_title, user_id, user_rating])
        
ratings_array[:20]

[['357', "One Flew Over the Cuckoo's Nest", '138', '4.0'],
 ['709', 'Strictly Ballroom', '92', '2.0'],
 ['412', 'Very Brady Sequel, A', '301', '4.0'],
 ['56', 'Pulp Fiction', '60', '4.0'],
 ['895', 'Scream 2', '197', '3.0'],
 ['325', 'Crash', '601', '4.0'],
 ['95', 'Aladdin', '710', '3.0'],
 ['92', 'True Romance', '833', '2.0'],
 ['425', 'Bob Roberts', '916', '5.0'],
 ['271', 'Starship Troopers', '940', '2.0'],
 ['355', 'Sphere', '611', '1.0'],
 ['712', 'Tin Men', '707', '3.0'],
 ['825', 'Arrival, The', '699', '3.0'],
 ['240', 'Beavis and Butt-head Do America', '16', '4.0'],
 ['1150', 'Last Dance', '314', '4.0'],
 ['684', 'In the Line of Fire', '217', '5.0'],
 ['124', 'Lone Star', '276', '5.0'],
 ['294', 'Liar Liar', '510', '3.0'],
 ['265', 'Hunt for Red October, The', '757', '3.0'],
 ['465', 'Jungle Book, The', '881', '3.0']]

In [52]:
with open('../RawData/raw_media_items.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    media_items_array = []
    for row in csv_reader:
        media_id = row[0][2:-1]
        media_title = row[1][2:-8]
        media_items_array.append([media_id, media_title])
        
media_items_array[:20]
len(media_items_array)

1682

In [55]:
with open('../RawData/raw_animes.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    animes_array = []
    for row in csv_reader:
        media_title = row[0][2:-1]
        media_year = row[1][2:-1]
        animes_array.append([media_title, media_year])
        
animes_array[:20]

[['Sentai Hero Sukiyaki Force: Gunma no Heiwa o Negau Season e, Mata?',
  '2018'],
 ['Guobao Te Gong 2nd Season', '2012'],
 ['Saiunkoku Monogatari Recaps', '2007'],
 ['Lupin the IIIrd: Chikemuri no Ishikawa Goemon', '2017'],
 ['Mofumofuiction', '2018'],
 ['Kurage-P: Check Check Check One Two!', '2016'],
 ['Larva 3rd Season', '2014'],
 ['Cardfight!! Vanguard G: Stride Gate-hen', '2016'],
 ['BanG Dream! Film Live 2nd Stage', '2021'],
 ['Petit Manga', '2009'],
 ['Wakakusa Monogatari: Nan to Jo-sensei', '1993'],
 ['Wo Qi Ku Liao Baiwan Xiulian Zhe', '2021'],
 ['Artiswitch', '2021'],
 ['Jishin da!! Mii-chan no Bousai Kunren', '2006'],
 ['Tiger Mask Pilot Film', '1969'],
 ['Yowamushi Pedal: New Generation', '2017'],
 ['Mikito-P: Endroll ni Boku no Namae wo Irenaide', '2014'],
 ['Joshi Ochi!!: 2-kai kara Ero Musume ga Futte kite, Ore no Areni!?', '2018'],
 ['Kagachi-sama Onagusame Tatematsurimasu: Netorare Mura Inya Hanashi The Animation',
  '2013'],
 ['Koneko no Chi: Ponponra Daibouken', '20