# Data Processing: Mock Data

We will use this notebook for modifying the data we are inputing into our model for training. We strongly recommend creating a **virtual environment** before running the following code. Don't forget to install TensorFlow dependencies into your environment.

## Imports

Next, let's invoke the necessary packages.

In [1]:
import os
import pprint

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

## Dataset

This is included in the TensorFlow library. We intend to use the MovieLens ratings dataset.

In [91]:
# Ratings data.
ratings = tfds.load("movielens/100k-ratings", split="train")
# Features of all the available movies.
movies = tfds.load("movielens/100k-movies", split="train")

Let's take a look at the data structure:

In [92]:
for x in ratings.take(1).as_numpy_iterator():
    print("Rating: ")
    pprint.pprint(x)

for x in movies.take(1).as_numpy_iterator():
    print("Movie: ")
    pprint.pprint(x)

Rating: 
{'bucketized_user_age': 45.0,
 'movie_genres': array([7], dtype=int64),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}
Movie: 
{'movie_genres': array([4], dtype=int64),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


You can modify the limits of the previous for-loops if you would like to see more examples. The next thing to do is to process the data for saving it into a CSV file.

In [93]:
ratings = ratings.map(lambda x: {
    "media_id": x["movie_id"],
    "media_title": x["movie_title"],
    "user_id": x["user_id"],
    "user_rating": x["user_rating"],
})

media_items = movies.map(lambda x: {
    "media_id": x["movie_id"],
    "media_title": x["movie_title"],
})

Now that we have some raw data, we will put it in separate files for further processing

In [94]:
import csv

with open('../RawData/raw_ratings.csv', mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    for rating in ratings.as_numpy_iterator():
        csv_writer.writerow([rating["media_id"], rating["media_title"], rating["user_id"], rating["user_rating"]])

In [95]:
with open('../RawData/raw_media_items.csv', mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    for item in media_items.as_numpy_iterator():
        csv_writer.writerow([item["media_id"], item["media_title"]])

Now we are going to import some Anime data from this wonderful [GitHub repository](https://github.com/manami-project/anime-offline-database). The objective is to get an array of anime titles and years so we can use for further processing.

In [96]:
import json

In [97]:
with open("../RawData/anime-offline-database.json", encoding='utf-8') as data_file:
    data = json.load(data_file)
    
def reduce(item):
    return {
        "media_title": item["title"].encode('utf-8'),
        "media_year": str(item["animeSeason"]["year"]).encode('utf-8'),
    }

animes = list(map(reduce, data["data"]))

We will shuffle the data a little bit before we put it into a CSV file. We will also perform some adjustments for processing.

In [98]:
np.random.shuffle(animes)

with open('../RawData/raw_animes.csv', mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    for anime in animes:
        csv_writer.writerow([anime["media_title"], anime["media_year"]])

Let's take a look at our data and at the same time do some cleanup.

In [99]:
with open('../RawData/raw_ratings.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    ratings_array = []
    for row in csv_reader:
        media_id = row[0][2:-1]
        media_title = row[1][2:-8]
        user_id = row[2][2:-1]
        user_rating = row[3]
        ratings_array.append([media_id, media_title, user_id, user_rating])
        
pprint.pprint(ratings_array[:20])
print("Ratings:", len(ratings_array))

[['357', "One Flew Over the Cuckoo's Nest", '138', '4.0'],
 ['709', 'Strictly Ballroom', '92', '2.0'],
 ['412', 'Very Brady Sequel, A', '301', '4.0'],
 ['56', 'Pulp Fiction', '60', '4.0'],
 ['895', 'Scream 2', '197', '3.0'],
 ['325', 'Crash', '601', '4.0'],
 ['95', 'Aladdin', '710', '3.0'],
 ['92', 'True Romance', '833', '2.0'],
 ['425', 'Bob Roberts', '916', '5.0'],
 ['271', 'Starship Troopers', '940', '2.0'],
 ['355', 'Sphere', '611', '1.0'],
 ['712', 'Tin Men', '707', '3.0'],
 ['825', 'Arrival, The', '699', '3.0'],
 ['240', 'Beavis and Butt-head Do America', '16', '4.0'],
 ['1150', 'Last Dance', '314', '4.0'],
 ['684', 'In the Line of Fire', '217', '5.0'],
 ['124', 'Lone Star', '276', '5.0'],
 ['294', 'Liar Liar', '510', '3.0'],
 ['265', 'Hunt for Red October, The', '757', '3.0'],
 ['465', 'Jungle Book, The', '881', '3.0']]
Ratings: 100000


In [100]:
with open('../RawData/raw_media_items.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    media_items_array = []
    for row in csv_reader:
        media_id = row[0][2:-1]
        media_title = row[1][2:-8]
        media_items_array.append([media_id, media_title])
        
pprint.pprint(media_items_array[:20])
print("Movies:", len(media_items_array))

[['1681', 'You So Crazy'],
 ['1457', 'Love Is All There Is'],
 ['500', 'Fly Away Home'],
 ['838', 'In the Line of Duty 2'],
 ['1648', 'Niagara, Niagara'],
 ['547', "Young Poisoner's Handbook, The"],
 ['387', 'Age of Innocence, The'],
 ['1495', 'Flirt'],
 ['817', 'Frisk'],
 ['267', ''],
 ['1637', 'Girls Town'],
 ['1396', 'Stonewall'],
 ['498', 'African Queen, The'],
 ['852', 'Bloody Child, The'],
 ['685', 'Executive Decision'],
 ['231', 'Batman Returns'],
 ['719', 'Canadian Bacon'],
 ['308', 'FairyTale: A True Story'],
 ['445', 'Body Snatcher, The'],
 ['486', 'Sabrina']]
Movies: 1682


In [101]:
with open('../RawData/raw_animes.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    animes_array = []
    for row in csv_reader:
        media_title = row[0][2:-1]
        media_year = row[1][2:-1]
        animes_array.append([media_title, media_year])
        
pprint.pprint(animes_array[:20])
print("Animes:", len(animes_array))

[['Classroom\\xe2\\x98\\x86Crisis', '2015'],
 ['Highlander: The Search for Vengeance', '2007'],
 ['Chuuken Mochi Shiba', 'None'],
 ['Renou Xueyuan', '2021'],
 ['Jashin-chan Dropkick Episode 12', '2018'],
 ['Ketsuekigata-kun! 3', '2015'],
 ['Sanguo Yanyi 2nd Season: Zhulu Zhongyuan', '2019'],
 ['Danball Senki', '2011'],
 ['8-gatsu no Symphony: Shibuya 2002-2003', '2009'],
 ['Koisuru Boukun', '2010'],
 ['Chang Jian Fengyun 2', 'None'],
 ['Mahoutsukai no Yome', '2017'],
 ['New Big Head Son and Little Head Dad Season 3', '2015'],
 ['SINBAD', '2016'],
 ['Message Song', '1996'],
 ['Oren no yurai', '2003'],
 ['Yakimochi Caprice', '2011'],
 ['Kanojo to Kanojo no Neko: Everything Flows', '2016'],
 ['Garakuta-doori no Stain', '2003'],
 ['Huanbao Tegong Dui', 'None']]
Animes: 32967


Now we are going to replace half of the movies with anime titles. Since the anime titles are already shuffled, let's do the same with me movies too.

In [102]:
np.random.shuffle(media_items_array)

pprint.pprint(media_items_array[:20])

[['1198', 'Purple Noon'],
 ['939', 'Murder in the First'],
 ['1141', 'War Room, The'],
 ['1601', 'Office Killer'],
 ['922', 'Dead Man'],
 ['1190', 'That Old Feeling'],
 ['230', 'Star Trek IV: The Voyage Home'],
 ['607', 'Rebecca'],
 ['1435', 'Steal Big, Steal Little'],
 ['1557', 'Yankee Zulu'],
 ['914', 'Wild Things'],
 ['1398', 'Anna'],
 ['365', 'Powder'],
 ['1414', 'Coldblooded'],
 ['1078', 'Oliver & Company'],
 ['505', 'Dial M for Murder'],
 ['794', 'It Could Happen to You'],
 ['771', 'Johnny Mnemonic'],
 ['1133', 'Escape to Witch Mountain'],
 ['1127', 'Truman Show, The']]


In [103]:
for i in range(int(len(media_items_array)/2)):
    media_items_array[i][1] = animes_array[i][0]
    
pprint.pprint(media_items_array[:20])
print("Movies:", len(media_items_array))

[['1198', 'Classroom\\xe2\\x98\\x86Crisis'],
 ['939', 'Highlander: The Search for Vengeance'],
 ['1141', 'Chuuken Mochi Shiba'],
 ['1601', 'Renou Xueyuan'],
 ['922', 'Jashin-chan Dropkick Episode 12'],
 ['1190', 'Ketsuekigata-kun! 3'],
 ['230', 'Sanguo Yanyi 2nd Season: Zhulu Zhongyuan'],
 ['607', 'Danball Senki'],
 ['1435', '8-gatsu no Symphony: Shibuya 2002-2003'],
 ['1557', 'Koisuru Boukun'],
 ['914', 'Chang Jian Fengyun 2'],
 ['1398', 'Mahoutsukai no Yome'],
 ['365', 'New Big Head Son and Little Head Dad Season 3'],
 ['1414', 'SINBAD'],
 ['1078', 'Message Song'],
 ['505', 'Oren no yurai'],
 ['794', 'Yakimochi Caprice'],
 ['771', 'Kanojo to Kanojo no Neko: Everything Flows'],
 ['1133', 'Garakuta-doori no Stain'],
 ['1127', 'Huanbao Tegong Dui']]
Movies: 1682


As we can see, our data remains of the same length. Let's write this in a new CSV file.

In [104]:
with open('../RawData/modified_media_items.csv', mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(["record_id", "media_id", "media_title"])

    record_id = 0
    for item in media_items_array:
        record_id = record_id + 1
        csv_writer.writerow([record_id, item[0], item[1]])

Finally, let's also modify the ratings data by replacing the old values with our updated ones.

In [105]:
for rating in ratings_array:
    for item in media_items_array:
        if item[0] == rating[0]:
            rating[1] = item[1]
            
pprint.pprint(ratings_array[:20])
print("Ratings:", len(ratings_array))

[['357', 'Sifan', '138', '4.0'],
 ['709', 'Strictly Ballroom', '92', '2.0'],
 ['412', 'Very Brady Sequel, A', '301', '4.0'],
 ['56', 'Pulp Fiction', '60', '4.0'],
 ['895', 'Kobo-chan: Matsuri ga Ippai!', '197', '3.0'],
 ['325', 'Crash', '601', '4.0'],
 ['95', 'Zhun Xing', '710', '3.0'],
 ['92', 'True Romance', '833', '2.0'],
 ['425', 'Bob Roberts', '916', '5.0'],
 ['271', 'Starship Troopers', '940', '2.0'],
 ['355', 'Sphere', '611', '1.0'],
 ['712', 'Baosheng Dadi Zhi Qi Er Duo Bao', '707', '3.0'],
 ['825', 'Arrival, The', '699', '3.0'],
 ['240', 'Zhu Zhu Xia: Jing Qiu Xiao Yingxiong', '16', '4.0'],
 ['1150', 'Last Dance', '314', '4.0'],
 ['684', 'In the Line of Fire', '217', '5.0'],
 ['124', 'Lone Star', '276', '5.0'],
 ['294', 'Karakuri Circus (TV)', '510', '3.0'],
 ['265', 'Haitoku no Kyoukai', '757', '3.0'],
 ['465', 'Chan Shuo A Kuan', '881', '3.0']]
Ratings: 100000


Finally, let's save our progress in a CSV file.

In [106]:
with open('../RawData/modified_ratings.csv', mode='w', newline='') as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(["record_id", "media_id", "media_title", "user_id", "user_rating"])

    record_id = 0
    for rating in ratings_array:
        record_id = record_id + 1
        csv_writer.writerow([record_id, rating[0], rating[1], rating[2], rating[3]])

Since shuffling data modifies the original array, we recommend running the Notebook from the begginning in case you would like to reset the data. Now we have everything we need for creating TensorFlow Datasets.