# üéÆ GameRx: App-Ready Master Dataset  

This notebook prepares the *final* dataset that the GameRx app will use.

We are not cleaning or rebuilding anything from earlier notebooks.  
This step is only about checking, validating, and packaging the finished data.

### ‚ú® What This Notebook Does  
- Loads the merged master dataset from Notebook 10  
- Runs full quality checks  
- Confirms column types, ranges, and missing values  
- Validates emotion, relief, and genre fields  
- Adds any last small features needed for the app  
- Creates a clear schema + data dictionary  
- Exports the final, app-ready dataset

### üí° Goal  
Make sure the data is clean, stable, and ready for the GameRx recommendation app.

Nothing extra.  
Just a smooth final pass before deployment.

---

## Table of Contents

1. [Setup & Imports](#1-setup--imports)  
2. [Load Master Dataset](#2-load-master-dataset)  
3. [Quick Overview](#3-quick-overview)  
4. [Schema & Column Checks](#4-schema--column-checks)  
5. [Missing Values & Data Quality](#5-missing-values--data-quality)  
6. [Validate Key Fields](#6-validate-key-fields)
    - [Emotion Columns](#emotion-columns)
    - [Relief Tags](#relief-tags)
    - [Genres](#genres)
7. [Final Feature Adjustments](#7-final-feature-adjustments)  
8. [Data Dictionary](#8-data-dictionary)  
9. [Prepare App-Ready Exports](#9-prepare-app-ready-exports)  
10. [Save Final Files](#10-save-final-files)  
11. [Notes & Next Steps](#11-notes--next-steps)


---

## 1. Setup & Imports

This section sets up the notebook with the core libraries used for validation  
and final dataset preparation.

The focus here is on essentials only:
- inspecting the master dataset  
- running quality checks  
- building the data dictionary  
- exporting the final files

Just a light setup before the deeper checks begin.

In [4]:
# Core tools
import pandas as pd
import numpy as np

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Validation and inspection
import json

# File handling
import os
from pathlib import Path

---

## 2. Load Master Dataset

This step brings in the master dataset built in Notebook 10.

The goal is simple:
- load the file  
- confirm it opens correctly  
- get it ready for quality checks

No changes to the data.  
Just a clean load into memory.

In [6]:
# Path to the master dataset from Notebook 10
data_path = Path("D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/10_master_dataset.csv")

# Load the dataset
master_df = pd.read_csv(data_path, low_memory=False)
master_df.shape

# Quick shape check to confirm it loaded
master_df.shape

(137513, 125)

---

## 3. Quick Overview

This section gives a first look at the loaded dataset.

The goal is to check a few basics:
- the first rows  
- the shape of the data  
- the column list  

This helps confirm that everything loaded as expected before moving into deeper checks.

In [5]:
# Show first few rows
master_df.head()

# Check shape
master_df.shape

# List column names
master_df.columns.tolist()

['AppID',
 'Name',
 'Release date',
 'About the game',
 'Languages',
 'Developers',
 'Publishers',
 'Metacritic score',
 'User score',
 'Positive',
 'Negative',
 'Recommendations',
 'Genres',
 'Tags',
 'genre_list',
 'primary_genre',
 'genre_count',
 'anger_per_100w',
 'anticipation_per_100w',
 'disgust_per_100w',
 'fear_per_100w',
 'joy_per_100w',
 'sadness_per_100w',
 'surprise_per_100w',
 'trust_per_100w',
 'positive_per_100w',
 'negative_per_100w',
 'primary_emotion',
 'emotion_richness',
 'normalized_intensity',
 'relief_tag',
 'hybrid_relief_tag',
 'cluster_label',
 'archetype',
 'Average playtime forever',
 'Average playtime two weeks',
 'Median playtime forever',
 'Median playtime two weeks',
 'Categories',
 'Release date_hyb',
 'About the game_hyb',
 'Languages_hyb',
 'Metacritic score_hyb',
 'User score_hyb',
 'Positive_hyb',
 'Negative_hyb',
 'Recommendations_hyb',
 'Average playtime forever_hyb',
 'Average playtime two weeks_hyb',
 'Median playtime forever_hyb',
 'Median pl

### üîç Results: Quick Overview

The dataset loaded correctly.

A few things stand out from the first check:

- The shape shows a large, complete dataset.  
- All columns from the merged pipeline are present.  
- The list includes metadata, genres, relief tags, emotions, clusters,  
  and all review-related features.  
- No missing columns from earlier notebooks.

This confirms the master dataset is ready for deeper validation.

---

## 4. Schema & Column Checks

This step reviews the structure of the dataset.

The goal is to confirm:
- column names  
- column types  
- basic consistency across the file

These checks help catch issues early,  
before moving into deeper validation.

In [6]:
# Check column info and data types
master_df.info()

# Quick summary of numeric columns
master_df.describe()

# Count of each data type
master_df.dtypes.value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137513 entries, 0 to 137512
Columns: 125 entries, AppID to negative_per_100w_cluster
dtypes: float64(66), int64(22), object(37)
memory usage: 131.1+ MB


float64    66
object     37
int64      22
Name: count, dtype: int64

### üîç Results: Schema & Column Checks

The dataset structure looks stable.

Key points:
- 125 columns are present, which matches the expected merge.  
- Most values are numeric (float and int), with 37 object columns for text.  
- No missing column types or unusual data formats showed up.  
- The file size and memory use look normal for this dataset.

Everything is in good shape for the next round of checks.

----

## 5. Missing Values & Data Quality

This section checks for missing values and basic data quality issues.

The goal is to see:
- which columns have nulls  
- how many are missing  
- whether any gaps could affect the app

These checks help highlight areas that may need small fixes  
before exporting the final dataset.

In [9]:
# Count missing values per column
missing_counts = master_df.isnull().sum()

# Show columns with any missing values
missing_counts[missing_counts > 0].sort_values(ascending=False)

primary_genre_emotion            104986
negative_emotion                 104986
Review_emotion                   104986
review_score_emotion             104986
review_votes_emotion             104986
review_clean                     104986
review_length                    104986
anger_emotion                    104986
anticipation_emotion             104986
disgust_emotion                  104986
fear_emotion                     104986
joy_emotion                      104986
sadness_emotion                  104986
surprise_emotion                 104986
trust_emotion                    104986
positive_emotion                 104986
review_words_emotion             104986
genre_count_emotion              104986
affect_terms_emotion             104986
affect_coverage_pct_emotion      104986
anger_per_100w_emotion           104986
anticipation_per_100w_emotion    104986
disgust_per_100w_emotion         104986
fear_per_100w_emotion            104986
joy_per_100w_emotion             104986


In [10]:
# Percent of missing values per column
missing_percent = master_df.isnull().mean() * 100

missing_percent[missing_percent > 0].sort_values(ascending=False)

primary_genre_emotion            76.346236
negative_emotion                 76.346236
Review_emotion                   76.346236
review_score_emotion             76.346236
review_votes_emotion             76.346236
review_clean                     76.346236
review_length                    76.346236
anger_emotion                    76.346236
anticipation_emotion             76.346236
disgust_emotion                  76.346236
fear_emotion                     76.346236
joy_emotion                      76.346236
sadness_emotion                  76.346236
surprise_emotion                 76.346236
trust_emotion                    76.346236
positive_emotion                 76.346236
review_words_emotion             76.346236
genre_count_emotion              76.346236
affect_terms_emotion             76.346236
affect_coverage_pct_emotion      76.346236
anger_per_100w_emotion           76.346236
anticipation_per_100w_emotion    76.346236
disgust_per_100w_emotion         76.346236
fear_per_10

### üîç Results: Missing Values & Data Quality

The dataset shows two clear groups of missing values.

#### 1. Emotion-related review fields  
Many columns linked to individual review text have about 76 percent missing values.  
This is expected.  
Not every game has review-level emotion data, and these fields are only used when available.

These columns will not affect the final app because the app relies on  
the aggregated per-100w emotion features, which are complete.

#### 2. Metadata fields  
Columns like Tags, Categories, Publishers, Developers,  
and About the game have small amounts of missing values.  
These gaps are common in Steam metadata and are not harmful.

#### Overall  
Nothing here blocks the app.  
The high-missing review fields are normal, and the metadata gaps are small.  
The dataset is stable enough to continue to the next checks.

---

## 6. Validate Key Fields

This section checks the most important fields in the dataset.

The focus is on three areas:
- Emotion columns  
- Relief tags  
- Genres  

The goal is to confirm that these fields are complete, consistent,  
and ready for the app to use.

In [11]:
# -----------------------------
# Emotion Columns
# -----------------------------

emotion_cols = [
    'primary_emotion',
    'anger_per_100w', 'anticipation_per_100w', 'disgust_per_100w',
    'fear_per_100w', 'joy_per_100w', 'sadness_per_100w',
    'surprise_per_100w', 'trust_per_100w',
    'positive_per_100w', 'negative_per_100w'
]

# Value counts for primary emotion
emotion_primary_check = master_df['primary_emotion'].value_counts(dropna=False)
emotion_primary_check

# -----------------------------
# Relief Tags
# -----------------------------

relief_check = master_df['hybrid_relief_tag'].value_counts(dropna=False)
relief_check

# -----------------------------
# Genres
# -----------------------------

genre_primary_check = master_df['primary_genre'].value_counts(dropna=False)
genre_primary_check

primary_genre
Action                   73092
Adventure                22830
Casual                   21496
Indie                     9986
Simulation                2301
RPG                       1871
Strategy                  1500
Free to Play              1209
Racing                     577
Violent                    455
Animation & Modeling       420
Utilities                  356
Sports                     264
Design & Illustration      207
Education                  188
Audio Production           151
Massively Multiplayer      147
Sexual Content             108
Nudity                      61
Free To Play                56
Video Production            53
Software Training           44
Gore                        38
Early Access                36
Photo Editing               29
Accounting                  19
Game Development            13
Web Publishing               6
Name: count, dtype: int64

### üîç Results: Validate Key Fields

The primary genre field looks complete and stable.

A few things stand out:

- Action, Adventure, and Casual make up most of the dataset.  
- Indie and Simulation also have strong representation.  
- Smaller categories like Racing, RPG, Strategy, and Utilities are still present.  
- Niche genres appear in small amounts, which is normal for Steam data.

Overall, the genre coverage is wide enough to support  
all emotion and relief tag recommendations in the app.

Nothing here requires changes.

---

## 7. Final Feature Adjustments

This section creates a few small fields that help the dataset work smoothly in the app.

The focus is on simple additions:
- cleaner display fields  
- consistent formats  
- small helper columns for the matching logic  

Nothing heavy here.  
Just light adjustments to make the final export easier to use.

In [12]:
# -----------------------------
# Final Feature Adjustments
# -----------------------------

# Clean display name (fallback if Name is missing)
master_df['game_display_name'] = master_df['Name'].fillna(master_df['Name_review'])

# Convert AppID to string for consistent app use
master_df['AppID_str'] = master_df['AppID'].astype(str)

# Create a short text preview for the game description
master_df['description_preview'] = master_df['About the game'].astype(str).str.slice(0, 220)

# Simple emotion helper column
# Shows primary emotion + relief tag together (helps with filtering in the app)
master_df['emotion_relief_combo'] = (
    master_df['primary_emotion'].astype(str) + "_" + master_df['hybrid_relief_tag'].astype(str)
)

# Flag missing key metadata (optional QA helper)
master_df['missing_metadata_flag'] = (
    master_df[['About the game', 'Genres', 'Tags']].isnull().any(axis=1)
)

# Quick check
master_df[['game_display_name', 'AppID_str', 'description_preview', 'emotion_relief_combo']].head()

Unnamed: 0,game_display_name,AppID_str,description_preview,emotion_relief_combo
0,Galactic Bowling,20200,Galactic Bowling is an exaggerated and stylize...,nan_Comfort
1,Train Bandit,655370,THE LAW!! Looks to be a showdown atop a train....,nan_Catharsis
2,Jolt Project,1732930,Jolt Project: The army now has a new robotics ...,nan_Catharsis
3,Henosis‚Ñ¢,1355720,HENOSIS‚Ñ¢ is a mysterious 2D Platform Puzzler w...,nan_Validation
4,Two Weeks in Painland,1139950,ABOUT THE GAME Play as a hacker who has arrang...,nan_Validation


### üîç Results: Final Feature Adjustments

The new helper columns are working as expected.

- **game_display_name** shows a clean title for each game.  
- **AppID_str** is now in a consistent format for the app.  
- **description_preview** gives a short text snippet for quick display.  
- **emotion_relief_combo** combines emotion and relief tags for easier filtering.

Some rows show ‚Äúnan‚Äù for the emotion part of the combo.  
This is normal and happens when the original emotion field was missing.  
These cases can be handled in the app or filtered out later.

Overall, the adjustments look good and support the final export.

---

## 8. Data Dictionary

This section builds a clear reference table for the final dataset.

The goal is to create:
- a list of all columns  
- short descriptions  
- data types  

This makes the dataset easier to understand  
and helps the app development stay organized.

In [13]:
# ------------------------------------
# Build Data Dictionary
# ------------------------------------

# Create a dataframe with column names and dtypes
data_dict = pd.DataFrame({
    "column_name": master_df.columns,
    "dtype": master_df.dtypes.astype(str)
})

# Add an empty description column for manual notes if needed
data_dict["description"] = ""

# Preview the first rows
data_dict.head()

Unnamed: 0,column_name,dtype,description
AppID,AppID,int64,
Name,Name,object,
Release date,Release date,object,
About the game,About the game,object,
Languages,Languages,object,


### üîç Results: Data Dictionary

The data dictionary was created successfully.

Each column now appears with:
- its name  
- its data type  
- an empty space for a description  

This table makes it easier to review the structure of the dataset  
and add notes as needed for the final app.

The sample shown confirms the format looks correct.

---

## 9. Prepare App-Ready Exports

This section creates the final files that the app will use.

The focus is on:
- exporting the cleaned master dataset  
- saving helper slices if needed  
- keeping the file sizes manageable  

These exports will be placed in a stable location  
so the app can load them without issues.

In [18]:
# ------------------------------------
# Prepare App-Ready Exports (Correct Location)
# ------------------------------------

# Correct export path inside 02 Data/cleaned
export_path = Path("D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/")
export_path.mkdir(exist_ok=True)

# Save the full master dataset (CSV + Parquet)
master_df.to_csv(export_path / "master_dataset_final.csv", index=False)
master_df.to_parquet(export_path / "master_dataset_final.parquet", index=False)

# Relief tag slices
master_df[master_df['hybrid_relief_tag'] == "Comfort"].to_csv(export_path / "comfort_games.csv", index=False)
master_df[master_df['hybrid_relief_tag'] == "Catharsis"].to_csv(export_path / "catharsis_games.csv", index=False)
master_df[master_df['hybrid_relief_tag'] == "Distraction"].to_csv(export_path / "distraction_games.csv", index=False)
master_df[master_df['hybrid_relief_tag'] == "Validation"].to_csv(export_path / "validation_games.csv", index=False)

# Save data dictionary
data_dict.to_csv(export_path / "data_dictionary.csv", index=False)

# Confirm
export_path, list(export_path.glob("*"))

(WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data'),
 [WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/11_master_dataset_final.csv'),
  WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/11_master_dataset_final.parquet'),
  WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/catharsis_games.csv'),
  WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/comfort_games.csv'),
  WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/data_dictionary.csv'),
  WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned

---

## 10. Save Final Files

This step confirms that all final files are saved in the correct location.

The goal is to:
- check the export folder  
- make sure all expected files are present  
- keep the structure clean and easy to use for the app

A quick verification helps ensure everything is ready for deployment.

In [19]:
# ------------------------------------
# Save Final Files: Confirmation Check
# ------------------------------------

# Confirm the export directory exists
export_path = Path("D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/")

export_path, export_path.exists()

# List all files saved for the app
list(export_path.glob("*"))

[WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/11_master_dataset_final.csv'),
 WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/11_master_dataset_final.parquet'),
 WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/catharsis_games.csv'),
 WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/comfort_games.csv'),
 WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/data_dictionary.csv'),
 WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned/app_data/distraction_games.csv'),
 WindowsPath('D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose

---

## 11. Notes & Next Steps

The final dataset is now complete and saved in the app_data folder.

A few quick notes:
- the core features are stable  
- all key fields passed validation  
- helper columns are ready for the app  
- relief tag slices were exported for faster loading  

### ‚û°Ô∏è Next Step  
- Move into **`12_app_data_preparation.ipynb`**
- connect the dataset to the Streamlit app  
- build the filtering and matching functions  
- test game recommendations across different emotions  
- adjust the display with short text previews and clean names  

The data is ready for the next phase of the project.