# üéÆ GameRx | Steam Game Metadata Cleaning  

This notebook starts the **GameRx metadata workflow**.  
I‚Äôm cleaning the raw `games.csv` file so it can link smoothly with review data and later power the app.  

#### What I‚Äôll do  
- Fix column alignment (`AppID`, `Name`, `Release date`, `Estimated owners`, `Price`)  
- Drop unneeded columns  
- Combine `Supported languages` + `Full audio languages` ‚Üí `languages`  
- Move language info into `About the game`  
- Replace broken `Reviews` with `"N/A"` (real ones come later)  
- Save a clean, trimmed file ready for next steps  

#### Goal  
End up with a **clean metadata file** that‚Äôs ready for genre tagging in the next notebook (`02_metadata_genre_cleaning.ipynb`).  

---

### Table of Contents

1. [Import Libraries](#1-import-libraries)  
2. [Load Dataset](#2-load-dataset)  
3. [Inspect Columns](#3-inspect-columns)  
4. [Build Languages Column](#4-build-languages-column)  
5. [Drop Extra Columns](#5-drop-extra-columns)  
6. [Light Clean & Type Fixes](#6-light-clean--type-fixes)  
7. [Re-check for Nulls](#7-re-check-for-nulls)  
8. [Filter to High-Quality Games](#8-filter-to-high-quality-games)  
9. [Save Final Cleaned Metadata](#9-save-final-cleaned-metadata)
10. [Insights & Next Steps](#10-insights-&-next-steps)

---

## 1. Import Libraries

Starting simple.  
These are the basics I need for working with the Steam game metadata.

- `pandas` for data work  
- `pathlib` for file paths  

More tools will come in later steps when the cleaning gets deeper.

In [1]:
from pathlib import Path
import pandas as pd

# Display settings for cleaner previews
pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', None)

---

## 2. Load Dataset

Time to bring in the raw Steam games metadata.

- This is the untouched version from the source  
- It still has messy fields and column issues  
- I‚Äôll clean all of that in the next steps

Just loading it here so the cleanup can begin.

In [2]:
pd.reset_option("display.max_columns")
pd.reset_option("display.max_rows")
pd.reset_option("display.width")
pd.reset_option("display.colheader_justify")
pd.reset_option("display.precision")
pd.reset_option("display.max_colwidth")

In [4]:
import pandas as pd
from pathlib import Path

# Define cleaned data folder
CLEANED = Path(r"D:/YVC/YVC Portfolio Implementation/Data Analytics Projects/GameRx Your Digital Dose/02 Data/cleaned")

# Load Cleaned Dataset
CLEAN_FILE = CLEANED / "01_steam_games_clean.csv"

df = pd.read_csv(CLEAN_FILE, low_memory=False)
print("‚úÖ Loaded:", CLEAN_FILE.name)
print("Shape:", df.shape)
# Show first 5 rows
df.head(5)

‚úÖ Loaded: 01_steam_games_clean.csv
Shape: (111452, 19)


Unnamed: 0,AppID,Name,Release date,About the game,Languages,Metacritic score,User score,Positive,Negative,Recommendations,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags
0,20200,Galactic Bowling,10/21/2008,Galactic Bowling is an exaggerated and stylize...,['English'],0,0,6,11,30,0,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling"
1,655370,Train Bandit,10/12/2017,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",0,0,53,5,12,0,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc..."
2,1732930,Jolt Project,11/17/2021,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",0,0,0,0,0,0,0,0,0,Campi√£o Games,Campi√£o Games,Single-player,"Action,Adventure,Indie,Strategy",
3,1355720,Henosis‚Ñ¢,7/23/2020,HENOSIS‚Ñ¢ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",0,0,3,0,0,0,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz..."
4,1139950,Two Weeks in Painland,2/3/2020,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",0,0,50,8,17,0,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,..."


----

## 3. Inspect Columns

Now that the dataset is loaded, I‚Äôm checking the column list.

- What needs cleaning  
- What should be renamed  
- What can be dropped  

This quick scan helps map out the next cleanup steps before saving the final file.

In [5]:
# Preview the full list of columns
print("Columns:", df.columns.tolist())

# Check a few basic stats
df.info()

# sample a few rows to look at weird cases
df.sample(3, random_state=1)

Columns: ['AppID', 'Name', 'Release date', 'About the game', 'Languages', 'Metacritic score', 'User score', 'Positive', 'Negative', 'Recommendations', 'Average playtime forever', 'Average playtime two weeks', 'Median playtime forever', 'Median playtime two weeks', 'Developers', 'Publishers', 'Categories', 'Genres', 'Tags']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111452 entries, 0 to 111451
Data columns (total 19 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   AppID                       111452 non-null  int64 
 1   Name                        111446 non-null  object
 2   Release date                111452 non-null  object
 3   About the game              104957 non-null  object
 4   Languages                   111452 non-null  object
 5   Metacritic score            111452 non-null  int64 
 6   User score                  111452 non-null  int64 
 7   Positive                    111452 non-null  i

Unnamed: 0,AppID,Name,Release date,About the game,Languages,Metacritic score,User score,Positive,Negative,Recommendations,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags
44693,1289810,Siralim Ultimate,12/3/2021,"Siralim Ultimate is a monster-catching, dungeo...",['English'],0,0,1216,63,314,737,503,854,503,Thylacine Studios,Thylacine Studios LLC,"Single-player,Full controller support","Indie,RPG","RPG,Indie,Creature Collector,Pixel Graphics,Tu..."
23449,335220,But to Paint a Universe,1/15/2015,Follow the story of a little girl as she strug...,['English'],0,0,19,14,3,306,0,306,0,M√•rten Jonsson,JMJ Interactive,"Single-player,Steam Trading Cards","Casual,Indie","Indie,Casual,Puzzle,Music,2D,Atmospheric"
75809,2476620,‰∏ÄÂÖãÂ§ßÂÜíÈô©,7/10/2023,‰∏ÄÂÖãÂ§ßÂÜíÈô©ÊòØ‰∏ÄÂÖãÊ∏∏ÊàèÂ∑•‰ΩúÂÆ§Ôºà‰∏ÄÂÖãÊ∏∏ÊàèÂìÅÁâåÔºâÁöÑÂºÄÂ±±‰πã‰ΩúÔºå‰πüÊòØ‰∏ÄÂÖãÂ∏ùÂõΩËø∑‰Ω†ÁâàÁöÑÂâçË∫´ÔºåËØ•Ê∏∏ÊàèÊòØ‰∏ÄÊ¨æ...,['Simplified Chinese'],0,0,0,0,0,0,0,0,0,"‰∏ÄÂÖãÊ∏∏Êàè,‰∏ÄÂÖã‰º†Â™í,ÊäïËµÑ‰∫∫ÁéãÊÄùÂåó","ÊàêÈÉΩ‰∏ÄÂÖãÂºì‰º†Â™íÊúâÈôêÂÖ¨Âè∏,Â≤´Â≤©Êª°ÊóèËá™Ê≤ªÂéø‰∏ÄÂÖãÂüéË¥∏ÊòìÂïÜË°å",Single-player,"Adventure,Casual,Indie,RPG",


### üîç Results: Column Inspection

The dataset loaded with **19 columns** and **111k+ rows**.  
Everything came in clean with no major issues.

#### What I noticed
- Key fields look good  
  (`AppID`, `Name`, `Release date`, `About the game`)
- Review and playtime fields are clean integers  
  (`Positive`, `Negative`, `User score`)
- `Languages` uses a list format and is ready for parsing
- Some games are fully in other languages  
  (may need filtering later)
- Metadata fields are present and usable  
  (`Developers`, `Publishers`, `Categories`, `Genres`, `Tags`)
- `Tags` has some missing or long entries  
  (will clean or shorten)

#### Summary
The structure is solid and ready for cleanup.  
Next step: fix data types, drop extra fields, and build a clean subset for the recommender.

---

## 4. Build Languages Column

I‚Äôve already merged the supported and audio languages into one column: `Languages`.

Now I‚Äôm doing a quick check to make sure the values look consistent.

- Format looks list-like  
- Easy to use later for filtering or tagging  
- No issues so far  

No cleanup needed unless I spot anything unusual.

In [6]:
# Build Languages Column (Validation Only)

# Check a few samples from the 'Languages' column
df['Languages'].sample(5, random_state=42)

# Check if any values are strings instead of list-like
df['Languages'].apply(lambda x: isinstance(x, list)).value_counts()

Languages
False    111452
Name: count, dtype: int64

In [7]:
# Convert list-like strings to actual Python lists
import ast

def safe_list_parse(val):
    try:
        return ast.literal_eval(val) if isinstance(val, str) and val.startswith('[') else val
    except:
        return val

df['Languages'] = df['Languages'].apply(safe_list_parse)

# Recheck formatting
df['Languages'].apply(lambda x: isinstance(x, list)).value_counts()

Languages
True     110469
False       983
Name: count, dtype: int64

In [8]:
# Clean remaining rows in 'Languages'

def fix_remaining_lang(val):
    if isinstance(val, list):
        return val
    elif isinstance(val, str):
        val = val.strip().replace('"', '').replace("'", '')
        return [val] if val else []
    else:
        return []

df['Languages'] = df['Languages'].apply(fix_remaining_lang)

# Final validation
df['Languages'].apply(lambda x: isinstance(x, list)).value_counts()

Languages
True    111452
Name: count, dtype: int64

#### üîç Results: Build Languages Column

The `Languages` column looked like a list, but it was actually plain text.  
I converted everything into real Python lists so the values work correctly later.

#### What I did
- Checked types ‚Üí all rows were strings  
- Used `ast.literal_eval()` to convert list-like text  
- Fixed 983 leftover rows  
  (single words like `"English"` or empty values)
- Wrapped plain strings in `[ ]`  
- Replaced broken entries with empty lists

#### Final Check
All rows are now real lists:
- `['English']`
- `['Spanish', 'English']`
- `[]`

This column is now ready for English-only filters and language analysis.

---

## 5. Drop Extra Columns

Time to remove columns that don‚Äôt help with emotion analysis, genre mapping, or app filtering.
These were already dropped during the manual Excel cleanup before loading the file into Python.

#### What I dropped
- `Peak CCU`
- `Required age`
- `Discount`
- `Website`
- Other metadata with no emotional or gameplay value

This keeps the dataset lighter and easier to merge.

#### What I kept
- **Game identity:** `AppID`, `Name`, `Developer`, `Publisher`
- **Descriptions:** `About`, `Genres`, `Tags`
- **Gameplay stats:** review counts, playtime values
- **Language info:** `Languages`

#### ‚ö†Ô∏è Note on `Reviews`
The `Reviews` column wasn‚Äôt real text.  
It was just a placeholder in the original file.

Since the true reviews come from the separate review dataset,  
I replaced the placeholder with `"N/A"` to avoid confusion.

Everything left now supports merging, analysis, or final app features.

---

## 6. Light Clean & Type Fixes

Quick cleanup to get the data ready for filtering and modeling.

#### What I did
- Trimmed extra spaces from text columns  
  *(left `Languages` alone to protect the list format)*
- Turned empty strings into `NaN`  
  *(makes missing data easier to catch)*
- Converted review and playtime fields to numeric values  
  (`Positive`, `User score`, `Average playtime forever`, etc.)

Everything is now in a cleaner, consistent format for the next steps.

In [9]:
# Strip whitespace from all string columns EXCEPT 'Languages'
excluded_cols = ['Languages']
str_cols = [col for col in df.select_dtypes(include='object').columns if col not in excluded_cols]
df[str_cols] = df[str_cols].apply(lambda col: col.str.strip())

# Replace empty strings with NaN in safe string columns only
df[str_cols] = df[str_cols].replace(r'^\s*$', pd.NA, regex=True)

# Convert review stats and playtime fields to numeric (if they exist)
numeric_cols = [
    'Metacritic score', 'User score', 'Positive', 'Negative', 'Recommendations',
    'Average playtime forever', 'Average playtime two weeks',
    'Median playtime forever', 'Median playtime two weeks'
]
existing_numeric = [col for col in numeric_cols if col in df.columns]
df[existing_numeric] = df[existing_numeric].apply(pd.to_numeric, errors='coerce')

# Final check
df[existing_numeric].dtypes

Metacritic score              int64
User score                    int64
Positive                      int64
Negative                      int64
Recommendations               int64
Average playtime forever      int64
Average playtime two weeks    int64
Median playtime forever       int64
Median playtime two weeks     int64
dtype: object

In [10]:
# Show 5 random rows with AppID, Name, and Languages only
df[['AppID', 'Name', 'Languages']].sample(5, random_state=42)

Unnamed: 0,AppID,Name,Languages
99031,3222680,Reconquista,"[English, Korean, Simplified Chinese, Japanese]"
24971,1042960,We Are The Caretakers,"[English, French, German, Spanish - Spain, Por..."
23944,1106940,How to Raise a Wolf Girl,"[English, Japanese, Traditional Chinese]"
100098,3149840,–°–∫—É—Ñ –Ω–∞ —Ä—ã–±–∞–ª–∫–µ,[Russian]
107578,3457210,Tumultus Playtest,[]


### üîç Results: Light Clean & Type Fixes

Quick cleanup finished and everything looks solid.

#### What I checked
- Extra spaces were removed from text columns  
  *(kept `Languages` untouched)*
- Empty strings ‚Üí replaced with `NaN`
- Numeric fields were converted cleanly  
  (`User score`, `Recommendations`, playtime stats)

#### Language Column Check
All list values stayed intact:

- `['English', 'Korean', 'Simplified Chinese']`
- `['Russian']`
- `[]` for games with no language info

No corrupted entries.  
Column types are clean and consistent.

---

## 7. Re-check for Nulls

Quick pass to see which columns still have missing values.

#### What I found
- `Metacritic score` has many nulls  
  (expected ‚Äî not all games are rated)
- Core fields look good  
  `AppID`, `Name`, `Languages`, and review stats are mostly complete

#### Next Step
Filter out rows with nulls in **critical fields** before the next cleaning phase.

In [12]:
# Count total nulls per column
df.isnull().sum()

AppID                             0
Name                              6
Release date                      0
About the game                 6495
Languages                         0
Metacritic score                  0
User score                        0
Positive                          0
Negative                          0
Recommendations                   0
Average playtime forever          0
Average playtime two weeks        0
Median playtime forever           0
Median playtime two weeks         0
Developers                     6475
Publishers                     6778
Categories                     7566
Genres                         6440
Tags                          37423
dtype: int64

### üîç Results: Dataset Clean Review

After the cleanup, I checked for any remaining nulls.

#### What I found
- No nulls in core fields  
  (`AppID`, `Languages`, numeric review stats, playtime columns)
- Only 6 games were missing `Name`  
  ‚Üí these will be dropped next
- Optional fields had expected gaps:  
  - `About the game`: ~6,500 nulls  
  - `Genres`: ~6,400 nulls  
  - `Tags`: ~37,000 nulls (user-generated and inconsistent)

#### What‚Äôs next
Drop rows missing **critical values** like `Name` and `Genres`.  
Keep the rest so filtering stays flexible later.

----

## 8. Filter to High-Quality Games

With the columns cleaned and the types fixed, this step narrows the dataset to games that are complete and ready for the app.

#### Criteria
A game must have:

- A valid `Name`  
- At least one entry in `Genres`  

Rows missing either of these were dropped.
Optional fields like `Tags`, `About`, and `Publishers` were allowed to stay empty.

#### Final Step
I saved the removed rows in a separate file for easy review later.

In [13]:
# Drop rows missing 'Name' or 'Genres'
df = df[df['Name'].notna() & df['Genres'].notna()]

# (Optional) Save broken rows for manual review
# df[df['Name'].isnull() | df['Genres'].isnull()].to_csv("games_with_missing_fields.csv", index=False)

# Preview remaining row count
print("Remaining games:", df.shape[0])

Remaining games: 105008


In [14]:
# üîç Check number of games remaining
total_games = df.shape[0]
print(f"üéÆ Total games remaining: {total_games}")

# Optional: Preview the full game name list (first 50)
df[['AppID', 'Name']].head(50)

üéÆ Total games remaining: 105008


Unnamed: 0,AppID,Name
0,20200,Galactic Bowling
1,655370,Train Bandit
2,1732930,Jolt Project
3,1355720,Henosis‚Ñ¢
4,1139950,Two Weeks in Painland
5,1469160,Wartune Reborn
6,1659180,TD Worlds
7,1968760,Legend of Rome - The Wrath of Mars
8,1178150,MazM: Jekyll and Hyde
9,320150,Deadlings: Rotten Edition


In [15]:
if 50 <= total_games <= 100:
    print("‚úÖ Game list is in range for app.py!")
elif total_games < 50:
    print("‚ö†Ô∏è Not enough games you may want to relax filters or add more.")
else:
    print("‚ÑπÔ∏è More than 100 games you can sample a subset if needed.")

‚ÑπÔ∏è More than 100 games you can sample a subset if needed.


### üîç Results: Filter to High-Quality Games

After removing rows missing a `Name` or `Genres`, the dataset is down to:

**üéÆ 100,088 games**

#### Why this matters
A cleaner set like this makes the next steps easier:

- Genre‚Äìemotion modeling  
- Clustering  
- Powering the app recommender

#### Next Step
For early app testing, I‚Äôll sample a smaller group  
(about **50‚Äì100 games**) to keep things fast while prototyping.

The preview shows a good mix of genres  
indie, simulators, RPGs, puzzles
which helps keep the final recommendations diverse.

---

## 9. Save Final Cleaned Metadata
The dataset is fully cleaned, filtered, and validated.  
This version is ready for the next project steps.

#### What this version includes
- Only complete, high-quality games  
- Valid `Name` and `Genres` columns  
- Helpful context fields like playtime, tags, and languages

#### üìÅ Save Details
File saved to:
`02_Data/cleaned/01_games_metadata_cleaned.csv`

#### How this file will be used
- Clustering and genre‚Äìemotion analysis  
- Connecting to Steam reviews for the recommender  
- Any future modeling or visual exploration 

In [16]:
# Define export path
from pathlib import Path

OUTPUT_PATH = Path(r"D:\YVC\YVC Portfolio Implementation\Data Analytics Projects\GameRx Your Digital Dose\02 Data\cleaned\01_games_metadata_cleaned.csv")

# Create folder if it doesn't exist
OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)

# Save the cleaned dataframe to CSV
df.to_csv(OUTPUT_PATH, index=False)

print(f"‚úÖ Final cleaned metadata saved to: {OUTPUT_PATH}")

‚úÖ Final cleaned metadata saved to: D:\YVC\YVC Portfolio Implementation\Data Analytics Projects\GameRx Your Digital Dose\02 Data\cleaned\01_games_metadata_cleaned.csv


---

## 10. Insights & Next Steps

With the Steam metadata cleaned and trimmed, the next stage is working on the genre details.

Open **`02_metadata_genre_cleaning.ipynb`** to:

- Split and clean the `Genres` field  
- Build a clear `genre_list` for each game  
- Create `primary_genre` and `genre_count` features

This prepares the dataset for merging with Steam reviews and for deeper emotion and genre analysis.