# Converting Genres to Binary

In preparation for clustering, this notebook converts the release date to a numeric feature and produces two datasets: one with genres expanded into binary columns, and one with a single binary indie flag.

In [6]:
import pandas as pd
df = pd.read_csv('../datasets/combined_filtered_genres.csv', index_col='name')
df.head()

Unnamed: 0_level_0,r-date,score,user score,critics,users,price,dlc_count,achievements,average_playtime_forever,average_playtime_2weeks,total_steam_reviews,percent_positive_steam,Combined_Genres
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Half-Life 2,"November 16, 2004",96,91.0,81,10773,9.99,1,33,606,185,146055,0.974756,"Action,Shooter,Arcade,Sci-Fi,First-Person"
Half-Life,"November 19, 1998",96,90.0,24,4321,9.99,1,0,459,102,75965,0.965524,"Action,Shooter,Arcade,Sci-Fi,First-Person"
Portal 2,"April 18, 2011",95,91.0,52,7482,9.99,1,51,874,118,301938,0.987951,"Action,Shooter,Arcade,Sci-Fi,First-Person,Adve..."
The Elder Scrolls V: Skyrim,"November 10, 2011",94,83.0,32,11301,19.99,1,75,6458,596,309973,0.94827,"First-Person,RPG,Role-Playing"
BioShock Infinite,"March 25, 2013",94,86.0,68,8729,29.99,5,80,719,251,125615,0.951447,"Action,Shooter,Arcade,Sci-Fi,First-Person"


## Release Date as Numeric Feature

The release date is converted to days elapsed since the earliest game in the dataset, making it usable as a numerical feature in the clustering models.

In [7]:
df['r-date'] = pd.to_datetime(df['r-date'], format='%B %d, %Y')
min_date = df['r-date'].min()
df['release_date_numeric'] = (df['r-date'] - min_date).dt.days

## Indie Flag Dataset

A binary `indie` column is added indicating whether a game's genre includes 'Indie'. This version of the dataset is saved for use in the indie-specific clustering notebook.

In [8]:
df2 = df.copy()
df2['indie'] = df2['Combined_Genres'].str.contains('Indie', case=False).astype(int)
df2 = df2.drop(columns=['r-date', 'Combined_Genres'])
df2.to_csv('../datasets/all_numerical_only_indie.csv', index=True)


## Full Binary Genre Dataset

Each valid genre is expanded into its own binary column. The original genre string columns and release date are then dropped, leaving a fully numerical dataset.

In [9]:
df['genres'] = df['Combined_Genres'].str.split(',')

all_genres = set([genre for sublist in df['genres'] for genre in sublist])

for genre in all_genres:
    df[genre] = df['genres'].apply(lambda x: 1 if genre in x else 0)

df.drop(columns=['Combined_Genres', 'genres','r-date'], inplace=True)

print(df.head(1))

             score  user score  critics  users  price  dlc_count  \
name                                                               
Half-Life 2     96        91.0       81  10773   9.99          1   

             achievements  average_playtime_forever  average_playtime_2weeks  \
name                                                                           
Half-Life 2            33                       606                      185   

             total_steam_reviews  ...  Adventure  Casual  Simulation  RPG  \
name                              ...                                       
Half-Life 2               146055  ...          0       0           0    0   

             Role-Playing  Fantasy  Indie  3D  Action Adventure  2D  
name                                                                 
Half-Life 2             0        0      0   0                 0   0  

[1 rows x 31 columns]


In [10]:
df.to_csv('../datasets/all_numerical.csv', index=True)