# Initial Data Exploration

This notebook is used to carry out some exploratory analysis on a dataset of games available on Steam.

We start by loading and combining the data.

In [None]:
import pandas as pd
import numpy as np

all_games = pd.read_csv("../data/external/all_titles_20240617.csv", index_col="appid")
game_dates = pd.read_csv("../data/external/game_dates.csv", index_col="appid").drop_duplicates()
game_tags = pd.read_csv("../data/external/game_tags.csv", index_col="appid").drop_duplicates()

# Clear up unnecessary columns formed from indexes in data pulling
for df in [all_games, game_dates, game_tags]:
    df.drop(columns="Unnamed: 0", inplace=True)

# Combine data into single dataset
games = all_games.merge(game_dates, left_index=True, right_index=True)
games = games.merge(game_tags, left_index=True, right_index=True)
games.head(10)

We have various data on games, including basic metadata (name, publisher, release date, genres and tags), number of owners in a range, and usage stats (average/median playtime, peak users yesterday - ccu).

Let's look at the data types:

In [None]:
games.info()

A couple of variables will need converting to more useful formats - the date and the owners.

In [None]:
# Convert date to datetime data type
games['date'] = pd.to_datetime(games['date'], errors="coerce")

# Add a year variable
games['year'] = games['date'].dt.year

In [None]:
#Convert owners to ordered category data type
from pandas.api.types import CategoricalDtype

labels = games['owners'].unique()
categories = sorted(labels, key=lambda x: int(x.split(" .. ")[0].replace(",","")))

cat_type = CategoricalDtype(categories=categories, ordered=True)
games['owners'] = games['owners'].astype(cat_type)

In [None]:
import seaborn as sns

# Create histogram
ax = sns.histplot(data=games, y="owners")

# Add labels
for container in ax.containers:
    ax.bar_label(container, fmt="{:,.0f}", padding=5)

# Increase x limit to fit labels
_ = ax.set_xlim(ax.get_xlim()[0], ax.get_xlim()[1]*1.1)

We can see that the vast majority of games have below 20,000 owners, with decreasing numbers of games in each larger size bucket. There are a handful of individual games in the largest owners buckets.

Now looking at the tags, let's have a look at the frequencies

In [None]:
games.filter(like="tag").stack().value_counts()[:40].plot.bar()

A couple of observations:
* The top 4 tags are very frequent, then there is a steep drop to the next four, another steep drop and then a much more gradual progression.
* Most tags describing the genre or gameplay ('action', 'strategy', 'platformer'), but some describe other meta characteristics, graphical style, business model, or just vibes ('VR', 'Pixel Graphics', 'Indie', 'Early access', 'Cute')

Finally, let's look at release dates:

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

games.dropna(subset='year', inplace=True)
games.groupby(games['year'])['name'].count().plot(kind = 'bar', ax=ax)

# Label axes
plt.xlabel("Year of release on Steam")
plt.ylabel("Number of games released")


Steam launched in 2003, and opened up to third party games in 2005. But we can see that the number of games being release on Steam really started taking off in 2014, and has accelerated signfiicantly since.

# Tags over time

In [None]:
# Make a table of tag rank by year
all_tags = pd.concat([games['year'], games.filter(like="tag")], axis=1)
all_tags = all_tags.melt(id_vars='year', ignore_index=False, value_name='tag').drop(columns='variable')

tags_per_year = all_tags.groupby('year').value_counts()
tag_ranks_per_year = tags_per_year.groupby(level='year').rank(ascending=False, method='min')
tag_ranks_per_year.name = 'rank'

tag_ranks_per_year = pd.DataFrame(tag_ranks_per_year).reset_index()

In [None]:
top_5_tags_2023 = tag_ranks_per_year[(tag_ranks_per_year['year'] == 2023) & (tag_ranks_per_year['rank'] <= 5)]['tag'].values
top_100_tags_2023 = tag_ranks_per_year[(tag_ranks_per_year['year'] == 2023) & (tag_ranks_per_year['rank'] <= 100)]['tag'].values

In [None]:
# Filter a DF that has the rank history for the top 5 tags in 2023
top_5_tags_history = tag_ranks_per_year.query(
    "2008 <= year <= 2023 and tag in @top_5_tags_2023"
)

# Line chart showing evolution over time
g = sns.lineplot(
    data = top_5_tags_history,
    x = 'year',
    y = 'rank',
    hue = 'tag'
)

_ = g.invert_yaxis()

There is some amount of variability over time, but it isn't huge - action, casual and adventure stay within the top 5 throughout the period, and puzzle and simulation drift more but remain in the top 10.

To explore more than just the top 5, we'll use Plotly for an interactive plot that lets us pick which tags to draw out from the mass. Inspiration taken from the UK Office for National Statistics visualiser for baby name popularity: https://www.ons.gov.uk/visualisations/dvc363/babyindex.html

In [None]:
import plotly.express as px
import plotly.graph_objects as go

top_100_tags_history = tag_ranks_per_year.query(
    "2008 <= year <= 2023 and rank <= 100"
)

# Set up a line chart
# TO DO: change this to use Plotly Graph Objects instead of using Plotly Express 
fig = px.line(
    top_100_tags_history,
    x = 'year',
    y = 'rank',
    color = 'tag',
    title = 'Tag popularity over time'
)
# Set all lines to grey and not shown on legend to begin with
fig.update_traces(line_color='#e0e0e0', showlegend=False) 

# Reverse the y axis (so top-ranked on top)
fig.update_yaxes(autorange='reversed')

# Convert to a figure widget to enable click events
fw = go.FigureWidget(fig)

# Set up dictionary for tracking which lines are active
active_lines = {trace.name: False for trace in fw.data}
active_n = 0

# Set up colour pallete
palette = px.colors.qualitative.Plotly.copy()

# Define click function to toggle clicked lines
def toggle_line(trace, points, selector):
    # Give function access to active_n variable outside of function scope
    global active_n
    
    # Skip when point clicked was outside trace
    if not points.point_inds:
        return
    
    # Toggle 
    if trace.showlegend:
        # Deactivate
        palette.insert(0, trace.line.color) # return line colour to the palette
        trace.line.color = '#e0e0e0'
        trace.showlegend = False
        #trace.zorder = 1
        #trace.mode = 'lines'
        active_n -= 1
    else:
        # activate
        trace.line.color = palette.pop(0)
        trace.showlegend = True
        #trace.zorder = 20
        #trace.mode = 'lines+markers'
        active_n +=1
        
for trace in fw.data:
    trace.on_click(toggle_line)
fw