---
title: "Analyzing Gaming Addiction Trends and Player Engagement"
#subtitle: "Spring 2025"
author: "Bijay Adhikari"
bibliography: references.bib
nocite: |
  @*
number-sections: false
format:
  html:
    theme: default
    rendering: embed-resources
    code-fold: true
    code-tools: true
    toc: true
jupyter: python3
---



![Source: iStockPhoto](https://media.istockphoto.com/id/1132282499/photo/woman-playing-video-games.jpg?s=612x612&w=0&k=20&c=ljeQB2UiYW6fYNjAZONaCcdH1r7GPPE7SvXzQ5YJPMk=){fig-alt="A picture of a person indulged in video game."}


Have you ever found yourself so immersed in a video game that hours passed without you even noticing? For me, it was the Grand Theft Auto series. After school, I would sit down to play and quickly become absorbed in its open virtual world, filled with action, rich stories, and endless possibilities. It was my version of freedom. I could cruise through cities, outrun the police, complete missions, and later laugh about it with friends. Video games were fun, simple, and a great way to relax.

Today, the gaming industry has transformed massively. Video games are more immersive, cinematic, and competitive than ever. Streamers and esports players are treated like celebrities, with huge fan bases following everything they do. Video games are no longer just a form of entertainment. They have become a way to express yourself, get inspired, and even explore future career options.
With that rise comes a darker flip side, which is gaming addiction. As fun and exciting as games can be, excessive play can negatively impact mental health, productivity, and real-world relationships. Studies show that around 3-4% of gamers exhibit symptoms of gaming disorder, leading to anxiety, social withdrawal, and even depression in some cases. 

This blog takes a closer look at where we stand today when it comes to gaming addiction. By using real-time data from the Steam library, one of the world’s largest gaming platforms, this project explores how often people play, what types of games they spend the most time on, and what patterns might point to addictive behavior.


## Data Sources

For this project, Steam was selected as the primary source of gaming data due to its large user base, extensive game catalog, and publicly accessible API. Two main datasets were extracted using the Steam Web API specifically using the **ISteamApps/GetAppList** and **IPlayerService/GetOwnedGames** endpoints:  

- **`steam_game_data.csv`** : This dataset contains detailed information about 91,690 games, including metadata such as the game title, developer, publisher, genre, release date, supported platforms, multiplayer support, graphics quality, story depth, and metrics like review scores and player engagement statistics.  

- **`user_playtime_data.csv`** : This dataset consists of 66,187 player-specific data, capturing metrics like total playtime, recent playtime, and game ownership information. Each user (`user_id`) can have multiple games (`game_id`) associated with their account.  


::: {.cell .code-fold="true" .code-summary="View user data collection script"}

In [None]:
import requests
import pandas as pd
import time
import random
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

# Generate random Steam User IDs (these are not guaranteed to be valid)
def generate_random_steam_ids(n=50000):
    # SteamID64 range (valid SteamID64 values generally start from 76561197960265728)
    base_id = 76561197960265728
    return [str(base_id + random.randint(0, 1000000000)) for _ in range(n)]

# columns for the output CSV
columns = ['user_id', 'game_count', 'appid', 'name', 'playtime_forever', 'playtime_2weeks']

# Generate 50,000 random Steam User IDs
user_ids = generate_random_steam_ids()


session = requests.Session()
retry = Retry(connect=5, backoff_factor=1, status_forcelist=[502, 503, 504])
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)


API_KEY = 'API_KEY' # removing this for privacy

# Store user game data
all_user_data = []

# Fetch data for each random user
for idx, user_id in enumerate(user_ids):
    try:
        url = f'https://api.steampowered.com/IPlayerService/GetOwnedGames/v1/?key={API_KEY}&steamid={user_id}&include_appinfo=true&include_played_free_games=true'
        response = session.get(url, timeout=10)
        response.raise_for_status()
        data = response.json()
        
        if 'response' in data and 'games' in data['response']:
            games = data['response']['games']
            game_count = data['response'].get('game_count', 0)
            for game in games:
                game_data = {
                    'user_id': user_id,
                    'game_count': game_count,
                    'appid': game.get('appid'),
                    'name': game.get('name', 'Unknown'),
                    'playtime_forever': game.get('playtime_forever', 0),
                    'playtime_2weeks': game.get('playtime_2weeks', 0)
                }
                all_user_data.append(game_data)
        
        print(f'Fetched data for User ID {user_id} ({idx + 1}/{len(user_ids)})')
        
    except requests.exceptions.RequestException as e:
        print(f'Error fetching data for User ID {user_id}: {e}')

    # rate limiting
    time.sleep(0.5)


df = pd.DataFrame(all_user_data, columns=columns)

df.to_csv('user_playtime_data.csv', index=False)

print('Data collection complete. Saved to "user_playtime_data.csv"')


In [None]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

import scipy.stats as stats
import time
import itertools

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')


games_data = pd.read_csv("../data/steam_game_data.csv")
users_data = pd.read_csv("../data/user_playtime_data.csv")
reviews_data = pd.read_csv("../data/reviews.csv")

## Data Preparation


The Steam library includes not only actively released games but also test builds, delisted content, and placeholders. A significant portion of the entries featured non-standard release indicators such as “Q2 2025” or “Coming soon,”.

Entries with incomplete or ambiguous metadata were removed. Standardized formatting was applied to key fields, including release dates and platform tags, to produce a clean dataset suitable for reliable trend analysis.


In [None]:
# Removing rows with any missing values
games_data.dropna(inplace=True)

# Attemptting to parse the dates with possible date formats : 
# Possible formats known from: https://steamcommunity.com/sharedfiles/filedetails/?id=2554483179#:~:text=Date%20part%20order%20can%20be,format%20will%20be%20applied%20immediately.

valid_dates_format1 = pd.to_datetime(games_data['release_date'], format='%d %b, %Y', errors='coerce')
valid_dates_format2 = pd.to_datetime(games_data['release_date'], format='%b %d, %Y', errors='coerce')
valid_dates_format3 = pd.to_datetime(games_data['release_date'], format='%d %b %Y', errors='coerce')

# Identify rows that do not match either of the formats
invalid_dates = games_data[(valid_dates_format1.isna()) & (valid_dates_format2.isna()) & (valid_dates_format3.isna())]

# let's remove those invalid dates
invalid_dates_list = invalid_dates['release_date'].tolist()
games_data = games_data[~games_data['release_date'].isin(invalid_dates_list)]


# Combining all parsed dates
games_data['release_date'] = valid_dates_format1.fillna(valid_dates_format2).fillna(valid_dates_format3)

# Convert dates to 'YYYY-MM-DD' format for consistency only
games_data['release_date'] = games_data['release_date'].dt.strftime('%Y-%m-%d')


# Rename the column from 'appid' to 'game_id'
users_data.rename(columns={'appid': 'game_id'}, inplace=True)

# let's add avg_playtime coloumn - for viz
users_data['avg_playtime'] = users_data['playtime_2weeks'] / 14

# Convert playtime from minutes to hours
users_data['playtime_2weeks'] = users_data['playtime_2weeks'] / 60
users_data['playtime_forever'] = users_data['playtime_forever'] / 60

##  Engagement Levels

In [None]:
active_users = users_data[users_data['playtime_2weeks'] > 0]

# calculate weekly playtime (in hours) for those active uses
active_users['weekly_playtime_hours'] = (active_users['playtime_2weeks'] / 2)


# visualization the  distribution of weekly playtime
plt.figure(figsize=(8, 6))
plt.hist(active_users['weekly_playtime_hours'], bins=30, color='skyblue', edgecolor='black')
plt.axvline(x=20, color='red', linestyle='--', label='Addiction Threshold (20 hours)')
plt.xlabel('Average Weekly Playtime (Hours)')
plt.ylabel('Number of Active Users')
plt.title('Distribution of Weekly Playtime Among Active Users')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Lets also cateogrize active users based on weekly playtime
def categorize_playtime(hours):
    if hours < 7:
        return 'Casual Gamer'
    elif hours < 20:
        return 'Moderate Gamer'
    else:
        return 'Potential Addiction'

active_users['playtime_category'] = active_users['weekly_playtime_hours'].apply(categorize_playtime)

# total active users in each category
active_category_counts = active_users['playtime_category'].value_counts()


plt.figure(figsize=(6, 6))
plt.pie(active_category_counts, labels=active_category_counts.index, autopct='%1.1f%%', startangle=140, colors=['#66c2a5', '#fc8d62', '#8da0cb'])
plt.title('Active User Distribution by Playtime Category')
plt.show()


When analyzing active user behavior, it turns out most people aren't glued to their screens for hours on end.

In fact, 75.4% of users are casual gamers, playing for less than 7 hours a week. This group makes up the overwhelming majority, indicating that for most users, gaming is a light and occasional pastime.

Another 17.9% fall into the moderate gamer category, playing between 7 to 20 hours per week. This suggests a healthy level of interest, frequent enough to show engagement, but not so much that it takes over their weekly schedule.

Only 6.6% of users exceed 20 hours of playtime per week, which may point to a possible risk of gaming addiction. However, this small percentage suggests that excessive gaming behavior isn’t widespread in this dataset.

A closer look at the playtime distribution reveals a long tail, a small number of users play for 40+ hours per week. These outliers might be worth exploring further, but overall, they don’t represent the norm.

So what does all of this tell us? Most people play games casually. A smaller group plays regularly, but still within a reasonable range. Very few exhibit playtime that raises concerns about overuse.

This paints a picture of gaming as something people enjoy in moderation, not as a widespread addictive behavior.



## What drives players to keep playing?

In [None]:
# Merging user_data with games_data on game_id
merged_data = users_data.merge(games_data, on='game_id', how='left')
merged_data.dropna(inplace=True)


merged_data = merged_data[merged_data['playtime_2weeks'] > 0]


merged_data['multiplayer_support'] = merged_data['multiplayer_support'].astype(int)


# game design elements for analysis
design_elements = ["DLC_count", "multiplayer_support", "max_concurrent_players", "average_review_score"]


merged_data['playtime_forever'] = np.log1p(merged_data['playtime_forever'])
merged_data['DLC_count'] = np.log1p(merged_data['DLC_count'])
merged_data['max_concurrent_players'] = np.log1p(merged_data['max_concurrent_players'])
merged_data['average_review_score'] = np.log1p(merged_data['average_review_score'])
merged_data['playtime_2weeks'] = np.log1p(merged_data['playtime_2weeks'])

correlation_results = merged_data[["playtime_forever"] + design_elements].corr()


# heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_results, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Between Game Design Elements and Playtime")
plt.show()

hypothesis test

In [None]:
#hyopthesis testing
import pandas as pd
from scipy.stats import chi2_contingency

merged_data = users_data.merge(games_data, on='game_id', how='left')
merged_data.dropna(inplace=True)

merged_data = merged_data[merged_data['playtime_2weeks'] > 0]
merged_data['multiplayer_support'] = merged_data['multiplayer_support'].astype(int)


df_filtered = merged_data[merged_data['playtime_2weeks'] > 0].copy()

# Create addicted column: 1 if playtime_2weeks > 20, else 0
df_filtered['addicted'] = (df_filtered['playtime_2weeks'] > 40).astype(int)

# Create contingency table between multiplayer_support and addicted status
contingency_table = pd.crosstab(df_filtered['multiplayer_support'], df_filtered['addicted'])

print("Contingency Table:")
print(contingency_table)

# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"\nChi-square Statistic: {chi2:.4f}")
print(f"Degrees of Freedom: {dof}")
print(f"P-value: {p}")


prop_df = df_filtered.groupby('multiplayer_support')['addicted'].mean().reset_index()
prop_df['Game Type'] = prop_df['multiplayer_support'].map({0: 'Single-player', 1: 'Multiplayer'})

plt.figure(figsize=(8, 5))
sns.barplot(x='Game Type', y='addicted', data=prop_df)
plt.ylabel("Proportion of Addicted Players")
plt.title("Proportion of Addicted Players by Game Type")

plt.ylim(0, 0.3) 
plt.show()

other tests

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from scipy.stats import ttest_ind


merged_data = users_data.merge(games_data, on='game_id', how='left')
merged_data.dropna(inplace=True)

merged_data = merged_data[merged_data['playtime_2weeks'] > 0]

merged_data['multiplayer_support'] = merged_data['multiplayer_support'].astype(int)


# Assume df_filtered is defined as before with an 'addicted' column
df_filtered = merged_data[merged_data['playtime_2weeks'] > 0].copy()
df_filtered['addicted'] = (df_filtered['playtime_2weeks'] > 40).astype(int)

# Define groups
group_addicted = df_filtered[df_filtered['addicted'] == 1]
group_non_addicted = df_filtered[df_filtered['addicted'] == 0]

# Function to run and print t-test results for a given predictor
def run_ttest(predictor):
    t_stat, p_value = ttest_ind(group_addicted[predictor], group_non_addicted[predictor], nan_policy='omit')
    print(f"T-test for {predictor}:")
    print(f"  t-statistic: {t_stat:.4f}")
    print(f"  p-value: {p_value:.4f}\n")

# Run t-tests for each predictor
run_ttest('DLC_count')
run_ttest('max_concurrent_players')
run_ttest('average_review_score')

## Q3

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr


# Merge user data with game data
merged_data = users_data.merge(games_data, on="game_id", how="inner")

merged_data = merged_data[merged_data['playtime_2weeks'] > 40]

genre_playtime = merged_data.groupby("genre")["playtime_2weeks"].sum().sort_values(ascending=False)

# Plot genre vs. total playtime
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_playtime.index[:10], y=genre_playtime.values[:10])
plt.xticks(rotation=30, ha="right")  # Rotating labels for better readability
plt.xlabel("Game Genre")
plt.ylabel("Total Playtime (hours)")
plt.title("Top 10 Most Engaging Game Genres for addicted users by Playtime")
plt.tight_layout()  # Adjust layout to prevent clipping
plt.show()

addiction_correlation, _ = pearsonr(merged_data["playtime_forever"], merged_data["average_review_score"])

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr


# Merge user data with game data
merged_data = users_data.merge(games_data, on="game_id", how="inner")

merged_data = merged_data[merged_data['playtime_2weeks'] > 0]

merged_data = merged_data[merged_data['playtime_2weeks'] < 40]

genre_playtime = merged_data.groupby("genre")["playtime_2weeks"].sum().sort_values(ascending=False)

# Plot genre vs. total playtime
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_playtime.index[:10], y=genre_playtime.values[:10])
plt.xticks(rotation=30, ha="right")  # Rotating labels for better readability
plt.xlabel("Game Genre")
plt.ylabel("Total Playtime (hours)")
plt.title("Top 10 Most Engaging Game Genres for non-addicted users by Playtime")
plt.tight_layout()  # Adjust layout to prevent clipping
plt.show()

addiction_correlation, _ = pearsonr(merged_data["playtime_forever"], merged_data["average_review_score"])


## Heading

## Heading

## Heading

Here's an example of citing a source [see @phil99, pp. 33-35]. Be sure the source information is entered in "BibTeX" form in the `references.bib` file.


The bibliography will automatically be generated, listing all sources in the `.bib` file.