# Steam User Interaction: Exploratory Data Analysis

Using this [dataset found from kaggle](https://www.kaggle.com/tamber/steam-video-games/data), which includes about 200,000 interactions from Steam users, which includes purchasing and playtime data. The dataset is a bit limited because it only discloses which products were purchased and how many hours of gameplay the user has, but no information about when any of these actions occur.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We begin our exploration by utilizing pandas to extract the data from the csv and allow us to flexily manipulate this data.

The columns didn't have a header, so the column names were added according to the kaggle page. Also, note that the final column is only comprise of 0's, and has no significance, so it was dropped.

In [2]:
#names to assign to columns
column_names = ['user-id','game-title','behavior-name','value']

#dtypes to assign
dtypes = {'user-id':int, 'game-title':str, 'behavior-name':'category', 'value':np.float64}

#read in data from csv (please adjust path if you want to run this). Dropped last column because useless
df = pd.read_csv('./steam-200k.csv', 
                usecols=[0,1,2,3],
                names=column_names)

In [3]:
df.head()

Unnamed: 0,user-id,game-title,behavior-name,value
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0
1,151603712,The Elder Scrolls V Skyrim,play,273.0
2,151603712,Fallout 4,purchase,1.0
3,151603712,Fallout 4,play,87.0
4,151603712,Spore,purchase,1.0


I want to clear up some things about this dataset; the first I'm curious about, "does every play behavior HAVE to have a purchase behavior?"

We'll do this for one user and then iterate amoungst all users.

In [4]:
#boolean mask to extract only for one user
user_mask = df['user-id'] == 151603712

#dataframe is filter according to the mask above (i.e.) only transactions from user: 151603712
ind_user_df = df[user_mask]

#creates masks to seperate purchase and play data
purchase_only = ind_user_df['behavior-name'] == 'purchase'
play_only = ind_user_df['behavior-name'] == 'play'

#uses masks to split
ind_purch_df= ind_user_df[purchase_only]
ind_play_df = ind_user_df[play_only]

#creates list of unique values
ind_purch_list = list(ind_purch_df['game-title'].unique())
ind_play_list = list(ind_play_df['game-title'].unique())

#Is every game played purchased?
check = [game in ind_purch_list for game in ind_play_list]
print(all(check))
print(check)

True
[True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]


It appears that for the user in question, all games that they played have a corresponding purchased entry. In the next cell, the code is converted into a function, and the cell following iterates through all users.

In [5]:
def allplayedpurchased(userid):    
    
    #boolean mask to extract only for one user
    user_mask = df['user-id'] == userid

    #dataframe is filter according to the mask above (i.e.) only transactions from user: 151603712
    ind_user_df = df[user_mask]

    #creates masks to seperate purchase and play data
    purchase_only = ind_user_df['behavior-name'] == 'purchase'
    play_only = ind_user_df['behavior-name'] == 'play'

    #uses masks to split
    ind_purch_df= ind_user_df[purchase_only]
    ind_play_df = ind_user_df[play_only]

    #creates list of unique values
    ind_purch_list = list(ind_purch_df['game-title'].unique())
    ind_play_list = list(ind_play_df['game-title'].unique())

    #Is every game played purchased?
    check = all(game in ind_purch_list for game in ind_play_list)
    return check

In [6]:
unique_user_ids = list(df['user-id'].unique())

confirmed_users = {}
for user in unique_user_ids:
    confirmed_users[user] = allplayedpurchased(user)

print(all(confirmed_users))

True


Our function confirmed that every played game will have a corresponding purchased data entry. The converse is of courese not true, because plenty of people have game in steam that they haven't played.

Next, lets create new tables which split the data into more useful views:

-a table with user stats based off data, splitting up the columns to describe habits in both

-a table with game stats based off data, one based on purchases, another based on play

First, we'll create the user-based tables:

In [7]:
#create masks to split purchases and play
purch_mask = df['behavior-name'] == 'purchase'
play_mask = df['behavior-name'] == 'play'

#creates 2 new dataframes by splitting original
purch_df = df[purch_mask]
play_df = df[play_mask]

purch_df

Unnamed: 0,user-id,game-title,behavior-name,value
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0
2,151603712,Fallout 4,purchase,1.0
4,151603712,Spore,purchase,1.0
6,151603712,Fallout New Vegas,purchase,1.0
8,151603712,Left 4 Dead 2,purchase,1.0
...,...,...,...,...
199990,128470551,Fallen Earth,purchase,1.0
199992,128470551,Magic Duels,purchase,1.0
199994,128470551,Titan Souls,purchase,1.0
199996,128470551,Grand Theft Auto Vice City,purchase,1.0


We'll likely be interested in know how many games each user bought as well as how many games they've played and how many hours they have total. This next cell will prepare these columns as pandas.series

In [32]:
#number of games purchased aggregated per user-id
purchased_game_count = purch_df.groupby('user-id').count()['game-title'].rename('purchased-game-count')

#number of unique games played, and total hours played amoungst all games
played_game_count = play_df.groupby('user-id').count()['game-title'].rename('played-game-count')
played_hours_count = play_df.groupby('user-id')['value'].sum().rename('played-hours-count')


In [25]:
#makes a list of all games purchased
purchased_gametitle_list = purch_df.groupby('user-id').apply(lambda x: list(x['game-title'])).rename('purhased-gametitles-list')

#makes a list of all games played at least once
play_gametitle_list = play_df.groupby('user-id').apply(lambda x: list(x['game-title'])).rename('played-gametitles-list')

In [26]:
#This two lines create column with fraction of games played out of purchased. then fills Nan with 0%
percent_library_played = (played_game_count/purchased_game_count).rename('percent-library-played')
percent_library_played.fillna(0, inplace=True)

In [18]:
#the averge number of hours per game for the user, and the standard deviation

played_hours_avg = play_df.groupby('user-id')['value'].mean().rename('played-hours-avg')
played_hours_std = play_df.groupby('user-id')['value'].std().rename('played-hours-std').fillna(0)
played_hours_max = play_df.groupby('user-id')['value'].max().rename('played-hours-max').fillna(0)

In [12]:
#the most played game for the user
most_played_game_idx = play_df.groupby('user-id')['value'].idxmax()

user_index = most_played_game_idx.index

In [13]:
most_played_game = pd.Series(name='most_played_game', index=user_index)
for user, idx in most_played_game_idx.items():
    most_played_game[user] = df.iloc[idx, 1]

In [38]:
columns_to_join_users = [purchased_game_count,
                         played_game_count,
                         played_hours_count,
                         purchased_gametitle_list,
                         play_gametitle_list,
                         percent_library_played,
                         played_hours_avg,
                         played_hours_std,
                         played_hours_max,
                         most_played_game]

users_stats_df = pd.concat(columns_to_join_users, axis=1)

In [34]:
fillna_dict_users = {'played-game-count':0,
                    'played-hours-count':0,
                    'played-gametitles-list':[]
                    'played-hours-avg':0,
                    'played-hours-std':0,
                    'played-hours-max':0,
                    'most_played_game': ''}

users_stats_df = 

user-id
5250          6
76767        20
86540        15
144736        1
181212        2
             ..
309434439     1
309554670     1
309626088     1
309824202     1
309903146     1
Name: played-game-count, Length: 11350, dtype: int32

In [40]:
users_stats_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12393 entries, 5250 to 309903146
Data columns (total 10 columns):
purchased-game-count        12393 non-null int64
played-game-count           11350 non-null float64
played-hours-count          11350 non-null float64
purhased-gametitles-list    12393 non-null object
played-gametitles-list      11350 non-null object
percent-library-played      12393 non-null float64
played-hours-avg            11350 non-null float64
played-hours-std            11350 non-null float64
played-hours-max            11350 non-null float64
most_played_game            11350 non-null object
dtypes: float64(6), int64(1), object(3)
memory usage: 1.0+ MB
