# Online Gaming Behavior EDA:
## Descriptive and Inferntial Analyses

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import statistics as stat

Read in the CSV file and look at the overall info of the dataset.

In [2]:
df = pd.read_csv('data/online_gaming_behavior.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40034 entries, 0 to 40033
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PlayerID                   40034 non-null  int64  
 1   Age                        40034 non-null  int64  
 2   Gender                     40034 non-null  object 
 3   Location                   40034 non-null  object 
 4   GameGenre                  40034 non-null  object 
 5   PlayTimeHours              40034 non-null  float64
 6   InGamePurchases            40034 non-null  int64  
 7   GameDifficulty             40034 non-null  object 
 8   SessionsPerWeek            40034 non-null  int64  
 9   AvgSessionDurationMinutes  40034 non-null  int64  
 10  PlayerLevel                40034 non-null  int64  
 11  AchievementsUnlocked       40034 non-null  int64  
 12  EngagementLevel            40034 non-null  object 
dtypes: float64(1), int64(7), object(5)
memory usag

No apparentl null values from above, but there might be some unknown 'nulls'

Let's get an idea of our range of numerical values

In [3]:
df.describe()

Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
count,40034.0,40034.0,40034.0,40034.0,40034.0,40034.0,40034.0,40034.0
mean,29016.5,31.992531,12.024365,0.200854,9.471774,94.792252,49.655568,24.526477
std,11556.964675,10.043227,6.914638,0.400644,5.763667,49.011375,28.588379,14.430726
min,9000.0,15.0,0.000115,0.0,0.0,10.0,1.0,0.0
25%,19008.25,23.0,6.067501,0.0,4.0,52.0,25.0,12.0
50%,29016.5,32.0,12.008002,0.0,9.0,95.0,49.0,25.0
75%,39024.75,41.0,17.963831,0.0,14.0,137.0,74.0,37.0
max,49033.0,49.0,23.999592,1.0,19.0,179.0,99.0,49.0


In [4]:
df.head(4)

Unnamed: 0,PlayerID,Age,Gender,Location,GameGenre,PlayTimeHours,InGamePurchases,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked,EngagementLevel
0,9000,43,Male,Other,Strategy,16.271119,0,Medium,6,108,79,25,Medium
1,9001,29,Female,USA,Strategy,5.525961,0,Medium,5,144,11,10,Medium
2,9002,22,Female,USA,Sports,8.223755,0,Easy,16,142,35,41,High
3,9003,35,Male,USA,Action,5.265351,1,Easy,9,85,57,47,Medium


In terms of this analysis I do not care about PlayerID or InGamePurchases. Also, since theses are all varying games, PlayerLevel is very ambiguous and so is AchievementsUnlocks. One game could only have 10 achievements. A percentage value here would be more descriptive. Same goes for PlayerLevel. What does that mean in terms of any game. Level 100 might be max, where level 50 could max in another game. We will drop these values for all analyses moving forward, including the ML portioned Notebook.

In [5]:
df = df.drop(['PlayerID', 'InGamePurchases', 'PlayerLevel', 'AchievementsUnlocked'], axis=1)

In [6]:
df

Unnamed: 0,Age,Gender,Location,GameGenre,PlayTimeHours,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,EngagementLevel
0,43,Male,Other,Strategy,16.271119,Medium,6,108,Medium
1,29,Female,USA,Strategy,5.525961,Medium,5,144,Medium
2,22,Female,USA,Sports,8.223755,Easy,16,142,High
3,35,Male,USA,Action,5.265351,Easy,9,85,Medium
4,33,Male,Europe,Action,15.531945,Medium,2,131,Medium
...,...,...,...,...,...,...,...,...,...
40029,32,Male,USA,Strategy,20.619662,Easy,4,75,Medium
40030,44,Female,Other,Simulation,13.539280,Hard,19,114,High
40031,15,Female,USA,RPG,0.240057,Easy,10,176,High
40032,34,Male,USA,Sports,14.017818,Medium,3,128,Medium


In [9]:
df.Gender.unique()

array(['Male', 'Female'], dtype=object)

This is a petpeeve, but Male/Female is not a Gender classification, so we will rename this to sex.

In [16]:
df = df.rename(columns={'Gender': 'Sex'})
df

Unnamed: 0,Age,Sex,Location,GameGenre,PlayTimeHours,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,EngagementLevel
0,43,Male,Other,Strategy,16.271119,Medium,6,108,Medium
1,29,Female,USA,Strategy,5.525961,Medium,5,144,Medium
2,22,Female,USA,Sports,8.223755,Easy,16,142,High
3,35,Male,USA,Action,5.265351,Easy,9,85,Medium
4,33,Male,Europe,Action,15.531945,Medium,2,131,Medium
...,...,...,...,...,...,...,...,...,...
40029,32,Male,USA,Strategy,20.619662,Easy,4,75,Medium
40030,44,Female,Other,Simulation,13.539280,Hard,19,114,High
40031,15,Female,USA,RPG,0.240057,Easy,10,176,High
40032,34,Male,USA,Sports,14.017818,Medium,3,128,Medium


In [19]:
df.Location.unique()
# Looks like 4 regions total
df.Location.value_counts()

USA       16000
Europe    12004
Asia       8095
Other      3935
Name: Location, dtype: int64

In [23]:
df.GameGenre.unique()
# And we have 5 genre groups
df.GameGenre.value_counts()

Sports        8048
Action        8039
Strategy      8012
Simulation    7983
RPG           7952
Name: GameGenre, dtype: int64

Let's look at our target column for our future exploration

In [24]:
df.EngagementLevel.unique()

array(['Medium', 'High', 'Low'], dtype=object)

Okay, so only 3 groups to worry about here. We will probably look at encoding this later on for predictions

In [29]:
sorted_unique = sorted(df.Age.unique())

# Reindex the value_counts() to match the sorted order
value_counts_sorted = df.Age.value_counts().reindex(sorted_unique, fill_value=0)

value_counts_sorted

15    1101
16    1138
17    1149
18    1167
19    1139
20    1113
21    1128
22    1150
23    1130
24    1153
25    1108
26    1107
27    1217
28    1108
29    1187
30    1150
31    1228
32    1163
33    1123
34    1103
35    1151
36    1154
37    1219
38    1140
39    1128
40    1202
41    1111
42    1187
43    1180
44    1166
45    1108
46    1121
47    1102
48    1097
49    1106
Name: Age, dtype: int64

Looks like the ages represented are from 15 to 49, with a pretty even spread throughout. That's pretty good sampling. 