# Imports

In [79]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt


# Data Collection

In this exploratory data analysis we are going to be looking at the top 5 most played games on steam as of June 2022 which are Lost Ark, Counter-Strike, Dota 2, ARK, and Apex Legends. The data is going to be csv files that were downloaded from https://steamdb.info/graph/ and looking at the concurrent players over the course of the month of June 2022 to present day. 

It is important to consider the limitations and strengths of the data set that we are examining:

Pros:
- Exclusively records players from the Steam platform
- Records updated every 10 minutes
- Includes twitch views as an additional feature

Cons:
- Not all Players are included (i.e other platforms)
- Data limited to current week


In [80]:
# Data Collection

FILE_PATH = '../data/apex.csv'
df = pd.read_csv(FILE_PATH)

# Data Exploration

- Checking Assumptions (Dataframe datatypes)
- Handling Missing Data
- Removing Unnecessary Features
- Renaming Columns

In [81]:
df

Unnamed: 0,DateTime,Players,Players Trend,Twitch Viewers
0,2022-05-31 22:20:00,112300,177956.152778,96807
1,2022-05-31 22:30:00,111307,177912.427710,96756
2,2022-05-31 22:40:00,110563,177868.428131,98085
3,2022-05-31 22:50:00,110144,177825.616185,98080
4,2022-05-31 23:00:00,109717,177784.107016,99534
...,...,...,...,...
2011,2022-06-14 21:30:00,108148,161229.261770,69661
2012,2022-06-14 21:40:00,107093,161264.566413,68039
2013,2022-06-14 21:50:00,105766,161307.219453,69285
2014,2022-06-14 22:00:00,104124,161352.114239,77704


We will be handling any missing data. Drop the `Players Trend` column as it is the moving average provided by the Steamdb we will be creating our own player trend later on.

In [82]:
# Remove rows where there is no values for players
df.dropna(subset='Players', inplace=True)

# Removing flags column
df.drop(columns=['Players Trend'], inplace=True)

# Rename column names for convinence for later on when doing manipulations with dataframes.
df.columns = ['date', 'players', 'twitch_viewers']

df

Unnamed: 0,date,players,twitch_viewers
0,2022-05-31 22:20:00,112300,96807
1,2022-05-31 22:30:00,111307,96756
2,2022-05-31 22:40:00,110563,98085
3,2022-05-31 22:50:00,110144,98080
4,2022-05-31 23:00:00,109717,99534
...,...,...,...
2011,2022-06-14 21:30:00,108148,69661
2012,2022-06-14 21:40:00,107093,68039
2013,2022-06-14 21:50:00,105766,69285
2014,2022-06-14 22:00:00,104124,77704


Lets make sure that our columns have the correct datatypes. We can see that our date column is of type object indicating its being detected as a string when we really want it as a datetime.

In [83]:
# Checking to make sure that the dataframe has all the correct data types
df.dtypes

date              object
players            int64
twitch_viewers     int64
dtype: object

In [84]:
# Converting to first column to datateime
df.date = pd.to_datetime(df.date)

In [85]:
# Checkign to see that the date column is indeed a datetime rather than object
df.dtypes

date              datetime64[ns]
players                    int64
twitch_viewers             int64
dtype: object

In [86]:
# Check to make sure that the time interval is consistant

df['date_interval'] = df.date - df.shift(1).date

In [87]:
# Data Cleaning

In [88]:
# Data Preperation