# (1) description of dataset
### players.csv
contains survey data of players self declared experience, whether they are subscribed, their hashed emails a form of ID, self-reported play hours, name, gender and age.

Notes: 
1. there exists people who participated in the survey who did not log in to the server even once.
2. people may be lying in the survey (there is a bit of time discrepency between self reported play time and actual accumulated session play time)
3. individualID and organization name columns do not seem to have any useful or relavent data
4. each hashedEmail only appears once, so it can be used to key the data.

### sessions.csv
contains session data (presumably from minecraft logs) of each players play session, including the player hashedEmail as a form of ID, start time, end time (to nearest minute) and two columns of garbage data supposed to be some sort of epoch time but messed up, we will not be using this.

- hashedEmail: string, type of ID for each player (not a suitable key for the table)
- start_time: string for start time/log on time (to nearest minute)
- end_time: string for end time/log off time (to nearest minute)
- original_start_time/original_end_time: broken epoch time of start/end of session


Notes:
1. contains rows where start and end time is not recorded
2. a single player may log on multiple sessions, so hashedEmail is not a UID for each row.


# (2) Question

We will be answering question 3. specifically, can we predict the length of a players session given their join time (which day of the week, and at what hour they are joining). 

Predictor variable(s): Minutes since the start of the week
Response/output variable(s): predicted amount of time the player stays logged on the server

Learnt Method to use: KNN regression. We are not classifying anything, and since the data is perliminarily visualized to be wave-like in nature, the predictor model should be able to curve (so we will not use linear regression)



# (3) Exploratory visualization
(see below)

# (4) My plan:
### what data to use?
I will be using sessions.csv, because players.csv does not give any useful information, is noisy and is inconsistent with the session data.

The hashedEmail column is also not useful, we are not predicting activity on a per player basis (although we could, there is not enough data)

### the actual plan to do stuff...
1. use start and end time columns to produce relavent epoch time columns.
2. apply modulus epoch time columns to get "seconds since start of week" column
3. plot player count (y axis) vs time of week
4. train KNN regressor on the data.


### Why is this method appropriate?
  we are using a scalar predictor to produce a scalar output that is not known to be linear by nature. It is not a classification problem, so we will use a regressor. KNN regressor is not linear, so it will be our choice.
### Which assumptions are required, if any, to apply the method selected?
  n/a
### What are the potential limitations or weaknesses of the method selected?
  might encounter over/underfitting
  Model might not scale well (but is appropriate for such a small dataset we are using)
### How are you going to compare and select the model?
  25% of the data will be randomly selected to test the model of N neighbours, we can run it as many times as we want, randomizing the split each time, to properly gague average model performance.
  We might also compare it to a linear regression model if the data resembles anything linear.
### How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
  1. by turning sessions data into a simple linear dataset of players/time
  2. yes, randomly multiple times, 25%.
  3. split when the data is completely wrangled
  4. yes
  5. yes, as many times as computationally fesible




In [19]:
import warnings
warnings.filterwarnings('ignore') # my local python version is higher, which creates annoying future warnings.
# the server at UBC will actually not even produce these warnings

import pandas as pd
import altair as alt



### Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [2]:
play = pd.read_csv("https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz")
sesh = pd.read_csv("https://drive.google.com/uc?export=download&id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB")

In [3]:
play.head(5)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,


In [4]:
sesh.head(5)

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0


In [5]:
from datetime import datetime
import numpy as np
date_string = "20/05/2024 14:30"
date_string2 = "20/05/2025 14:32" # for testing
format_code = "%d/%m/%Y %H:%M"
parsed_date = datetime.strptime(date_string, format_code)
parsed_date2 = datetime.strptime(date_string2, format_code)

(parsed_date2 - parsed_date).total_seconds()

global i;
i = 0

def catchWrongDate(d):
    global i;
    s = str(d)
    if(len(s) != 16):
        pos = 11
        s = s[:pos] + "0" + s[pos:]
    return(s)


sesh.loc[sesh['end_time'] == '', 'end_time'] = np.nan
sesh.dropna(subset=['end_time'], inplace=True)
sesh.reset_index(inplace=True)

sesh['time'] = sesh.apply(
    lambda row: (datetime.strptime(catchWrongDate(row["end_time"]), format_code)-datetime.strptime(catchWrongDate(row["start_time"]), format_code)).total_seconds(), 
    axis=1
)

sesh["start_time"] = pd.to_datetime(sesh['start_time'],format="%d/%m/%Y %H:%M")

sesh['start_hr_of_day'] = sesh["start_time"].dt.hour
sesh['start_day_of_week'] = sesh["start_time"].dt.weekday
sesh['start_hr_of_week'] = sesh['start_hr_of_day']+(24*sesh['start_day_of_week'])

sesh.head(5)


Unnamed: 0,index,hashedEmail,start_time,end_time,original_start_time,original_end_time,time,start_hr_of_day,start_day_of_week,start_hr_of_week
0,0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,30/06/2024 18:24,1719770000000.0,1719770000000.0,720.0,18,6,162
1,1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,17/06/2024 23:46,1718670000000.0,1718670000000.0,780.0,23,0,23
2,2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,25/07/2024 17:57,1721930000000.0,1721930000000.0,1380.0,17,3,89
3,3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,25/07/2024 03:58,1721880000000.0,1721880000000.0,2160.0,3,3,75
4,4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,25/05/2024 16:12,1716650000000.0,1716650000000.0,660.0,16,5,136


In [6]:

# sesh = sesh.drop(columns=['original_start_time', 'original_end_time'])
sesh.head(5)


Unnamed: 0,index,hashedEmail,start_time,end_time,original_start_time,original_end_time,time,start_hr_of_day,start_day_of_week,start_hr_of_week
0,0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,30/06/2024 18:24,1719770000000.0,1719770000000.0,720.0,18,6,162
1,1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,17/06/2024 23:46,1718670000000.0,1718670000000.0,780.0,23,0,23
2,2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,25/07/2024 17:57,1721930000000.0,1721930000000.0,1380.0,17,3,89
3,3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,25/07/2024 03:58,1721880000000.0,1721880000000.0,2160.0,3,3,75
4,4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,25/05/2024 16:12,1716650000000.0,1716650000000.0,660.0,16,5,136


In [7]:
seshG = sesh.groupby("hashedEmail").size().reset_index(name='counts')
seshG #sessions grouped (just for looking)

Unnamed: 0,hashedEmail,counts
0,0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc9335...,2
1,060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe...,1
2,0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce02...,1
3,0d4d71be33e2bc7266ee4983002bd930f69d304288a866...,13
4,0d70dd9cac34d646c810b1846fe6a85b9e288a76f5dcab...,2
...,...,...
120,fc0224c81384770e93ca717f32713960144bf0b52ff676...,1
121,fcab03c6d3079521e7f9665caed0f31fe3dae6b5ccb86e...,1
122,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,310
123,fe218a05c6c3fc6326f4f151e8cb75a2a9fa29e22b110d...,1


In [8]:
instances = alt.Chart(seshG).mark_bar().encode(
        x=alt.X("count()",title="repeated joins"), # 'Q' specifies a quantitative data type
        y=alt.Y("counts",title="amount of players"),
    
).properties(title="repeated join frequency of players")
instances

the visualization shows that the vast majority of players only played on the server once in the duration of the study. This tells us that it is impractical to predict their activity on a per-user basis.

In [9]:
hours = alt.Chart(sesh).mark_bar().encode(
    x=alt.X("start_hr_of_day:O",title="Hours after midnight"),
    y=alt.Y("count()",title="unscaled relative user log-ons")
).properties(
    title="unscaled cumulative frequency of logins at given time of day"
)
hours

This bar chart shows the relative distribution of players joining at a point of day. Although the Y axis is arbitrary, the relative heights show us that most players prefer to join at midnight. It seems most player join between 2-4 a.m. This gives us a fun glimpse into the average UBC minecrafter's sleep priorities.

In [10]:
days = alt.Chart(sesh).mark_bar().encode(
    x=alt.X("start_day_of_week:O",title="Days since monday"),
    y=alt.Y("count()",title="unscaled relative user log-ons")
).properties(
    title="unscaled cumulative frequency of logins at given day of week"
)
days

Here we see the relative amount of log-ons at a given day of the week.
It seems players actually log on least on friday and most on the weekend
other trends seem very intuitive (weekdays less log-ons).

Since the data starts from april and ends in september (172.925 days), the day of the week may not have as high as an impact as if it were to be a study done only during the winter sessions.

In [11]:
sesh["time_mins"] = sesh["time"] / 60
hours = alt.Chart(sesh).mark_bar().encode(
    y=alt.Y("time_mins",bin=alt.Bin(maxbins=20),title="time played each session (minutes)"),
    x=alt.X("count()",title="number of sessions")
).properties(
    title="time played vs number of sessions"
)
hours

here we more intuitively see that most players play less than 20 minutes per session, and the amount drops off the longer each session gets.

In [76]:
main_df = pd.DataFrame({"time":sesh["time"],"start":sesh["original_start_time"],"end":sesh["original_end_time"],"hr":sesh["start_hr_of_week"],"hrd":sesh["start_hr_of_day"]})
main_df["startd"] = main_df["start"]%86400000 / 3600000
main_df["startw"] = main_df["start"]%604800000 / 3600000

main_df["time"] = main_df["time"]/60/60

main_df["endd"] = main_df["end"]%86400000 / 3600000
main_df["endw"] = main_df["end"]%604800000 / 3600000
main_df

Unnamed: 0,time,start,end,hr,hrd,startd,startw,endd,endw
0,0.200000,1.719770e+12,1.719770e+12,162,18,17.888889,89.888889,17.888889,89.888889
1,0.216667,1.718670e+12,1.718670e+12,23,23,0.333333,120.333333,0.333333,120.333333
2,0.383333,1.721930e+12,1.721930e+12,89,17,17.888889,17.888889,17.888889,17.888889
3,0.600000,1.721880e+12,1.721880e+12,75,3,4.000000,4.000000,4.000000,4.000000
4,0.183333,1.716650e+12,1.716650e+12,136,16,15.222222,63.222222,15.222222,63.222222
...,...,...,...,...,...,...,...,...,...
1528,0.100000,1.715380e+12,1.715380e+12,119,23,22.444444,46.444444,22.444444,46.444444
1529,0.183333,1.719810e+12,1.719810e+12,4,4,5.000000,101.000000,5.000000,101.000000
1530,0.350000,1.722180e+12,1.722180e+12,159,15,15.333333,87.333333,15.333333,87.333333
1531,0.116667,1.721890e+12,1.721890e+12,78,6,6.777778,6.777778,6.777778,6.777778


In [77]:
def regress(x,t="hours since",y="time"):
    np.random.seed(2019) # DO NOT CHANGE
    
    
    
    param_grid = {
        "kneighborsregressor__n_neighbors": range(1, 201, 1),
    }
    
    tuned = GridSearchCV(
        estimator=pipe, 
        param_grid=param_grid, 
        cv=5,
        n_jobs=-1,
            scoring="neg_root_mean_squared_error",
    )
    res = pd.DataFrame(tuned.fit(X_train, y_train).cv_results_) 
    # marathon_prediction = tuned.predict(X_test)
    
    preds = training.assign(
        predictions= tuned.predict(X_train)
    )
    marathon_plot = alt.Chart(preds).mark_circle(opacity=0.4).encode(
        x = alt.X(x,title=t),
        y = alt.Y(y,title="time (hr)")
    )+ alt.Chart(preds).mark_line(color='black').encode(
        x=alt.X(x),
        y=alt.Y("predictions")
    )
    
    
    return(marathon_plot)

In [78]:
alt.Chart(main_df).mark_point().encode(
    x=alt.X("startd",title="start hr of week"),
    y=alt.Y("time",title="time played (s)"),
    ).properties(title="playtime vs time started",width=1000)

In [79]:
alt.Chart(main_df).mark_point().encode(
    x=alt.X("startw",title="start hr of day"),
    y=alt.Y("time",title="time played (s)"),
    ).properties(title="playtime vs time started")

In [80]:
training, testing = train_test_split(
    main_df,
    test_size=0.25,
    random_state=2000,  # Do not change the random_state
)
X_train = training[["startd"]]  # A single column data frame
y_train = training["time"]  # A series

X_test = testing[["startd"]]  # A single column data frame
y_test = testing["time"]  # A series

In [81]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(),
)

marathon_cv = pd.DataFrame(
    cross_validate(
        estimator=pipe,
        X=X_train,
        y=y_train,
        cv=5,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)

marathon_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.003952,0.002356,-1.062261,-0.81274
1,0.003039,0.002196,-0.936657,-0.886361
2,0.002992,0.0022,-0.962225,-0.854957
3,0.002972,0.002168,-0.826504,-0.88602
4,0.002954,0.002178,-1.036828,-0.842831


In [82]:
regress("startd","starting hour (hours since 12am)")

In [83]:
alt.Chart(main_df).mark_point().encode(
    x=alt.X("endd",title="start hr of week"),
    y=alt.Y("time",title="time played (s)"),
    ).properties(title="playtime vs time started",width=1000)

In [84]:
alt.Chart(main_df).mark_point().encode(
    x=alt.X("endw",title="start hr of day"),
    y=alt.Y("time",title="time played (s)"),
    ).properties(title="playtime vs time started")

In [85]:
training, testing = train_test_split(
    main_df,
    test_size=0.25,
    random_state=2000,  # Do not change the random_state
)
X_train = training[["endd"]]  # A single column data frame
y_train = training["time"]  # A series

X_test = testing[["endd"]]  # A single column data frame
y_test = testing["time"]  # A series

In [86]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(),
)

marathon_cv = pd.DataFrame(
    cross_validate(
        estimator=pipe,
        X=X_train,
        y=y_train,
        cv=5,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)

marathon_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.003905,0.002616,-1.043066,-0.823191
1,0.003043,0.00216,-0.931811,-0.854639
2,0.002962,0.002153,-0.986957,-0.870458
3,0.002951,0.002157,-0.872028,-0.878992
4,0.002944,0.002141,-1.01472,-0.824356


In [87]:
regress("endd","ending hour (hours since 12am)")

In [88]:
training, testing = train_test_split(
    main_df,
    test_size=0.25,
    random_state=2000,  # Do not change the random_state
)
X_train = training[["startw"]]  # A single column data frame
y_train = training["time"]  # A series

X_test = testing[["startw"]]  # A single column data frame
y_test = testing["time"]  # A series
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(),
)

marathon_cv = pd.DataFrame(
    cross_validate(
        estimator=pipe,
        X=X_train,
        y=y_train,
        cv=5,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)

marathon_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.003552,0.002309,-1.015076,-0.756702
1,0.003005,0.002202,-0.924043,-0.793456
2,0.00296,0.002164,-0.920934,-0.788269
3,0.003005,0.002186,-0.83159,-0.807233
4,0.002938,0.002165,-1.05467,-0.761866


In [89]:
regress("startw","start hour (hours since 12am monday)")