In [None]:
https://github.com/dathwal/DSCI100-Project.git

First, we load in the necessary libraries.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(tidyclust)
options(repr.matrix.max.rows = 6)


The two datasets (players.csv and sessions.csv) are then loaded in using the read_csv function. To help with 
visualizing the column names, the glimpse function is used. 

In [None]:
players<- read_csv("players.csv")

glimpse(players)


In [None]:
# loaded Data Set
sessions <- read_csv("sessions.csv")
glimpse(sessions)


The question that I wish to answer is exploratory, so the method that we will use is clustering. We wish to find clusters of data that will hopefully show if there is a particular age group that contributes the most played hours so that we are able to target this age group with advertising. First, tidy the data. 

In [None]:
players_cleaned <- players |>
select(played_hours, Age)
players_cleaned

This cleaned data can now be used to create a plot to observe any relationships between the two variables.

In [None]:
players_plot <- players_cleaned|>
ggplot(aes(y=played_hours, x=Age))+
geom_point() +
ggtitle("Played Hours vs Time")+
ylab("Played Hours")

players_plot

The dataset can now be split into training and testing sets.

In [None]:
set.seed(1543)
players_split <- initial_split(players_cleaned, prop=0.7,strata=played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)

A recipe is then created based on the training set.

In [None]:
kmeans_recipe <- recipe(~., data = players_training)|>
step_scale(all_predictors())|>
step_center(all_predictors())

A Kmeans model and is then tuned and fitted.

In [None]:

grid_vals <- tibble(num_clusters=1:10)

kmeans_spec <- k_means(num_clusters = tune())|>
set_engine("stats")

kmeans_fit <- workflow()|>
add_recipe(kmeans_recipe)|>
add_model(kmeans_spec)|>
tune_cluster(resamples = apparent(players_training), grid = grid_vals)|>
collect_metrics()
kmeans_fit