# Term Project - Eric Xiao (Group 34)

In [None]:
# PLEASE RUN THIS ONE TIME

library(tidyverse)
first_time_ran <- TRUE

## 1. Dataset Description

This dataset consists of two files: `players.csv` contains data about all the players and `sessions.csv` contains data about the play sessions that participants had.

**NOTE: the variables (column names) have been renamed to be PascalCase since they were previously not consistent**

*The following markdown tables were made using https://www.tablesgenerator.com/markdown_tables*

### Players.csv

There are a total of 7 variables and 196 observations.

| Variable    | Type | Interpreted Meaning                                                          | Statistical Summary                                          | Possible Issues                                                                                                                                 |
|-------------|------|------------------------------------------------------------------------------|--------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| Experience  | chr  | The amount of experience the player has with Minecraft                       | 35 Beginners, 63 Amateurs, 36 Regulars, 14 Pros, 48 Veterans | **Should convert type into fct.** The distinction between some of these labels such as "pro" and "veteran" are not that clear in their meanings |
| Subscribe   | lgl  | Represents whether or not the player subscribed to a game-related newsletter | 52 not subscribed, 144 subscribed to the newsletter          | N/A                                                                                                                                             |
| HashedEmail | chr  | The email of the player, which has been hashed for their privacy             | N/A                                                          | Not really any uses for this variable in terms of data analysis                                                                                 |
| PlayedHours | dbl  | The total number of hours that the player has played on the server           | Average of 5.84 hours played                                 | Some players have 0 hours played                                                                                                                |
| Name        | chr  | Their name                                                                   | N/A                                                          | Not really any uses for this variable in terms of data analysis                                                                                 |
| Gender      | chr  | Their gender                                                                 | 124 Male, 37 Female, 35 other gender minorities              | **Should convert type into fct.** There is a large difference in number of female and male players                                              |
| Age         | dbl  | Their age                                                                    | Average age of 21.14                                         | Two players have NA as their age                                                                                                                |

### Sessions.csv

There are a total of 5 variables and 1535 observations.

| Variable          | Type | Interpreted Meaning                                                 | Statistical Summary                               | Possible Issues                                                                                                                             |
|-------------------|------|---------------------------------------------------------------------|---------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| HashedEmail       | chr  | The email of the player playing in this session, hashed for privacy | N/A                                               | Cannot be used for data analysis, but can be used to match the player in `players.csv` to the session                                       |
| StartTime         | chr  | The starting time of the play session                               | Average session duration 51 minutes\*             | Not formatted well to be interpreted as a time, should either split into multiple columns, **OR parse datetime string into dttm**           |
| EndTime           | chr  | The end time of the play session                                    | Longest session was 4 hours 19 minutes\*          | *Same issues as StartTime*                                                                                                                  |
| OriginalStartTime | dbl  | The Unix timestamp in milliseconds, representing the start time     | First session started on Apr 06 2024 (1.7124e+12) | Since we already have StartTime, this is not really necessary and is less accurate since it only goes to ±10000000 milliseconds = ±2h 46min |
| OriginalEndTime   | dbl  | The Unix timestamp in milliseconds, representing the end time       | Last session ended on Sep 26 2024 (1.72734e+12)   | *Same issues as OriginalStartTime*                                                                                                          |

\* = used multiple variables for statistic summary

## 2. Questions

Selected broad question: *What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? (question 1)*

Specific question: **Can a player's age, total hours played, experience, and average play time duration within a session predict if a player would subscribe to a game-related newsletter using the players and sessions dataset?**

This is a classification problem, so we use predictive variables to classify a player as subscribing or not subscribing to a game-related newsletter.

The `players.csv` data can easily be used since it has many variables about player characteristics. `sessions.csv` data can also be used since the hashed email can be linked to a player. One possible metric from `sessions.csv` is the average session duration that the player plays for; we can group by email, create a new column called AverageSessionDuration based on EndTime - StartTime, and summarize an average for each email. Then, we can use bind_col to add AverageSessionDuration to players and use it as a predictive variable.

After wrangling the data, we can use k-nn classification to train a model, but we should probably separate our data into training/testing, tune for the best k-value, then retrain, and apply the model on the testing set to determine how good our model is.

## 3. Exploratory Data Analysis and Visualization

> Note that the report of mean values of quantitative variables in `players.csv` are already present in [section 1](#Players.csv)  
> (The code that computes these values are located near the end of the notebook)

Below is some code that makes the two datasets tidy.

There are also some visualizations that relate to the research question about predicting newsletter subscription based on player characteristics and behaviors.

### Experience Plot

Compares if players with different experience have different perentage that subscribe. The average percentage is about 75%, with pro, veteran, and amateur players subscribing a bit less than average.

### Gender Plot

Compares if players of different genders have different likelihood of subscribing. There doesn't seem to be a statistically significant difference between male and female players, and the other gender minorities show some interesting trends but they only represent a few observations.

### Hours Played and Age Plot

Scatter plot shows Hours Played vs Age, and colors points based on if they subscribed. It shows that players that have played a lot (> 20 hours) are almost guaranteed to be subscribed, while there is no clear trend within 1 - 6 hours of play time (this is shown through 2 plots, one zoomed in more).

In [None]:
players <- read_csv("https://raw.githubusercontent.com/avahbot/dsci-100-term-project/27334598713332bcae2f0dfbe56513e06786391f/data/players.csv") |>
    rename(Experience = experience,
           Subscribe = subscribe,
           HashedEmail = hashedEmail,
           PlayedHours = played_hours,
           Name = name,
           Gender = gender,
           Age = Age) # technically doesn't need to be here

sessions <- read_csv("https://raw.githubusercontent.com/avahbot/dsci-100-term-project/27334598713332bcae2f0dfbe56513e06786391f/data/sessions.csv") |>
    rename(HashedEmail = hashedEmail,
           StartTime = start_time,
           EndTime = end_time,
           OriginalStartTime = original_start_time,
           OriginalEndTime = original_end_time)

In [None]:
# Guard to make sure it doesn't try and parse the Start/EndTime multiple times, causing error
if (first_time_ran) {
    # Experience and Gender should be converted into factors
    players <- players |> mutate(Experience = as_factor(Experience), Gender = as_factor(Gender))

    # Start and End Time should be datetime variables (Other way is to separate them into different columns)
    sessions <- sessions |> mutate(StartTime = dmy_hm(StartTime), EndTime = dmy_hm(EndTime))

    first_time_ran <- FALSE
}

head(players)
head(sessions)

In [None]:
# Does experience level of player affect their likelihood of subscribing?
options(repr.plot.width = 10, repr.plot.height = 10)

players |> ggplot(aes(x = Experience, fill = Subscribe)) +
    geom_bar(position = "fill") +
    labs(x = "Player's Experience with Minecraft", y = "Percentage Subscribed to Newsletter", fill = "Subscribed?") +
    ggtitle("Percentage of players subscribed to newsletter by experience level") +
    theme(text = element_text(size = 16))

In [None]:
# Does gender of player affect their likelihood of subscribing?
options(repr.plot.width = 10, repr.plot.height = 10)

players |> ggplot(aes(x = Gender, fill = Subscribe)) +
    geom_bar(position = "fill") +
    labs(x = "Player's Gender", y = "Percentage Subscribed to Newsletter", fill = "Subscribed?") +
    ggtitle("Percentage of players subscribed to newsletter by gender") +
    theme(text = element_text(size = 16))

In [None]:
# Does play time of player and age affect their likelihood of subscribing?
options(repr.plot.width = 10, repr.plot.height = 10)

playtime_age_subscribed_plot <- players |> ggplot(aes(x = PlayedHours, y = Age, color = Subscribe)) +
    geom_point(size = 2, alpha = 0.7) +
    labs(x = "Total Time Played (Hours)", y = "Age", color = "Subscribed?") +
    ggtitle("Hours played vs Age, with their newsletter subscription status colored") +
    theme(text = element_text(size = 16)) +
    scale_x_log10()

playtime_age_subscribed_plot + xlim(-1, 250)

playtime_age_subscribed_plot + xlim(-0.1, 6)

## 4. Methods and Plan

Using K-NN Classification.

> Why is this method appropriate?

Question wants to predict a label (subscribed to newsletter or not) using many predictors, so k-nn classification is appropriate.

> Which assumptions are required, if any, to apply the method selected?

Not many assumptions are required for knn classification.

> What are the potential limitations or weaknesses of the method selected?

Might overfit to the specific dataset, especially with only 196 observations (players).

> How are you going to compare and select the model?

The model can be evaluated using either its accuracy, recall, or precision. In this case, the positive should be subscribing to the newsletter.

> How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

Just use the process we've been learning in class, split into training/testing, split into folds, tune for best k value, retrain with best k value, apply model against testing data to see how good the model is.

## 5. Github

The github repository link for this project is here: https://github.com/avahbot/dsci-100-term-project

## 6. Appendix

In [None]:
# Used to get variable statistical summaries

players |> group_by(Experience) |> summarize(count = n())
players |> group_by(Subscribe) |> summarize(count = n())
players |> group_by(Gender) |> summarize(count = n())
players |> summarize(average_hours_played = mean(PlayedHours))
players |> summarize(average_age = mean(Age, na.rm = TRUE))
sessions |> summarize(first_session_start_time = min(OriginalStartTime))
sessions |> summarize(last_session_end_time = max(OriginalEndTime, na.rm = TRUE))

sessions |> summarize(average_session_duration = mean(EndTime - StartTime, na.rm = TRUE))
sessions |> summarize(max_session_duration = max(EndTime - StartTime, na.rm = TRUE))