# **DSCI100 Project: Predicting Playtime Range from Age and Experience**

#### Group 42
#### Members: Linda Zhu, Eunelsy Trillanes, Lavender Sun, and Kelly Ye

## **Introduction**

*Based on a player's age and experience, can we predict the range of hours — zero, low, medium, high, or extreme — that they will contribute?*

**Instructions:** provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project
identify and fully describe the dataset that was used to answer the question

## **Methods and Results**

**Instructions:** describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
your report should include code which:
loads data 
wrangles and cleans the data to the format necessary for the planned analysis
performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
performs the data analysis
creates a visualization of the analysis 
note: all figures should have a figure number and a legend

In [2]:
# Loading the necessary packages to read and process the data:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [3]:
# Reading the data from a URL. Data is in .csv format, so read_csv is used.
url <- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
player_data <- read_csv(url) |>
        # Selecting only the relevant variables to reduce the dataset size
        select(experience, played_hours, age) 

head(player_data) # Preview the data!

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,played_hours,age
<chr>,<dbl>,<dbl>
Pro,30.3,9
Veteran,3.8,17
Veteran,0.0,17
Amateur,0.7,21
Regular,0.1,21
Amateur,0.0,17


In [10]:
# Manipulating the data! First, convert played_hours into a categorical variable.

zero <- player_data |>
        filter(played_hours == 0) |>
        mutate(hours_range = "Zero")

low <- player_data |>
        filter(played_hours > 0 , played_hours <= 0.25) |>
        mutate(hours_range = "Low")

medium <- player_data |>
        filter(played_hours > 0.25 , played_hours <= 1) |>
        mutate(hours_range = "Medium")

high <- player_data |>
        filter(played_hours > 1, played_hours <= 5) |>
        mutate(hours_range = "High")

extreme <- player_data |>
        filter(played_hours > 5) |>
        mutate(hours_range = "Extreme")

# Combining into one dataframe:

player_data_classes <- rbind(zero, low, medium, high, extreme) |>
                    group_by(experience)

Next, converting experience level into a numerical variable by assigning values:

In [6]:
amateur <- player_data_classes |>
        filter(experience == "Amateur") |>
        mutate(exp_level = 0)

beginner <- player_data_classes |>
        filter(experience == "Beginner") |>
        mutate(exp_level = 1)

regular <- player_data_classes |>
        filter(experience == "Regular") |>
        mutate(exp_level = 2)

pro <- player_data_classes |>
        filter(experience == "Pro") |>
        mutate(exp_level = 3)

veteran <- player_data_classes |>
        filter(experience == "Veteran") |>
        mutate(exp_level = 4)


In [9]:
final_dataset <- rbind(amateur, beginner, regular, pro, veteran) |>
                    group_by(hours_range) |>
                    select(age, hours_range, exp_level)

head(final_dataset)

age,hours_range,exp_level
<dbl>,<chr>,<dbl>
17,Zero,0
21,Zero,0
22,Zero,0
17,Zero,0
33,Zero,0
17,Zero,0


## **Discussion**

- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

### **References**

Hi this is lInda