# Individual Project Planning #
## Project Overview:
**Broad question addressed:** We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

We want players to contribute a large amount of data -> more playtime

**Specific question:** Can age predict playtime in PLAICraft servers?

## (1) Data Description:
Full descriptive summary of the dataset:
1. Number of observations and summary statistics (2 decimal places) (Mean value for each quantitative variable in players.csv)
2. Variable: number, types and names, meaning, data collection methods
3. Exploratory visualizations (plots)
4. Data issues (observable and non-observable)



In [1]:
# load the data and the library:
library(tidyverse)
players <- read_csv("https://raw.githubusercontent.com/calentynes/INDVPRD100/refs/heads/master/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/calentynes/INDVPRD100/refs/heads/master/sessions.csv")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


### 1. Number of observations and summary statistics:
Do the minimum necessary wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase. Compute the mean value for each quantitative variable in the players.csv data set. Report the mean values in a table format.

In [6]:
#For players.csv:

summarystatistics_players <- summary(players)
summarystatistics_players #summary statistics

mean_players <- group_by(players) |>
      summarise(
        mean_hours = mean(played_hours, na.rm = TRUE),
        mean_age = mean(Age, na.rm = TRUE))

mean_players #mean value in a table format



  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

mean_hours,mean_age
<dbl>,<dbl>
5.845918,21.13918


If run players, we can see that the data is already tidy. There is one observation on each row, each column has one type of observation, and each cell has one observation. Therefore, we don't need to tidy the data.

In [7]:
head(players, n = 5) #showing first 5 rows to make my point
head(sessions, n = 5)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21


hashedEmail,date,time,end_time,original_start_time,original_end_time,total_playtime
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024,18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0,0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024,23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0,0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024,17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0,0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024,03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0,0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024,16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0,0


Now, we will compute summary statistics and tidy up data for sessions.csv:

The original_start_time and original_end_time columns aren't very readable and don't provide useful data for our research question. We want to find out the playtime of each observation.

In [5]:
sessions <- sessions |> 
    mutate(total_playtime = original_end_time - original_start_time) |>
    separate(col = start_time,
             into = c("date", "time"),
             sep = " ")

sessions

ERROR: [1m[33mError[39m in `separate()`:[22m
[1mCaused by error:[22m
[33m![39m object 'start_time' not found


### 2. Variable: number, types and names, meaning, data collection methods

In players.csv:

- **Length: 196** — there are 196 observations
  
There are 7 categories:
1. **experience**
    - The level of Minecraft experience each player has in the 'character' data type. It has levels: Beginner, Amateur, Regular, Veteran, and Pro. This data was collected from a UBC qualtrics form question where the player can choose an experience level of Minecraft from 0 - 5, from "I've never played" (0, Beginner) to "I'm a pro" (5, Pro). A similar question proceeds this one, asking the level of familiarity with PLAICraft on a level from 0 - 5.
2. **subscribe**
   - The - in the 'logical' data type (true or false). It shows that 52/196 players have FALSE and 144/196 players have TRUE
3. **hashedEmail**
   - The email of each player in a hashed format (likely for privacy purposes), in the 'character' data type. Each player must provide their email address to play on the server, which is how the data was collected.
4. **played_hours**
   - The number of hours each player has played in the 'numeric' data type. mean_hours displays the average number of hours played for all players, $\approx 5.85$
5. **name**
   - The first name of each player in the 'character' data type. This data was likely obtained from when players need to input contact information.
6. **gender**
   - The gender of each player in the 'character' data type. Player choices are: Male, Female, Agender, Two-Spirited, Non-binary, and Prefer not to say—although it is unclear if these were all the available choices, or that there were more options that were not picked. These stats were likely taken from previous survey questions, similar to **experience**.
7. **Age**
   - The age of each player in the 'numerical' data type'. mean_age displays that the average age of all players is $\approx 21.14$. These stats were likely taken from previous survey questions, similar to **experience**.
   
