**Individual Planning Report (MineCraft Project)**



Abrahem Chaudhry

# **Data Description**

## ****Files****

**players.csv:** one row per unique player with demographic and account attributes (see snippet below).
**sessions.csv:** one row per play session with timing and session-level attributes(See snippet below)
##### How data were collected
Players join a dedicated MineCraft server. Server-side logging captures session metadata (start/end or duration) and links each session to a player identifier. Demographics are collected at registration or via linked profiles. Potential issues include incomplete self-reported demographics, missing sessions for players who churned early, and clock/timezone inconsistencies when deriving durations.

### **High‑level size overview**

**players.csv:** 196 rows, 7 columns

**sessions.csv:** 1535 rows, 5 columns

All data is collected from DSCI100 materials in canvas and loaded into a R software on JupyterNotebooks. A repository is made to store and manage the the 2 datasets, and the Jupyter Notebook.

In [25]:
library(readr)
library(dplyr)
library(tidyverse)

players <- read_csv('players.csv')
sessions <- read_csv('sessions.csv')




[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


# Data: Players *(players.csv)*

In [13]:
head(players)
summary(players)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

### Variable Descriptions

| Variable   | Type      | Description                                |
| ----------- | --------- | ------------------------------------------ |
| Experience  | Character   | Level of experience player is at in gaming terms        |
| Age        | Numeric   | Age of the player in years                 |
| Hashed Email | Character  | Encrypted or anonymized identifier for each user 
| Name      | Character   | Name of the person data belongs to              |
| Gender     | Character | Player’s self-reported gender              |
| Subscribe    | Character | True/False: Has the player subscribed or not? |
| Played hours | Numeric   | Total minutes played        |

* The dataset contains 196 player records with variables describing experience level, subscription status, gender, age, and total hours played. The average age of players is 21.14 years, ranging from 9 to 58, while the mean number of hours played is 5.85, with a maximum of 223.1, suggesting a few extreme outliers. Most players have relatively low playtime, and the dataset includes both subscribed and unsubscribed users, as well as multiple experience levels (e.g., Pro, Amateur, Veteran). Minor issues include inconsistent variable naming (e.g., “Age” vs. lowercase variables) and missing values in the age column.

### Summary Statistics

| Variable      | Mean  |  Min  |Max|
|----------------|-------|-----|--- |
| Hours Played   | 5.85 | 0.00| 223.10|
| Age(years)            | 21.14 |9.00| 58 |


* There are potential outliers, such as played_hours values reaching 48.4 while most are near 0–2. The age range of 9 to 58 suggests either data entry errors or the inclusion of minors, which could raise ethical concerns. Hidden issues may include missing values in gender or experience, inconsistent category formatting (e.g., “Male” vs. “male”), and duplicate users masked by the anonymized hashedEmail field. To improve data quality, variable names should be standardized, missing and extreme values identified, and logical relationships (such as older players typically having more experience) verified.

# Data: Sessions *(sessions.csv)*

In [24]:
head(sessions)
summary(sessions)

hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

### Variable Descriptions

| Variable            | Type      | Description                                      |
|----------------------|-----------|--------------------------------------------------|
| Hashed Email          | Character | Encrypted or anonymized identifier for each user |
| Start Time         | Character | Recorded session start time date format|
| End time             | Character | Recorded session end time date format|
| original start time  | Numeric   | Original session start time numeric format|
| original end time    | Numeric   | Original session end time numeric format|

* The dataset appears mostly clean, but there are a few issues to note. Both start time and end  time are stored as character strings instead of proper date-time objects, which may cause problems when calculating durations or filtering by time. The numeric equivalents, original start time and original end time, are recorded as large Unix timestamps (~1.7e+12), which can introduce rounding or conversion errors if not handled carefully.

### Summary Statistics

| Variable            | Mean          |
|----------------------|---------------|
| original start time  | 1.719201e+12  |
| original end time    | 1.719196e+12  |


* Hidden issues could include missing timestamps, duplicate session records for the same hashedEmail, or time zone inconsistencies that distort time comparisons. Standardizing date-time formats, verifying time order, and checking for missing or overlapping entries would improve data accuracy.