# Title

## Introduction

Plaicraft is an online Minecraft server provided by a research group from UBC computer science. Players who join the server provide a small amount of information about themselves before they begin to play the game. The server is set up to collect data about each player’s playtime and this is logged within a .csv file titled players.csv. The time played is tabulated with other data such as the gender of each player, their age, their previous experience with the game as well as their participation in the emailing list. The goal of our research group has been to answer the question: what kinds of players are most likely to contribute a large amount of data to the dataset? We go on further to investigate whether the experience of a player can be predicted with a model based on factors such as play time and other variables offered in the data set. We would also like to know how this information can be used to target players that are more likely to contribute more data during recruitment. This is important as the researchers involved with maintaining the Plaicraft server need to know how much to expand the capacity of the server to compensate for more players. The researchers would also be able to gain more players for their efforts by targeting advertisements to players that fit the demographics that seem to contribute the most data of players in the dataset. 

### The Data

The chosen dataset, players.csv, details **nine** variables describing  the data of **196** PlaiCraft players. These nine variables provide information on the players': 
- **experience** - The self-reported level of experience players listed as having when signing up. Experience levels include Beginner, Amateur, Regular, Veteran, and Pro,
- **subscribe** - Players' subscription status to email updates about when other players are online on the server,
- **hashedEmail** - The players' hashed email address,
- **played_hours** - Each player's total playtime in hours,
- **name** - The name chosen by the player,
- **gender** - The self-reported gender of the player. Players were given the options: Male, Female, Agender, Non-Binary, Two-Spirited, Other, and 'Prefer not to say',
- **age** - The self-reported age in years of the player,
- **individualId** - The individual ID of the user,
- **organizationName** - The name of the organization the player is a part of.

To allow for data analysis, this dataset is loaded in using pandas.

## Methods and Results

Before any analysis can begin, all nescessary packages must be loaded in. These packages include: altair, numpy, pandas, and sklearn

In [8]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

To ensure our methadology and findings are reporducable, the players.csv dataset is loaded directly from the web. The dataset is saved the variable "players"

In [9]:
url = 'https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
players = pd.read_csv(path)
players.head(3)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,


Loading this dataset, it is clear that, although the data is tidy, much of the presented variables are irrelevant to answering the proposed questions. To allow for accurate and easily interpretable analysis, the columns for players' name, individual ID, organization name, and hashedEmail will be dropped as they do not include any relevant data for the chosen question as the name and hashed email of players do not provide insight into the "kind" of player they are and the individual ID and organization name variables do not include any data.

This is done using pandas' drop columns function and the new dataset is save to "players_clean"

In [10]:
players_clean= players.drop(columns= {"hashedEmail", "individualId", "organizationName"})
players_clean.head(3)

Unnamed: 0,experience,subscribe,played_hours,name,gender,age
0,Pro,True,30.3,Morgan,Male,9
1,Veteran,True,3.8,Christian,Male,17
2,Veteran,False,0.0,Blake,Male,17


Now that the dataset only includes the data relevant to answering the question "Which 'kinds' of players are most likely to contribute a large amount of data?", analysis to identify these "kinds" of players can be conducted. The best way to do this is to determine how different variables influence the variable "played_hours", as the groups with the greatest play time will contribute the largest amount of data and should thus be targeted for recruiting efforts. 

Having said that, problems do arise due to individuals misreporting their age. For example, the 196th datapoint has an unrealistic age of 91 years. If these data points are included, this will lead to false data resulting in inaccurate predictions for the age group that contributes the most playtime. For these reasons, only data for ages 15-30 will be used for analysis of the relationship between age and playtime.

In addition to the relationship between average age and playtime, playtime will be compared with experience, gender, and subscription status. These comparisons will be done using bar plots.

In [14]:
#Plot Based on Experience
player_data_experience_mean = (players_clean.groupby('experience')
                        .mean(numeric_only = True) #Taking the mean playtime for individuals of different experience levels.
                        .reset_index()
                              )

player_data_experience_bar = alt.Chart(player_data_experience_mean).mark_bar().encode(
    x=alt.X('experience', title = 'Experience Level'),
    y=alt.Y('played_hours', title = "Average Hours Played (Hours)")
).properties(
    title= 'Average Number of Hours Played by Different Experience Levels'
)

In [15]:
display(player_data_experience_bar)

In [16]:
#Plot Based on Age
player_data_age_mean = (players_clean[(players_clean["age"] >= 15) & (players_clean["age"] <= 30)]
                        .groupby('age')
                        .mean(numeric_only = True) #Taking the mean playtime of players with different ages between 15 and 30
                        .reset_index()
                       )


player_data_age_bar = alt.Chart(player_data_age_mean).mark_bar().encode(
    x=alt.X('age', title = 'Age of Players (Years)'),
    y=alt.Y('played_hours', title = "Average Hours Played (Hours)")
).properties(
    title= 'Average Number of Hours Played by Players of Different Ages'
)

In [17]:
display(player_data_age_bar)

In [18]:
#Plot Based on Gender
player_data_gender_mean = (players_clean.groupby('gender')
                        .mean(numeric_only = True) #Taking the mean playtime of players of different genders 
                        .reset_index()
                       )

player_data_gender_bar = alt.Chart(player_data_gender_mean).mark_bar().encode(
    x=alt.X('gender', title = 'Reported Gender of Players'),
    y=alt.Y('played_hours', title = 'Average Hours Played (Hours)')
).properties(
    title = 'Average Number of Hours Played by Players of Different Genders'
)

In [19]:
display(player_data_gender_bar)

In [20]:
#Plot Based on Subscription Status
player_data_subscribe_mean = (players_clean.groupby('subscribe')
                        .mean(numeric_only = True) #taking the mean playtime of players of different subscription statuses
                        .reset_index()
                          )

player_data_subscribe_bar = alt.Chart(player_data_subscribe_mean).mark_bar().encode(
    x=alt.X('subscribe', title = 'Subscription Status of Players'),
    y=alt.Y('played_hours', title = 'Average Hours Played (Hours)')
).properties(
    title = 'Average Number of Hours Played by Players of Different Subscription Status'
)

In [21]:
display(player_data_subscribe_bar)

## Discussion

## References

# Plan:

## Step 1: Model Which age of each gender plays the most
## Step 1.5: Write, write, write.
## Step 2: Bar plot, y axis played hours, x axis a mix of variables attained from the model

# Computation: 

something something, first we must pull the required dataset from the web.

In [13]:
#url = 'https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit'
#path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
#df = pd.read_csv(path)
#df