# Predicting Hours Played by MineCraft Users Based on Age and Experience Level


### Introduction 

Researchers in the Computer Science department at UBC are collecting data on how people play video games to answer a few questions. One of the questions the researchers are asking is “Which ‘kinds’ of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?” 

The researchers have set up a MineCraft server to record players' actions as they navigate through the server’s world. The project they are running is a lot more complicated than it seems. They need to ensure there are enough resources (server hardware, software licences, etc.) in order to accommodate the number of players they attract to contribute to the study.

To understand the demographics of the players and session activity, the individuals contributing to the study must answer questions that have been formulated by the research group before playing such as what is their age, gender, and experience level. They can then join the server to play, and their session activity is monitored and recorded, allowing for the tracking of both demographic information and gameplay duration.

### Our Question

We will be trying to answer the researchers' question on “Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?.” And further we explore whether we can predict the number of hours played based on the players age and experience. 

### The Data Set 

We will be focusing on the `players.csv` dataset containing demographic and experience level information for each participant, which will allow us to examine the relationship between the number of hours played and the characteristics of the participants. 

The `sessions.csv` data set will not be used in this analysis as it does not contain demographic information about the players. We chose to exclude the sessions csv file from our analysis as we believed it did not provide any meaningful infromation to our study that was not already given through the `players.csv` file.

Further analysis will allow us to observe which types of players are likely to contribute more hours when playing the game.

`players.csv` dataset:

The number of observations: 196 (for each player in the study)
The number of variables: 9

Variables:
- `experience` (string): player's experience level
- `subscribe` (boolean): subscription to study's mailing list.
- `hashedEmail` (string): encrypted version of the player's email address.
- `played_hours` (float): number of hours the player has spent on the server.
- `name` (string): player's name.
- `gender` (string): player's gender.
- `age` (integer): age of the player in years.
- `individualId` (NoneType): Doesn't contain data or represent a value. It could be for an alternative ID for the player.
- `organizationName`
(NoneType): Doesn't contain data or represent a value.


### Methods and Results

We started by importing the required libraries and functions into Jupyter.

In [1]:
import pandas as pd
import altair as alt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, train_test_split, cross_validate
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

Then, we loaded the players csv file into Jupyter.

In [3]:
players_data = pd.read_csv("data/players.csv")
players_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


### Exploratory Data Analysis and Visualization

To answer the researchers question for "Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?" We will create 2 scatter plots to see if there is any patterns in the data between `played_hours`and `age` for both `gender` and `experience` categories. 

We will do this by selcecting the `experience`,`played_hours`, `gender`, `age` columns by using the `[]` function. So we are dropping the columns that are not needed as they do not provide information on the players demographics

Our next step was to wrangle the data. The first step for this was to drop the columns we didn't need for our analysis. We did so by keeping only the `played_hours`, `experience` and `age` columns by using `[]`. Next, we used one hot encoding to turn the experience columns into numerical variables in order to be able to use them for regression. Lastly, we combined the original dataframe with the new one hot encoding data frame to get our final data frame using the concat function.