# Title

## Introduction

For this project, I was tasked, alongside my group, to answer one of three questions pertaining to the analytics of a Minecraft server and its player base; of these questions, I have chosen the topic "Which 'kinds' of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts". To aid in this analysis, two datasets were provided named players.csv and sessions.csv. For the chosen question, information on individual sessions is not highly relevant data, thus players.csv will be used. Loading players.csv into the notebook:

### The Data

This dataset includes **nine** variables relating to the data of **196** players. These variables cover the players': 
- **experience** - The level of experience players listed themselves as having when signing up. Experience levels include Beginner, Amateur, Regular, Veteran, and Pro,
- **subscribe** - Players' subscription status to email updates about the server,
- **hashedEmail** - The players' hashed email address,
- **played_hours** - Each player's total number of hours played,
- **name** - The name chosen by the player
- **gender** - The self-reported gender of the player. Players could choose from Male, Female, Agender, Non-Binary, Two-Spirited, Other, and Prefer not to say,
- **age** - The self-reported age in years of the player,
- **individualId** - The individual ID of the user,
- **organizationName** - The name of the organization the player is a part of.

 Before further data analysis is conducted, the columns for name, individual ID, organization name, and hashedEmail will be dropped as they do not include any relevant data for the chosen question as the name and hashed email of players do not provide insight into the "kind" of player they are and the individual ID and organization name variables do not include any data.

## Methods and Results

### Methods

Now that the dataset only includes the data relevant to answering the question "Which 'kinds' of players are most likely to contribute a large amount of data?", analysis to identify these "kinds" of players can be conducted. The best way to go about doing this is to determine how different variables influence the variable "played_hours", as the groups with the greatest play time will contribute the largest amount of data and should thus be targeted for recruiting efforts. 

Having said that, problems do arise due to individuals misreporting their age. For example, the 196th datapoint has an unrealistic age of 91 years. If these data points are included, this will lead to false data resulting in inaccurate predictions for the age group that contributes the most playtime. For these reasons, only data for ages 15-30 will be used for analysis of the relationship between age and playtime.

In addition to the relationship between average age and playtime, playtime will be compared with experience, gender, and subscription status. For now, these comparisons will be done using bar plots as proof they can be used to find some relevant data before the actual group project:

### Results

Based on these bar plots, demographics with the highest playtime can be determined and can be focused on for recruiting efforts. From these plots, basic conclusions about playtime and demographics can be drawn.

Further analysis can be conducted by further separating these variables to answer a question such as "What age of male contributes the greatest playtime?" Thus, males of a particular age can be targeted, and females of a different age can be targeted. This splitting can also be done with other variables to further improve findings. 

Moving forward into the final report, such analysis will be conducted and trained and tested k-regression models will be created and used to allow for prediction
of playtime based on certain variables such as age, allowing for the variables that will result in the most playtime to be identified and players that adhere to that demographic to be targeted for recruitment.

## Discussion

## References

# Plan:

## Step 1: Model Which age of each gender plays the most
## Step 1.5: Write, write, write.
## Step 2: Bar plot, y axis played hours, x axis a mix of variables attained from the model

# Computation: 

In [29]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [30]:
url = 'https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
players_shit = pd.read_csv(path)
players_shit

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


something something, first we must pull the required dataset from the web.

In [31]:
players_clean= players_shit.drop(columns= {"hashedEmail", "individualId", "organizationName"})
players_clean

Unnamed: 0,experience,subscribe,played_hours,name,gender,age
0,Pro,True,30.3,Morgan,Male,9
1,Veteran,True,3.8,Christian,Male,17
2,Veteran,False,0.0,Blake,Male,17
3,Amateur,True,0.7,Flora,Female,21
4,Regular,True,0.1,Kylie,Male,21
...,...,...,...,...,...,...
191,Amateur,True,0.0,Bailey,Female,17
192,Veteran,False,0.3,Pascal,Male,22
193,Amateur,False,0.0,Dylan,Prefer not to say,17
194,Amateur,False,2.3,Harlow,Male,17


In [32]:
#url = 'https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit'
#path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
#df = pd.read_csv(path)
#df

In [33]:
#Plot Based on Experience
player_data_experience_mean = (players_clean.groupby('experience')
                        .mean(numeric_only = True)
                        .reset_index()
                              )

player_data_experience_bar = alt.Chart(player_data_experience_mean).mark_bar().encode(
    x=alt.X('experience', title = 'Experience Level'),
    y=alt.Y('played_hours', title = "Average Hours Played (Hours)")
).properties(
    title= 'Average Number of Hours Played by Different Experience Levels'
)

#Plot Based on Age
player_data_age_mean = (players_clean[(players_clean["age"] >= 15) & (players_clean["age"] <= 30)]
                        .groupby('age')
                        .mean(numeric_only = True)
                        .reset_index()
                       )


player_data_age_bar = alt.Chart(player_data_age_mean).mark_bar().encode(
    x=alt.X('age', title = 'Age of Players (Years)'),
    y=alt.Y('played_hours', title = "Average Hours Played (Hours)")
).properties(
    title= 'Average Number of Hours Played by Players of Different Ages'
)

#Plot Based on Gender
player_data_gender_mean = (players_clean.groupby('gender')
                        .mean(numeric_only = True)
                        .reset_index()
                       )

player_data_gender_bar = alt.Chart(player_data_gender_mean).mark_bar().encode(
    x=alt.X('gender', title = 'Reported Gender of Players'),
    y=alt.Y('played_hours', title = 'Average Hours Played (Hours)')
).properties(
    title = 'Average Number of Hours Played by Players of Different Genders'
)

#Plot Based on Subscription Status
player_data_subscribe_mean = (players_clean.groupby('subscribe')
                        .mean(numeric_only = True)
                        .reset_index()
                          )

player_data_subscribe_bar = alt.Chart(player_data_subscribe_mean).mark_bar().encode(
    x=alt.X('subscribe', title = 'Subscription Status of Players'),
    y=alt.Y('played_hours', title = 'Average Hours Played (Hours)')
).properties(
    title = 'Average Number of Hours Played by Players of Different Subscription Status'
)

In [34]:
display(player_data_experience_bar)
display(player_data_age_bar)
display(player_data_gender_bar)
display(player_data_subscribe_bar)

In [35]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Preprocessing the dataset
# Assuming `players` is the dataset provided

# Dropping irrelevant columns
players = players_shit.drop(columns=["hashedEmail", "individualId", "organizationName", "name"])

# Handling categorical variables using pandas Categorical
categorical_columns = ["gender", "experience"]
category_mappings = {}

for col in categorical_columns:
    players[col] = pd.Categorical(players[col])
    category_mappings[col] = players[col].cat.categories
    players[col] = players[col].cat.codes

# Splitting into features and target variables
X = players[["played_hours"]]
y = players[["gender", "age", "experience"]]

# Scaling the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Training a kNN model for each target variable
knn_gender = KNeighborsClassifier(n_neighbors=5)
knn_age = KNeighborsClassifier(n_neighbors=5)
knn_experience = KNeighborsClassifier(n_neighbors=5)

# Fitting the models
knn_gender.fit(X_train, y_train["gender"])
knn_age.fit(X_train, y_train["age"])
knn_experience.fit(X_train, y_train["experience"])

def predict_player_attributes(played_hours):
    """
    Predict gender, age, and experience level based on played hours.
    
    Args:
        played_hours (float): The number of hours played.

    Returns:
        dict: Predicted attributes (gender, age, experience).
    """
    # Preparing the input for prediction
    input_data = np.array([[played_hours]])
    input_data_scaled = scaler.transform(input_data)

    # Predicting each attribute
    predicted_gender = knn_gender.predict(input_data_scaled)[0]
    predicted_age = knn_age.predict(input_data_scaled)[0]
    predicted_experience = knn_experience.predict(input_data_scaled)[0]

    # Decoding the categorical values
    decoded_gender = category_mappings["gender"][predicted_gender]
    decoded_experience = category_mappings["experience"][predicted_experience]

    return {
        "gender": decoded_gender,
        "age": predicted_age,
        "experience": decoded_experience,
    }

# Example usage
predicted_attributes = predict_player_attributes(100)
print(predicted_attributes)

{'gender': 'Male', 'age': np.int64(17), 'experience': 'Amateur'}


