In [3]:
import pandas as pd
import altair as alt
import numpy as np


In [15]:
game = 'https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz&export=download'
game_data = pd.read_csv(game)
game_data


Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


# 1. Data Description:
- 196 rows/observations, 9 columns, 9 variables
- Column names: experience, subscribe, hashed email, played hours, name, gender, age, individual id, organization name
- Subscribe column: true/false (boolean), are they subscribed?
- Experience column: self reported experience level out of 4 possible values => Amateur, Regular, Veteran, Pro
- Played hours: number of how many hours each player has 
- Gender: self reported gender identity out of possible options => Female, Male, Non-binary, Agender, Two-Spirit, other, prefer not to say.
- Age: self reported age value
- "Hashed email" and "Name" are id variables but hard to use due to their lack of broadness
- Individual ID: NaN column
- Organization Name: NaN column

## Data issues:
- NaN columns to be removed, columns to be chosen for analyzation.
- When using knn, "age" and "played hours" data to be standardized

# 2. Question

## Question Chosen:
- What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

## Proposed Question:
- How does age and hours played influence subscription to a game-related newsletter?

## To answer this:
- KNN-Classification focusing on hours played and age to predict which of these characteristics may influence subscription.

# 3. Data Analysis

In [121]:
game_tidy = game_data[["subscribe","played_hours","age"]]
game_tidy


Unnamed: 0,subscribe,played_hours,age
0,True,30.3,9
1,True,3.8,17
2,False,0.0,17
3,True,0.7,21
4,True,0.1,21
...,...,...,...
191,True,0.0,17
192,False,0.3,22
193,False,0.0,17
194,False,2.3,17


In [122]:
game_columns = game_data[["subscribe","played_hours","age"]]
game_columns


Unnamed: 0,subscribe,played_hours,age
0,True,30.3,9
1,True,3.8,17
2,False,0.0,17
3,True,0.7,21
4,True,0.1,21
...,...,...,...
191,True,0.0,17
192,False,0.3,22
193,False,0.0,17
194,False,2.3,17


In [123]:
game_columns=game_columns.copy()
game_columns["subscribe"]=game_columns["subscribe"].astype("int")
game_columns


Unnamed: 0,subscribe,played_hours,age
0,1,30.3,9
1,1,3.8,17
2,0,0.0,17
3,1,0.7,21
4,1,0.1,21
...,...,...,...
191,1,0.0,17
192,0,0.3,22
193,0,0.0,17
194,0,2.3,17


In [125]:
hours_age_plot = alt.Chart(game_tidy).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Hours Played"),
    color=alt.Color("subscribe:N")
).properties(title="Hours Played by Age and Subscription Status"
)
hours_age_plot


In [129]:
age_count_plot = alt.Chart(game_tidy).mark_bar().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("count():Q").title("Number Of Players"),
    color=alt.Color("subscribe:N")
).facet(column=alt.Column("subscribe:N").title("Subscription")).properties(
    title="Number of Players subscribed by Age")
age_count_plot


In [137]:
game_percentage=(
    game_tidy.groupby(["subscribe","age"])
    .size()
    .reset_index(name="count")
)
game_percentage["percentage"]=(
    game_percentage["count"]/196
)

age_percentage_plot = alt.Chart(game_percentage).mark_bar().encode(
    x=alt.X(
        "age",
        bin=alt.Bin(maxbins=30),
        title=("Age")
    ),
    y=alt.Y(
        "percentage:Q", 
        axis=alt.Axis(format="%"),
        title=("Percentage of Total Players")
    ),
    color=alt.Color("subscribe:N").title("Subscription")
).properties(
    title="Percentage of Total Players subscribed by Age"
)

age_percentage_plot


In [134]:
age_sub_plot = alt.Chart(game_tidy).mark_bar().encode(
    x=alt.X(
        "age",
        bin=alt.Bin(maxbins=30),
        title="Age"
    ),
    y=alt.Y("count()").title("Number of players"),
    color=alt.Color("subscribe:N").title("Subscription")
).properties(title="Age Distribution by Subscription")
age_sub_plot


## Insights
- Players between the ages of 15-25 play the most hours of Minecraft. The players with the most played hours are also subscribed.
- Most players are between the ages of 15-25. Most likely teenagers. This age range also contains the highest proportion of subscribed players.
- About 35% of players aged between 15-20 are subscribed, and about 25% of players aged between 20-25 are subscribed.
- There are players between the ages of 90-100 that are subscribed, but they make up less than 5% of players.

# 4. Methods and Plan
## KNN-Classifier:
    - By using KNN-Classifier, it will read the nearest neighbours to our observation and give us back a classification (subscribed/not subscribed) for our values. It is appropriate because we are trying to figure out player characteristics that contribute to if a player is subscribed to a game-newsletter or not.
    - No assumptions needed for KNN-Classifier because it works for most databases
    - No limitations for this classifier, as both classifying values are numerical and the classification is categorical.
    - KNN-Classifier might be weak as we have a higher percentage of subscribed players than non-subscribed but this may be fixed by increasing the "weight" of the unsubscribed data.
# Plan
- Train and test data in a 75/25 split.
- Use cv=5 and a Grid Search of 1-25, since these fall in the range of exploration for 196 data points.
- Find best n-neighbour for applying to the data and the test to find the accuracy/precision of the training model.