In [1]:
#Input libraries 
import pandas as pd 
import numpy as np 
import altair as alt 

<u> **Data Description**

In the table below, there are 9 columns and 196 rows.
The variables describe the following: 

1.`experience` is a catgorical variable, with levels of experience, where `Veteran` is the highest, followed by `Pro`, `Regular` and `Ameteur`.

2. `subscribe` is a booleen variable, indicating the whether players have subscribed to the newsletter of the game.

3. `hashedEmail` is an ID variable that refers to the players email.

4. `played_hours` is a quantitiative variable - refers to the number of hours played on the game.

5. `name` refers to the name of the player.

6. `gender` indicates that of the player, and `age` is their age.

In [16]:
#Import urls
players_url = "https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(players_url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


The data is tidy, but some of the columns seem to not provide the information we want:
* `individualId` and `organizationName`has no values of data (NaN columns), and so would not be useful in answering our querstion.
* `hashedEmail` and `name` have information about the players , but is not relevant nor useful to answer a possible questions because of the unclear values of data in the column.
* `experience` is unclear with the rankings of each player, as we do not know whether "Veteran" deems as better experiences that a "Pro" in this game, so we might have to assume.

<u>**Research question**<u>

_Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?_

My proposed specific question inspired from the question above is as follows: 

**Can we predict the  the likelihood of a player subscribing to the game newsletter based on their hours played, and their experience level?**

* `KNearestNeighbors` could also be used in classification to predict the likelihood of a player subscribing to the newsletter based on their hours played and experience levels.
  - It is expected that layers that have higher experience levels or have more hours on the game might be more inclined to subscribe to the newsletter.
* `experience` will be used to rank players into different categories/rankings, but because it is ordinal, it will be changed so that the following will be used in Kneighbours Classification:
    1. `Beginner` will be labelled as 1
    2. `Amateur` will be labelled as 2 
    3. `Regular` will be labelled as 3
    4. `Pro` will be labelled as 4
    5. `Veteran` will be labelled as 5
* `subscribe` will be used to show whether thes eplayers have subscribed or not.
* `played_hours` will be used to compare the amount of hours spent on the game by each player.

<u>**Exploratory Data Analysis and Visualization**<u>

First, We will drop the columns `individualId`, `organizationName`, `hashedEmail`, `gender`, `age` and `name` as these are not relevenant.

In [17]:
players.drop(columns=["individualId", "organizationName", "hashedEmail", "gender", "name", "age"], inplace=True)
players

Unnamed: 0,experience,subscribe,played_hours
0,Pro,True,30.3
1,Veteran,True,3.8
2,Veteran,False,0.0
3,Amateur,True,0.7
4,Regular,True,0.1
...,...,...,...
191,Amateur,True,0.0
192,Veteran,False,0.3
193,Amateur,False,0.0
194,Amateur,False,2.3


Then we plot a graph to illustrate the relationship between the hours played and the experience level, colour-coordinating whether the players have or have not subscribed to the newsletter.

In [18]:
players_plot = alt.Chart(players).mark_bar(opacity=0.5).encode(
    x = alt.X("experience").title("Player Experience Level"),
    y = alt.Y("played_hours").title("Time Played (hours)"),
    color = alt.Color("subscribe").title("Subscription to Newletter"),
).properties(
    title=["A Bar Graph Representing the relation of Time Played in hours vs. Player", 
          "Experience Level with the Subscription status indicated by colour"]
)
players_plot

The bar graph indicates that most players, regardless of experience level, have subscribed to the newsletter. This might skew the data if KNearestNeighbours is used as well over 90% of the values are assigined to players that have subscirbed to the newsletter. 

To analyze some more, assigning numerical values to the experience levels will aid in KNN Classification, where (as stated previously):
 1. `Beginner` will be labelled as 1
    2. `Amateur` will be labelled as 2 
    3. `Regular` will be labelled as 3
    4. `Pro` will be labelled as 4
    5. `Veteran` will be labelled as 5

Now `experience_number` will be used for classification instead

In [35]:
players["experience_number"] = players["experience"].replace({
    "Beginner": 1,
    "Amateur": 2,
    "Regular": 3,
    "Pro": 4,
    "Veteran": 5
}).astype(int)

players[["experience", "experience_number"]].head(10)

  players["experience_number"] = players["experience"].replace({


Unnamed: 0,experience,experience_number
0,Pro,4
1,Veteran,5
2,Veteran,5
3,Amateur,2
4,Regular,3
5,Amateur,2
6,Regular,3
7,Amateur,2
8,Amateur,2
9,Veteran,5


<u>**Methods and Plan**<u>

* A possible Method: `K-NN Classification`
  - KNN Classification because we are trying to find classify whether players will be more likely to subcribe or not based on experience level and hours played.
* No assumptions needed as KNN classification requires few assumptions on what the data should look like.
* Limitations: KNN classifcation does not work very well on imbalanced classes of data, in this case there are more players subscribed to the newletter than those that are not, which can result inaccuracies. This can be solved by re-weighing the data.
* The data will be split into 80/20 training and testing sets.
* `GridSearch` will be used with n values of 1-80 with cross validation set to 5 to find n_neighbous.
* The model will then fit with the best n_neighbours and applied to the testing set to calculate the accuracy, recall and precision. 
