### Project starting in Jupyter

blah

## TITLE

## INTRODUCTION

- provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
- clearly state the question you tried to answer with your project
- identify and fully describe the dataset that was used to answer the question

In [2]:
import pandas as pd
import altair as alt

In [3]:
url2 = "https://raw.githubusercontent.com/agallagh/DSCI-Project/refs/heads/main/players.csv"
players_data = pd.read_csv(url2)
players_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [4]:
# tidying the data by dropping the unecessary columns

tidy_players = players_data[["experience", "played_hours", "age"]]
tidy_players

# filtering data in age column for our demographic

filtered_age_df = tidy_players[tidy_players["age"] < 60]
filtered_age_df

Unnamed: 0,experience,played_hours,age
0,Pro,30.3,9
1,Veteran,3.8,17
2,Veteran,0.0,17
3,Amateur,0.7,21
4,Regular,0.1,21
...,...,...,...
190,Amateur,0.0,20
191,Amateur,0.0,17
192,Veteran,0.3,22
193,Amateur,0.0,17


In [5]:
# creating a scatterplot for our variables and colouring by experience
age_chart = alt.Chart(filtered_age_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience")
).properties(width=700)

age_chart

In [6]:
# creating a bar plot of experience vs played hours

experience_chart = alt.Chart(filtered_age_df).mark_bar().encode(
    x=alt.X("experience").title("Experience"),
    y=alt.Y("played_hours").title("Played Hours")
).properties(width = 500).configure_axisX(labelAngle = -45)

experience_chart

This chart varies on an individual level within each experiece based on our first scatterplot. Therefore the few outliers of amateurs and regulars that have played over 150 hours are largely responsible for the difference. To make the data more representative of the overall demographict, we will be filtering out those outliers by making the played hours less than 100.

In [7]:
# filtering the played hours to be less than 100

filtered_hrs_df = filtered_age_df[filtered_age_df["played_hours"] < 100] 

In [8]:
filtered_chart = alt.Chart(filtered_hrs_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience Level")
).properties(width = 500)



# Facet by experience to make the visualization more clear.
facetted_chart = alt.Chart(filtered_hrs_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience Level")
).properties(width = 200).facet("experience", columns = 5)

facetted_chart

## METHODS AND RESULTS

- describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
- your report should include code which:
    - loads data 
    - wrangles and cleans the data to the format necessary for the planned analysis
    - performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
    - creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
    - performs the data analysis
    - creates a visualization of the analysis 
note: all figures should have a figure number and a legend

In [15]:
### Run this cell before continuing.
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [17]:
np.random.seed(2020) 

#Train model

# split the train/test data: 75% train, 25% test
player_train, player_test = train_test_split(filtered_hrs_df, train_size = 0.75, random_state = 31)

# make X/y objects
X_train = player_train[['played_hours',	'age']]
y_train = player_train['experience']


# make a preprocessor for the hours played and age columns
player_preprocessor = make_column_transformer(
    (StandardScaler(), ['played_hours',	'age']),
     remainder='passthrough',
    verbose_feature_names_out = False
)

#fit the data to the

#knn_spec = KNeighborsClassifier(n_neighbors = n, random_seed = 2000)

In [18]:
knn_spec = KNeighborsClassifier()

param_grid = {
    "n_neighbors": range(2, 16, 1),
}

knn_tune_grid = GridSearchCV(
    knn_spec, param_grid, return_train_score=True, n_jobs=-1, cv=5
)

knn_model_grid = knn_tune_grid.fit(X_train, y_train)

accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)

cross_val_plot = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("param_n_neighbors").title("Number of neighbours").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean test score").scale(zero=False)
)

cross_val_plot

## DISCUSSION

- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

**references:** You may include references if necessary, as long as they all have a consistent citation style.