## TITLE: 
Predicting the experience levels of Plaicraft participants under 60 years old (based on their age and hours played) to determine which experience levels are most likely to have the largest playing hours. 
We will be looking at age and played hours as the predictor variables to determine what experience level of players under 60 years old are most are likely to contribute large amounts of data.
What will the experience level of players under 60 be predicted as based on their age and number of played hours? 

## METHODS

In [38]:
import pandas as pd
import altair as alt

In [39]:
url2 = "https://raw.githubusercontent.com/agallagh/DSCI-Project/refs/heads/main/players.csv"
players_data = pd.read_csv(url2)
players_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


As our question is targetting players aged under 60 we need to filter the data for players under that age as well as drop columns that are irrelevant to our data analysis. This includes every column except, experience, played hours, and age.

In [40]:
# tidying the data by dropping the unecessary columns

tidy_players = players_data[["experience", "played_hours", "age"]]
tidy_players

# filtering data in age column for our demographic

filtered_age_df = tidy_players[tidy_players["age"] < 60]
filtered_age_df

Unnamed: 0,experience,played_hours,age
0,Pro,30.3,9
1,Veteran,3.8,17
2,Veteran,0.0,17
3,Amateur,0.7,21
4,Regular,0.1,21
...,...,...,...
190,Amateur,0.0,20
191,Amateur,0.0,17
192,Veteran,0.3,22
193,Amateur,0.0,17


In [41]:
# creating a scatterplot for our variables and colouring by experience
age_chart = alt.Chart(filtered_age_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience")
).properties(width=700)

age_chart

#### Figure 1.
This scatterplot represents the scatterplot of hours played for individuals under age 60 who participated in the study. The individual points are colour coded by experience level

In [42]:
# creating a bar plot of experience vs played hours

experience_chart = alt.Chart(filtered_age_df).mark_bar().encode(
    x=alt.X("experience").title("Experience"),
    y=alt.Y("played_hours").title("Played Hours")
).properties(width = 500).configure_axisX(labelAngle = -45)

experience_chart

#### Figure 2.
This bar graph represents to total played hours for each experience level as a sum of all the individuals. 

Both figure 1 and figure 2 are needed to fully understand the data. The bar graph would simply tell us that amateurs and regulars are most likely to contribute large amounts of played hours, which targets our specific question. However, the scatterplot shows us that this bar plot is heavily influenced by a few outlier individuals. Therefore, the outliers of amateurs and regulars that have played over 150 hours are largely responsible for the prediction we would be making. To make the dataset more representative of the overall demographic, we will be filtering out those outliers by making the played hours column include only values less than 100.

In [43]:
# filtering the played hours to be less than 100

filtered_hrs_df = filtered_age_df[filtered_age_df["played_hours"] < 100] 

In [44]:
filtered_chart = alt.Chart(filtered_hrs_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience Level")
).properties(width = 500)



# Facet by experience to make the visualization more clear.
facetted_chart = alt.Chart(filtered_hrs_df).mark_point().encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played Hours"),
    color=alt.Color("experience").title("Experience Level")
).properties(width = 200).facet("experience", columns = 5)

facetted_chart

#### Figure 3.
This facetted plot shows the filtered data for the individuals in each experience level. Each coloured point represents an individual. 

Figure 3 demonstrates that each experience level has roughly the same range of age in each graph, with most clustering of individual points occurring between 15-30. We can also note a few outliers with high played hours in each experience level, the amateur category having the most. Each graph shows that typical played hours will likely play under 10 hrs for each experience level. Even before doing the analysis, we can see that there is not any clear deviation, or relationship between age, experience level and played hours, as each group has its own outlier. Therefore, we know the data analysis may have a low accuracy and not be able to perform well when predicting the test data or any new real data.