# Project Individual Proposal
## Sumedha Ray
### 87860987

In [11]:
import pandas as pd
import altair as alt

My proposal uses the 196 x 9 “players” dataset, produced using a MineCraft server that requests identifying information and monitors player activity. The variables are as follows:

- "Experience" (categorical): possibly the player's level of familiarity with the game.
- "Subscribe" (categorical): if the player is subscribed to the game.
- "hashedEmail" (categorical): used for identification (unusable).
- "Played hours" (quantitative): the total amount of time the player has spent on the game over the study period.
- "Name" (categorical): used as individual player identification.
- "Gender" (categorical): player's gender identity.
- "Age" (quantitative): player age in years.
- "Individual ID": used for identification (private).
- "OrganizationName": used for identification (private).

The timeframe for “played hours” is unclear. Total hours played over the study period was assumed because some values (e.g. 48h) are unrealistic for one sitting.

“Experience” has no specified criteria and is subjective. Specifying a quantifiable criterion to define “experience” would make it comparable to other variables.

My proposal takes the total instances of data over age ranges and genders (the explanatory variables) and plots them to determine the player age and gender that provides the most data points (the response variable). It addresses the question: What gender and age group accounts for most of the data points in this study?

In [12]:
players_data = pd.read_csv('data/players.csv')
players_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


The data is tidy, so no wrangling is needed. However, the data can be cleaned by extracting the gender column and creating age ranges by using bins and labels and adding a column, 'age_group.' The resulting dataframe contains only the variables 'gender' and 'age_group.'

In [17]:
players_cleaned = players_data[['gender','age']].copy()

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']

players_cleaned['age_group'] = pd.cut(players_cleaned['age'], bins=bins, labels=labels, right=False)

players_cleaned[['gender','age_group']]

Unnamed: 0,gender,age_group
0,Male,0-10
1,Male,11-20
2,Male,11-20
3,Female,21-30
4,Male,21-30
...,...,...
191,Female,11-20
192,Male,21-30
193,Prefer not to say,11-20
194,Male,11-20


The following plots show that males (first plot) and individuals ages 11 to 30 (second plot) provided the most data (separately). These plots serve as a point of comparison for the final result.

In [35]:
gender_counts = players_data['gender'].value_counts().reset_index()
gender_counts.columns = ['gender', 'count']

gender_plot = alt.Chart(gender_counts).mark_bar().encode(
    x=alt.X('count:Q').title('Number of Data Points'),
    y=alt.Y('gender:N').title('Gender')
).properties(
    title='Number of Data Points for Each Gender'
)

gender_plot

In [36]:
age_counts = players_cleaned['age_group'].value_counts().reset_index()
age_counts.columns = ['age_group', 'count']

age_plot = alt.Chart(age_counts).mark_bar().encode(
    x=alt.X('count:Q').title('Number of Data Points'),
    y=alt.Y('age_group:N').title('Age Range (years)')
).properties(
    title='Number of Data Points for Each Age Group'
)

age_plot

My proposal uses the .groupby() function to count the data points of gender and age combinations. I would extract the top 25% of combinations using .sort_values(), calculating 0.25 * len(dataframe), and using .head(). I would plot the results using a bar graph with age range on the x-axis, count on the y-axis, and coloured by gender.

This method is straightforward and allows simultaneous analysis of both gender and age. Using the top 25% of counts cleans the data so the important information is easy to distinguish. A bar chart as the final visualization makes comparison of different gender-age combinations easy.

Although this method is limited by not taking experience, subscription status, or played hours into account, it targets the most fundamental identifiers of an audience in this dataset, universally consistent and objective variables. The gender and age distribution can give the researchers insight into marketing strategies and targeting underrepresented groups.

Attempting a visualization with the other variables would likely be difficult to read. Additionally, too many variables would reduce the presence of a trend due to restrictive criteria. We would need a much larger sample size to effectively include five variables.

Overall, my method is easy to implement, easy to read and covers the fundamental categorization of the players.