## DSCI 100 Group Project - PLAICRAFT Data Study
### Members: Justin Galimpin (59053306), Alexis Kuerbig (15606007), Arjun Sharma (61155750), Ahmad Khattab (90009473)

**Introduction**

The `players.csv` dataset contains all the individuals who have taken part in the **(PLAI)** research study, which aims to collect data about how people play video games. There are 196 observations within the dataset, and the variables for the given dataset are as follows:

* `experience`: An object that stores a value about the player's experience level with Minecraft as Pro, Veteran, Regular, and Amateur.
* `subscribe`: Boolean value that indicates whether or not the player is subscribed to the mailing list or not.
* `hashedEmail`: Object value that indicates the hashed email of the player.
* `played_hours`: Float value that indicates the total number of hours a player has spent playing the game on this server.
* `name` : Object value that indicates the player's name.
* `gender` : Object value that indicates the player's gender.
* `age`: Int value that indicates the player's age.
* `individualId`: Unique Player ID (left NaN)
* `organizationName`: Player Organization (left NaN)
  
**Our Question: Can we predict the total hours a player will partake in the study based on their age and/or their experience level?**

**Response Variable:** `played_hours`
**Predictor Variables:** `experience` `age`

Answering this question will give the research group a clear idea of what type of demographic may be able to contribute the most to their study. For example, the results of the question may find a player with X type of experience contributes far more than those with Y type of experience, or that a player with Z type of experience actually rarely contributes at all. The key variables to help us answer this question are `age`, `experience`, and `played_hours` in `players.csv`. 

In [1]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

**Method and Results**

In [2]:
players = pd.read_csv("players.csv")
sessions = pd.read_csv("sessions.csv")

In [4]:
players["experience"] = players["experience"].replace({
    "Beginner" : 1.0,
    "Amateur" : 2.0,
    "Regular" : 3.0,
    "Veteran" : 4.0,
    "Pro" : 5.0
})


# Split our data
players_training, players_testing = train_test_split(
    players,
    test_size=0.20,
    random_state=2000,
)
X_train = players_training[["experience","age"]]
y_train = players_training["played_hours"]

X_test = players_testing[["experience","age"]]
y_test = players_testing["played_hours"]





# knn model building

players_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

players_cv = pd.DataFrame(
    cross_validate(
        players_pipe,
        X_train,
        y_train,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
        cv=5
    )
)
players_cv

np.random.seed(101)
param_grid = {'kneighborsregressor__n_neighbors': range(1, 100, 1)}
players_tuned = GridSearchCV(players_pipe, param_grid, cv=5, n_jobs=-1, scoring="neg_root_mean_squared_error")
players_results = pd.DataFrame(players_tuned.fit(X_train, y_train).cv_results_)
players_results


# This is where we chose the "best" k value to use, by calling best_params_ on players_tuned
players_min = players_tuned.best_params_


# This is where we called to show the RMSE of the model, by calling best_score_
playersbest_RMSE = -players_tuned.best_score_





players_min
playersbest_RMSE




# See how our model (players_tuned) predicts our test data (X_test)
np.random.seed(1234)
players_prediction = players_tuned.predict(X_test)



# Use the test set to calculate RMSPE. Use y_test to compare to the PREDICTIONS our model makes
players_summary = mean_squared_error(y_test, players_prediction)**(1/2)

players_prediction
players_summary

np.float64(25.362170401998327)

In [None]:
# Linear Regression model building



lm = LinearRegression()
lm_fit = lm.fit(X_train, y_train)
players_preds = players_training.assign(
    predictions= lm.predict(X_train)
)
test_preds = players_testing.assign(
            predictions = lm.predict(X_test)
)
lm_rmspe = mean_squared_error(y_test, test_preds["predictions"])**(1/2)
lm_rmspe 

**Discussion**

**References (If Any)**

In [None]:
# Does this work?