1. Background of Project
2. Hypotheses
3. Summary of Findings and Insights
4. Data Acquisition and Prep
5. Exploratory Data Analysis
6. Feature Engineering
7. Modeling
8. Evaluation

Notes:
- Image use footnote
    - PBS KIDS and the PBS KIDS Logo are registered trademarks of PBS. Used with permission. The contents of PBS KIDS Measure Up! were developed under a grant from the Department of Education. However, those contents do not necessarily represent the policy of the Department of Education, and you should not assume endorsement by the Federal Government. The app is funded by a Ready To Learn grant (PR/AWARD No. U295A150003, CFDA No. 84.295A) provided by the Department of Education to the Corporation for Public Broadcasting.

# Child Learning and Development with PBS KIDS Measure Up! App

![title](mu_image.jpeg)

## I. BACKGROUND

### MEASURE UP! APP  
The PBS KIDS Measure Up! app is designed for children ages 3-5 to develop their comprehension of early STEM concepts. The app takes users on an adventure through three worlds: 
   - **Magma Peak** focuses on capacity and displacement
   - **Crystal Caves** focus is about weight
   - **Treetop City** teaches the concepts of length and height.

### 2019 KAGGLE DATA SCIENCE BOWL COMPETITION PRESENTED BY PBS KIDS & BOOZ ALLEN HAMILTON
This year's competition is focused on early childhood education through multimedia learning. Anonymized Measure Up! gameplay data was provided by PBS KIDS. The end product of each submission is to provide a model that predicts the accuracy group of users.

Accuracy group is based on the number of attempts a user makes until s/he completes the "challenge" (a.k.a. assessment). Based on user's performance on each completed assessment, an accuracy group is assigned:
 - 3 - Successfully completed the assessment on 1 attempt
 - 2 - Completed assessment on 2 attempts
 - 1 - Completed assessment on 3 attempts
 - 0 - Completed assessment on more than 3 attempts

The model will help PBS KIDS to improve game design and to discover relationships on game engagement and learning processes.

### THE PROJECT
The project is inspired by 2019's Kaggle Data Science Bowl. By analyzing data publicly provided by PBS KIDS on Kaggle, our team has sought to identify and understand drivers of children’s success on assessments (a.k.a. "challenges") in the app. The insights inform teachers and parents on how well using the different activities in the application prepare 3-5-year old children for the application’s final assessments.

A presentation documenting our findings and recommendations is delivered on Jan 30, 2020.

### PROBLEM STATEMENTS & HYPOTHESES:

**Problem 1:** What are the drivers of users assessment accuracy?  
**Hypothesis:** Users will show improvement with more engagement.
  
**Problem 2:** Is there a certain path that better prepare users in assessments - with the two paths being, linear progression and random (choose activities at their own will)  
**Hypothesis:** There is no difference in the performance between users who followed a linear progression and those who had random learning path.

> Linear Progression and their corresponding in-game activities are as follows:  
**Exposure** (video clip) → **Exploration** (activity) → **Practice** (game) → **Demonstration** (assessment)

### SUMMARY OF FINDINGS AND INSIGHTS

### CONCLUSION AND RECOMMENDATION




---

## II. MODULES & LIBRARIES

In [1]:
import pandas as pd
import numpy as np

#viz
import matplotlib.pyplot as plt
import seaborn as sns

#feature engineering
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV

#modeling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

import warnings
warnings.filterwarnings("ignore")

---
## III. DATA ACQUISITION & PREP

- Get csv files from: https://www.kaggle.com/c/data-science-bowl-2019/data
    - train.csv
    - train_labels.csv
    - test.csv

In [2]:
train = pd.read_csv('train.csv')

In [3]:
train_labels = pd.read_csv('train_labels.csv')

Look at the shape of train and train_labels dataframe

In [4]:
train.shape

(11341042, 11)

In [5]:
train_labels.shape

(17690, 7)

Look at how many installation_ids are in each dataframe.

In [6]:
train.installation_id.nunique()

17000

In [7]:
train_labels.installation_id.nunique()

3614

How many installation id's are the same in each dataframe?

In [8]:
train[train.installation_id.isin(train_labels.installation_id.unique())].installation_id.nunique()

3614

Merge df, and train_labels together on installation_id. This will match our target variable 'accuracy_group' to their installation_ids.

In [9]:
df = train[train.installation_id.isin(train_labels.installation_id.unique())]

In [10]:
# Look at the shape of the new df.
df.shape

(7734558, 11)

In [11]:
11341042 - 7734558

3606484

We reduced our df by 3.6 million rows!!

Look at how many unique installation_ids are in the new df.

In [12]:
df.installation_id.nunique()

3614

How many students overall took assessments?

In [13]:
train_labels.installation_id.nunique()/train.installation_id.nunique() 

0.21258823529411763

In [14]:
df1 = pd.merge(df, train_labels, on = 'game_session', how = 'left')

In [15]:
df1.shape

(7734558, 17)

In [16]:
df1.columns

Index(['event_id', 'game_session', 'timestamp', 'event_data',
       'installation_id_x', 'event_count', 'event_code', 'game_time',
       'title_x', 'type', 'world', 'installation_id_y', 'title_y',
       'num_correct', 'num_incorrect', 'accuracy', 'accuracy_group'],
      dtype='object')

In [17]:
df1.drop(columns = ['installation_id_y', 'title_y', 'title_x'], inplace = True)

In [18]:
df1.rename(columns = {'installation_id_x': 'installation_id'}, inplace = True)

In [19]:
# Turn timestamp into datatime format
df['timestamp'] = pd.to_datetime(df.timestamp)

#### Evaluate how many assessments each user completes.

In [20]:
df1[df1.type == 'Assessment'].accuracy_group.value_counts(dropna=False)

3.0    303575
0.0    255079
1.0    197291
2.0    109502
NaN     38158
Name: accuracy_group, dtype: int64

Drop the NaN accuracy_group rows.

In [21]:
# capture the index values of NaN assessments
na_assessments = df1[(df1.type == 'Assessment') & (df1.accuracy_group.isna())].index

In [22]:
# make sure all of the NaN assessments are captured.
len(na_assessments)

38158

In [23]:
df1.drop(na_assessments, inplace = True)

In [24]:
#Check the shape of df1: 7734558 - 7696400 = 38158
df1.shape

(7696400, 14)

In [25]:
assessments = df1.groupby(['installation_id', 'game_session', 'accuracy_group']).count().reset_index()

In [26]:
# Count how many times each user was in each accuracy_group. 
# Count how many assessments they took overall. 
# Drop the bottom 'All' ROW.
assessment_count = pd.crosstab(assessments.installation_id, assessments.accuracy_group, margins = True).drop('All')
assessment_count

accuracy_group,0.0,1.0,2.0,3.0,All
installation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0006a69f,1,0,1,3,5
0006c192,1,0,1,1,3
00129856,0,0,0,1,1
001d0ed0,2,0,1,2,5
00225f67,1,0,0,0,1
00279ac5,1,0,0,0,1
002db7e3,2,2,2,3,9
003372b0,1,0,1,4,6
004c2091,2,0,0,2,4
00634433,1,0,0,2,3


In [27]:
# Look at the overall distribution of how many assessments each user took.
assessment_count.All.describe()

count    3614.000000
mean        4.894853
std         6.887616
min         1.000000
25%         1.000000
50%         3.000000
75%         6.000000
max       156.000000
Name: All, dtype: float64

In [28]:
assessment_count.All.value_counts().sort_index()

1      1027
2       633
3       442
4       311
5       256
6       181
7       135
8       112
9        75
10       66
11       48
12       38
13       41
14       30
15       37
16       19
17       21
18       17
19       13
20       12
21        9
22       11
23        5
24        7
25        6
26        1
27        7
28        5
29        3
30        5
31        5
32        4
33        1
34        3
35        1
36        3
37        3
38        3
39        1
42        4
45        1
46        3
47        1
48        1
49        1
64        2
72        1
78        1
129       1
156       1
Name: All, dtype: int64

In [29]:
q1 = assessment_count.All.quantile(0.25)
q1

1.0

In [30]:
q3 = assessment_count.All.quantile(0.75)
q3

6.0

In [31]:
iqr = q3-q1
iqr

5.0

In [32]:
upper_fence = q3 + 3*iqr
upper_fence

21.0

In [33]:
assessment_count[assessment_count.All <= 21]

accuracy_group,0.0,1.0,2.0,3.0,All
installation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0006a69f,1,0,1,3,5
0006c192,1,0,1,1,3
00129856,0,0,0,1,1
001d0ed0,2,0,1,2,5
00225f67,1,0,0,0,1
00279ac5,1,0,0,0,1
002db7e3,2,2,2,3,9
003372b0,1,0,1,4,6
004c2091,2,0,0,2,4
00634433,1,0,0,2,3


In [34]:
df1[df1.installation_id.isin(assessment_count[assessment_count.All <= 21].index)].installation_id.nunique()

3523

### Create a new dataframe that only contains installation_ids with <= 21 assessments

In [None]:
# pd.read_csv('df_21.csv').shape

In [35]:
df_21 = df1[df1.installation_id.isin(assessment_count[assessment_count.All <= 21].index)]

In [36]:
df_21.shape

(6583204, 14)

In [37]:
assessments_21 = df_21.sort_values(by = ['installation_id', 'timestamp'])[df_21.type == 'Assessment']

In [38]:
assessments_21.shape

(715397, 14)

Grab the last assessments from each installation id. These will be our final accuracy_group that we use as target variables in our model.

In [41]:
assessments_21.drop_duplicates(subset = ['installation_id'], keep = 'last', inplace = True)

In [42]:
assessments_21.shape

(3523, 14)

In [51]:
last_assessments_index = assessments_21.index

In [49]:
last_assessment_game_sessions = assessments_21.game_session.unique()
last_assessment_game_sessions

array(['a9ef3ecb3d1acc6a', '957406a905d59afd', 'ae691ec5ad5652cf', ...,
       '460e8bdc2822b340', 'b05a02b52d5c1f4c', '5448d652309a6324'],
      dtype=object)

## Create a dataframe that does not contain the last game sessions.

In [46]:
df_new = df_21[~df_21.game_session.isin(last_assessment_game_sessions)]

In [47]:
df_new.shape

(6402301, 14)

In [48]:
df_new.installation_id.nunique()

3515

In [54]:
after_last_assessment_df = df_new[0:0]
for i in assessments_21.index:
    after_last_assessment_df = after_last_assessment_df.append(df_new[(df_new.installation_id == assessments_21.loc[i].installation_id) & (df_new.timestamp > assessments_21.loc[i].timestamp)])
    print(i)
    

2617
6022
6067
7263
8523
11474
17761
19557
19950
20248
20761
22974
23383
23429
24377
25108
25327
26772
28535
33947
41734
42459
43344
44873
47512
51055
53933
54298
55726
58957
60367
60736
64772
66287
68858
71465
73536
73716
75327
75636
77161
78233
78539
80269
81289
83610
87263
89144
92746
93428
94008
99059
101136
101811
105118
106596
107896
109163
116342
116546
118379
119039
119446
120353
120585
121880
122005
129165
133351
135845
136512
138670
139521
141059
151504
156490
156817
157184
163528
164838
165061
165711
166284
173223
177362
182281
190346
191323
192474
193397
194887
199508
201028
201964
202581
205046
207531
208250
208966
211602
237623
241660
245885
246175
247597
248242
249831
250401
252785
264667
269190
272744
274447
274841
276439
278648
279084
279769
281904
282691
283756
284204
284899
286011
293803
295495
296216
307587
308492
310374
312412
313441
314268
316893
318304
318733
323010
323696
330811
331321
331946
333089
336120
336753
337156
338313
338408
340121
345216
346664
346886


2426189
2435355
2441405
2442010
2445928
2446429
2447198
2447728
2457562
2459389
2466948
2469528
2469573
2473050
2474268
2475118
2476213
2476798
2477006
2479116
2479752
2480279
2481831
2483298
2488185
2493761
2509975
2510205
2514650
2514987
2516246
2517565
2517921
2519913
2521237
2522176
2523939
2524434
2524825
2529457
2529929
2530111
2530460
2530534
2532126
2533224
2534241
2534363
2536505
2536679
2541715
2547443
2548574
2550848
2551500
2552667
2554880
2558103
2559612
2561631
2563948
2564785
2572282
2572573
2577817
2578325
2585487
2586758
2587199
2587488
2588026
2592146
2593309
2596510
2597180
2599455
2603690
2604158
2604452
2615332
2615755
2620316
2622590
2624515
2624919
2627727
2631839
2631945
2633572
2636844
2639368
2640119
2648087
2649389
2650348
2668969
2671615
2672813
2673949
2675468
2680493
2680927
2680991
2681532
2683268
2683577
2683614
2685787
2686857
2689068
2689964
2694631
2694973
2696603
2696873
2697850
2698882
2704495
2708788
2709186
2711029
2714942
2715256
2715837
2716165


4640909
4641283
4641997
4643101
4647145
4650457
4651815
4652142
4652475
4652782
4653384
4657638
4659664
4660658
4661119
4661355
4668362
4670546
4670983
4677465
4677817
4678497
4679967
4681602
4682797
4693558
4695665
4695832
4698404
4704649
4704716
4712104
4712904
4716417
4716787
4717429
4718170
4719118
4720696
4722306
4731517
4732646
4732771
4733413
4734242
4735353
4747096
4749484
4750495
4750766
4753161
4756235
4757061
4757545
4760456
4766185
4767621
4768309
4771459
4774213
4774832
4775055
4779765
4781415
4782648
4783729
4784706
4798134
4799321
4799380
4800336
4807699
4810022
4813139
4814247
4820079
4821508
4821911
4823678
4825136
4828165
4828562
4829141
4830922
4835189
4836671
4837745
4838988
4843554
4847639
4852917
4853638
4854649
4859418
4861249
4861902
4868701
4875046
4875658
4882763
4883103
4884021
4913573
4916320
4916669
4918173
4918603
4929438
4931765
4932099
4932730
4934189
4935649
4938966
4939680
4939944
4941796
4943332
4946496
4949066
4955203
4956422
4957512
4958448
4959744


6911302
6912967
6913515
6914676
6915289
6916099
6926287
6927601
6928193
6928776
6930813
6932420
6933405
6940765
6941214
6942775
6943006
6944091
6949025
6950154
6950923
6953698
6954096
6955410
6956084
6956326
6956547
6957363
6965437
6966972
6972197
6972713
6973216
6979781
6983522
6984113
6985083
6985915
6986025
6988103
6988312
6988677
6991533
6993257
6993520
6993886
6995576
6995732
6996034
6996396
6998137
6998605
6999614
6999984
7004378
7004626
7006211
7010291
7011099
7012862
7013202
7014905
7016177
7018983
7023068
7028130
7028772
7029262
7029782
7030117
7035183
7037969
7040120
7044005
7044441
7044781
7047314
7047689
7051164
7052131
7052473
7053683
7053929
7054644
7058300
7059959
7061307
7065227
7066071
7070392
7071499
7075213
7080265
7080300
7082111
7083258
7083537
7084585
7084944
7085953
7087943
7089225
7089700
7109477
7110161
7112198
7112587
7114975
7117966
7118065
7118567
7119431
7119762
7121912
7122338
7133451
7133968
7135289
7136215
7136447
7139458
7140773
7141371
7142265
7144198


In [55]:
after_last_assessment_df.game_session.nunique()

24241

In [56]:
after_last_assessment_df.installation_id.nunique()

2884

In [57]:
df_new.drop(after_last_assessment_df.index).shape

(5310606, 14)

In [None]:
# Write final df_new to csv to load and use after 
# df_new.to_csv('train_maybe_final')

In [58]:
df_test = pd.read_csv('train_maybe_final.csv')

In [59]:
df_test.shape

(5310606, 19)

---
## IV. EXPLORATORY DATA ANALYSIS 

### Feature Selection

In [None]:
big = pd.read_csv("train_maybe_final.csv")

In [None]:
big.installation_id.nunique()

In [None]:
big.head()

In [None]:
assessment = big[big.type == "Assessment"]

#### MAX CEILING

In [None]:
mc = assessment.groupby(["installation_id","game_session"])[["accuracy"]].max().sort_values(by=["installation_id","accuracy"],ascending=False).reset_index()
mc = mc.drop_duplicates(subset="installation_id",keep="first")

max_ceiling = mc[["installation_id","accuracy"]].set_index("installation_id").rename(columns={"accuracy":"max_ceiling"})

In [None]:
max_ceiling.shape

#### LOWEST POSSIBLE SCORE

In [None]:
ls = assessment.groupby(["installation_id","game_session"])[["accuracy"]].min().sort_values(by=["installation_id","accuracy"],ascending=True).reset_index()
ls = ls.drop_duplicates(subset="installation_id",keep="first")

lowest_score = ls[["installation_id","accuracy"]].set_index("installation_id").rename(columns={"accuracy":"low_score"})

In [None]:
lowest_score.shape

#### MEDIAN SCORE

In [None]:
md = assessment.groupby(["installation_id","game_session"])[["accuracy"]].median().sort_values(by=["installation_id","accuracy"]).reset_index()
md = md.drop_duplicates(subset="installation_id",keep="first")

median_score = md[["installation_id","accuracy"]].set_index("installation_id").rename(columns={"accuracy":"median_score"})

In [None]:
median_score.shape

#### NUMBER OF ACTIONS

In [None]:
actions = assessment.groupby(['installation_id','game_session'])[['event_id']].count().reset_index().groupby("installation_id").sum().reset_index()

actions = actions[["installation_id","event_id"]].set_index("installation_id").rename(columns={"event_id":"no_actions"})

In [None]:
actions.shape

#### NUMBER OF INCORRECT

In [None]:
no_incorrect = assessment.groupby(['installation_id','game_session'])[['num_incorrect']].count().reset_index().groupby("installation_id").sum()

In [None]:
no_incorrect.shape

#### ACCURACY TENDENCY

In [None]:
tendency = assessment.groupby(['installation_id','game_session',"accuracy"])[['accuracy']].count().rename(columns={"accuracy":"acc"}).sort_values(by=["installation_id","acc"],ascending=False).reset_index()

tendency = tendency.drop_duplicates(subset="installation_id",keep="first")
tendency = tendency[["installation_id","accuracy"]]

In [None]:
condition_list = [tendency.accuracy == 0, tendency.accuracy == 0.5,tendency.accuracy == 1, (~tendency.accuracy.isin([0,1,0.5]))]
choice_list = ["low_scorer","avg_scorer","high_scorer","random_scorer"]

tendency["group"] = np.select(condition_list,choice_list,0)

In [None]:
tendency = pd.get_dummies(tendency,columns=["group"], prefix="tendency")
tendency = tendency[['installation_id','tendency_avg_scorer','tendency_high_scorer', 'tendency_low_scorer','tendency_random_scorer']].set_index("installation_id")

In [None]:
tendency.shape

In [None]:
qu = max_ceiling.join([lowest_score,median_score,actions,no_incorrect,tendency])

In [None]:
qu.head()

In [None]:
qu.to_csv("cris_df.csv")

---
## V. TEST-TRAIN SPLIT & FEATURE ENGINEERING 

Acquire "tidy" data frames with features and y for feature engineering.

In [None]:
# Features on separate data frames
qu = pd.read_csv("cris_df.csv")
be = pd.read_csv("beta.csv")

In [None]:
qu.set_index("installation_id",inplace=True)
be.set_index("installation_id",inplace=True)

In [None]:
print(f"shape:{be.shape}")
be.head()

In [None]:
print(f"shape:{qu.shape}")
qu.head()

#### SUBSETTING FOR BALANCED DATA

In [None]:
be.accuracy_group.value_counts().plot(kind="bar")

#### MERGING WITH BETA

In [None]:
features = be.join(qu,how="left").fillna(0)
features = features.copy().drop(columns="accuracy_group")
features.head()

In [None]:
features.shape

#### REMOVE NA_USERS ON TRAIN

In [None]:
na_users = list(features[features.max_ceiling.isna()].index)

In [None]:
features = features[~features.index.isin(na_users)]

In [None]:
# Data Frame with accuracy group, i.e., the "y" or target
y = pd.read_csv("last_assessments.csv")
y = y[["installation_id","accuracy_group"]].set_index("installation_id")

y = y[~y.index.isin(na_users)]

In [None]:
y.shape

In [None]:
print(f"Features Shape: {features.shape}")
print(f"y Shape: {y.shape}")

In [None]:
y.accuracy_group.value_counts()

### Scale

In [None]:
scaler = StandardScaler()
scaled_features = pd.DataFrame(scaler.fit_transform(be),columns=be.columns).set_index(be.index)

In [None]:
scaler = StandardScaler()
scaled_features = pd.DataFrame(scaler.fit_transform(features),columns=features.columns).set_index(features.index)

In [None]:
scaled_features.head()

In [None]:
scaled_features.columns

In [None]:
df_feed = scaled_features.copy()

In [None]:
df_feed

### Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_feed,y, test_size=0.3, stratify=y["accuracy_group"],random_state=123)

In [None]:
y_train.accuracy_group.value_counts()

In [None]:
X_test.shape

### Feature Engineering

In [None]:
lasso = LassoCV()
lasso.fit(X_train, y_train)

In [None]:
print(f"Best alpha using built-in LassoCV: {lasso.alpha_}")
print(f"Best score using built-in LassoCV: {lasso.score(X_train,y_train)}")

In [None]:
coef = pd.DataFrame(lasso.coef_, index = X_train.columns).rename(columns={0:"feature_weights"}).sort_values(by="feature_weights",ascending=False)

In [None]:
coef.head()

In [None]:
coef.plot(kind="barh",figsize=(15,12))

### Keep all non-zero features

In [None]:
weighted_coef = coef[coef.feature_weights != 0]

In [None]:
#see if all zero-value features are removed
weighted_coef.describe()

In [None]:
weighted_coef_list = list(weighted_coef.index)

In [None]:
X_train.head()

In [None]:
X_train = X_train[weighted_coef_list]
X_test = X_test[weighted_coef_list]

In [None]:
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

---
## VI. MODELING

### Random Forest

Create Random Forest object.  
Fit Train data.

In [None]:
rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=3,
                            n_estimators=100,
                            max_depth=6, 
                            random_state=600)

rf.fit(X_train, y_train)

Look at how the algorithm use features, that is, check feature importance.

In [None]:
rf_features = pd.DataFrame([rf.feature_importances_],columns = X_train.columns).T.rename(columns = {0: 'feature_importance'}).sort_values(by='feature_importance', ascending=False)

In [None]:
rf_features.plot(kind="barh",figsize=(15,12))

In [None]:
y_pred = pd.DataFrame(rf.predict(X_train),index = X_train.index)
y_pred_test = pd.DataFrame(rf.predict(X_test),index = X_test.index)

In [None]:
y_train.head()

In [None]:
predictions = y_train.copy().rename(columns={"accuracy_group":"actual_y"})
predictions[["predicted_y"]] = y_pred

In [None]:
predictions.head()

#### EVALUATION: RANDOM FOREST

In [None]:
print(f"RF Score for Train: {rf.score(X_train, y_train)}")
print(f"RF Score for Test: {rf.score(X_test, y_test)}")

In [None]:
confusion_matrix(predictions.actual_y, predictions.predicted_y)

In [None]:
print(classification_report(y_test,y_pred_test))

### Logistic Regression

---

# Checking - to be removed

In [None]:
from scipy import stats

#### DF THAT HAS TENDENCIES PER INSTALLATION ID

In [None]:
assessment.head()

In [None]:
# a_counts = assessment.groupby(["installation_id","game_session"]).agg(stats.mode)[["accuracy_group"]]
# # a_counts = a_counts.drop_duplicates(subset="installation_id",keep="first")

In [None]:
a_counts = assessment.groupby(["installation_id","game_session"]).median()[["accuracy_group"]].reset_index().groupby(["installation_id"]).median()[["accuracy_group"]].reset_index()
# a_counts = a_counts.drop_duplicates(subset="installation_id",keep="first")

In [None]:
a_counts.head()

In [None]:
a_counts["tendency"

In [None]:
a_counts = a_counts.drop_duplicates(subset="installation_id",keep="first")

In [None]:
a_counts =

In [None]:
a_counts[]

In [None]:
assessment.groupby(["installation_id","accuracy_group","game_session"])[["game_session"]].sum()

#### HOW MUCH ARE THEY PLAYING AROUND BEFORE THEY HIT "DONE"/ATTEMPT

In [None]:
assessment_sub = assessment[assessment.accuracy_group.isin([1,2])]

In [None]:
assessment_sub.sample()

In [None]:
assessment_sub["num_incorrect"].value_counts().plot(kind="bar")

In [None]:
assessment_sub[["event_id","title"]]

In [None]:
assessment_sub.groupby(["installation_id","game_session","event_id","accuracy"]).count()[["num_incorrect"]]

In [None]:
assessment_sub.shape

In [None]:
assessment_sub.accuracy_group.value_counts()

In [None]:
assessment.num_correct.value_counts()

In [None]:
assessment[assessment.installation_id == "baedce19"]

In [None]:
assessment

---

In [None]:
subset_12 = features.join(y)

In [None]:
subset_12.head()

In [None]:
subset_12 = subset_12[subset_12.accuracy_group.isin([1,2])]
subset_12.accuracy_group = np.where(subset_12.accuracy_group == 1, "one","two")

In [None]:
subset_12[["Assessment","Clip","Game","Activity"]].sum().plot(kind="bar")

---

In [None]:
X_subset_12 = subset_12.drop(columns="accuracy_group")

In [None]:
y_subset_12 = subset_12[["accuracy_group"]]

In [None]:
rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=3,
                            n_estimators=100,
                            max_depth=6, 
                            random_state=600)

rf.fit(X_subset_12, y_subset_12)

In [None]:
subset12_features = pd.DataFrame([rf.feature_importances_],columns = X_subset_12.columns).T.rename(columns = {0: 'feature_importance'}).sort_values(by='feature_importance', ascending=False)

In [None]:
subset12_features.plot(kind="barh",figsize=(15,12))

In [None]:
print(f"RF Score for Train: {rf.score(X_subset_12, y_subset_12)}")