# Analytics Vidya - CodeFest Enigma

#### Author - Akkash K N R
#### Machine - MacBook Pro, 8GB RAM
#### Software - Python 2.7, GraphlabCreate


### Problem Statement

Online judges provide a platform where many users solve problems everyday to improve their programming skills. The users can be beginners or experts in competitive programming. Some users might be good at solving specific category of problems(e.g. Greedy, Graph algorithms, Dynamic Programming etc.) while others may be beginners in the same. There can be patterns to everything, and the goal of the machine learning would be to identify these patterns and model user’s behaviour from these patterns.
The goal of this challenge is to predict range of attempts a user will make to solve a given problem given user and problem details.

#### Importing the data

In [2]:
import graphlab as gl
import pandas as pd
import numpy as np

In [3]:
train = pd.read_csv('train/train_submissions.csv')
user = pd.read_csv('train/user_data.csv')
problem = pd.read_csv('problem_mm.csv')
test = pd.read_csv('test_submissions.csv')

In the Given Dataset we can clearly understand that the datatype are

1. Train - Its an Explicit dataset where we have both user_id, problem_id and also the attempts they make.
2. Test - Its an Implicit dataset where we are missing the attempts.
3. Problem & User - These are Side information which provide additional information that helps the model eg.rank, ratings, problem solved etc.

In [4]:
train.head()

Unnamed: 0,user_id,problem_id,attempts_range
0,user_232,prob_6507,1
1,user_3568,prob_2994,3
2,user_1600,prob_5071,1
3,user_2256,prob_703,1
4,user_2321,prob_356,1


In [5]:
user.head()

Unnamed: 0,user_id,submission_count,problem_solved,contribution,country,follower_count,last_online_time_seconds,max_rating,rating,rank,registration_time_seconds
0,user_3311,47,40,0,,4,1504111645,348.337,330.849,intermediate,1466686436
1,user_3028,63,52,0,India,17,1498998165,405.677,339.45,intermediate,1441893325
2,user_2268,226,203,-8,Egypt,24,1505566052,307.339,284.404,beginner,1454267603
3,user_480,611,490,1,Ukraine,94,1505257499,525.803,471.33,advanced,1350720417
4,user_650,504,479,12,Russia,4,1496613433,548.739,486.525,advanced,1395560498


In [6]:
problem.head()

Unnamed: 0,problem_id,level_type,points,binari,brute,construct,data,dfs,forceimplement,greedi,implement,math,pointer,searchdata,structur,theori
0,prob_3649,H,1500,0,0,0,0,0,0,0,0,0,0,0,0,0
1,prob_6191,A,1500,0,0,0,0,0,0,0,0,0,0,0,0,0
2,prob_2020,F,1500,0,0,0,0,0,0,0,0,0,0,0,0,0
3,prob_313,A,500,0,0,0,0,0,0,0,0,0,0,0,0,0
4,prob_101,A,500,0,0,1,0,0,0,0,0,0,0,0,0,0


##### Since Both Train and test are explicit and implicit, we dont need to do any preprocessing. lets work on the problem and user data

In [7]:
#lets check the NaN in both problem and in user
problem.isnull().sum()

problem_id          0
level_type        133
points              0
binari              0
brute               0
construct           0
data                0
dfs                 0
forceimplement      0
greedi              0
implement           0
math                0
pointer             0
searchdata          0
structur            0
theori              0
dtype: int64

In [8]:
user.isnull().sum()

user_id                         0
submission_count                0
problem_solved                  0
contribution                    0
country                      1153
follower_count                  0
last_online_time_seconds        0
max_rating                      0
rating                          0
rank                            0
registration_time_seconds       0
dtype: int64

Only Country has the missing value hence lets fill the Missing values in user data. For initial purpose we are imputing with 'Others' lets work on it further

In [9]:
user['country'] = user['country'].fillna('Others')

Now the levels in problem dataset

In [10]:
problem.groupby('level_type').count()

Unnamed: 0_level_0,problem_id,points,binari,brute,construct,data,dfs,forceimplement,greedi,implement,math,pointer,searchdata,structur,theori
level_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
A,1042,1042,1042,1042,1042,1042,1042,1042,1042,1042,1042,1042,1042,1042,1042
B,1017,1017,1017,1017,1017,1017,1017,1017,1017,1017,1017,1017,1017,1017,1017
C,915,915,915,915,915,915,915,915,915,915,915,915,915,915,915
D,850,850,850,850,850,850,850,850,850,850,850,850,850,850,850
E,795,795,795,795,795,795,795,795,795,795,795,795,795,795,795
F,421,421,421,421,421,421,421,421,421,421,421,421,421,421,421
G,328,328,328,328,328,328,328,328,328,328,328,328,328,328,328
H,272,272,272,272,272,272,272,272,272,272,272,272,272,272,272
I,256,256,256,256,256,256,256,256,256,256,256,256,256,256,256
J,212,212,212,212,212,212,212,212,212,212,212,212,212,212,212


In [11]:
#Here we are imputng the missing values with the mode value
problem['level_type'] = problem['level_type'].fillna(problem['level_type'].value_counts().index[0])

In [12]:
problem['points'].describe()

count    6544.000000
mean     1480.884322
std       500.733720
min        -1.000000
25%      1500.000000
50%      1500.000000
75%      1500.000000
max      5000.000000
Name: points, dtype: float64

In [13]:
#problem['points'] = problem['points'].fillna(problem['points'].mean(),inplace=True)

In [14]:
#problem['tags'] = problem['tags'].fillna('No Tag')

KeyError: 'tags'

In [120]:
#from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
#cnt = CountVectorizer().fit(problem['tags'])
#vec = TfidfVectorizer(max_features=50).fit(problem['tags'])

In [121]:
#vec.get_feature_names()

In [15]:
problem.head()

Unnamed: 0,problem_id,level_type,points,binari,brute,construct,data,dfs,forceimplement,greedi,implement,math,pointer,searchdata,structur,theori
0,prob_3649,H,1500,0,0,0,0,0,0,0,0,0,0,0,0,0
1,prob_6191,A,1500,0,0,0,0,0,0,0,0,0,0,0,0,0
2,prob_2020,F,1500,0,0,0,0,0,0,0,0,0,0,0,0,0
3,prob_313,A,500,0,0,0,0,0,0,0,0,0,0,0,0,0
4,prob_101,A,500,0,0,1,0,0,0,0,0,0,0,0,0,0


Now we are creating the side data which can be useful to boostup our model

In [16]:
userData = gl.SFrame({'user_id':user['user_id'],'submission_count':user['submission_count'],'rank':user['rank'],
                     'country':user['country'],'max_rating':user['max_rating'],'follower_count':user['follower_count'],
                     'contribution':user['contribution'],'rating':user['rating'],})

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1506190348.log


This non-commercial license of GraphLab Create for academic use is assigned to aka7h.sathya@hotmail.com and will expire on March 23, 2018.


In [17]:
itemData = gl.SFrame({'item_id':problem['problem_id'],'level_type':problem['level_type'],'points':problem['points'],
                     'binari':problem['binari'],'brute':problem['brute'],'construct':problem['construct'],
                      'data':problem['data'],'dfs':problem['dfs'],
                      'forceimplement':problem['forceimplement'],'greedi':problem['greedi'],'implement':problem['implement'],
                      'math':problem['math'],'pointer':problem['pointer'],'searchdata':problem['searchdata'],
                      'structur':problem['structur'],'theori':problem['theori']})

In [18]:
trainBasic = gl.SFrame({"user_id":train['user_id'],"item_id":train['problem_id'],"attempts":train['attempts_range']})

In [19]:
testBasic = gl.SFrame({'user_id':test['user_id'],'item_id':test['problem_id']})

Now lets create a model. here we are going to use Factorization Machines to predict our User Problem attempts. The reason to use this algorithm is because this problem is quite similar to recommendation use case. 

In [20]:
model1 = gl.factorization_recommender.create(trainBasic, target='attempts',
                                                user_data=userData,
                                            item_data=itemData,
                                            num_factors=80,side_data_factorization=True,random_seed=40,nmf=False,
                                             max_iterations=100
                                            )

Now lets predict the output

In [21]:
prediction = model1.predict(testBasic,new_user_data=userData,new_item_data=itemData)

In [22]:
prediction_asint = prediction.astype(int)

In [23]:
prediction_round = prediction.to_numpy

In [28]:
round_pred = []
for i in prediction:
   round_pred.append(int(round(i)))

In [25]:
for n,i in enumerate(round_pred):
    if i < 1:
        round_pred[n] = 1
    if i > 6:
        round_pred[n] = 6

The reason is because i have seen both 0 and 6 in the prediction. ther is no possibility of having a 0 or value greater than 6. Hence converting both 

In [29]:
submission_asint = gl.SFrame({'ID':test['ID'],'attempts_range':prediction_asint})
submission_asround = gl.SFrame({'ID':test['ID'],'attempts_range':round_pred})

In [30]:
submission_asint.save('asint_prediction_v3_23_sep.1143.csv',format="csv")
submission_asround.save('asround_prediction_v3_23_sep_1143.csv',format="csv")