In [1]:
import os 
import json

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

Cindicator is a smart asset management fund. Thousands of analysts answer questions about market movements, their predictions are aggregated and trades are made. 

There is an example of one of those questions:

<img src="question_example.png">

An analyst is asked to predict a probability for a described event.

The data was generated by asking that kind of questions over time.

Multiple users answer each question.

Your task is to build a model that will aggregate users guesses and make a prediction if an event will actually happen. 

This notebook will help you understand the structure of the data.

On Dbrain the data is split into four parts:
    - preview dataset that you are able to download
    - train dataset that is used to train your model on our servers
    - public test set
    - private test set
    
Here we explore the preview dataset

The dataset is split into multiple `.csv` files. Each file contains data related to questions that ended at the same time. 

Let's take a look at the first file:

In [2]:
data_dir = "data/preview/"

In [3]:
csv_paths = sorted([os.path.join(data_dir, fname) for fname in os.listdir(data_dir) 
                    if fname.endswith(".csv")])

In [4]:
first_df = pd.read_csv(csv_paths[10], index_col=0)
first_df.head()

Unnamed: 0,question_id,started_question_time,finished_question_time,target_date,user_id,user_created_at,user_country,user_answer,user_answer_created_at,question_ticker,user_birthday,user_gender,question_answer,window_id
17234,1829,2017-12-09 00:00:00.000000,2017-12-10 00:00:00.000000,2018-12-25,24876,2017-09-04 16:22:28.356122,927c2718-87c9-40b2-96bc-9c573e4a1bb2,0.0,2017-12-10 05:50:43.108546,222,1968-05-20 00:00:00.000000,0.0,1,10
17251,1829,2017-12-09 00:00:00.000000,2017-12-10 00:00:00.000000,2018-12-25,4244,2017-10-16 19:50:00.495838,043ebed1-3332-4149-9c2a-cdda0aa8f041,0.0,2017-12-10 06:58:44.934519,222,1983-03-11 00:00:00.000000,1.0,1,10
17240,1829,2017-12-09 00:00:00.000000,2017-12-10 00:00:00.000000,2018-12-25,25739,2017-11-11 03:06:27.895552,927c2718-87c9-40b2-96bc-9c573e4a1bb2,0.9,2017-12-10 06:21:36.277638,222,,1.0,1,10
17235,1829,2017-12-09 00:00:00.000000,2017-12-10 00:00:00.000000,2018-12-25,659,2017-12-01 03:23:11.207051,78df5ee9-9e91-45fc-8fc5-3ffbb981443f,0.2,2017-12-10 05:51:25.581892,222,,,1,10
17236,1829,2017-12-09 00:00:00.000000,2017-12-10 00:00:00.000000,2018-12-25,22415,2017-11-14 20:09:52.229964,043ebed1-3332-4149-9c2a-cdda0aa8f041,0.75,2017-12-10 03:43:08.041030,222,1987-03-01 00:00:00.000000,,1,10


Each row contains an answer to a question from a single analyst 

`question_id` - unique id for the question that analyst had to answer

`started_question_time` - when the question was made availiable

`finished_question_time` - when the question was closed (answers no longer accepted)

`target_value` - questioned price change of the ticker

`target_date` - questioned date

`user_id` - unique user (analyst) id

`user_created_at`, `user_country`, `user_birthday`, `user_gender` - self-explanatory

`user_answer` - probability score assigned by user to event in question

`question_ticker` - financial asset in question 

`question_answer` - true outcome of the question, that is what you are predicting!

`window_id` - technical field, derived from `finished_question_time`. All questions that are finished at the same time have same `window_id`

Your task is to predict a `question_answer` for each `question_id`, based on all analyst predictions. 

In [5]:
df = pd.concat(pd.read_csv(fname, index_col=0) for fname in csv_paths)

In [6]:
print("Total predictions:", df.shape[0])

Total predictions: 739850


In [7]:
print("Unique users:", df["user_id"].nunique())

Unique users: 39263
