## First steps

We are in <3 with neural networks and decided to try to build recommendation system via supervised problem approach

### Imports

In [None]:
import pandas as pd

In [None]:
from processors import QueProc, StuProc, ProProc # data preprocessors for questions, students and professionals
from generator import BatchGenerator # class to ingest data from pre-processed DataFrames to model in form of batches of NumPy arrays
from models import Mothership, \ # main model which combines two encoders (for questions and professionals)
                   Adam # and Keras optimizer to train it
from evaluation import permutation_importance, \ # calculate model feature importance via random permutations of feature values
                       plot_fi # and nicely plot it
from doc2vec import pipeline as pipeline_d2v # pipeline for training and saving embeddings for
                                             # professional's industries and question's tags via doc2vec algorithm

### Set some global parameters

In [None]:
data_path = '../../data/' # path to folder with initial .csv data files
dump_path = 'dump/' # path to all the dump data, like saved models, calculated embeddings etc.
split_date = '2019-01-01' # date used for splitting data on train and test subsets

### Read the data and split it into train and test sets

In [None]:
train, test = dict(), dict() # dictionaries with data split in train and test subsets

In [None]:
for var, file_name in [('que', 'questions.csv'), ('ans', 'answers.csv'),
                       ('pro', 'professionals.csv'), ('stu', 'students.csv')]:
    df = pd.read_csv(data_path + file_name) # read the data

    date_col = [col for col in df.columns if 'date' in col][0] # find the column which contains date
    df[date_col] = pd.to_datetime(df[date_col]) # convert it to datetime64 type

    train[var] = df[df[date_col] < split_date] # just to make sure no data from train will be present in test
    test[var] = df # we will need to use all the data in test
    
tag_que = pd.read_csv(data_path + 'tag_questions.csv')
tags = pd.read_csv(data_path + 'tags.csv').merge(tag_que, left_on='tags_tag_id', right_on='tag_questions_tag_id')

## Train step

In [None]:
data = train

### doc2vec

### Feature engineering and data preprocessing

Some of the main points of our data preparation:  
- Questions and professionals are the two main entities in our recommendation system
- Student's features will be included in question's features
- Question's features are designed to be time-independent, while some of student's and professional's features inevitably depend on time. This is why we will need to compute student's and professional's feature vectors for each moment in time when they change. These moments will correspond to appearance of new answer
- So, there are three entities in our dataset for which we will compute features separately: question, student and professional

In [None]:
oblige_fit = True # whether it is necessary to fit new StandardScaler (for numerical features)
                  # or LabelEncoder (for categorical) or use existent if there is one
                  # True in train mode, False in test

In [None]:
que_proc = QueProc(oblige_fit, dump_path)
que_data = que_proc.transform(data['que'], tags)

In [None]:
stu_proc = StuProc(oblige_fit, dump_path)
stu_data = stu_proc.transform(data['stu'], data['que'], data['ans'])

pro_proc = ProProc(oblige_fit, dump_path)
pro_data = pro_proc.transform(data['pro'], data['que'], data['ans'])

### Additional data computation

The general solution to problem is to build the classifier which for given question and professional will predict whether professional will answer to given question  
So, for training classifier on binary classification task, we will need both positive and negative samples  
First ones are easy to obtain: we can compute them directly from data

In [None]:
# construct dataframe used to extract positive pairs
df = data['que'].merge(data['ans'], left_on='questions_id', right_on='answers_question_id') \
    .merge(data['pro'], left_on='answers_author_id', right_on='professionals_id') \
    .merge(data['stu'], left_on='questions_author_id', right_on='students_id')
# select only relevant columns
df = df[['questions_id', 'students_id', 'professionals_id']]
# extract positive pairs themselves
pos_pairs = list(df.itertuples(index=False))

In [None]:
# mappings from professional's id to his registration date. Used in batch generator
pro_dates = {row['professionals_id']: row['professionals_date_joined'] for i, row in data['pro'].iterrows()}