## First steps

We are in <3 with neural networks and decided to try to build recommendation system via supervised problem approach

### Imports

In [None]:
import pandas as pd

### Set some global parameters

In [None]:
data_path = '../../data/' # path to folder with initial .csv data files
dump_path = 'dump/' # path to all the dump data, like saved models, calculated embeddings etc.
split_date = '2019-01-01' # date used for splitting data on train and test subsets

### Read the data and split it into train and test sets

In [None]:
train, test = dict(), dict() # dictionaries with data split in train and test subsets

In [None]:
for var, file_name, date_col in [('que', 'questions.csv', 'questions_date_added'), ('ans', 'answers.csv', 'answers_date_added'),
                       ('pro', 'professionals.csv', 'professionals_date_joined'), ('stu', 'students.csv', 'students_date_joined')]:
    df = pd.read_csv(data_path + file_name) # read the data

    df[date_col] = pd.to_datetime(df[date_col]) # convert it to datetime64 type

    train[var] = df[df[date_col] < split_date] # just to make sure no data from train will be present in test
    test[var] = df # we will need to use all the data in test
    
tag_que = pd.read_csv(data_path + 'tag_questions.csv')
tags = pd.read_csv(data_path + 'tags.csv').merge(tag_que, left_on='tags_tag_id', right_on='tag_questions_tag_id')

## Train step

In [None]:
data = train

### doc2vec

In [1]:
from doc2vec import pipeline as pipeline_d2v # pipeline for training and saving embeddings for
                                             # professional's industries and question's tags via doc2vec algorithm

ModuleNotFoundError: No module named 'doc2vec'

*some text*

### Feature engineering and data preprocessing

Some of the main points of our data preparation:  
- Questions and professionals are the two main entities in our recommendation system
- Question's features are designed to be time-independent, while some of student's and professional's features inevitably depend on time. This is why we will need to compute student's and professional's feature vectors for each moment in time when they change. These moments will correspond to appearance of new answer
- Student's features will be included in question's features later on, just before the model  

So, there are three entities in our dataset for which we will compute features separately: question, student and professional. Each 
of three resulting DataFrames will contain features for each id on different timestamps

data_structure.jpg

We distinguish three main types of features: categorical, numerical and datetime-like
- For categorical features, we will consider only top N of its most popular categories. We will encode them via LabelEncoder with labels from 0 to N-1; all the remaining categories will be encoded with N and NaNs with N+1 label. Later, in model we will train embeddings for every label of each categorical feature
- In numerical feature, we will fill its NaNs with either zero or mean and then standardize it via StandardScaler
- From the datetime-like feature we will extract three new features: absolute time, sine and cosine of scaled day of the year. Then, we will work with three new features just like with numerical ones

In [None]:
oblige_fit = True # whether it is necessary to fit new StandardScaler (for numerical features)
                  # or LabelEncoder (for categorical) or use existent if there is one
                  # True in train mode, False in test

In [None]:
from processors import QueProc, StuProc, ProProc # data preprocessors for questions, students and professionals

Exact question's features
- Numerical
    - Question's body length in symbols
- Datetime-like
    - Date question was added
- Averaged question's tags embeddings pre-trained via doc2vec

In [None]:
que_proc = QueProc(oblige_fit, dump_path)
que_data = que_proc.transform(data['que'], tags)

In [None]:
que_data

Student's features
- Categorical
    - Location
    - State - extracted from location
- Numerical
    - Number of asked questions
    - Average question age - time between question's date added and first answer
    - Average asked question body length
    - Average body length of answer on student's questions
    - Average number of answers on student's questions
- Datetime-like
    - Student's date joined
    - Time of previous student's question

In [None]:
stu_proc = StuProc(oblige_fit, dump_path)
stu_data = stu_proc.transform(data['stu'], data['que'], data['ans'])

In [None]:
stu_data

Professional's features
- Categorical
    - Industry
    - Location
    - State - extracted from location
- Numerical
    - Number of answered questions
    - Average answered question's body length
    - Average answer's body length
- Datetime-like
    - Professional's date joined
    - Time of previous professional's answer
- Industry embedding pre-trained via doc2vec

In [None]:
pro_proc = ProProc(oblige_fit, dump_path)
pro_data = pro_proc.transform(data['pro'], data['que'], data['ans'])

In [None]:
pro_data

### Additional data computation

The general solution to problem is to build the classifier which for given question and professional will predict whether professional will answer to given question  
Then we will partially use it to find simillar questions, make recommendations, etc.  
So, for training classifier on binary classification task, we will need both positive and negative samples  
First ones are easy to obtain: they are formed from those questions and professionals, where professional gave answer to that question
So, we can compute positive pairs directly from data

In [None]:
# construct dataframe used to extract positive pairs
df = data['que'].merge(data['ans'], left_on='questions_id', right_on='answers_question_id') \
    .merge(data['pro'], left_on='answers_author_id', right_on='professionals_id') \
    .merge(data['stu'], left_on='questions_author_id', right_on='students_id')
# select only relevant columns
df = df[['questions_id', 'students_id', 'professionals_id']]
# extract positive pairs themselves
pos_pairs = list(df.itertuples(index=False))

In [None]:
# mappings from professional's id to his registration date. Used in batch generator
pro_dates = {row['professionals_id']: row['professionals_date_joined'] for i, row in data['pro'].iterrows()}

### Batch generator and model

Negative samples are a bit more tricky. Logic of sampling negative question-professional pairs is implemented in batch generator. Some of its key points are:  
- To determine the exact feature vectors of both students and professionals, we need the concept of current time
- For positive professional-question pairs, current time is the time of an answer that connects given question and professional
- In case of negative pairs, we will sample current time as random shift from question's added date  

So, for sampling negative pair, we will choose random question, sample random current time and sample random professional among those who we registered at current time and who were not forming positive pair with selected question  
Finally, for given tuple of question, student and professional, we will use their features at a current time

In [None]:
from generator import BatchGenerator # class to ingest data from pre-processed DataFrames to model in form of batches of NumPy arrays

In [None]:
bg = BatchGenerator(que_data, stu_data, pro_data, 64, pos_pairs, nonneg_pairs,
                            que_proc.pp['questions_date_added_time'], pro_dates)

Finally, it comes to model

model.jpg

*some text*

In [None]:
from models import Mothership, \ # main model which combines two encoders (for questions and professionals)
                   Adam          # and Keras optimizer to train it

In [None]:
model = Mothership(que_dim=len(que_data.columns) - 2 + len(stu_data.columns) - 2 + 1, # -2 is for id and time columns, 
                                                                                      # +1 is for current time feature
                               que_input_embs=[102, 42], que_output_embs=[2, 2],
                               pro_dim=len(pro_data.columns) - 2 + 1,
                               pro_input_embs=[102, 102, 42], pro_output_embs=[2, 2, 2], inter_dim=10)

In [None]:
model.compile(Adam(lr=0.001), loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
model.fit_generator(bg, epochs=10, verbose=2)

In [None]:
model.save_weights(dump_path + 'model.h5')

### Model evaluation

In [None]:
from evaluation import permutation_importance, \ # calculate model feature importance via random permutations of feature values
                       plot_fi # and nicely plot it

*some text*

In [None]:
# dummy batch generator used to extract single big batch of data to calculate feature importance
bg = BatchGenerator(que_data, stu_data, pro_data, 512, pos_pairs, nonneg_pairs,
                    que_proc.pp['questions_date_added_time'], pro_dates)
# dict with descriptions of feature names, used for visualization of feature importance
fn = {"que": list(stu_data.columns[2:]) + list(que_data.columns[2:]) + ['que_current_time'],
      "pro": list(pro_data.columns[2:]) + ['pro_current_time'],
      'text': [f'que_emb_{i}' for i in range(10)] + [f'pro_emb_{i}' for i in range(10)]}

In [None]:
# calculate and plot feature importance
fi = permutation_importance(model, bg[0][0][0], bg[0][0][1], bg[0][1], fn, n_trials=3)

In [None]:
plot_fi(fi, fn)

## Test

## Usage

## Future plans