Clue-WATTx hackathon
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
clustering
src
test
.dockerignore
.gitignore
Dockerfile
LICENSE
README.md
clusters+access_probability.pdf
dataset_analysis.ipynb
requirements.txt
run.sh
submit.py

README.md

clue-hackathon

This repository is what we used at Clue-WATTx hackathon, which won the best performing method.

The submission and starting point of train scripts are copied from the above repository.

IMPORTANT: Do NOT add the sample or the synthesized data to this repository or any clones. The data is the property of the Clue company (BioWink GmbH), and I believe we do not have the permission to publish it publicly. It is safe to copy the data under a "data" folder in a locally cloned repo, since it is included in the .gitignore file and won't automatically be added to the repo.

Installation

You can run the code using a python virtual environment (http://docs.python-guide.org/en/latest/dev/virtualenvs/).

virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements.txt
./run.sh

Data Description

The dataset includes information about users, their cycles, and recorded symptoms within each cycle. The data includes the following sections:

  • Users: Demographic information about users, including age, country, etc.
  • Active Days: consists of (user id, date, cycle id, day in cycle) tuples, showing days in which the user has had some interaction with the app.
  • Cycles: includes information about cycles, such as starting date, length, and cycle id
  • Tracking: this is the largest part of the dataset, including one row per (user id, cycle id, day in cycle, category, symptom).

There are almost 80 symotoms, which are in groups of four, each group called a category. In this analysis category is completely ignored, since symptom names are unique.

We approach the task as a regression problem, and not a classification problem. Then we take the predicted values as probabilities and export them as output. One supporting argument is that the recorded symptoms are simply there when the symptom as passed a certain threshold, and the symptoms themselves behave in a continues way and are not discrete phenomena.

Data Preparation

The data is transformed such that there is one row per user at the end. The tracking data is converted into columns, each column name being a (symptom, day in cycle) tuple. To better map cycles and day in cycles to eachother, all sycles are scaled into a 29 day cycle, which is the median cycle length. This is done by this transformation:

day in cycle => (day in cycle / cycle length) * 29

and the cycle length is fetched from the cycles table. This results in a matrix which has rows as the number of users, and over 2500 (number of symptoms times 29) columns. At the end we attach the median number of active days per cycle for each user, to this matrix.

X and Y

After maping all the cycles to the same length, we aggregate the data on cycle id, and count the number of times a user has reported a symptom on that each day of the cycle. This results in a NaN for days in the cycle for which the user has not ever reported the respective symptom, and an integer otherwise, reflecting the number of times the user has recorded that symptom on that day of the cycle. All the NaN values are later replaced by zeros.

Before creating the matrix, we exclude the last cycle of each user from the data, which gives us our input matrix, or X. And the data representing the last cycle results in a similar matrix, which we use as the output, or Y. Please note that the output mostly includes only ones and zeros, since the user has recorded a symptom on that day of the last cycle or not, unless the user has interacted with the app about a symptom more than once for the same recorded day, or that after scaling the cycle, two days are mapped into the same day.

The respective code for data preparation is done by process_level2(.) function in pre_process.py file.

Methods

Once the data is preprocessed and we have our X and Y matrices, we train a model for each symptom. This means the model takes the X matrix as input, and gives an output having dimensions (n_users, 29). If the model doesn't support a vector as an output, we need to train 29 separate models per symptom, and due to the computational cost of doing so we avoid such models.

The model is in this case a scikit-learn pipeline:

pipeline = Pipeline([
    ('remove_low_variance_features', VarianceThreshold(threshold=0.0)),
    ('standard_scale', StandardScaler()),
    ('estimator', Lasso()),
])

param_grid = {'estimator__alpha': [.1, .3, .5, .7, .8]}
model = GridSearchCV(pipeline, param_grid = param_grid, n_jobs = 4,
                     verbose=2)
model.fit(X, Y)

Y being the output for only one symptom. The actual code is found in pipeline.py file. The pipeline can be changed easily to set the desired preprocessing and/or prediction model.

The model returns 29 numbers per user per symptom, but we need to map that back to the expected cycle length. It's done using a polynomial interpolation, which you can find in dump_results.py file as:

def dump_cycle(f, user, ps, symptom, cl):
    """ Takes predicted values, dumps appropriate results accordingly.
    :param f: output file
    :param user: user id
    :param ps: predictions for this user
    :param symptom: symptom for which these predictions are for
    :param cl: expected cycle length
    """
    y = ps
    y[np.isnan(y)] = 0
    x = np.array(list(range(len(ps))))
    xx = np.linspace(x.min(), x.max(), cl)
    itp = interp1d(x, y, kind='linear')
    window_size, poly_order = 5, 3
    yy_sg = savgol_filter(itp(xx), window_size, poly_order)
    for i in range(int(cl)):
        lp = np.max([.001, np.min([yy_sg[int(i)], .99])])
        f.write("%s,%d,%s,%g\n" % (user, i, symptom, lp))

Results

We compared and submitted a few models in our evaluations. At the end, the best performing model was a LASSO and a grid search on its alpha parameter. Other evaluated models include:

  • Linear Regression
  • Decision Trees with a grid search on its depth and max number of used input features
  • Random Forests
  • Support Vector Machines with an RBF kernel and grid search on the gamma parameter of the kernel and the C of the optimization problem (computation was too expensive and never finished in time)

Potential Future Work

One important question dealing with this type of data is weather to train one single model for all the users, or a single model per user. Arguably clustering the users and training a model per cluster can be considered a middle ground. We tried clustering the users beforehand to train a model for each cluster, but using the computed clusters is something we didn't have time to put in our pipeline. The resulting related codes are located in clustering directory. We decided against training a model per user for two reasons:

  • computationally intensive
  • not enough data per user

An alternative approach is to train a general model for all the users, and then do further fitting for each given user. This uses the knowledge we have from the population, and then fits the model a bit further to the user. Some possible ways to do so are:

  • Train a Gaussian process on the whole data, then further train the GP (for only a few iterations if it's an iterative method) given only the data from the user. One important detail to consider is that GPs are usually prune to overfitting unless they are sparse, or the covariance function imposes sparsity, or a feature selection is done prior to the GP fit.
  • scikit-learn supports incremental fit for some models. The same strategy as explained above can be taken using these models. More information about partial_fit and the models can be found here

There are also a few tasks that we could do on the preprocessing side of the code:

  • The code in this repository assumes 1 whenever a symptom is tracked, and 0 otherwise. Then we sum over cycles of the user to calculate the input. Instead of taking the sum, we could take the average or the median of those values and see if it improves the performance. The downside of what we do is that users with more recorded cycles have higher values in their feature vector solely because they've been on the app longer, and not necessarily because they've been more active.
    • UPDATE: this did not improve the performance
  • At the moment we scale the whole cycle linearly into a 29 day cycle. Whilst it might make sense for some of the symptoms, other symptoms might show a different behavior. For instance, it might be better to keep the last 2 weeks of the cycle as it is, and scale the rest of it linearly into 15 days. There is some evidence that this is a better transformation for many symptoms. One idea is to transform all the symptoms using both explained approaches, and then for each symptom, test which transformation makes the population closer to each other. The transformation resulting in less variance among the population is probably a better fit.
  • The StandardScaler used in our code includes all the zeros in the estimated mean and variance before transformation, which in such a sparse data as we have in this task is wrong. As long as the prediction model does not assume having normally distributed input variables, this preprocessing step can be removed.
    • UPDATE: this did not improve the performance
  • compare a constant small output to other models
    • UPDATE (1): this was implemented, but got failed and never got debugged (competition over). It is the DummyModel in dump_results.py file.
    • UPDATE (2): according to the organizers a model returning a "small" constant value gives a score close to 0.09, whereas our model's score was 0.065.

A final remark: inside clusters+access_probability.pdf in the last part we tried to recreate profiles of click of a specific symptom given the fact that the user accessed the app. Such analysis it is really valuable as it shows the real behaviour of the sympthoms over time, and it can become especially powerful whenever users are clustered in different groups.