# Federated Learning

## Frecency Sampling

To be able to quickly prototype the federated learning algorithm, a dataset is required.
This notebook is based on a fake frecency dataset that was designed to be very interpretable and at the same time close to the actual data.
The assumption for the data generation is that the current frecency algorithm is perfect. By sampling based on this axiom, we can check if the algorithm actually works.

In [1]:
import numpy as np
from random import random

These are weights that describe how common certain features are.

In [2]:
type_weights = {
    "visited": 0.6,
    "typed": 0.2,
    "bookmarked": 0.1,
    "other_type": 0.1
}

In [3]:
recency_weights = {
    "4-days": 0.03,
    "14-days": 0.05,
    "31-days": 0.1,
    "90-days": 0.32,
    "other_recency": 0.5
}

A one-hot representation makes it easier to implement the rest of the formulas. numpy allows us to generate this easily using a permutation of the identity matrix.

In [17]:
def one_hot(num_choices, vector):
    return np.eye(num_choices)[vector]

In [18]:
def sample_url_features(num_samples):
    num_choices = len(weights)
    choice_weights = weights.values()
    samples = np.random.choice(num_choices, num_samples, p=choice_weights)
    return one_hot(num_choices, samples)

In [5]:
def sample_weighted(num_samples, weight_dict):
    num_choices = len(weight_dict)
    choice_weights = weight_dict.values()
    samples = np.random.choice(num_choices, num_samples, p=choice_weights)
    return one_hot(num_choices, samples)

These are the weights found in the current frecency algorithm. Based on the one-hot encoding, this is just a linear function.

In [6]:
def sample_type(num_samples):
    return sample_weighted(num_samples, type_weights)

In [378]:
frecency_points_dict = {
    "visited": 1.2,
    "typed": 2,
    "bookmarked": 1.4,
    "other_type": 0,
    "4-days": 100,
    "14-days": 70,
    "31-days": 50,
    "90-days": 30,
    "other_recency": 10
}

In [7]:
def sample_recency(num_samples):
    return sample_weighted(num_samples, recency_weights)

In [379]:
# To make sure that the order of keys is the same everywhere
key_order = recency_weights.keys() # type_weights.keys() + recency_weights.keys()
frecency_points = np.array([frecency_points_dict[key] for key in key_order])

In [380]:
def frecency(x):
    return x.dot(frecency_points)

In [20]:
sample_url_features(3).shape

(3, 20)

Finally, we are sampling from the above distributions and then call the frecency function.

In [381]:
def sample(num_samples):
    X = sample_url_features(num_samples)
    y = frecency(X)
    return X, y

## Linear Regression

In [351]:
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet

In [382]:
X, y = sample(1000000)

In [385]:
n = 10000000
noise = np.random.normal(0, 1, size=(n))
X, y = sample(n)
#y += noise

In [386]:
model = Ridge(fit_intercept=False)

In [387]:
model.fit(X, y)

Ridge(alpha=1.0, copy_X=True, fit_intercept=False, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [375]:
y

array([ 10. ,  11.2,  12. , ...,  31.2,  71.2,  51.2])

In [376]:
model.predict(X)

array([  9.99998722,  11.20001056,  12.00000027, ...,  31.20000638,
        71.1999236 ,  51.19998524])

In [366]:
X.dot(frecency_points)

array([  31.4,   31.2,  101.2, ...,   72. ,   32. ,  101.2])

In [339]:
model.score(X, y)

0.9977874144422556

In [336]:
model.predict(X)

array([ 31.19844101,  30.00479431,  51.20031098, ...,  31.19844101,
        11.20040165,  10.00675495])

In [337]:
y

array([ 30.826925  ,  28.70856841,  50.6918614 , ...,  31.49069645,
        10.50982997,  11.93372446])