The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly

The data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model.

## 1. Data downloading

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

data=pd.read_csv('/datasets/insurance_us.csv')
print(data.head())
print(data.info())

   Gender   Age   Salary  Family members  Insurance benefits
0       1  41.0  49600.0               1                   0
1       0  46.0  38000.0               1                   1
2       0  29.0  21000.0               0                   0
3       0  21.0  41700.0               2                   0
4       1  28.0  26100.0               0                   0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB
None


## 2. Multiplication of matrices

In this task, you can write formulas in *Jupyter Notebook.*

To write the formula in-between the text, frame it with dollar signs \\$; if it should be outside the text —  with double signs \\$\\$. These formulas are written in markup language *LaTeX.* 

For example, we wrote down linear regression formulas. You can copy and edit them to solve the task.

You don't have to use *LaTeX*.

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

** Answer:** ...

$$
w = (X^T X)^{-1} X^T y
$$
$$
w1= ((XP)^T XP)^{-1} (XP)^T y
$$
$$
w1= ((P^TX^T XP)^{-1} P^TX^T y
$$
$$
w1= (XP)^{-1}(P^TX^T)^{-1} P^TX^T y
$$
$$
w1= P^{-1}X^{-1} (X^T)^{-1}(P^T)^{-1}P^TX^T y
$$
$$
w1= P^{-1}(X^TX)^{-1}X^T y
$$
$$
w1= P^{-1}w
$$
$$
a1= XPw1
$$
$$
a1= XPP^{-1}w
$$
$$
a1= Xw
$$
$$
a1= a
$$

** Justification:** ...

Product of the invertible matrix and its inverse is equal to Identity matrix. Replacing those in the above equation solves the equation to a1=a, which means that predictions (a and a1) will remian same even if the features are multiplied by the invertible matrix.


## 3. Transformation algorithm

** Algorithm**

In [None]:
#creating a random matrix

P=np.random.normal(size=(4,4))
print(P)

#checking if the matrix is invertible
np.linalg.inv(P)

features=data.drop('Insurance benefits', axis=1)
target=data['Insurance benefits']

#masking the original data by multiplying the data with the invertible matrix
X1=np.dot(features,P)
print(X1)

[[ 0.14068234  1.39989287  0.09900651 -0.80568628]
 [-0.56985407 -0.57874414  0.28882811 -0.56369694]
 [ 1.1240096   0.02333567 -0.5541411   2.40318778]
 [-1.34731741 -1.02852514  0.21619226  0.77710822]]
[[ 55726.30572678   1134.09214364 -27473.24148037 119174.97357161]
 [ 42684.80436287    859.10474551 -21043.85956964  91295.98256355]
 [ 23587.67592451    473.26551264 -11628.58711519  50450.59609943]
 ...
 [ 38089.83387331    777.44731661 -18779.17439221  81458.34590768]
 [ 36738.6760048     748.65839075 -18113.31221551  78573.36460371]
 [ 45617.62739003    931.59477773 -22489.72633277  97553.61164154]]


** Justification**

Based on the above solved equation, even if the features are multiplied by an invertible matrix, the predictions remain same. Hence, created a random invertible matrix P and multiplied it by the features of the data. This will mask the original data, however the predictions will be unaffected.

## 4. Algorithm test

In [None]:
#spliting the data into train and test set

features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)

#Linear regression on the original dataset
model=LinearRegression()
model.fit(features_train, target_train)
predicted=model.predict(features_valid)
print("R2 =", r2_score(target_valid, predicted))

R2 = 0.435227571270266


In [None]:
#spliting the masked data into train and test set

trans_features_train, trans_features_valid, trans_target_train, trans_target_valid=train_test_split(X1, target, test_size=0.25, random_state=12345)

#Linear regression on the masked dataset
model_mask=LinearRegression()
model_mask.fit(trans_features_train, trans_target_train)
predicted_mask=model_mask.predict(trans_features_valid)
print("R2_mask =", r2_score(trans_target_valid, predicted_mask))


R2_mask = 0.43522757127197365


Target prediction was done for both original and the masked data. The r2_score for both the datasets is same, hence this proves that if data is multipled with an invertible matrix, then it does not impact the predictions.

This way, the original data is masked and the quality of machine learning models doesn't suffer. 