## Purpose

The Sure Tomorrow insurance company wants to protect its clients' data. The task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly.

The data should be protected in such a way that the quality of machine learning models doesn't suffer. The best model does not need to be picked.

## Table of Contents
<a href='#Data downloading'>Data downloading</a>

<a href='#Multiplication of matrices'>Multiplication of matrices</a>

<a href='#Transformation algorithm'>Transformation algorithm</a>

<a href='#Algorithm test'>Algorithm test</a>

<a href='#Conclusion'>Conclusion</a>

<a id='Data downloading'></a>
## Data downloading

First the necessary modules are imported, the data is downloaded and an initial look is performed.

In [1]:
#Import necessary libraries and modules
import pandas as pd
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
from scipy import stats as st
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from collections import Counter

In [2]:
df = pd.read_csv('/datasets/insurance_us.csv')

df.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [4]:
df.describe()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


An initial look at the data has been done, and there does not seem to be any issues that need addressed.

<a id='Multiplication of matrices'></a>
## Multiplication of matrices

To begin the data masking, a proof will be conducted to verify that the quality of the model will not change as a result of transforming the data. The proof will begin with the linear regression equation and compare the model quality for the base equation as well as an equation multiplied by an invertible matrix.

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

- $E$ — identity matrix (1)

The following matrix properties will be utilized throughout this proof:

1. $(AB)^{-1} = B^{-1}*A^{-1}$
2. $(AB)^T = B^T * A^T$
3. $A*A^{-1} = A^{-1}*A = E$

**Standard LR equation**

The linear regression equation is:

$$
y = wX + w_0
$$

For the standard linear regression equation, the minimum mean squared error (MSE) occurs when w is equal to:

$$
w = (X^TX)^{-1}X^Ty
$$

**Transformed LR equation**

For data masking, X will be multiplied by an invertible matrix, P. The transformed features matrix will be denoted X'. Therefore X' is:

$$
X' = X*P
$$

The transformed linear regression equation is then:

$$
y' = wX' + w_0
$$

where the minimum MSE occurs at:

$$
w' = (X'^TX')^{-1}X'^Ty
$$

Using the above equivalences, X*P shall be substituted in for X' to receive:

$$
w' = ((XP)^T(XP))^{-1}(XP)^Ty
$$

Using property #2, w' can be rewritten as:

$$
w' = (P^TX^TXP))^{-1}P^TX^Ty
$$

Further applying property #1, w' can be rewritten as:

$$
w' = P^{-1}(X^TX)^{-1}P^{T-1}P^TX^Ty
$$

With property #3, w' can be reduced to:

$$
w' = P^{-1}(X^TX)^{-1}EX^Ty
$$

since $w = X^{-1}X^{T-1}X^Ty$, w' relates to w by:

$$
w' = P^{-1}Ew
$$

**Proof Conclusion**

The weights between the initial feature matrix and the transformed feature matrix differ by the inverse of the invertible matrix used to mask the dataset. This relationship is applied to the results of the model (and thus the quality) by factoring it into the following equation:

$$
a' = X'*w' = XPP^{-1}Ew = XE^2w = Xw
$$

Therefore, this proves that the model results or quality does not change as a result of transforming the data since the model results, a and a', are independent of matrix P.

<a id='Transformation algorithm'></a>
## Transformation algorithm

The transformation algorithm is as follows:

1. Take the original features of the dataset and assign them to a matrix X.
2. Create a random matrix, P, of size m x m where m is the number of features in matrix X.
3. Check that P is invertible (Ensure the inverse is real).
4. Multiply matrix X by matrix P to receive matrix X', which is the transformed features matrix.
5. Split the available dataset into a training and testing dataset
6. Create a linear regression model
7. Train the linear regression model using the training dataset.
8. Predict the targets for the transformed features matrix for the testing dataset using the trained linear regression model.
9. Calculate the MSE for the predicted targets compared to the actual targets.

**Justification**:

To ensure the data masking occurs successfully, there are two critical steps in the above algorithm: the random matrix must be a square matrix that matches the number of features in the dataset, and matrix P must be invertible. Since P does not affect the linear regression model's predictions or quality as shown in the proof above, this allows for the dataset features to be inconsequential masked. This will be shown using the actual dataset in the upcoming section.

<a id='Algorithm test'></a>
## Algorithm test

To begin testing the algorithm, the features and targets of a training and testing dataset must be determined, and the linear regression model must be created. The features matrix will be multiplied by an invertible matrix prior to splitting the dataset into a training and testing dataset to ensure they are both masked by the same matrix.

In [5]:
#Set the features and targets for the dataset
features = df.drop('Insurance benefits', axis=1)
target = df['Insurance benefits']

#Create a random masking matrix, P
P = np.random.rand(len(features.columns), len(features.columns))

P must be checked to ensure it is invertible:

In [6]:
#Check matrix P for non-invertible
try:
    np.linalg.inv(P)
    features_prime = features@P
except:
    print('Matrix P is non-invertible, an invertible masking matrix must be used.')

In [7]:
#Separate into train and test dataset for both the masked and unmasked datasets
features_train, features_test, target_train, target_test = train_test_split(features, target, random_state=12345)
features_train_masked, features_test_masked, target_train_masked, target_test_masked = train_test_split(features_prime, target, random_state=12345)

#Create linear regression models
model_lr = LinearRegression()
model_lr_masked = LinearRegression()

Initially, the model is trained on the original dataset and the R2 score for the model is calculated.

In [12]:
#Fit model and calculate R2 Score
model_lr.fit(features_train, target_train)
predictions = model_lr.predict(features_test)

print('The R2 Score for the unmasked model is {}.'.format(r2_score(target_test, predictions)))

The R2 Score for the unmasked model is 0.435227571270266.


The R2 Score for the model using the original dataset is 0.435. The same exact procedure will be done, except for the masked dataset, and the r2 Scores will be compared. Based on the proof performed earlier, it is expected these two scores will be almost identical.

In [9]:
#Fit model and calculate R2 Score
model_lr_masked.fit(features_train_masked, target_train_masked)
predictions_masked = model_lr_masked.predict(features_test_masked)

print('The R2 Score for the masked model is {}.'.format(r2_score(target_test, predictions_masked)))

The R2 Score for the masked model is 0.43522757127004286.


As predicted, the two scores are nearly identical, validating the proof that was performed.

<a id='Conclusion'></a>
## Conclusion

The purpose of this project was to develop a data transforming algorithm that would make it hard to recover personal information from the data.

The data was masked by multiplying the dataset features by an invertible matrix in order to "randomize" all the feature values. Prior to constructing the algorithm, a proof of this approach was performed to verify that this would not affect the model predictions or model quality. 

Next an algorithm was proposed that described the process for masking the dataset and evaluating the model quality. The algorithm was subsequently tested on models for both the original, unmasked dataset, and the transformed, masked dataset. The R2 score for each of these models were nearly identical, validating the proof. The proposed algorithm is successful at transforming the data to protect personal information.