# Personal data protection

**Project Objective** is to develop a data transformation algorithm with an explanation of its correctness.  It is necessary to create such a method of data transformation that it is difficult to reconstruct personal information from it, so that the quality of machine learning models does not deteriorate.

## Data loading

In [7]:
# imports
import pandas as pd
import numpy as np
from numpy import linalg as LA

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
# data
data = pd.read_csv('/datasets/insurance.csv')
data.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


The dataframe includes 5 columns. 4 of them are attributes (`Sex`, `Age`, `Salary`, `Family members`). 

The last column `Insurance payments` (number of payments in the last 5 years) is the target variable.
No omissions were found in the data provided. All attributes are expressed as numbers. 

It is only necessary to convert `Age` and `Salary` to integer data type.

In [4]:
data = data.astype(int)

## Matrix multiplication

**Formulas for linear regression:**

Here:
- $X$ — feature matrix (zero column consists of ones)
- $y$ — target vector
- $P$ — matrix by which features are multiplied
- $w$ — vector of linear regression weights (zero element equals the shift)

Predictions:

$$
a = Xw
$$

Training problem:

$$
w = \arg\min_w MSE(Xw, y)
$$

Solution:

$$
w = (X^T X)^{-1} X^T y
$$

**Question:** The signs are multiplied by a reversible matrix. Will the quality of the linear regression change?

**Answer:** No, it will not.

**Explanation:** The quality of a linear regression can be assessed, for example, by the magnitude of the mean square error (MSE). Let's train two models (with the original feature matrix and with the feature matrix additionally multiplied by the reversible matrix) and compare the result.

In [5]:
X = data.drop(columns=['Страховые выплаты'])
y = data['Страховые выплаты']

In [6]:
# P matrix
P = np.random.randint (1, 30, (4, 4))
P

array([[15, 22, 14, 11],
       [26, 15, 16, 25],
       [18, 25,  4,  4],
       [ 6,  6, 18, 26]])

In [8]:
# reversibility testing
LA.det(P)

44674.00000000003

In [9]:
XP = pd.DataFrame(np.array(X) @ P, columns=X.columns).astype(int)
XP.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,893887,1240643,199088,199462
1,685202,950696,152754,153176
2,378754,525435,84464,84725
3,751158,1042827,167172,167377
4,470543,652942,104862,105111


In [10]:
# linear regression fit on the initial feature matrix X
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
mse = mean_squared_error(y, predictions)

print('MSE для модели, обученной на X:', round(mse, 10))

MSE для модели, обученной на X: 0.1233468894


In [11]:
# linear regression fit on feature matrix XP
model = LinearRegression()
model.fit(XP, y)
predictions = model.predict(XP)
mse = mean_squared_error(y, predictions)

print('MSE для модели, обученной на XP:', round(mse, 10))

MSE для модели, обученной на XP: 0.1233468894


The prediction errors for both models are the same, which confirms the correctness of the answer.

## Transformation algorithm

**Algorithm** 

To protect personal data of clients, the following method of data preparation for model training can be suggested. After preprocessing the data (leveling outliers, omissions, etc.), multiply the feature matrix by a reversible matrix with random numerical values (key matrix). Subsequently, the new feature matrix will need to be similarly multiplied by the key matrix to obtain model predictions.

Also additionally, if linear regression is used as the machine learning algorithm, after multiplying the feature matrix with the key matrix, the values can be scaled for better results and security.

**Explanation**

When the feature matrix is multiplied by the key matrix, the quality of the model does not change due to the equality of the prediction arrays.

Predictions of the model in the original: $ a = Xw = X(X^T X)^{-1} X^T y $ \
Model predictions using the key matrix: $ a_{new} = XPw = XP ((XP)^T XP)^{-1} (XP)^T y $

Properties of matrices used:
$$(AB)^T = B^T A^T$$
$$(AB)^{-1} = B^{-1} A^{-1}$$
$$ A A^{-1} = A^{-1} A = E $$

Evidence:
$$ a_{new} = XP (P^T X^T XP)^{-1} P^T X^T y $$
$$ a_{new} = X (P P^{-1}) (X^TX)^{-1} ((P^T)^{-1} P^T) X^T y$$
$$ a_{new} = XE (X^TX)^{-1} EX^T y$$
$$ a_{new} = X(X^TX)^{-1}X^T y = a$$

## Algorithm check

In [13]:
# X - DataFrame, key_matrix - squared matrix where number of rows = number of features
def protected_data(X, key_matrix):
    if LA.det(key_matrix) == 0:
        print('Матрица-ключ необратима!')
        return
    
    return pd.DataFrame(np.array(X) @ key_matrix, columns=X.columns)

In [14]:
# without encryption
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
r2 = r2_score(y, predictions)

print('R2 без шифрования:', round(r2, 10))

R2 без шифрования: 0.4249455031


In [15]:
# with encryption
key_matrix = np.random.randint (1, 30, (4, 4))

model = LinearRegression()
model.fit(protected_data(X, key_matrix), y)
predictions = model.predict(protected_data(X, key_matrix))
r2 = r2_score(y, predictions)

print('R2 с использованием шифрования:', round(r2, 10))

R2 с использованием шифрования: 0.4249455031


The algorithm works, the model on original data and encrypted data shows the same result.

**Conclusion:**
    
In this project, an algorithm has been proposed to encrypt client data on the request of the company "Want a Flood". It transforms the feature dataframe using multiplication by a key matrix, making it unrecoverable for those who do not know the key. Training linear regression on the encrypted feature dataframe shows the same results as training on the original one. This is proven by mathematical formulas and tested on a real example.