# Защита персональных данных клиентов

Необходимо защитить данные клиентов страховой компании. Задача - разработать такой метод преобразования данных, чтобы по ним было сложно восстановить персональную информацию.
Нужно защитить данные, чтобы при преобразовании качество моделей машинного обучения не ухудшилось.

## Loading data

**Check the data for gaps and anomalies.**

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
data = pd.read_csv('/content/drive/My Drive/projects/personal_data_protection_algorithm/insurance.csv')
data.columns = ['Gender', 'Age', 'Salary', 'Family members', 'Insurance claims'] #transaltion into English
display(data.head())
display(data.info())
display(data.describe())
print('number of duplicate lines =', data.duplicated().sum())
data.corr()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance claims
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Gender            5000 non-null   int64  
 1   Age               5000 non-null   float64
 2   Salary            5000 non-null   float64
 3   Family members    5000 non-null   int64  
 4   Insurance claims  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


None

Unnamed: 0,Gender,Age,Salary,Family members,Insurance claims
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


number of duplicate lines = 153


Unnamed: 0,Gender,Age,Salary,Family members,Insurance claims
Gender,1.0,0.002074,0.01491,-0.008991,0.01014
Age,0.002074,1.0,-0.019093,-0.006692,0.65103
Salary,0.01491,-0.019093,1.0,-0.030296,-0.014963
Family members,-0.008991,-0.006692,-0.030296,1.0,-0.03629
Insurance claims,0.01014,0.65103,-0.014963,-0.03629,1.0


**3% of the total data duplicates were found in the data. And a positive strong correlation between the feature “Age” and the target feature “Insurance payments”. Let's remove duplicates below.**

In [3]:
data = data.drop_duplicates()
print('number of duplicate lines =', data.duplicated().sum())

number of duplicate lines = 0


**Features: `Gender`, `Age`, `Salary`, `Family members`.**

**Target Feature: `Insurance payments` - number of payments to the client over the last 5 years.**

**We also see that the type of features `Age` and `Salary` are float, although an integer type would be more suitable for these features, let's convert these features to an integer type.**

**For the convenience of further work, we will translate the names of the signs into English and use the “snake register”.**

In [5]:
data.columns = ['gender', 'age', 'salary', 'family_members', 'insurance_claim']
data = data.astype(int)
display(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   gender           5000 non-null   int64
 1   age              5000 non-null   int64
 2   salary           5000 non-null   int64
 3   family_members   5000 non-null   int64
 4   insurance_claim  5000 non-null   int64
dtypes: int64(5)
memory usage: 195.4 KB


None

## Conclusions of section 1

**3% duplicates were found in the data and duplicates were removed.**

**In the data there is a positive strong correlation between the feature `Age` and the target feature `Insurance claims`.**

**No data gaps.**

**Data types of all features are converted to integer.**

**Names of features, including target ones, are given in snake case and English.**

## Matrix multiplication

Designations:

- $X$ — matrix of features (the zero column consists of ones)

- $y$ — vector of the target feature

- $P$ is the matrix by which the features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning Objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Features are multiplied by an invertible matrix. Let's determine whether the quality of linear regression will change (it can be trained again).**

$$
w = (X^T X)^{-1} X^T y
$$

**The quality of linear regression will not change.**

**Rationale:**

Because It is indicated that we are multiplying features by an invertible matrix, then based on the criterion of matrix invertibility we can conclude that the multiplication occurs by a square matrix.

Second fact: the operation of multiplying two matrices is feasible only if the number of columns in the first factor is equal to the number of rows in the second; in this case the matrices are said to be consistent. This means that in our case, we can only allow multiplication of features by a square matrix, whose height and width are equal to the width (number of features) in the feature matrix.

**Initial formula for training (vector of weights) and predictions:**

$$
w = (X^T X)^{-1} X^T y
$$

$$
a = X w
$$

**Let's consider what will happen in the learning formula when multiplying the feature matrix by the square matrix P:**
$$
w_p = (X_p^T X_p)^{-1} X_p^T y
$$

**Where:** $ X_p = X P $

$$
w_p = ((X P)^T (X P))^{-1} (X P)^T y
$$

**Let's open the brackets of the expression above according to the rules of transformation with matrices, we get:**
$$
w_p = (P^T X^T X P)^{-1} (P^T X^T) y
$$

**In the expression above we group groups of products of matrices and individual matrices with brackets, we get:**
$$
w_p = ((P^T) (X^T X) (P))^{-1} P^T X^T y
$$

**Let’s open the brackets that appear under the “degree” (-1) in the expression above, according to the rules of transformation with matrices, we get:**
$$
w_p = P^{-1} (X^T X)^{-1} (P^T)^{-1} P^T X^T y
$$

**The product of the transposed matrix P by the inverse of the transposed one according to the properties of matrices gives the identity matrix E, multiplying another matrix by which we obtain a similar one. Therefore, we can simply shorten this, we get the following:**
$$
w_p = P^{-1} (X^T X)^{-1} X^T y
$$

**Now let’s write the formula for finding $a_p$ - predictions of the target feature:**
$$
a_p = X_p w_p
$$

**Substitute the expression for $w_p$ found above into this formula, we get:**
$$
a_p = (X P) (P^{-1} (X^T X)^{-1} X^T y)
$$

**Open the brackets in the expression above, we get:**
$$
a_p = X P P^{-1} (X^T X)^{-1} X^T y
$$

**The product of matrix P and its inverse, according to the properties of matrices, gives the identity matrix E, multiplying another matrix by which we obtain a similar one. Therefore, we can simply shorten this, we get the following:**
$$
a_p = X (X^T X)^{-1} X^T y = X w = a
$$

**Total received:**
$$
a_p = a
$$

**Which proves that the predictions will be similar in the case of multiplying features by an invertible matrix. CTD.**

## Conversion algorithm

**Algorithm**

Based on the proof above, by multiplying features by an invertible matrix, we do not degrade the quality of linear regression.

Invertibility criterion: a matrix is invertible if and only if it is non-degenerate, that is, its determinant is not equal to zero. For non-square matrices and singular matrices, there are no inverse matrices.

Accordingly, it is necessary to generate a square matrix whose dimension coincides with the number of features in the source data and whose determinant is not equal to 0. In this case, the generated matrix should not be identity, otherwise, when multiplying the original feature matrix by it, we will obtain the same matrix and the data transformation will not happens, the data protection task will not fail.

The algorithm can be written in a structured way as follows:
1. Generate a random square matrix of size M x M, where M is the number of features in the original dataset.
2. Check the generated matrix for invertibility by comparing its determinant with zero.
3. Train a linear regression model using the original feature matrix.
4. Using the trained model, we will predict the values of the target feature based on the initial features.
5. Calculate the quality metric R2 for this trained linear regression model.
6. Multiply the original matrix of features by the generated matrix, and call the resulting matrix a coded matrix of features.
7. Let's retrain the linear regression model using the encoded feature matrix.
8. Using the retrained model, we will predict the values of the target feature based on the encoded features.
9. Let's calculate the quality metric R2 for the retrained linear regression model.
10. Let’s compare the R2 metric of the original linear regression model and the model trained on encoded features.

**Rationale**

In Section 2, it was mathematically proven that multiplying a feature matrix by an invertible matrix does not change the prediction and, accordingly, does not degrade the quality of linear regression.

## Algorithm check

Let's program the algorithm using matrix operations. Next, let’s check that the quality of the linear regression from sklearn does not differ before and after the transformation. To do this, we apply the R2 metric.

In [None]:
features = data.drop('insurance_claim', axis=1)
print('features.head before')
display(features.head())
target = data['insurance_claim']
model = LinearRegression()
model.fit(features, target)
predictions_before = model.predict(features)
score_before = r2_score(target, predictions_before)
print(f'R2 score before encryption = {score_before}')

singular = lambda m: np.linalg.det(m) == 0 #check if the matrix determinant = 0
P = np.random.random([features.shape[1],features.shape[1]]) #create square matrix with random elements from [0;1)
if not singular(P):
    features_encrypted = features.to_numpy() @ P # multiply initial features with the square random matrix
    print('\n\n\nencryption matrix')
    print(P)
    print('\n\n\nfeatures.head after encryption')
    display(pd.DataFrame(features_encrypted, columns = features.columns).head())
    model.fit(features_encrypted, target)
    predictions_after = model.predict(features_encrypted)
    score_after = r2_score(target, predictions_after)
    print(f'R2 score after encryption = {score_after}')
    print('\n\n\nr2_score difference:', score_after - score_before)

features.head before


Unnamed: 0,gender,age,salary,family_members
0,1,41,49600,1
1,0,46,38000,1
2,0,29,21000,0
3,0,21,41700,2
4,1,28,26100,0


R2 score before encryption = 0.4302010046633359



encryption matrix
[[0.26130595 0.65787401 0.29584038 0.76437669]
 [0.84355721 0.28586644 0.30151503 0.79294882]
 [0.19892611 0.5234803  0.79208387 0.71328965]
 [0.40760029 0.73508471 0.72627317 0.40646642]]



features.head after encryption


Unnamed: 0,gender,age,salary,family_members
0,9901.989821,25977.73645,39300.744049,35412.848431
1,7598.403422,19906.136408,30113.782923,27141.888848
2,4201.911475,11001.376464,16642.50515,15002.078186
3,8313.7487,21836.601948,33037.681629,29761.643302
4,5215.852386,13671.49801,20682.127198,18639.826833


R2 score after encryption = 0.43020100466333566



r2_score difference: -2.220446049250313e-16


**The result obtained in the difference between two quality metrics is close to zero and proves that this encryption algorithm works correctly and does not degrade the quality of linear regression.**

## Conclusion

**The original dataset consisted of 4 features and 1 target feature. 3% of all data duplicates were removed, the type of all features was changed to integer, the names of all features were converted to snake case and to the English version for ease of further work.**

**It has been proven mathematically that multiplying the original feature matrix by an invertible matrix does not degrade the quality of the linear regression model, and in this way it is possible to encode the original feature values, because they are confidential information.**

**Next, in practice, using the code, an example was demonstrated of how, when multiplying the original feature matrix by a random invertible matrix, we obtained the same value of the linear regression metric R2, retrained on these encoded features. It was shown that this encryption method does not degrade the quality of linear regression - the value of the R2 metrics was similar for both the model trained on the original features and the model trained on the encoded features - and, accordingly, this data encryption method can be used in work.**