<h1>Content<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading data</a></span></li><li><span><a href="#Matrix-multiplication" data-toc-modified-id="Matrix-multiplication-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Matrix multiplication</a></span></li><li><span><a href="#Conversion-algorithm" data-toc-modified-id="Conversion-algorithm-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Conversion algorithm</a></span></li><li><span><a href="#Verification-of-the-algorithm" data-toc-modified-id="Verification-of-the-algorithm-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Verification of the algorithm</a></span></li></ul></div>

# Protection of personal data of clients

We need to protect the data of customers of the insurance company "Though the Flood". It is necessary to develop a method of data transformation so that it is difficult to recover personal information from them. It is also necessary to justify the correctness of its work.

You need to protect the data so that the quality of the machine learning models does not deteriorate during the transformation.

## Loading data

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_spd_matrix
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

#ins = pd.read_csv('insurance.csv')
ins = pd.read_csv('/datasets/insurance.csv')

In [2]:
ins.info()
ins.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Пол                  5000 non-null int64
Возраст              5000 non-null float64
Зарплата             5000 non-null float64
Члены семьи          5000 non-null int64
Страховые выплаты    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
5,1,43.0,41000.0,2,1
6,1,39.0,39700.0,2,0
7,1,25.0,38600.0,4,0
8,1,36.0,49700.0,1,0
9,1,32.0,51700.0,1,0


## Matrix multiplication

Designations:

- $X$ - feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ — matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning task:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

The features are multiplied by an invertible matrix. Will the quality of linear regression change? (It can be retrained.)

**Answer:** The quality of the linear regression will not change.

**Explanation:**
Let us substitute the product $XP$ instead of the feature matrix $X$ into the learning formula, denoting the weight vector $w$ as $w_P$.
$$
w_P = ((X P)^T X P)^{-1} (X P)^T y
$$

For further transformation, we use the following two properties: $(AB)^T=B^TA^T$ and $(A^T)^{-1}A^T=E$.
$$ w_P = (P^T X^T XP)^{-1} P^T X^T y $$
$$ w_P = P^{-1}(X^T X)^{-1}(P^T)^{-1} P^T X^T y $$
$$ w_P = P^{-1}(X^TX)^{-1} X^T y $$
$$ w_P = P^{-1} w $$

In the resulting expression, its part fully corresponds to the usual learning formula, which allows us to replace this part with the weight vector $w$.
$$ w_P = P^{-1} w $$

For further substantiation, we substitute $w_P$ and $XP$ into the prediction formula.
$$ a_P = X P w_P $$
$$ a_P = X P P^{-1} w $$

Multiplying a matrix by its inverse gives the identity matrix $E$.
$$ a_P = X E w $$
$$ a_P = X w $$
$$a_P = a$$

The final equality of $a_P$ and $a$ is the justification that the quality of the linear regression will not change - they will be equal.

## Conversion algorithm

**Algorithm**

Using the hypothesis justified by us to multiply the feature matrix $X$ by a random invertible matrix $P$, we can perform a transformation of customer data that will protect it from recovery, but at the same time will not change the quality of machine learning models.

**Explanation**

Let's prepare the data for justification by dividing it into features and target feature.

In [3]:
features = ins.drop('Страховые выплаты', axis=1)
target = ins['Страховые выплаты']

Let's create a random square matrix using the make_spd_matrix random symmetric positive definite matrix generator.  We choose a matrix size of 4, since the number of features in our data is 4.

In [4]:
p_matrix = make_spd_matrix(n_dim=4, random_state=12345)
p_matrix

array([[ 1.37245706, -1.03845957, -0.84389737, -0.26033015],
       [-1.03845957,  2.87886199,  1.67157893,  0.48470484],
       [-0.84389737,  1.67157893,  2.10204907,  0.3257384 ],
       [-0.26033015,  0.48470484,  0.3257384 ,  1.01695329]])

Let's check our random matrix for irreversibility to eliminate the occurrence of an error in the future.

In [5]:
try:
    np.linalg.inv(p_matrix)
    print('Матрица P обратима')
except:
    print('Матрица P не обратима')

Матрица P обратима


In [6]:
det = 0
while det == 0:
    r = np.random.randint(100)
    p_matrix = make_spd_matrix(n_dim=4, random_state = r)
    det = np.linalg.det(p_matrix)

We multiply the feature matrix by the random matrix $P$ and check the possibility of removing the protection by multiplying by the inverse matrix $P^{-1}$.

In [7]:
features_p = features @ p_matrix
features_check = features_p @ np.linalg.inv(p_matrix)
features

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,1,41.0,49600.0,1
1,0,46.0,38000.0,1
2,0,29.0,21000.0,0
3,0,21.0,41700.0,2
4,1,28.0,26100.0,0
...,...,...,...,...
4995,0,28.0,35700.0,2
4996,0,34.0,52400.0,1
4997,0,20.0,33900.0,2
4998,1,22.0,32700.0,3


In [8]:
features_check = features_check.astype('int')
features_check

Unnamed: 0,0,1,2,3
0,1,40,49600,1
1,0,46,38000,0
2,0,28,21000,0
3,0,20,41700,2
4,1,27,26100,0
...,...,...,...,...
4995,0,27,35700,2
4996,0,33,52400,1
4997,0,19,33900,2
4998,0,21,32700,3


As a result, we observe the absence of a difference between the values, which allows us to say that the proposed algorithm can be used for transformation in order to protect data.

## Verification of the algorithm

For the final verification of the proposed algorithm, we compare the quality of linear regression using the R2 metric.

In [9]:
features_train, features_valid, target_train, target_valid = train_test_split(features, 
                                                                              target, 
                                                                              test_size=0.25, 
                                                                              random_state=12345)

In [10]:
features_p_train, features_p_valid, target_p_train, target_p_valid = train_test_split(features_p, 
                                                                                      target, 
                                                                                      test_size=0.25, 
                                                                                      random_state=12345)

To split the data into two samples (training and validation), the train_test_split method was used. The sampling ratio was 3:1 or 75%:25%.

In [11]:
model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_valid)
print("R2 модели линейной регрессии до преобразования:", 
      r2_score(target_valid, predictions)
     )

R2 модели линейной регрессии до преобразования: 0.435227571270266


In [12]:
model_p = LinearRegression()
model_p.fit(features_p_train, target_p_train)
predictions_p = model_p.predict(features_p_valid)
print("R2 модели линейной регрессии до преобразования:", 
      r2_score(target_p_valid, predictions_p)
     )

R2 модели линейной регрессии до преобразования: 0.43522757127032763


After training two linear regression models, almost identical indicators of the R2 metric were obtained, the differences between which begin to be observed at a significant number of decimal places (more than 10).

Based on the results of the work carried out, it can be concluded that it is possible to use the considered data conversion algorithm to protect customer information.