# Protection of personal data of clients

You need to protect the data of clients of the insurance company "Though the Flood". Develop a method for transforming data so that it is difficult to recover personal information from it. Justify the correctness of its operation.

It is necessary to protect the data so that the quality of machine learning models does not deteriorate during conversion. There is no need to select the best model.

## Loading data

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
data = pd.read_csv('datasets/insurance.csv')
print(data.head())
print(data.info())
print(data.describe())

   Пол  Возраст  Зарплата  Члены семьи  Страховые выплаты
0    1     41.0   49600.0            1                  0
1    0     46.0   38000.0            1                  1
2    0     29.0   21000.0            0                  0
3    0     21.0   41700.0            2                  0
4    1     28.0   26100.0            0                  0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB
None
               Пол      Возраст      Зарплата  Члены семьи  Страховые выплаты
count  5000.000000  5000.000000   5000.000000  5000.000000        5000.0000

## Matrix multiplication

In this activity you can write formulas in *Jupyter Notebook.*

To write a formula inside text, surround it with dollar symbols \\$; if outside - double symbols \\$\\$. These formulas are written in the layout language *LaTeX.*

For example, we wrote down the linear regression formulas. You can copy and edit them to solve the problem.

Working in *LaTeX* is not necessary.

Designations:

- $X$ — matrix of features (the zero column consists of ones)

- $y$ — vector of the target feature

- $P$ is the matrix by which the features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning Objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** the quality of linear regression will not change

**Rationale:**
*Problem:* It is necessary to prove whether the quality of linear regression changes when a matrix with data is multiplied by an invertible matrix.

*Hypothesis:* if you multiply a matrix $X (A, B)$ by an invertible matrix $C$ of dimension $(B, B)$, then the predictions of $a$ should not change.

*Proof:*
Formulas for calculating the new prediction $a_{2}$ and coefficient $w_{2}$:
$$
a_{2} = XСw_{2}
$$

$$
w_{2} = ((XС)^T XС)^{-1}(XС)^T y
$$

Let's substitute $w_{2}$ into $a_{2}$ and transform:

$$
a_{2} = X С ((XС)^T XС)^{-1}(XС)^T y
$$
Let's open the brackets taking into account the rule $(XС)^T=С^T X^T$:
$$X С (С^TX^T XС)^{-1}С^T X^T y = \\X С (X^TXС)^{-1} (С^T)^{-1} С^ T X^T y = \\X С С^{-1}(X^TX)^{-1} (С^T)^{-1} С^T X^T y $$
Since the product of a direct and inverse matrix is equal to the identity matrix, we get:
$$X E (X^TX)^{-1} E X^T y = \\X (X^TX)^{-1} X^T y = X w
$$
Because $$
X w = XСw_{2}
$$
That
$$w = Сw_{2}$$
The relationship between the parameters is linear

## Conversion algorithm

**Algorithm**

1. Train a linear regression model on the X matrix, obtain a prediction for the training sample and the r2 value.

2. Find out the dimension of matrix X, generate a random invertible square matrix C with the number of rows and columns equal to the number of columns X. Check the invertibility of matrix C.

3.Multiply matrix X by C.

4. Train a linear regression model on the XC matrix, obtain a prediction for the training sample and the r2 value, compare with the data obtained for the X matrix.

**Rationale**

If our proof is correct, then the quality of linear regression when multiplied by an invertible matrix should not change

## Algorithm check

In [3]:
#select the target feature
data_train = data.drop(columns='Страховые выплаты')
data_target = data['Страховые выплаты']


In [4]:
#train the model on the original data and get the r2 value for prediction on the training set:
model=LinearRegression()
model.fit(data_train,data_target)
predictions=model.predict(data_train)
r2=r2_score(data_target,predictions)
print('r2 on source data: ',r2)

r2 on source data:  0.42494550286668


In [5]:
#generate a random invertible matrix with a dimension equal to the number of columns of the feature matrix
#and check it for reversibility:
X_columns = data_train.shape[1]
C = np.random.normal(size=(X_columns, X_columns))
C_1 = np.linalg.inv(C)
#we will generate random matrices until the product of a matrix and the inverse matrix
#will not be equal to the identity matrix
while np.allclose(np.dot(C,C_1),np.eye(C.shape[0])) ==False:
    C = np.random.normal(size=(X_columns, X_columns))
    C_1 = np.linalg.inv(C)
data_train_2=np.dot(data_train,C)

In [6]:
#train the model on the transformed data and get the r2 value for prediction on the new training set:
model=LinearRegression()
model.fit(data_train_2,data_target)
predictions2=model.predict(data_train_2)
r2_2=r2_score(data_target,predictions2)
print('r2 on the converted data: ',r2_2)

r2 on the converted data:  0.4249455028666158


CONCLUSION:

The r2 values for a linear regression model trained on the original and transformed data do not differ, which makes it possible to use the invertible matrix multiplication algorithm to protect personal data