<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading data</a></span></li><li><span><a href="#Matrix-multiplication" data-toc-modified-id="Matrix-multiplication-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Matrix multiplication</a></span></li><li><span><a href="#Conversion-algorithm" data-toc-modified-id="Conversion-algorithm-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Conversion algorithm</a></span></li><li><span><a href="#Algorithm-check" data-toc-modified-id="Algorithm-check-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Algorithm check</a></span><ul class="toc-item"><li><span><a href="#Initial-matrix" data-toc-modified-id="Initial-matrix-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Initial matrix</a></span></li><li><span><a href="#New-matrix" data-toc-modified-id="New-matrix-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>New matrix</a></span></li><li><span><a href="#Data-decryption" data-toc-modified-id="Data-decryption-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Data decryption</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Protection of personal data of clients

You need to protect the data of clients of the insurance company. Develop a method for transforming data so that it is difficult to recover personal information from it. Justify the correctness of its operation.

It is necessary to protect the data so that the quality of machine learning models does not deteriorate during conversion. There is no need to select the best model.

## Loading data

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
RANDOM_STATE=12345

In [2]:
data=pd.read_csv("insurance.csv")
data.sample(10)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
883,0,21.0,39400.0,1,0
2753,1,20.0,44400.0,0,0
4606,0,42.0,59100.0,0,1
240,1,31.0,41500.0,0,0
4024,1,39.0,35500.0,0,0
2065,0,23.0,49700.0,0,0
1014,0,38.0,43300.0,1,0
3168,1,49.0,38200.0,1,2
2197,1,39.0,42300.0,0,0
1975,0,24.0,46800.0,0,0


In [3]:
data.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [5]:
data.duplicated().sum()

153

In [6]:
data.isna().sum()

Пол                  0
Возраст              0
Зарплата             0
Члены семьи          0
Страховые выплаты    0
dtype: int64

In [7]:
data=data.drop_duplicates()

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4847 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                4847 non-null   int64  
 1   Возраст            4847 non-null   float64
 2   Зарплата           4847 non-null   float64
 3   Члены семьи        4847 non-null   int64  
 4   Страховые выплаты  4847 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 227.2 KB


**Conclusion**

Using the describe() method, we studied customer data. Age ranges from 18 to 65 years, salary range from 5.3 thousand to 79 thousand, on average 1 child per family (maximum 6).

The maximum number of payments per client is 5, but on average - 0.2. Thus, there are no anomalies or errors in the data.

We also checked the data for any omissions.

## Matrix multiplication

**Features are multiplied by an invertible matrix. Will the quality of linear regression change?**
 
  a. Will change. Give examples of matrices.
 
  b. Will not change. Indicate how the linear regression parameters in the original problem and in the transformed one are related.

Designations:

- $X$ — matrix of features (the zero column consists of ones)

- $y$ — vector of the target feature

- $P$ is the matrix by which the features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning Objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** When multiplying features by an invertible matrix, the quality of linear regression does not change

**Rationale:**

Let $Z=XP$

$X$ — feature matrix (the zero column consists of ones)

$P$ - invertible matrix

$a=Xw$

$w=(X^TX)^{-1}X^Ty$

$a1=Zw$

$w1=(Z^TZ)^{-1}Z^Ty=((XP)^TXP)^{-1}(XP)^Ty=(P^TX^TXP)^{-1}(XP) ^Ty=P^{-1}(X^TX)^{-1}(P^T)^{-1}(XP)^Ty$

Let's check if $a$ and $a1$ are equal? Let's simplify the expression $Zw1$

$a1=XPP^{-1}(X^TX)^{-1}(P^T)^{-1}(XP)^Ty$

$a1=XE(X^TX)^{-1}(P^T)^{-1}(XP)^Ty$

$a1=X(X^TX)^{-1}(P^T)^{-1}P^TX^Ty$

$a1=X(X^TX)^{-1}X^Ty$

$a=Xw=X(X^TX)^{-1}X^Ty$

The equality is true, so the hypothesis is proven

**Conclusion**

We proved the assumption that multiplying the original matrix by the invertible one does not change the quality of linear regression. This output will be needed in step 3.

## Conversion algorithm

**Algorithm**

1. Create a random 4x4 matrix.
2. We check the invertibility of the matrix by calculating the determinant. If it is equal to zero, then we create the matrix again until the determinant becomes non-zero (and the matrix is invertible).
3. Multiply the original matrix by the coding one. We make sure that the received data is difficult to decrypt without the encoder signs.
4. The function returns the resulting and encoding matrix.
5. Calculate the R2 metric of the original matrix.
6. Calculate the R2 metric of the protected matrix. It is equal to the metrics of the original matrix. This means you can use the resulting matrix to work and train models without transmitting personal data.

In [9]:
def secure(features):
    new_features = features
    det=0
    while det == 0:
        encryption_matrix = np.random.normal(-100, 100, (4,4))
        det = np.linalg.det(encryption_matrix)
    new_features = new_features @ encryption_matrix
    return new_features, encryption_matrix

**Rationale**

In paragraph 2 of the project it is proven that multiplication by an invertible matrix does not affect the quality of linear regression. Data security is ensured by using a random number generator to compile the matrix

**Conclusion**

Based on the second point, an algorithm was created that changes the original matrix and helps protect personal data.

## Algorithm check

### Initial matrix

In [10]:
features = data.drop('Страховые выплаты', axis = 1)
target = data['Страховые выплаты']

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=RANDOM_STATE)
print(features_train.shape[0], features_test.shape[0])
print(target_train.shape[0], target_test.shape[0])

3877 970
3877 970


In [11]:
model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_test)
print("R2 in the original matrix =", r2_score(target_test, predictions))
features.head(5)

R2 in the original matrix = 0.41605492161510926


Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,1,41.0,49600.0,1
1,0,46.0,38000.0,1
2,0,29.0,21000.0,0
3,0,21.0,41700.0,2
4,1,28.0,26100.0,0


### New matrix

In [12]:
model = LinearRegression()
features = data.drop('Страховые выплаты', axis = 1)
target = data['Страховые выплаты']
features, encryption_matrix = secure(features)
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=RANDOM_STATE)

model.fit(features_train, target_train)
predictions = model.predict(features_test)
print("R2 in the new matrix =", r2_score(target_test, predictions))
features.head()

R2 in the new matrix = 0.4160549216148012


Unnamed: 0,0,1,2,3
0,-4089606.0,-5772087.0,2480267.0,-6177497.0
1,-3134959.0,-4423662.0,1897538.0,-4732324.0
2,-1732826.0,-2444976.0,1048061.0,-2615025.0
3,-3436837.0,-4851388.0,2087465.0,-5194181.0
4,-2152658.0,-3037981.0,1304072.0,-3250379.0


In [13]:
print("Coding matrix:")
encryption_matrix

Coding matrix:


array([[  45.33656695,  -68.55288519,   44.97948693,  -24.81133755],
       [-118.36626948, -104.77817966, -178.66369112,   31.30715965],
       [ -82.35208622, -116.28271444,   50.15438444, -124.56824415],
       [-134.87994016,  -99.45558118, -110.37658903, -171.30628829]])

### Data decryption

In [14]:
encryption_matrix_inv=np.linalg.inv(encryption_matrix)
decrypted=features @ encryption_matrix_inv
decrypted=decrypted.round().abs()
decrypted.columns=['Пол', 'Возраст', 'Зарплата', 'Члены семьи']
decrypted.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,1.0,41.0,49600.0,1.0
1,0.0,46.0,38000.0,1.0
2,0.0,29.0,21000.0,0.0
3,0.0,21.0,41700.0,2.0
4,1.0,28.0,26100.0,0.0


**Conclusion**

During the test, we were convinced that the algorithm was correct in practice.

## Conclusion
1. Pre-processed the data. We removed repetitions and also checked the quality of the source data: there are no gaps or extremes.
2. Prepared a theoretical basis. They proved that it is possible to multiply a matrix by an invertible one without losing the quality of the matrix. regression.
3. We created an algorithm that encrypts personal data, using the conclusions from the second point of the study.
4. We tested in practice the performance of the algorithm and the accuracy of its operation.
5. R2 metrics before and after encryption are equal and amount to **0.41605**