<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-description" data-toc-modified-id="Project-description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project description</a></span></li><li><span><a href="#Data-description" data-toc-modified-id="Data-description-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data description</a></span></li><li><span><a href="#Data-loading" data-toc-modified-id="Data-loading-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data loading</a></span><ul class="toc-item"><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></li><li><span><a href="#Multiplying-matrices" data-toc-modified-id="Multiplying-matrices-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Multiplying matrices</a></span><ul class="toc-item"><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li><li><span><a href="#Algorithm-of-transformation" data-toc-modified-id="Algorithm-of-transformation-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Algorithm of transformation</a></span></li><li><span><a href="#Checking-algorithm" data-toc-modified-id="Checking-algorithm-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Checking algorithm</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Protecting customers' personal information

## Project description

**Brief information:** insurance company plans to use data encryption to keep personal information secure.

**Objective:** protect user data from leaks as the company develops machine learning models.

**Tasks:** develop a method for transforming user data to make it harder to recover personal information, while not degrading the quality of machine learning models.

## Data description

Dataset contains one table with the following columns:
* `Пол` — clients gender;
* `Возраст` — clients age;
* `Зарплата` — clients salary;
* `Члены семьи` — number of people in the client's family;
* `Страховые выплаты` — number of insurance payments to the client in the last 5 years.

## Data loading

In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import make_spd_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from numpy.linalg import inv, cond, matrix_rank

In [2]:
data = pd.read_csv('https://code.s3.yandex.net/datasets/insurance.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [4]:
data.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [5]:
data.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


This dataset consists of 5 columns, where the target value is the column characterizing the number of insurance payments to the client in the last 5 years. The remaining columns are features and characterize the socio-demographic profile (gender, age, number of family members) and the consumer profile (salary of the client).

A quick exploratory analysis of the data revealed no missing values or anomalies in the data set.

Next, we'll check the data for explicit duplicates.

In [6]:
#checking for duplicates
print(f'Number of duplicates: {data.duplicated().sum()}')

Number of duplicates: 153


In [7]:
#showing first 15 rows of duplicates (sorting by age)
data[data.duplicated(keep=False)].sort_values(by='Возраст').head(15)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
1566,1,18.0,39800.0,2,0
2429,1,18.0,39800.0,2,0
2694,1,19.0,52600.0,0,0
712,1,19.0,52600.0,0,0
851,0,19.0,51700.0,0,0
887,1,19.0,35500.0,0,0
664,1,19.0,35500.0,0,0
1336,1,19.0,32700.0,0,0
1751,0,19.0,38600.0,0,0
1990,1,19.0,41600.0,1,0


It should be noted that there are obvious duplicates in the data and that complete coincidence is unlikely. The presence of duplicate values may be due to a technical error in downloading the data. To further develop the encryption algorithm and test the quality of its work, let's remove repeated values.

In [8]:
data.drop_duplicates(inplace=True)

For further work, let's rename the columns for convenience.

In [9]:
data.columns = ['sex', 'age', 'salary', 'family_members', 'payments_number']

### Conclusions

* **A quick EDA showed that there are no missing values or anomalies in the observations. However, there are apparent duplicates of 153 rows. The presence of duplicates may be due to a technical error during the data download. Duplicate values have been removed.**
* **The dataset consists of 5 columns, where the target value is the column characterizing the number of insurance payments to the client in the last 5 years. Other columns are characteristics and characterize socio-demographics (gender, age, number of family members) and consumer profile (client's salary).**

## Multiplying matrices

Notations:

- $X$ — features matrix (the zero column consists of ones)

- $y$ — target vector

- $P$ — matrix on which the features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

By the condition of the problem, the features are multiplied by the invertible matrix. It is necessary to answer the following question:

**The question:** does the quality of linear regression change when the features are multiplied by an invertible matrix?


**Solution:** 

Assume that the quality of the model does not change when the features are multiplied by the invertible matrix. Then the predictions obtained with the invertible matrix must be equal to those obtained without the invertible matrix.

To prove this, we multiply the features $X$ by the invertible matrix $P$ and obtain the encoded features $X1$:

$$
X1 = XP
$$

If we substitute the encoded features into the model learning formula, then $w$ takes the form:

$$
w = ((X P)^T X P)^{-1} (X P)^T y
$$

Let's plug the resulting weight into the prediction formula:

$$
a = XP ((X P)^T X P)^{-1} (X P)^T y
$$

Reveal the brackets $ (XP)^T $:

$$
a = XP (P^T X^T X P)^{-1} P^T X^T y
$$

Reveal the brackets $(P^T X^T X P)^{-1}$ given that the product of matrices $X^T$ and $X$ is a square matrix ($n*m$  @ $m*n$ = $n*n$):

$$
a = X P P^{-1} (X^T X)^{-1} (P^T)^{-1} P^T X^T y
$$

$ P P^{-1} = E$, then:

$$
a = XE(X^T X)^{-1}E X^T y
$$

Therefore:
$$
a = X (X^T X)^{-1} X^T y
$$

**The answer:** this gives you exactly the same predictions and does not change the quality of the linear regression when the features are multiplied by the invertible matrix.

**Обоснование:** the invertible matrix of the square one $P$ is the matrix $P^{-1}$. The product of the invertible matrix $P^{-1}$ with $P$ is $E$ (identity matrix). Multiplying any matrix by an identity matrix yields the same matrix without any changes. Thus, the quality of the linear regression does not change.

### Conclusion

* **Multiplying the features by a random invertible matrix P does not change the quality of the linear regression. Analytically, by substituting the new matrix into the formula, the same predictions were obtained as in the case without using this matrix. Thus, to encrypt user information, it is necessary to multiply the features by the random invertible matrix P. To decrypt them, the invertible matrix $P^{-1}$ must be used.**

## Algorithm of transformation

**Algorithm**

1. To encrypt the features, create a random square matrix ${P}$ of size 4х4 (since there are 4 features in the dataset).
2. Check it for inversion using the matrix rank, where its dimensionality should match the rank. Repeat step 1 if the condition is not met.
3. Multiply the features by the resulting random matrix in the training and test sets.
4. Train the model on the encrypted features and calculate the r2 value on the test set.
5. Train the model on the unencrypted features and calculate the r2 score on the test set.
6. Compare the obtained metrics. If they are equal, then the algorithm worked without errors.
7. As an additional check, decrypt the features and compare the obtained matrix with the initial feature matrix.

**Validation**

Multiplying the features by a random invertible matrix does not change the quality of the linear regression, because the product of the invertible matrix $P^{-1}$ and $P$ is the identity matrix, and multiplying any matrix by the identity matrix will produce the same matrix.

## Checking algorithm

Let's select the features and the target values (the number of insurance payments to the client in the last 5 years) to test the algorithm.

In [10]:
features = data.drop('payments_number', axis=1)
target = data['payments_number']

Let's create features matrix.

In [11]:
features_matrix = features.values

Let's use the function to create a random matrix of the desired size and encrypt the features with the resulting matrix.

In [12]:
#random matrix generator function with rank checking
def mat_generator(x):
    matrix_size = x.shape[1]
    rand_matrix = 0
    while True:
        rand_matrix = np.random.rand(x.shape[1], x.shape[1])
        if np.linalg.matrix_rank(rand_matrix) == matrix_size:
            return rand_matrix

In [13]:
#random matrix
rand_matrix = mat_generator(features)

#features encryption function
def encryption(features, random_matrix):
    matrix_transf = features @ random_matrix
    features_cipher = list(matrix_transf)
    return features_cipher, matrix_transf

#encrypted features
features_cipher, matrix_transf = encryption(features_matrix, rand_matrix)

In [14]:
rand_matrix

array([[0.53538319, 0.3886399 , 0.38970879, 0.06758633],
       [0.16717813, 0.52127807, 0.47296582, 0.51615975],
       [0.74907736, 0.45518234, 0.44541015, 0.57540569],
       [0.8870556 , 0.81383199, 0.67532486, 0.95400307]])

Let's split our dataset into train and test sets.

In [15]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=42)

Let's create two linear regression models with and without encryption.

In [16]:
model_no_cipher = LinearRegression()
model_cipher = LinearRegression()

Fitting the model without encryption and checking r2 score.

In [17]:
model_no_cipher.fit(features_train, target_train)
predicted = model_no_cipher.predict(features_test)
print("R2:", round(r2_score(target_test, predicted), 5))

R2: 0.44346


Let's split encrypted features matrix into train and test sets.

In [18]:
features_train_cipher, features_test_cipher = train_test_split(features_cipher, test_size=0.25, random_state=42)

In [19]:
model_cipher.fit(features_train_cipher, target_train)
predicted_cipher = model_cipher.predict(features_test_cipher)
print("R2:", round(r2_score(target_test,predicted_cipher), 5))

R2: 0.44346


The coefficient of determination of the two models is the same. The algorithm for encrypting the features worked without error. As an additional check, we decode the features using the invertible matrix for the previously created random matrix. 

In [20]:
#creating initial matrix using invertible one
original_matrix = matrix_transf @ inv(rand_matrix)
#matrices comparison
np.round(features_matrix) == np.round(original_matrix)

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       ...,
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

The initial matrix created is the same as the matrix with the features. The algorithm worked without errors.

## Summary

* **During the development of the encryption algorithm for user data, explicit duplicates were detected. The presence of duplicates may be due to a technical error during data download.**
* **The target value is the column that characterizes the number of insurance payments to the client in the last 5 years. The remaining columns are characteristics and characterize socio-demographics (gender, age, number of family members) and consumer profile (client's salary). All of this is user data that must be encrypted.**
* **As an algorithm for user data encryption, the multiplication of features by an invertible random matrix should be used. In this case, the quality of the linear regression model will not change due to the reversibility property of the random matrix. The product of a random matrix and an invertible matrix is equal to the identity matrix, and its multiplication by any other matrix leaves the identity matrix unchanged.**