# Protection of clients of the insurance company

Insurance company "X" wants to protect customer data. It is necessary to develop a method of data transformation so that it is difficult to recover personal information from them.

There is no need to select the best model.

**Task of the project:**

To protect the data so that the quality of the machine learning models does not deteriorate during the transformation.

**Objective of the project:**

- Check the hypothesis. The features are multiplied by an invertible matrix. The quality of the linear regression will not change.
- Suggest a data encryption algorithm. Program this algorithm using matrix operations. Check that the quality of the linear regression from sklearn is the same before and after the transformation by applying the R2 metric.

**Decision progress**

1. Download and review data
2. Testing the hypothesis. The features are multiplied by an invertible matrix. Will the quality of the linear regression change.
3. Choice of data encryption algorithm. Rationale for this algorithm.
4. Comparison of two models (before and after feature encryption) using the R2 metric.
5. Conclusions

## Loading data

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
clients_insurance = pd.read_csv(r'C:\Users\Vadim\Documents\Datasets\insurance.csv')

In [3]:
clients_insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [4]:
clients_insurance.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [5]:
clients_insurance.columns = ['sex', 'age', 'salary', 'number of family members', 'insurance payments']

In [6]:
clients_insurance.head(5)

Unnamed: 0,sex,age,salary,number of family members,insurance payments
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [7]:
clients_insurance.duplicated().value_counts()

False    4847
True      153
dtype: int64

In [8]:
def uniqu(tab):
    for column in tab.columns:
        print('Unique value of the column:', column)
        print(tab[column].unique())
        print('----------------------------------------------------------------------')

In [9]:
uniqu(clients_insurance)

Unique value of the column: sex
[1 0]
----------------------------------------------------------------------
Unique value of the column: age
[41. 46. 29. 21. 28. 43. 39. 25. 36. 32. 38. 23. 40. 34. 26. 42. 27. 33.
 47. 30. 19. 31. 22. 20. 24. 18. 37. 48. 45. 44. 52. 49. 35. 56. 65. 55.
 57. 54. 50. 53. 51. 58. 59. 60. 61. 62.]
----------------------------------------------------------------------
Unique value of the column: salary
[49600. 38000. 21000. 41700. 26100. 41000. 39700. 38600. 49700. 51700.
 36600. 29300. 39500. 55000. 43700. 23300. 48900. 33200. 36900. 43500.
 36100. 26600. 48700. 40400. 38400. 34600. 34800. 36800. 42200. 46300.
 30300. 51000. 28100. 64800. 30400. 45300. 38300. 49500. 19400. 40200.
 31700. 69200. 33100. 31600. 34500. 38700. 39600. 42400. 34900. 30500.
 24200. 49900. 14300. 47000. 44800. 43800. 42700. 35400. 57200. 29600.
 37400. 48100. 33700. 61800. 39400. 15600. 52600. 37600. 52500. 32700.
 51600. 60900. 41800. 47400. 26500. 45900. 35700. 34300. 26700. 2570

## Matrix multiplication

Designations:

- $X$ - feature matrix

- $y$ — target feature vector

- $P$ is the matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Question:** Features are multiplied by an invertible matrix. Will the quality of linear regression change?

**Hypothesis:** The quality of the linear regression will not change.

**Rationale:**

We multiply the feature matrix by the reversible matrix. We take Q as an invertible matrix, then:

$$ a1 = XQw $$

Linear regression weights:

$$ w = ((XQ)^T (XQ))^{-1} (XQ)^T y $$

Substitute the weights in the predictions:

$$ a1 = XQ((XQ)^T (XQ))^{-1} (XQ)^T y $$

$$ a1 = XQ(XQ)^{-1} ((XQ)^T)^{-1} (XQ)^T y $$

$$ a1 = XQQ^{-1}(X)^{-1} ((XQ)^T)^{-1} (XQ)^T y $$

Since $$ QQ^{-1} = E = 1 $$

Then $$ a1 = XEX^{-1} ((XQ)^T)^{-1} (XQ)^T y = XX^{-1} (X^T)^{-1} (Q^T )^{-1} X^T Q^T y $$

$$ a1 = X (X^TX)^{-1} E X^T y = X (X^TX)^{-1} X^T y = X w = a $$

Therefore, we can conclude that the multiplication of features by the inverse matrix does not affect the quality of linear regression.

**Checking the calculation of linear regression coefficients**

In [10]:
features = clients_insurance.drop('insurance payments',axis=1)
target = clients_insurance['insurance payments']

In [11]:
X = np.concatenate((np.ones((features.shape[0], 1)), features), axis=1)
y = target
w = np.linalg.inv(X.T @ X) @ X.T @ y
display(w[1:])
model = LinearRegression()
model.fit(features, target)
model.coef_

array([ 7.92580543e-03,  3.57083050e-02, -1.70080492e-07, -1.35676623e-02])

array([ 7.92580543e-03,  3.57083050e-02, -1.70080492e-07, -1.35676623e-02])

The coefficients calculated manually are completely the same as the coefficients calculated using the sklearn library.

## Conversion algorithm

**Algorithm**

Based on the above formulas, the feature encryption algorithm in the formula for calculating model predictions will be the multiplication of the feature matrix by the reversible matrix.

**Rationale**

For the algorithm to run correctly, two conditions must be met:
- the width of the feature matrix coincides with the length of the second
- the determinant of an invertible matrix is not equal to zero (otherwise the matrix is irreversible)

## Algorithm check

We will check by comparing the quality of models before and after feature encryption. We will evaluate the models using the metric R2.

In [12]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=7072020)

In [13]:
regressor = LinearRegression()
scaller = StandardScaler()
pipeline = Pipeline([("standard_scaller", scaller),("linear_regression", regressor)])
pipeline.fit(features_train, target_train)
Score = r2_score(target_test, pipeline.predict(features_test))
print("R2 =", Score)

R2 = 0.4184108151220888


Feature Encryption Feature:

In [17]:
def encryption_features(features):
    encrypted_features = features
    n = features.shape[1]
    np.random.seed(7072020)
    encrypted_matrix = np.random.randint(1, 10, (n,n))   # creating an invertible matrix
    det = np.linalg.det(encrypted_matrix)   
    while det == 0:   # we exclude the possibility of equality of the matrix determinant to zero (in this case, the matrix is irreversible)
        np.random.seed(7072021)
        encrypted_matrix = np.random.randint(1, 10, (n,n))
        det = np.linalg.det(encrypted_matrix)
    encrypted_features = encrypted_features @ encrypted_matrix
    return encrypted_features, encrypted_matrix   # function returns encoded features and invertible matrix

In [18]:
display(features.head())
encrypted_features, encrypted_matrix = encryption_features(features)
display(encrypted_features.head())
encrypted_matrix

Unnamed: 0,sex,age,salary,number of family members
0,1,41.0,49600.0,1
1,0,46.0,38000.0,1
2,0,29.0,21000.0,0
3,0,21.0,41700.0,2
4,1,28.0,26100.0,0


Unnamed: 0,0,1,2,3
0,397184.0,49981.0,99373.0,248097.0
1,304422.0,38421.0,76185.0,190101.0
2,168261.0,21261.0,42116.0,105058.0
3,333805.0,41903.0,83486.0,208560.0
4,209059.0,26357.0,52320.0,130562.0


array([[7, 5, 8, 6],
       [9, 9, 4, 2],
       [8, 1, 2, 5],
       [8, 7, 1, 9]])

Now the data is encrypted. Retrain the model and check the R2 metric.

In [19]:
features_train, features_test, target_train, target_test = train_test_split(
    encrypted_features, target, test_size=0.25, random_state=7072020)

In [20]:
regressor = LinearRegression()
scaller = StandardScaler()
pipeline = Pipeline([("standard_scaller", scaller),("linear_regression", regressor)])
pipeline.fit(features_train, target_train)
Score_encrypted = r2_score(target_test, pipeline.predict(features_test))
print("R2 =", Score)

R2 = 0.4184108151220888


In [21]:
result = pd.DataFrame(data= [Score,
                      Score_encrypted], 
                     columns=['R2'], 
                     index=['Linear Regression',
                            'Linear Regression on Transformed Features'])
result

Unnamed: 0,R2
Linear Regression,0.418411
Linear Regression on Transformed Features,0.418411


## Conclusion

During the study, the hypothesis was tested: if the feature matrix is multiplied by an invertible matrix, then the quality of linear regression does not change. The hypothesis was tested through a data encryption algorithm that describes this hypothesis. Then the model was compared before and after encryption. The models were compared using the R2 metric.

**This hypothesis is confirmed. The quality of the models has not changed. Simply multiplying the feature matrix by the reversible matrix does not change the quality of the model, which opens up a great way to encrypt data.**