# Protection of personal data of clients

The task is to protect the data of customers of the insurance company. Data transformation method must be dexeloped that makes it difficult to recover personal information from it. The choice is justified. 

The quality of the machine learning models must not be deteriorated during the transformation.

## Data overview

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
try:
    df = pd.read_csv('insurance.csv')
except:
    df = pd.read_csv('/datasets/insurance.csv')

In [3]:
def info(df):
    df.info()
    print(100*'=')
    display(df.describe())
    print(100*'=')
    display(df.head())
    print(100*'=')
    display(f'Shape: {df.shape}')
    
info(df) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0




Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0




'Shape: (5000, 5)'

There are data types `int64` and `float64`.

We have at our disposal the data of 5,000 clients - their gender, their age, salary, number of family members and insurance payments.

## Matrix multiplication



Designations:

- $X$ - feature matrix (zero column consists of ones)

- $y$ — target feature vector

- $P$ —  matrix by which features are multiplied

- $w$ — vector of linear regression weights (zero element equals shift)

Prediction:

$$
a = Xw
$$

Learning task:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

$$
a = Xw = XEw = XPP^{-1}w = (XP)P^{-1}w = (XP)w'
$$

$$
w' = ((XP)^T XP)^{-1} (XP)^T y
$$
$$
w' = (P^T (X^T X) P)^{-1} (XP)^T y
$$
$$
w' = P^{-1}  (X^T X)^{-1} (P^{-1})^T P^T X^T y
$$
$$
w' = P^{-1}  (X^T X)^{-1} (P^T)^{-1} P^T X^T y
$$
$$
w' = P^{-1} (X^T X)^{-1} E X^T y
$$
$$
w' = P^{-1}w
$$

## Transformation algorithm

**Algorithm**

It is necessary: 

- Create an invertible matrix, size 4*4.
- Multiply features by an invertible matrix.
- Train models with initial data and reversible matrix data.
- Compare the results of the R2 metric.


**Justification**

It is already proved that if we multiply a matrix by an invertible one, we get the same matrix. Therefore, the results of the R2 metric must match.

## Algorithm testing

In [4]:
features = df.drop('Страховые выплаты', axis=1)
target = df['Страховые выплаты']

Let's transform the data into a matrix.

In [5]:
df_matrix = np.random.normal(size = (4,4)) 

Let's create an inverse matrix.

In [6]:
df_inv = np.linalg.inv(df_matrix)
df_inv

array([[ 0.34485232, -0.97114931,  1.4610764 , -0.57791344],
       [-0.60319162, -0.25182567, -1.5462653 ,  0.29951298],
       [-0.62044378,  0.30063757, -0.21405901,  0.43207225],
       [-0.26517208,  0.61987907,  0.54817347, -0.54529364]])

Multiplying features by an invertible matrix.

In [7]:
x = features.values
new_x = x@df_inv
new_x

array([[ 350567.04289896, -213084.06879331,  689366.61403979,
        -890874.02329922],
       [ 268469.59821054, -163199.43001596,  527925.3539854 ,
        -682235.48694315],
       [ 148337.90257192,  -90176.22969761,  291694.8679473 ,
        -376954.043353  ],
       ...,
       [ 239663.06097261, -145665.84328365,  471282.1220344 ,
        -609046.20374746],
       [ 231155.34466911, -140499.52854978,  454552.63026253,
        -587424.93140322],
       [ 286998.02739098, -174439.12708417,  564363.19173292,
        -729333.76425658]])

Models training. 

In [8]:
class LinearRegression:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        y = train_target
        w =  np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0

In [9]:
model = LinearRegression()
model.fit(features, target)
predictions = model.predict(features)
print(f'R2 score: {r2_score(target, predictions)}')

R2 score: 0.42494550286668


In [10]:
model_inv = LinearRegression()
model_inv.fit(new_x, target)
predictions = model_inv.predict(new_x)
print(f' R2 score inverted: {r2_score(target, predictions)}')

 R2 score inverted: 0.4249455002483188


Metrics R2 match. 

If customer data needs to be protected, a good way is to multiply the data by the inverse matrix, since the quality of the linear regression does not change.