 # Protection of customers' personal data

## Loading data

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv('/project/datasets/insurance.csv')

In [3]:
df.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [5]:
df.isna().mean()

Пол                  0.0
Возраст              0.0
Зарплата             0.0
Члены семьи          0.0
Страховые выплаты    0.0
dtype: float64

In [6]:
df.duplicated().sum()

153

In [7]:
df[df.duplicated()].head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
281,1,39.0,48100.0,1,0
488,1,24.0,32900.0,1,0
513,0,31.0,37400.0,2,0
718,1,22.0,32600.0,1,0
785,0,20.0,35800.0,0,0


In [8]:
df.dtypes

Пол                    int64
Возраст              float64
Зарплата             float64
Члены семьи            int64
Страховые выплаты      int64
dtype: object

In [9]:
df['Зарплата'] = df['Зарплата'].astype('int')
df['Возраст'] = df['Возраст'].astype('int')

### Conclusion.
The data contains 5,000 rows of information about gender, age, salary, family members, and customer insurance benefits. There are no gaps in the data, there are 153 duplicate records, not many of them, and perhaps they are not duplicates, but identical records (let's leave them as is). The data is of integer type. 

## Matrix multiplication

**Answer the question:** The signs are multiplied by a reversible matrix. Will the quality of the linear regression change? (It can be re-trained.) 

Notations:

- $X$ - feature matrix (zero column consists of units)

- $y$ - vector of target attributes

- $P$ - matrix, on which the signs are multiplied

- $w$ - vector of linear regression weights (zero element equals shift)

Predictions:

$$
a = Xw
$$

Learning task:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** When the feature matrix is multiplied by the reversible matrix, the quality of the linear regression will not change.

**Rationale:**   

Let us denote X as XP, where P is a reversible matrix  
For the proof we will use the following identities:  

$$
(AB)^T = B^TA^T, (AB)^{-1} = B^{-1}A^{-1}, AA^{-1} = A^{-1}A = E
$$

$$
a' = XP((XP)^TXP)^{-1}(XP)^Ty = XP(P^TX^TXP)^{-1}P^TX^Ty = XPP^{-1}(X^TX)(P^T)^{-1}P^TX^Ty = XE(X^TX)^{-1}EX^Ty = X(X^TX)^{-1}X^Ty
$$

Since 
$
w = (X^T X)^{-1} X^T y
$
 
$$
a' = Xw
$$

$$
a' = a
$$

## Conversion algorithm and algorithm validation

1. Create a random matrix of size 4x4
2. Find its inverse matrix
3. Train a linear regression model
4. Compare the quality of the model after we multiply the features by the random inverse matrix

Let's create a random matrix of size 4x4

In [10]:
matrix_P =np.random.normal(size = (4,4)) 
matrix_P

array([[ 0.63856358, -0.89248069,  0.94651448, -0.69051953],
       [-0.493704  , -0.46422892,  0.41918516, -1.38642664],
       [ 0.12238673, -0.50285247,  0.34521165, -0.21889049],
       [-0.39273841, -0.3198608 ,  1.98625478,  0.42249399]])

Let's find its inverse matrix

In [11]:
i_matrix_P = np.linalg.inv(matrix_P)
i_matrix_P

array([[ 1.41868595, -0.47608271, -1.92512914, -0.2409877 ],
       [ 1.04680802,  0.18606207, -4.13296591,  0.18020578],
       [ 0.59296941,  0.06242646, -1.39657793,  0.45044201],
       [-0.67641913, -0.59517264,  1.6471564 ,  0.1616663 ]])

In [12]:
features = df.drop(['Страховые выплаты'],1)
target = df['Страховые выплаты']

  features = df.drop(['Страховые выплаты'],1)


Let's split the dataset into test and validation samples.

In [13]:
X_train, X_valid, y_train, y_valid = train_test_split(features, target, test_size=0.25,
                                                    random_state=42)

In [14]:
new_X_valid = X_valid @ i_matrix_P
new_X_train = X_train @ i_matrix_P

Train the model on a training sample

In [15]:
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_valid)
print('Linear regression quality metric R2:',r2_score(y_valid, predictions))


Linear regression quality metric R2: 0.4254778535754764


In [16]:
model = LinearRegression()
model.fit(new_X_train, y_train)
predictions = model.predict(new_X_valid)
print('Linear regression quality metric R2 after multiplication by the random inverse matrix:',r2_score(y_valid, predictions))

Linear regression quality metric R2 after multiplication by the random inverse matrix: 0.42547785357556756


## Conclusion

Two linear regression models were trained and tested: with the original data and with the features multiplied by a random reversible matrix. The R2_score metrics for both models, calculated on the validation sample, differ in 12 decimal places. The algorithm works.