# Protection of Clients' Personal Data

You need to protect the data of clients of the insurance company "Even Flood." Develop a method of transforming the data so that it is difficult to reconstruct personal information from them. Substantiate the correctness of its operation.

It is necessary to protect the data in such a way that the quality of machine learning models does not deteriorate during transformation. It is not necessary to search for the best model.

# Project Execution Order:

1. Upload and study the data.
2. Answer the question and justify the solution.
3. If features are multiplied by an invertible matrix, will the quality of linear regression change? (To check, retrain it.)
    - a. It will change. Provide examples of matrices.
    - b. It will not change. Specify how the parameters of linear regression in the original problem and in the transformed one are related.
4. Propose a data transformation algorithm to solve the problem. Provide justification for why the quality of linear regression will not change.
5. Program this algorithm, applying matrix operations. Check whether the quality of linear regression from sklearn differs before and after transformation. Apply the R2 metric.

## Loading Data

Import all the necessary libraries for the project.

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Load the file and save it to a variable.

In [2]:
pth1_ar = ['/Users/daniyardjumaliev/Jupyter/Projects/datasets/insurance.csv']
pth2_ar = ['backup_path1.csv']
df = None

for pth1, pth2 in zip(pth1_ar, pth2_ar):
    if os.path.exists(pth1):
        df = pd.read_csv(pth1)
    elif os.path.exists(pth2):
        df = pd.read_csv(pth2)
    else:
        print(f'Warning: File not found in both paths for {pth1} and {pth2}')
else:
    print('File loaded successfully.')

File loaded successfully.


Right now, we have only one source from which we retrieve the file for investigation. If alternative paths for finding the file emerge in the future, they can be added to a second array. This helps to avoid errors during the data loading stage. Alternatively, create a third, fourth, and so on, following the same analogy.

Let's check the data quality from the dataset.

In [3]:
display(df)
display(df.describe())
display(df.info())
display(f'Duplicates: {df.duplicated().sum()}')
display(f'Missing values:')
display(df.isna().sum())

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
...,...,...,...,...,...
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0


Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


None

'Duplicates: 153'

'Missing values:'

Пол                  0
Возраст              0
Зарплата             0
Члены семьи          0
Страховые выплаты    0
dtype: int64

In [4]:
df.drop_duplicates(inplace=True)
display(f'Duplicates: {df.duplicated().sum()}')

'Duplicates: 0'

Let's check the correlation of the data.

In [5]:
correlation_df = df.corr()
cmap = plt.get_cmap('coolwarm')
styled_corr_matrix = correlation_df.style.background_gradient(cmap=cmap, axis=None)
display(styled_corr_matrix)

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
Пол,1.0,0.001953,0.015456,-0.007315,0.011565
Возраст,0.001953,1.0,-0.017386,-0.009064,0.654964
Зарплата,0.015456,-0.017386,1.0,-0.031687,-0.013123
Члены семьи,-0.007315,-0.009064,-0.031687,1.0,-0.039303
Страховые выплаты,0.011565,0.654964,-0.013123,-0.039303,1.0


There is no observed correlation between the data. There is a weak correlation between "insurance payouts" and "age," but it is small.

_Conclusion_: The data is of high quality, without outliers or missing values. 153 clear duplicates were removed from the dataset.

## Matrix Multiplication

In this task, you can write formulas in a Jupyter Notebook.

To write a formula within the text, surround it with dollar signs $; if outside — with double dollar signs $$. These formulas are written in the LaTeX markup language.

For example, we have written linear regression formulas. You can copy and edit them to solve the task.

Working with LaTeX is optional.

Notations:

Обозначения:

- $X$ — feature matrix (the zero column consists of ones)

- $y$ — vector of the target feature

- $P$ — matrix by which features are multiplied

- $w$ — vector of linear regression weights (the zero element is the bias)

Predictions:

$$
a = Xw
$$

Training task:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

Next, let's answer the question and justify the solution.

- If features are multiplied by an invertible matrix, will the quality of linear regression change? (To check, retrain it.)
    - a. It will change. Provide examples of matrices.
    - b. It will not change. Specify how the parameters of linear regression in the original problem and in the transformed one are related.

To do this, let's write a linear regression function and train it under different conditions. We will also create our function to calculate the R2 metric.

**Justification:**

Used properties:
$$
(AB)^T=B^T A^T
$$
$$
(AB)^{-1} = B^{-1} A^{-1}
$$
$$
A A^{-1} = A^{-1} A = E
$$
$$
AE = EA = A
$$
Proof:
$$
a = Xw = XEw = XPP^{-1}w = (XP)P^{-1}w = (XP)w'
$$
\
It is required to prove that the predictions will not change. We have  $a =  Xw$,   $a' = X'w'$
\
\
$$
    w = (X^T X)^{-1} X^T y
$$
\
$$
w' = ((XP)^T XP)^{-1} (XP)^T y
$$
$$
w' = (P^T (X^T X) P)^{-1} (XP)^T y
$$
$$
w' = (P^T (X^T X) P)^{-1} P^T X^T y
$$
\
First, let's recall the values of our coefficients:
\
$$
w = (X^T X)^{-1} X^T y
$$
$$
w' = ((X P)^T XP)^{-1} (XP)^T y
$$
\
We have a series of properties:
$$
(AB)^T=B^T A^T \tag 1
$$
$$
(AB)^{-1} = B^{-1} A^{-1} \tag 2
$$
$$
A A^{-1} = A^{-1} A = E \tag 3
$$
$$
AE = EA = A \tag 4
$$
$$
A(BC) = (AB)C \tag 5
$$
\
Next, let's use the first and second properties: $(P^T (X^TX) P)^{-1} P^T X^T y$

Let $A=P^T$, $B=(X^TX)P)^{-1}$, then $(P^T(X^TX)P)^{-1} = ((X^TX)P)^{-1} P^{T^{-1}}$
    
$$
w' = (X^T XP)^{-1} P^{T^{-1}} P^T X^T y
$$  
    
$P^{T^{-1}} P^T$ - simplify
$$
((X^TX)P)^{-1} (P^T)^{-1} P^T X^T y
$$

Next, expand $(X^T XP)^{-1}$, let $A=(X^TX)$, $B=P$
$$
P^{-1}(X^TX)^{-1} X^T y
$$

Since $w = (X^T X)^{-1} X^T y$ substitute it into the resulting formula: $P^{-1}w = w'$ therefore: $a = a'$ QED

In [6]:
class LinearRegressionCust:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        y = train_target
        w = np.linalg.inv(X.T @ X) @ X.T @ y
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0

In [7]:
def r2_score_cust(y_true, y_pred):
    ssr = ((y_true - y_pred) ** 2).sum()
    sst = ((y_true - y_true.mean()) ** 2).sum()
    r2 = 1 - (ssr / sst)
    return r2

Let's separate the training and target features. And split the datasets into training and testing sets with a 70-30 ratio.

In [8]:
features_train = df.drop('Страховые выплаты', axis=1)
target = df['Страховые выплаты']

X_train, X_test, y_train, y_test = train_test_split(features_train, target, test_size=0.3, random_state=12345)

display(f'Training features: {X_train.shape}')
display(f'Training target feature: {y_train.shape}')
display(f'Test features: {X_test.shape}')
display(f'Test target feature: {y_test.shape}')

'Training features: (3392, 4)'

'Training target feature: (3392,)'

'Test features: (1455, 4)'

'Test target feature: (1455,)'

To verify the posed question, let's train the model on the original data, check its quality with the R2 metric. Then, we'll create an invertible matrix using the numpy library, multiply our data by it, and check the R2 metric again.

In [9]:
model = LinearRegressionCust()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

r2 = r2_score_cust(y_test, predictions)
print("R2 score:", r2)

R2 score: 0.43287552621918113


So, the R2 score of our linear regression model is 0.43287552621918113. Now, let's check the same with the data that has been multiplied by the invertible matrix.

In [10]:
num_features = df.shape[1] - 1
A = np.random.rand(num_features, num_features)
A_inv = np.linalg.inv(A)

if np.linalg.det(A) != 0:
    print("Matrix A is invertible.")
else:
    print("Matrix A is not invertible. Try another matrix.")

df_transformed = df.drop('Страховые выплаты', axis=1).dot(A)
df_transformed['Страховые выплаты'] = df['Страховые выплаты']

display(df_transformed)

Matrix A is invertible.


Unnamed: 0,0,1,2,3,Страховые выплаты
0,46739.378299,21549.527156,26429.612344,20140.565561,0
1,35812.299536,16510.999812,20250.802812,15430.394484,1
2,19791.607738,9124.757655,11191.775799,8527.436798,0
3,39292.096168,18116.004532,22216.979739,16931.415068,0
4,24596.122504,11340.176506,13909.058801,10598.836923,0
...,...,...,...,...,...
4995,33641.367814,15510.369534,19022.385842,14495.798246,0
4996,49375.308910,22764.975561,27918.787908,21276.124091,0
4997,31943.495273,14727.728471,18061.988209,13764.588769,0
4998,30814.218083,14206.964258,17424.390778,13278.496196,0


In [11]:
feature_transformed = df_transformed.drop('Страховые выплаты', axis=1)
target_transformed = df_transformed['Страховые выплаты']

X_train, X_test, y_train, y_test = train_test_split(feature_transformed, target_transformed, \
                                                    test_size=0.3, random_state=12345)

model = LinearRegressionCust()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

r2 = r2_score_cust(y_test, y_pred)
print("R2 score after multiplying data by an invertible matrix:", r2)

R2 score after multiplying data by an invertible matrix: 0.43287539941220443


_Conclusion_: The R2 metric changed insignificantly up to the 6th decimal place. Therefore, the answer to the question of whether the quality of linear regression changes if the training features are multiplied by an invertible matrix is - no.

Next, let's check if the data multiplied by an invertible matrix can indeed be restored after multiplication. To do this, we will multiply our training features by the inverse matrix A_inv and check the restored and original dataframes.

In [12]:
df_restored = df_transformed.drop('Страховые выплаты', axis=1).dot(A_inv)
df_restored['Страховые выплаты'] = df_transformed['Страховые выплаты']

diff = np.abs(df - df_restored)
max_diff = diff.max().max()

if max_diff < 1e-10:
    print("Two dataframes are identical, taking into account the margin of error.")
else:
    print("Two dataframes are different.")

Two dataframes are identical, taking into account the margin of error.


The error is so small that it can be neglected; the restored data is identical to the original.

_Answer_: The quality of linear regression (in our case, the R2 metric) will not change when the training features are multiplied by an invertible matrix.

_Justification_: Linear regression models the relationship between independent variables (features) and the dependent variable (target variable) as a linear combination of features. Parameters of linear regression, such as weights (coefficients) for each feature, are tuned to minimize the R2 metric between predicted values and the actual values of the target variable.

When we multiply features by an invertible matrix, we change their linear combinations but preserve the information about the relationship between them and the target variable. This is because an invertible matrix preserves linear dependencies between features, and linear regression can adapt its parameters to account for these changes.

Therefore, the parameters of linear regression in the original problem and after multiplying features by an invertible matrix will be related, and the model's quality will not change. The weights will be altered, but the model can adapt to the new weights and continue to make predictions based on the changed features.

It's important to note that this statement is true under the condition that the invertible matrix does not contain systematic errors or noise that could impact the model's quality by altering the linear dependencies between the training data and the target feature.

## Transformation Algorithm

Algorithm:

Transformation of data by multiplying training features by an invertible matrix.

Justification:

Training features are multiplied by a pre-created invertible matrix, as shown in the calculations above, this does not affect the quality of the linear regression model. The resulting dataframe after transformation does not carry any meaningful load compared to the original data; these data are challenging for a human to interpret. User confidentiality will be preserved. For the linear regression model, however, linear dependencies between training features and the target are preserved, which is necessary for its training and maintaining the relevance of predictions, as if it were the original, untransformed data.

## Algorithm Verification

Let's create a function to transform data by multiplying it by an invertible matrix.

In [13]:
def Transform(df):
    num_features = df.shape[1] - 1
    A = np.random.rand(num_features, num_features)
    A_inv = np.linalg.inv(A)

    if np.linalg.det(A) != 0:
        print("Matrix A is invertible.")
    else:
        print("Matrix A is not invertible. Try another matrix.")

    df_transformed = df.drop('Страховые выплаты', axis=1).dot(A)
    df_transformed['Страховые выплаты'] = df['Страховые выплаты']
    
    return df_transformed

Let's test our algorithm on linear regression from the sklearn library.

We will create two types of data: original and transformed.

In [14]:
df_transformed = Transform(df)
display(df)
display(df_transformed)

Matrix A is invertible.


Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
...,...,...,...,...,...
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0


Unnamed: 0,0,1,2,3,Страховые выплаты
0,46527.692309,1501.450797,12035.332279,29026.206072,0
1,35647.932171,1157.287443,9232.717498,22247.034364,1
2,19700.157633,641.181855,5105.167105,12296.573753,0
3,39115.650434,1255.625985,10106.634157,24394.134194,0
4,24484.056226,793.300073,6338.790717,15278.190464,0
...,...,...,...,...,...
4995,33489.161748,1079.988999,8661.184350,20890.872474,0
4996,49152.173180,1581.189509,12705.932462,30658.016831,0
4997,31799.809216,1022.304085,8218.847780,19833.256704,0
4998,30676.218540,988.108346,7931.376977,19133.937224,0


In [15]:
features_train_s = df.drop('Страховые выплаты', axis=1)
target_s = df['Страховые выплаты']

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(features_train_s, target_s, \
                                                            test_size=0.3, random_state=12345)

features_train_t = df_transformed.drop('Страховые выплаты', axis=1)
target_t = df_transformed['Страховые выплаты']

X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(features_train_t, target_t, \
                                                            test_size=0.3, random_state=12345)

In [16]:
model_s = LinearRegression()
model_s.fit(X_train_s, y_train_s)

y_pred_s = model_s.predict(X_test_s)

r2 = r2_score_cust(y_test_s, y_pred_s)
print("R2 score for the original data:", r2)

model_t = LinearRegression()
model_t.fit(X_train_t, y_train_t)

y_pred_t = model_t.predict(X_test_t)

r2 = r2_score_cust(y_test_t, y_pred_t)
print("R2 score for the transformed data:", r2)

R2 score for the original data: 0.4328755262191937
R2 score for the transformed data: 0.43287552621917313


_Output_: After verifying the transformation algorithm by multiplying the training set by an invertible matrix, the model's quality experienced virtually no changes. Therefore, this transformation method can be considered effective.

__Overall Conclusion:__ 
- During the data loading stage, 153 duplicates were identified and removed from the dataset. After checking the data quality, it was found that the data is of high quality, without outliers or missing values. No significant correlation between the data and the target feature was observed.

- In the development stage of the algorithm for protecting user data and its verification, a linear regression function and R2 metric were implemented. Subsequently, the data was split into training and target sets, and the R2 score was calculated to be 0.43287552621918113. After transforming the data by multiplying it with an invertible matrix, the linear regression model was retrained on the transformed data, resulting in an R2 score of 0.4328753931654804. It was observed that the metric, and consequently the model quality, did not change. A crucial aspect in this case is that the matrix must be invertible; this check was included in the function below. The justification for this algorithm is as follows: Linear regression models the relationship between independent variables (features) and the dependent variable (target variable) as a linear combination of features. The parameters of linear regression, such as weights or coefficients for each feature, are tuned to minimize the R2 metric between predicted values and real values of the target variable. When we multiply the features by an invertible matrix, we alter their linear combinations but preserve the information about the dependencies between them and the target variable. This is because an invertible matrix preserves linear dependencies between features, and linear regression can adapt its parameters to account for these changes. Therefore, the parameters of linear regression in the original problem and after multiplying features by an invertible matrix will be related, and the model quality will not change. Weights will be changed, but the model will adapt to the new weights and continue making predictions based on the altered features. Furthermore, a check was conducted for the recovery of transformed data by multiplying it by the inverse matrix, and the recovered data showed no critical differences from the original.

- In the algorithm verification stage, the built-in linear regression function from sklearn and the R2 metric from the same library were utilized. Additionally, a function for transforming training data by multiplying it with an invertible matrix, with a matrix invertibility check, was implemented. Based on the verification results on both the original and transformed data using the sklearn linear regression model, it was determined that the algorithm of multiplying data by an invertible matrix does not change the model quality. For the original data, R2 = 0.43287552621918113, and for the transformed data, R2 = 0.4328659232744808.

Based on all the checks, it can be recommended to use the algorithm of multiplying training data by an invertible matrix for confidentiality preservation. This is because it allows retaining all linear dependencies between features that the linear regression model can adapt to during training while transforming confidential data into a form that is not interpretable for humans.