The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly

The data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model.

## 1. Data downloading

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
df = pd.read_csv('/datasets/insurance_us.csv')
df.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [3]:
df.describe()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [5]:
features = df.drop(columns=['Insurance benefits'])
targets = df['Insurance benefits']
print(len(features), len(targets))

5000 5000


In [6]:
train_X, test_X, train_y, test_y = train_test_split(features, targets, shuffle=True, random_state=123345)
print(len(train_X), len(test_X), len(train_y), len(test_y))

3750 1250 3750 1250


#### No missing and negative value in Data

## 2. Multiplication of matrices

In this task, you can write formulas in *Jupyter Notebook.*

To write the formula in-between the text, frame it with dollar signs \\$; if it should be outside the text —  with double signs \\$\\$. These formulas are written in markup language *LaTeX.* 

For example, we wrote down linear regression formulas. You can copy and edit them to solve the task.

You don't have to use *LaTeX*.

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

** Answer:** Predictions(a) With Matrics(P) Multiplication are same as prediction(a') without matrix(P) multiplication

** Justification:** Lets Replace X with XP in the training formula and check if we are getting prediction y,

Lets Replace X with XP in the training formula and check if we are getting prediction y,



$$
w' = P^{-1} w
$$
If we consider below equations,

$$
a' = X'w',  X' = XP
$$
we can say that,



$$
a' = X w
$$
so,

$$
a = a'
$$

$$
W = (XTX)-1 XTY
$$

$$
W’ = ((XP)TXP)-1 (XP)Ty
$$

$$
W’ = (PTXTXP)-1 PTXTY
$$

$$
W’ = (XTXP)-1(PT)-1PTXTY = (XTXP)-1XTY
$$

$$
W’ = P-1 (XTX)-1 XTY = P-1W
$$

$$
a’ = X‘W’ = XPP-1W = XW = a
$$

## 3. Transformation algorithm

** Algorithm** : For Any Matrix which has Inverse Can be used for Multiplication


In [7]:
def Transformation(X):
    Flag = True
    while(Flag):
        # step1. lets define one arbitrary matrics P
        P = np.random.rand(X.shape[1], X.shape[1])
        try : 
            # step2. This will Check if matrix is invertible. If not, repeat the step 1.
            # if it does 
            np.linalg.inv(P) # transform_matrix 
            
            Flag = False
            
        except: Flag=True
    # Multiply feature matrix X on matrix P   
    return X @ P

In [8]:
#Lets define some arbitory matrix
X = np.random.rand(5,2)
X

array([[0.23530086, 0.05890506],
       [0.70399409, 0.60418652],
       [0.83823441, 0.75071496],
       [0.64834822, 0.36551665],
       [0.06758834, 0.5627006 ]])

In [9]:
Transformation(X)

array([[0.08637608, 0.15745216],
       [0.2793202 , 0.57792687],
       [0.3341111 , 0.69594792],
       [0.24792145, 0.48457984],
       [0.05145596, 0.18149373]])

** Justification** :
1. We define one arbitrary matrics P,
2. We check if P is invertible, If not we find another matrix and repeat the step unless we find invertible matrix, ex. Zero matrix is not invertible so we can not use it for transfomation.
3. Multiply feature matrix X on matrix P


Matrix(P) must have inverse if it isn't, theoretical steps will not justified.

## 4. Algorithm test

Lets calculate prediction(y) for normal features

In [10]:
reg = LinearRegression().fit(train_X, train_y)
a = reg.predict(test_X)

We will mutiply feature metrix with a small matrix P with shape 4 * 4

In [11]:
mod_train_X = Transformation(train_X)

In [12]:
mod_reg = LinearRegression().fit(mod_train_X, train_y)
a_desh = reg.predict(test_X)

In [13]:
r2_score(test_y, a)

0.38898265234717

In [14]:
r2_score(test_y, a_desh)

0.38898265234717

we are getting same r2_score with both approach, it proves that Linear Regression Algorithm doesn't get affected by Matrix Multiplication, unless Matrix is not invertible

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  Step 1 performed: the data was downloaded
- [x]  Step 2 performed: the answer to the matrix multiplication problem was provided
    - [x]  The correct answer was chosen
    - [x]  The choice was justified
- [x]  Step 3 performed: the transform algorithm was proposed
    - [x]  The algorithm was described
    - [x]  The algorithm was justified
- [x]  Step 4 performed: the algorithm was tested
    - [x]  The algorithm was realized
    - [x]  Model quality was assessed before and after the transformation