# Linear Algebra project

The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly

The data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model.

## 1. Data downloading

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [2]:
data = pd.read_csv('/datasets/insurance_us.csv')
print('-' * 100)
display(data.info())
print('-' * 100)
display(data.describe())
print('-' * 100)
display(data.head())
print('-' * 100)
display(data.tail())
print('Duplicated rows:',data.duplicated().sum())

----------------------------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


None

----------------------------------------------------------------------------------------------------


Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


----------------------------------------------------------------------------------------------------


Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


----------------------------------------------------------------------------------------------------


Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0
4999,1,28.0,40600.0,1,0


Duplicated rows: 153


## 2. Multiplication of matrices

In this task, you can write formulas in *Jupyter Notebook.*

To write the formula in-between the text, frame it with dollar signs \\$; if it should be outside the text —  with double signs \\$\\$. These formulas are written in markup language *LaTeX.* 

For example, we wrote down linear regression formulas. You can copy and edit them to solve the task.

You don't have to use *LaTeX*.

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

============ Description ============

$$
y = wX + w_0
$$

This formula for Linear regression. And now let's look for formula of w(weights):

$$
w = (X^T X)^{-1} X^T y
$$

As you see we will calculate weights like that.

$$
X^T X = X X^T
$$

Is a symmetric matrix(can switch the terms in each other)

P is invertible 
$
P^{-1}
$

So if:

** Answer:**

When we multiply P by X it becomes X'
$$
X ^{'} = XP
$$

And our prediction equation becomes:
$$
a^{'} = X^{'} w^{'} = X P w^{'}
$$

If we will sbstitute Eq1 into the training equation we will get:
$$
w^{'} = (X^{'T} X^{'})^{-1} X^{'T} y
$$

$$
w^{'} = (X^{T} P^{T} X P)^{-1} X^{T} P^{T} y
$$
** Justification:** 

So, now let's substitute our training equation into prediction equation:

$$
a^{'} = X^{'} w^{'}
$$

Then:

$$
a^{'} = X P (X^{T} P^{T} X P)^{-1} X^{T} P^{T} y
$$

After that rearranging some elements and evaluating them we will get:

$$
a^{'} = X (P P^{-1}) (X^{T} X)^{-1} (P^{T})^{-1} P^{T} X^{T} y
$$

And we know that $P P^{-1} = E$ where **E** is Identity matrix. And Therefore $(P^{T})^{-1} = (P^{-1})^{T}$.Also, multiplying **Identity** matrix to any matrix gives us this matrix. Let's apply it to formula on the above and simplify it.

$$
a^{'} = X (P P^{-1}) (X^{T} X)^{-1} (P^{-1} P)^{T} X^{T} y
$$

$$
a^{'} = X (X^{T} X)^{-1} X^{T} y
$$

It becomes:

$$
a^{'} = a
$$

And so quality of model reamins the same for the transformed features.

$$
w_1 = P ^ {-1} w
$$

## 3. Transformation algorithm

** Algorithm**

1. Creating a random matrix:
    - transform_matrix = np.random.normal(size = (features.columns, features.columns)
2. Checking invertibility of matrix:
    - np.linalg.inv(transform_matrix)
3. Multiplying feature matrix by Random Invertible matrix
4. Check to see difference between before and after the transformation

** Justification**

1. On this step we have created random matrix size of features.columns and features.columns because A x B multiplied by B x B gives us A x B shape matrix again.
2. Actually in this stage we checked invertibility of created random matrix.Chance of Creating a non-invertible matrix randomly is close to 0. 
3. Here we have multiplied our invertible matrix to our features matrix. This process also known as **data masking** or **data obfuscation**
4. Here we will check our `masking` algorithm works correctly or not?

## 4. Algorithm test

In [3]:
# lets split our data into train and test
features = data.drop('Insurance benefits',axis=1)
target = data['Insurance benefits']
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.25, random_state = 12345)

In [4]:
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)

(3750, 4) (1250, 4)
(3750,) (1250,)


In [5]:
# Create random matrix
transform_matrix = np.random.normal(size = (x_train.shape[1],x_train.shape[1]))
transform_matrix

array([[ 0.26102701, -2.50668469, -1.31878333, -2.67179316],
       [-0.03135847, -0.53317037, -0.17298387,  1.53181359],
       [-0.40455515, -0.73510315,  1.13530726,  0.78397582],
       [-0.66499236, -0.04634583,  0.15543759, -0.58376283]])

In [6]:
# Check is matrix is invertible
np.linalg.inv(transform_matrix)

array([[ 0.08429321, -0.57220901,  0.22732536, -1.58200201],
       [-0.25340827, -0.21988189, -0.34434658,  0.12038626],
       [-0.06895035, -0.68288211,  0.75920922, -0.45673573],
       [-0.09426328,  0.48745768, -0.02946564, -0.0420617 ]])

In [7]:
x_train_transformed = x_train @ transform_matrix
x_test_transformed = x_test @ transform_matrix

In [8]:
model = LinearRegression()

#Before transformation
model.fit(x_train, y_train)
predict_before = model.predict(x_test)
r2_before = r2_score(y_test, predict_before)
print('R2 score before transformation:', r2_before)

#After transformation
model.fit(x_train_transformed, y_train)
predict_after = model.predict(x_test_transformed)
r2_after = r2_score(y_test, predict_after)
print('R2 score before transformation:', r2_after)
print("Difference between before and after transformatio:", 100 * (abs(r2_after - r2_before)) / r2_before, '%')


R2 score before transformation: 0.435227571270266
R2 score before transformation: 0.43522757127014333
Difference between before and after transformatio: 2.818747072089756e-11 %


As wee see `R2 score` is almost same before and after transformation

###  Conclusion

According to our findings `r2_score` is same for original and transformed datasets. And difference between them very close to 0 we surely can neglect this value. This states that our derived proof is true! 

## Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  Step 1 performed: the data was downloaded
- [x]  Step 2 performed: the answer to the matrix multiplication problem was provided
    - [x]  The correct answer was chosen
    - [x]  The choice was justified
- [x]  Step 3 performed: the transform algorithm was proposed
    - [x]  The algorithm was described
    - [x]  The algorithm was justified
- [x]  Step 4 performed: the algorithm was tested
    - [x]  The algorithm was realized
    - [x]  Model quality was assessed before and after the transformation