The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly

The data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model.

## Table of Contents
1. [Data downloading](#step1)
2. [Multiplication of matrices](#step2)
3. [Transformation algorithm](#step3)
4. [Algorithm test](#step4)
5. [Conclusion](#step5)

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

## 1. Data downloading <a name="step1"></a>

In [2]:
data=pd.read_csv('/datasets/insurance_us.csv') #saves the csv file as a dataframe
data.head() #first 5 rows

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [3]:
data.info() #general info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


There are no missing values - a welcome sight! We can check for duplicate rows

In [4]:
data.duplicated().sum() #tells us the number of duplicate rows

153

There are 153 duplicate rows. We don't necessarily need to remove them given the task is to see if a matrix transformation changes machine learning quality or not

## 2. Multiplication of matrices <a name="step2"></a>

In this task, you can write formulas in *Jupyter Notebook.*

To write the formula in-between the text, frame it with dollar signs \\$; if it should be outside the text —  with double signs \\$\\$. These formulas are written in markup language *LaTeX.* 

For example, we wrote down linear regression formulas. You can copy and edit them to solve the task.

You don't have to use *LaTeX*.

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

** Answer:** ...

Multiplying the feature matrix (X) by the transformation matrix (P), we get the transformed feature matrix (X'):
$$ X'=XP $$

We will call the new predictions a'. The new prediction equation becomes: $$ a'=X'w'=XPw' $$

The new weight vector (w') becomes: $$ w'=(X'^T X')^{-1}X'^Ty=(X^TP^TXP)^{-1}X^TP^Ty $$


** Justification:** ...

Substituting the value of w' in a': $$ a'=XP(X^TP^TXP)^{-1}X^TP^Ty $$

Rearranging the terms, we get: $$ a'=X(PP^{-1})(X^TX)^{-1}(P^T)^{-1}P^TX^Ty $$

Remembering that any matrix multiplied by its inverse gives the identity matrix, we simplify:
$$ a'=X(X^TX)^{-1}X^Ty=Xw=a $$

This proves that the predictions won't change and the machine learning model quality won't suffer

The relation between w' and w:
$$ w'=(X^TP^TXP)^{-1}X^TP^Ty=(P^TP)^{-1}P^T(X^TX)^{-1}X^Ty=P^{-1}w $$

## 3. Transformation algorithm <a name="step3"></a>

** Algorithm**
1. Create a random transformation matrix. It should be a square matrix, with its shape being the number of feature columns. i.e t_matrix=np.random.normal(size=(features_columns, features_columns))

2. Check if the random matrix is invertible using: np.linalg.inv(t_matrix). If we receive an error from this operation, the matrix is not invertible, but such a case is quite rare.

3. Multiply the feature matrix by the random invertible matrix

4. Get the r2_scores for the models before and after transformation


** Justification **

The fact that the data obfuscation won't affect model model quality will be proven if the r2 scores for the models before and after transformation are the same (or at least have a very insignificant difference)

## 4. Algorithm test <a name="step4"></a>

Let us first of all split the data accordingly

In [5]:
features=data.drop('Insurance benefits', axis=1) 
#defines the features as all columns except 'Insurance benefits'

target=data['Insurance benefits'] #defines the target as the 'Insurance benefits' column
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=12345)
#splits our data into training and test sets with the test set making up 20% of the data

We can now create our transformation matrix whose number of rows and number of columns each equal the number of our features

In [6]:
t_matrix=np.random.normal(size=(x_train.shape[1], x_train.shape[1]))
#creates a random matrix whose number of rows and columns = the number of columns of our data features
t_matrix

array([[ 0.37919105, -1.78908388,  0.65317713,  1.23380321],
       [-1.40882454,  1.44913229, -0.17634509,  1.20112431],
       [-0.83140365, -0.50702834,  1.29321238, -0.98499947],
       [-1.42193691,  0.41360468, -0.27213911,  0.25784954]])

Let us check if our random transformation matrix is invertible by calculating its inverse

In [7]:
np.linalg.inv(t_matrix) #calculates the inverse of the matrix

array([[-0.04866213,  0.11500852, -0.1197118 , -0.76019564],
       [-0.30724436,  0.52985255,  0.08879559, -0.65881865],
       [ 0.09808786,  0.59858934,  0.61088707, -0.92409659],
       [ 0.3280082 ,  0.41607574, -0.15785423, -0.23247118]])

Our matrix is invertible so we can go on and transform our data by performing a matrix multiplication of the features by the transformation matrix

In [8]:
x_train_trans=x_train@t_matrix #gets the transformed version of the training features
x_test_trans=x_test@t_matrix #gets the transformed version of the test features

We can now train our Linear Regression model and compare the r2 scores for the models before and after transformation

In [9]:
model=LinearRegression() #Linear Regression model
#Before transformation
model.fit(x_train, y_train) #trains the model with the original training features and target
pred1=model.predict(x_test) #gets predictions for the original test features
r2_before=r2_score(y_test, pred1) #calculates the r2 score
rmse_before=(mean_squared_error(y_test, pred1))**0.5

#After transformation
model.fit(x_train_trans, y_train) #trains the model with the transformed training features and target
pred2=model.predict(x_test_trans) #gets predictions for the transformed test features
r2_after=r2_score(y_test, pred2) #calculates the r2 score
rmse_after=(mean_squared_error(y_test, pred2))**0.5

print('-----Before transformation-----')
print('R2 score: {:.2f}, RMSE: {:.2f}'.format(r2_before, rmse_before))
print('-----After transformation-----')
print('R2 score: {:.2f}, RMSE: {:.2f}'.format(r2_after, rmse_after))

-----Before transformation-----
R2 score: 0.41, RMSE: 0.33
-----After transformation-----
R2 score: 0.41, RMSE: 0.33


The differences between the r2 scores and RMSEs are very insignificant. So obfuscation did not affect our model

## Conclusion <a name="step5"></a>
1. We theoretically proved that the transformation doesn't affect the prediction formula
2. We trained Linear Regression models for the scenario without transformation and with transformation
3. We got the the r2 scores and RMSEs for both scenarios and saw that the differences between them are insignificant, proving that the transformation doesn't affect model quality

<div class="alert alert-success" role="alert">
Reviewer's comment v. 1:
    
Yes, you proved this algorithm :)
</div>

## Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  Step 1 performed: the data was downloaded
- [ ]  Step 2 performed: the answer to the matrix multiplication problem was provided
    - [ ]  The correct answer was chosen
    - [ ]  The choice was justified
- [ ]  Step 3 performed: the transform algorithm was proposed
    - [ ]  The algorithm was described
    - [ ]  The algorithm was justified
- [ ]  Step 4 performed: the algorithm was tested
    - [ ]  The algorithm was realized
    - [ ]  Model quality was assessed before and after the transformation