<a href="https://colab.research.google.com/github/dnevo/Practicum/blob/master/S11_Linear_Algebra_number_of_insurance_benefits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Review

Hi Doron. And again Soslan on-line :). As always I've added all my comments to new cells with different coloring.

<div class="alert alert-success" role="alert">
  If you did something great I'm using green color for my comment
</div>

<div class="alert alert-warning" role="alert">
If I want to give you advice or think that something can be improved, then I'll use yellow. This is an optional recommendation.
</div>

<div class="alert alert-danger" role="alert">
  If the topic requires some extra work so I can accept it then the color will be red
</div>

I like your project. Correct, compact and clean. So I'm accepting it. Good work.

---

In [11]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:11,.2f}'.format
np.set_printoptions(precision=2)

## Project Description
The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. This is called data masking, or data obfuscation. You are also expected to prove that the algorithm works correctly. Additionally, the data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model. Follow these steps to develop a new algorithm:
- construct a theoretical proof using properties of models and the given task;
- formulate an algorithm for this proof;
- check that the algorithm is working correctly when applied to real data.

We will use a simple method of data masking, based on an invertible matrix.
## Instructions
1. Download and look into the data.
2. Provide a theoretical proof based on the equation of linear regression. The features are multiplied by an invertible matrix. Show that the quality of the model is the same for both sets of parameters: the original features and the features after multiplication. How are the weight vectors from MSE minimums for these models related?
3. State an algorithm for data transformation to solve the task. Explain why the linear regression quality won't change based on the proof above.
4. Program your algorithm using matrix operations. Make sure that the quality of linear regression from sklearn is the same before and after transformation. Use the R2 metric.


# 1. Downloading and looking on the data

In [12]:
data = pd.read_csv('https://raw.githubusercontent.com/dnevo/Practicum/master/datasets/insurance_us.csv')
data.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [13]:
data.describe()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.5,30.95,39916.36,1.19,0.15
std,0.5,8.44,9900.08,1.09,0.46
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [14]:
data['Insurance benefits'].value_counts()

0    4436
1     423
2     115
3      18
4       7
5       1
Name: Insurance benefits, dtype: int64

As above, the targets distribution is unbalanced - in almost 90% of the examples the Insurance benefits is zero...

In [15]:
features = data.drop('Insurance benefits', axis=1)
target = data['Insurance benefits']

<div class="alert alert-success" role="alert">
Nice start</div>

# 2. Effect of Feature transformation on Regression quality

**Lets reminds the calculation of prediction (a) using Normal equation:**
<br>
> ${\mathrm a}={\mathrm X}{\mathrm w}={\mathrm X}{{({\mathrm X}}^T{\mathrm X})}^{-1}{\mathrm X}^T{\mathrm y}$

**Now, lets see what's happen after Transformation:**
<br>
>$\widetilde{\mathrm X}$ (the transformed features matrix) is the result of  ${\mathrm X}$ multiplied by an invertible matrix ${\mathrm A}$:

> $\widetilde{\mathrm X}={\mathrm X}{\mathrm A}$

> $\widetilde{\mathrm a}=\widetilde{\mathrm X}\widetilde{\mathrm w}=\widetilde{\mathrm X}{{(\widetilde{\mathrm X}}^T\widetilde{\mathrm X})}^{-1}\widetilde{\mathrm X}^T{\mathrm y}$

**Using different examples of ${\mathrm A}$, we can show that ${\mathrm a}$ (the prediction) stays the same - i.e.:**

> $\widetilde{\mathrm a}={\mathrm a}$

In [16]:
def calc_predict(features, target):
    X = np.concatenate((np.ones((features.shape[0], 1)), features), axis=1)
    y = target
    return np.dot(X,np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y))

a = calc_predict(features, target)
for i in range(10):
    np.random.seed(i)
    A = np.random.normal(size=(features.shape[1],features.shape[1]))
    a_trans = calc_predict(features @ A, target)
    print(f'A({i}) - MSE: {mean_squared_error(a, a_trans):.15f}')

A(0) - MSE: 0.000000000000001
A(1) - MSE: 0.000000000000240
A(2) - MSE: 0.000000000000001
A(3) - MSE: 0.000000000000000
A(4) - MSE: 0.000000000000002
A(5) - MSE: 0.000000000000000
A(6) - MSE: 0.000000000000012
A(7) - MSE: 0.000000000001119
A(8) - MSE: 0.000000000000006
A(9) - MSE: 0.000000000001131


**As above, over 10 different A matrices, a_trans stays equal to a**

<div class="alert alert-success" role="alert">
Although it is provable with pure math, I can accept such proof too :) Nice random testing.</div>

# 3. Algorithm for Linear Regression using masking

1. Feature masking
   - Generate an Invertible Matrix ($A$) with dimensions $n\times n$ ($n$ - number of features)
   - Tranform the Features ($F$) by multipying with the Invertible matrix: $\widetilde F=FA$
2. Perform linear regression using the transformed features ($\widetilde F$)

The linear regression quality will not change, as explained in the previous section

<div class="alert alert-success" role="alert">
Correct algorithm</div>

# 4. Algorithm Implementation

## 4.1 Generate Invertible matrix

In [17]:
np.random.seed(12345)
A = np.random.normal(size=(features.shape[1],features.shape[1]))

## 4.2 Linear Regression using matrix operations

In [18]:
class LinearRegressionMat:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        y = train_target
        w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0

In [19]:
model = LinearRegressionMat()
model.fit(features, target)
predictions = model.predict(features)
print('R2 score (without transformation):', r2_score(target, predictions))

features_trans = features @ A   # perform features transforamtion
model.fit(features_trans, target)
predictions = model.predict(features_trans)
print('R2 score (with transformation):', r2_score(target, predictions))

R2 score (without transformation): 0.4249455028666801
R2 score (with transformation): 0.4249455028666522


**As we can see above, model accuracy stays the same (up to 12th digit)**

## 4.3 Linear Regression using sklearn

In [20]:
model = LinearRegression()
model.fit(features, target)
predictions = model.predict(features)
print('R2 score (without transformation):', r2_score(target, predictions))

features_trans = features @ A   # perform features transforamtion
model.fit(features_trans, target)
predictions = model.predict(features_trans)
print('R2 score (with transformation):', r2_score(target, predictions))

R2 score (without transformation): 0.42494550286668
R2 score (with transformation): 0.4249455028666811


**As above, we receive exactly same results when we are using sklearn**

<div class="alert alert-success" role="alert">
Great. Correct checking.</div>