<font color='blue'>
    
# Intro

</font>


###  Insurance company wants to protect its clients' data. In this notebook I'll use linear algebra properties for data masking that would make it hard to recover personal information from the transformed data.


---
**Features:** insured person's gender, age, salary, and number of family members.

**Target:** number of insurance benefits received by the insured person over the last five years.

**The goal of this notebook is not achieving the best prediction score but rather exhibit a way to mask sensitive data**

-----

In [1]:
# imports

import numpy as np
from numpy.linalg import inv
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings('ignore')

# importing data

In [2]:
data = pd.read_csv('insurance_us.csv')
data.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [3]:
# rename columns for convenience

data.columns = data.columns.str.replace(' ', '_').str.lower()

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
gender                5000 non-null int64
age                   5000 non-null float64
salary                5000 non-null float64
family_members        5000 non-null int64
insurance_benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


no missing values, all columns in the right dtype

In [5]:
# target

data.insurance_benefits.value_counts()

0    4436
1     423
2     115
3      18
4       7
5       1
Name: insurance_benefits, dtype: int64

I could have threat the task either as classification or regression because of few outputs. I'll choose to use regression.

# Modeling - LinearRegression

In [6]:
X = data.drop('insurance_benefits', axis=1)
y = data.insurance_benefits

X.shape, y.shape

((5000, 4), (5000,))

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
steps = [
    ('scaler', MinMaxScaler()),
    ('model', LinearRegression())
]
pipe = Pipeline(steps=steps)
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='r2')
print(f'validation score: {scores.mean():.3f}')

validation score: 0.406


In [9]:
# pipe.score will return the same r2 score but I'll caclulate it manual

pipe.fit(X_train, y_train)
y_test_preds = pipe.predict(X_test)
test_score = r2_score(y_test, y_test_preds)

print(f'test score: {test_score:.3f}')

test score: 0.437


# Data Masking

## Theoretical proof for obtaining the same predictions using linear algebra and linear regression properties

$X$ - original matrix

$A$ - invertible matrix

$I$ - idintity matrix

$a$ - original predictions

$w$ - original weights

$a'$ - new predictions

$w'$ - new weights
_______________________


$a = Xw$

$a' = XAw'$

according to linear regression loss function:  
> $w = arg min MSE(Xw, y)$

> $w = (X^TX)^{-1}X^Ty$

$w' = ((XA)^T(XA))^{-1}(XA)^Ty$

$w' = (A^T(X^TX)A)^{-1}(XA)^Ty$

$w' = A^{-1}(X^TX)^{-1}(A^T)^{-1}(XA)^Ty$

$w' = A^{-1}X^{-1}(X^T)^{-1}(A^T)^{-1}A^TX^Ty$

$w' = A^{-1}X^{-1}(X^T)^{-1}IX^Ty$

$w' = A^{-1}(X^TX)^{-1}X^Ty$

$w' = A^{-1}w$
- - - 
$a' = XAA^{-1}w$

$a' = Xw$       


**SAME PREDICTIONS**

### And now, the application on real data:

In [10]:
def make_key(mask_shape):
    """providing a random key for data masking"""
    inv_check = False
#     check that matrix is invertible
    while not inv_check:
        random_matrix = np.random.normal(size=(mask_shape,mask_shape))
        try:
            inv_matrix = np.linalg.inv(random_matrix)
            inv_check = True
        except:
            continue
            
    return random_matrix

def masking(X, mask_matrix):
    """masking the data given a key"""
    return X.values.dot(mask_matrix)

def unmasking(X, mask_matrix):
    """unmasking the data given the same key used for masking"""
    return X.dot(np.linalg.inv(mask_matrix))

In [11]:
mask_matrix = make_key(X.shape[1])
X_train_mask, X_test_mask = masking(X_train, mask_matrix), masking(X_test, mask_matrix)

In [12]:
X_train_mask.shape, X_test_mask.shape

((4000, 4), (1000, 4))

In [13]:
# real data VS masking
X_train_mask[0,:], X_train.values[0,:]

(array([ -9758.93883348, -26728.93544587,  89634.58937376, -39296.28695247]),
 array([0.00e+00, 1.80e+01, 4.94e+04, 1.00e+00]))

## Re-training and comparing results

In [14]:
steps = [
    ('scaler', MinMaxScaler()),
    ('model', LinearRegression())
]
pipe_mask = Pipeline(steps=steps)
scores_mask = cross_val_score(pipe_mask, X_train_mask, y_train, cv=5, scoring='r2')
print(f'validation score: {scores_mask.mean():.3f}')

validation score: 0.406


In [15]:
pipe_mask.fit(X_train_mask, y_train)
y_test_preds_mask = pipe_mask.predict(X_test_mask)
test_score_mask = r2_score(y_test, y_test_preds_mask)

print(f'test score: {test_score_mask:.3f}')

test score: 0.437


**The results for masked data & real data are the same !**