# Personal Data Encryption

> An insurance company needs to protect user data by encryption of personal information. We work to create data encryption algorithm so that after transformation, the accuracy of Machine Learning model will not change as without transformation. 

- toc: true
- badges: true
- comments: true
- categories: [Machine Learning, Python, Linear Algebra, Data Encryption, Linear Regression, pandas, numpy, scikit-learn]
- image: images/encryption.PNG

# **Project Description**

---

The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. This is called **data masking**, or **data obfuscation**. You are also expected to prove that the algorithm works correctly. Additionally, the data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model. Follow these steps to develop a new algorithm:

* construct a theoretical proof using properties of models and the given task;
* formulate an algorithm for this proof;
* check that the algorithm is working correctly when applied to real data.

We will use a simple method of data masking, based on an invertible matrix.

## **Data description**

The dataset is stored in file `/datasets/insurance_us.csv`.

*   **Features**: insured person's gender, age, salary, and number of family members.

*   **Target**: number of insurance benefits received by the insured person over the last five years.

# Load Data

---

Download and look into the data.

In [None]:
# Import in libraries to use in project

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [None]:
df = pd.read_csv('/datasets/insurance_us.csv')

In [None]:
# Functions to get descriptions and info from dataframe

def get_information(df):
  """ Prints general info about the dataframe to get an idea of what it looks like"""
  print('Head: \n')
  display(df.head())
  print('*'*100, '\n') # Prints a break to seperate print data
  
  print('Info: \n')
  display(df.info())
  print('*'*100, '\n')

  print('Describe: \n')
  display(df.describe())
  print('*'*100, '\n')

  print('Columns with nulls: \n')
  display(get_null_df(df,4))
  print('*'*100, '\n')

  print('Shape: \n')
  display(df.shape)
  print('*'*100, '\n')

  print('Duplicated: \n')
  print('Number of duplicated rows: {}'.format(df.duplicated().sum()))

def get_null_df(df, num):
  """Gets percentage of null values per column per dataframe"""
  df_nulls = pd.DataFrame(df.isna().sum(), columns=['missing_values'])
  df_nulls['percent_of_nulls'] = round(df_nulls['missing_values'] / df.shape[0], num) *100
  return df_nulls

def get_null(df):
  """Gets percentage of null values in dataframe"""
  count = 0
  df = df.copy()
  s = (df.isna().sum() / df.shape[0])
  for column, percent in zip(s.index, s.values):

    num_of_nulls = df[column].isna().sum()
    if num_of_nulls == 0:
      continue
    else:
      count += 1
    print('Columns {} has {:.{}%} percent of Nulls, and {} number of nulls'.format(column, percent, num, num_of_nulls))

    if count !=0:
      print('Number of columns with NA: {}'.format(count))
    else:
      print('\nNo NA columns found')

In [None]:
get_information(df)

Head: 



Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


**************************************************************************************************** 

Info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


None

**************************************************************************************************** 

Describe: 



Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


**************************************************************************************************** 

Columns with nulls: 



Unnamed: 0,missing_values,percent_of_nulls
Gender,0,0.0
Age,0,0.0
Salary,0,0.0
Family members,0,0.0
Insurance benefits,0,0.0


**************************************************************************************************** 

Shape: 



(5000, 5)

**************************************************************************************************** 

Duplicated: 

Number of duplicated rows: 153


In [None]:
df.columns = [columns.lower().replace(' ', '_') for columns in df.columns]
df.columns

Index(['gender', 'age', 'salary', 'family_members', 'insurance_benefits'], dtype='object')

In [None]:
features = df.drop('insurance_benefits', axis=1)
target = df['insurance_benefits']

The dataset has 5000 entries with no null entries/missing information. There are 5 columns (gender, age, salary, family_members, and insurance_benefits). There seems to be 153 rows of duplicated data. This may be due to simlar clients information, without a unique identifying column it is hard to determine the duplicates apart. Dataset would be broken up into the following:

* **Features**: insured person's gender, age, salary, and number of family members.

* **Target**: number of insurance benefits received by the insured person over the last five years.

# Multiplication of matrices

---

A theoretical proof based on the equation of linear regression is presented here. We will take the features multiply it by an invertible matrix. 

The quality of the model will be the same for both sets of parameters: the original features and the features after multiplication.

**Matrix properties:**

An identity matrix (unit matrix)  is a square matrix with 1s on main diagional and 0s everywhere else. Example would be:
$$ E = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} $$

If any matrix $A$ is multiplied by an identity matrix (or vice versa), the same matrix $A$ is obtained.

$$ AE = EA = A $$

If matrix $A$ is a square matrix and multiplied by it's inverse matrix ($A^{-1}$) the product is equal to the identity matrix.

$$ AA^{-1} = A^{-1}A = E $$

This would make it true that the identity matrix is the inverse of itself:
$$ E = E^{-1} $$

Matrices that have obtainable inverses are called **invertible** matrix. Not all matrices have inverses. 

**Invertible matrix properties:**

1. $ (A^{-1})^{-1} = A $
2. $ (A^T)^{-1} = (A^{-1})^T $
> The transpose of an invertible matrix is also invertible, and its inverse is the transpose of the inverse of the original matrix.

3. $ (kA)^{-1} = k^{-1}A^{-1} $ for non-zero scalar $k$
4. For any **two square matrices** A and B, $ (AB)^{-1} = B^{-1}A^{-1}$

This would be true with transpose properties of $A$ and $B$ matrices: $ (AB)^T = B^T A^T $

> Note that the order of the factors reverses. From this one can deduce that a square matrix $A$ is invertible if and only if $A^T$ is invertible, and in this case we have $(A^{−1})^T = (A^T)^{−1}$. By induction, this result extends to the general case of multiple matrices.



Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer**:

$$ a' = a $$

Predictions ($a'$) made using the features matrix that is multiplied by an invertible matrix is the same as the predictions ($a$) made using the original features matrix.

**Justification:**

Known:
* $ a = Xw $
* $ w = (X^T X)^{-1} X^T y $
* $ a' = X'w' $
* $ X' = XP $

We want to prove that:  
* $ a' = X'w' = XP w' = Xw = a $

To find $w'$ we can substitute $X$ with $X'$ which known into the linear regression weight vector function $w$. 

$$ w' = ((XP)^T (XP))^{-1} (XP)^T y $$

$$ w' = ((P^T X^T) (XP))^{-1} P^T X^T y $$

$$ w' = (XP)^{-1} (P^T X^T)^{-1} P^T X^T y $$

$$ w' = (XP)^{-1} (X^T)^{-1} (P^T)^{-1} P^T X^T y $$

Since $P$ is a square (*n x n*) and an invertible matrix, $(P^T)^{-1} P^T$ would cancel each other out (this is only possible since $P$ is an invertible matrix). 

$$ w' = (XP)^{-1} (X^T)^{-1} X^T y $$

$$ w' = (X^TX P)^{-1} X^T y $$

Note that $P$ and $X^TX$ are the only square matrices:

$$ w' = P^{-1} (X^TX)^{-1} X^T y $$

The $X$ portion of the equation now looks exactly the same to weight vector equation so a substitutiion is possible to get:

$$ w' =  P^{-1} w $$

Now, that $w'$ is solved it can be put into the prediction equation from earlier:

$$ a' = X'w' $$
$$ a' = XP w' $$ 
$$ a' = XP (P^{-1} w) $$
$$ a' = XP P^{-1} w $$

Since $P$ is an invertible matrix, like earlier $P$ cancels out with its inverse.

$$ a' = X w  = a $$


The weight vectors from MSE minimums for these models are the similar according the proof we did earlier for linear regression. For the predictions with features multiplied by an invertible matrix, we have the weight vector being multiplied by factor of $P^{-1}$ which is the inverse matrix of the matrix used to multiply by features. This cancels out since we multiply the features with matrix with its inverse in the weight vector equation. So for the new weight vector equation after the transformation, it has a factor of $P^{-1}$. With this the quality of the model would not change since both predictions are the same according to the proof!

# Transformation algorithm

---

State an algorithm for data transformation to solve the task. Explain why the linear regression quality won't change based on the proof above.


**Algorithm**

$$ X' = XP $$
$$ w = (X^T X)^{-1} X^T y $$
$$ w' =  P^{-1} w $$


- $X'$ — new transformed matrix (zero column consists of unities)
- $X$ — feature matrix (zero column consists of unities)
- $P$ — invertable matrix by which the features are multiplied
- $y$ — target vector
- $w$ — linear regression weight vector (zero element is equal to the shift)
- $w'$ — transformed linear regression weight vector (zero element is equal to the shift)

**Justification**

A new matrix ($X'$) is created by a transformation of original features matrix ($X$) by multiplying with an invertible matrix ($P$).

The linear regression quality will not change as the since earlier in the proof it's shown that predictions made with original features versus the predictions with transformed matrix are the same.
...

<div class="alert alert-success">
    Yep! The algorithm and it's justification are correct, provided you fix the problems with the proof in the prior section.
</div>

# Algorithm test

---

The algorithm is programed using matrix operations. Make sure that the quality of linear regression from sklearn is the same before and after transformation. Use the R2 metric.

In [None]:
RANDOM_STATE = 12345 # Random_State

# Splits dataset to test/train (validate/train) datasets

def get_train_valid(df):
    df_train, df_valid = train_test_split(df, test_size=0.25, random_state=RANDOM_STATE) # Splits data up to 75% train and 25% test
    return df_train, df_valid

In [None]:
features_train, features_valid = get_train_valid(features)
target_train, target_valid = get_train_valid(target)

assert features_train.shape[0] == target_train.shape[0]
assert features_valid.shape[0] == target_valid.shape[0]

print('Datasets: \n')
print('Features Train:', features_train.shape, ' Target Train:', target_train.shape)
print('Features Validation:', features_valid.shape, ' Target Validation:', target_valid.shape)

Datasets: 

Features Train: (3750, 4)  Target Train: (3750,)
Features Validation: (1250, 4)  Target Validation: (1250,)


In [None]:
# Orignal features test

model = LinearRegression()
model.fit(features_train, target_train)
predictions_original = model.predict(features_valid)
print('R2 Score with original features:', r2_score(target_valid, predictions_original))

R2 Score with original features: 0.435227571270266


In [None]:
def create_matrix_P(features):
  matrix_P = np.random.normal(size=(features.shape[1], features.shape[1])) # Create matrix_P with shape of n_features x n_features
  return matrix_P

In [None]:
# Check to see if matrix_P is invertible
while True:
    matrix_P = create_matrix_P(features)
    try:
        np.linalg.inv(matrix_P)
        break
    except LinAlgError:
        print("Matrix is not invertible")

matrix_P.shape

(4, 4)

In [None]:
# To get a square matrix A.dot(A.T) -> This would to make features a square matrix but also invertible
#sq_feat = features.dot(features.T) -> Commenting out as this is not needed

feat_P= features.dot(matrix_P)

feat_P_train, feat_P_valid = get_train_valid(feat_P)

assert feat_P_train.shape[0] == target_train.shape[0]
assert feat_P_valid.shape[0] == target_valid.shape[0]

print('Datasets: \n')
print('Transformed Features Train:', feat_P_train.shape, ' Target Train:', target_train.shape)
print('Transformed Features Validation:', feat_P_valid.shape, ' Target Validation:', target_valid.shape)

Datasets: 

Transformed Features Train: (3750, 4)  Target Train: (3750,)
Transformed Features Validation: (1250, 4)  Target Validation: (1250,)


In [None]:
# Transformed features test

model = LinearRegression()
model.fit(feat_P_train, target_train)
predictions_trans = model.predict(feat_P_valid)
print('R2 Score with features multiplied by an invertible matrix:', r2_score(target_valid, predictions_trans))

R2 Score with features multiplied by an invertible matrix: 0.43522757127045497


The dataset was split into test/train (validate/train), the model was trained using the training datasets and then evaluated using the test set. With the creation of the invertible matrix P, the features matrix was transformed by multiplying it with matrix P. The R2 score for both linear regression models with the original features vs the transformed features are very similar.

# Overall Conclusion

With the transformation matrix, it was determined that having the orignal features multipled an invertible matrix gave us the same predictions in the linear regression model. The proof from above proves that it's same. The weight vector for the transformation features shows that it takes in the factor of $P^{-1}$. It's shown that the R2 score for the both models are very similar.