The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly

The data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model.

## 1. Data downloading

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

df = pd.read_csv('.csv')

df.head(5)

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


There are no missing values and the datatypes are as expected.

## 2. Multiplication of matrices

In this task, you can write formulas in *Jupyter Notebook.*

To write the formula in-between the text, frame it with dollar signs \\$; if it should be outside the text —  with double signs \\$\\$. These formulas are written in markup language *LaTeX.* 

For example, we wrote down linear regression formulas. You can copy and edit them to solve the task.

You don't have to use *LaTeX*.

Denote:

- $X$ — feature matrix (zero column consists of unities)
$X = df.drop('Insurance benefits', axis=1)

- $y$ — target vector
$y = df['Insurance benefits']

- $P$ — matrix by which the features are multiplied
The inverse matrix for a square matrix A is a matrix A with a superscript -1 whose product with A is equal to the identity matrix as multiplication can be performed in any order.

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
𝑎′=𝑋′𝑤′=𝑋𝑃𝑤′
$$

Training objective:

$$
𝑤′=(𝑋′𝑇𝑋′)−1𝑋′𝑇,
$$

Training formula:

$$
𝑦= w'x+w0,
y = X(X'TX') - X(1X'T)
$$

** Answer:** ...
Any matrix multiplied by its inverse give the identity. So the quality of the linear regression does not change.

** Justification:** ...
AE = EA = A
AA^-1 = A^-1A = E
#using sign(^) to superscript. Not sure if this is the rule in python.

## 3. Transformation algorithm

In [3]:
features = df.drop('Insurance benefits', axis=1)
target = df['Insurance benefits']

Defining the features and targets columns.

In [4]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345)

Splitting the data into train and test datasets.

In [5]:
predictions = pd.Series(target_test.mean(), index=target_test.index)
rmse = mean_squared_error(target_test, predictions)**0.5

print('rmse before transformation is = '+"{:.10}".format(rmse))


rmse before transformation is = 0.4543830983


In [6]:
print(r2_score(target_test, predictions))

0.0


In [7]:
#** Answer:** ...

class LinearRegression:
    def fit(self, features_train, target_train):
        X = np.concatenate((np.ones((features_train.shape[0], 1)), features_train), axis=1)
        y = target_train
        w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
        self.w = w[1:]
        self.w0 = w[0]
    def predict(self, features_test):
        return features_test.dot(self.w) + self.w0

#** Justification**
Creating a dummy code using Linear Regression.
   

## 4. Algorithm test

In [8]:
model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_test)
print(r2_score(target_test, predictions))

0.43522757127026657


In [9]:
predictions = pd.Series(target_test.mean(), index=target_test.index)
rmse = mean_squared_error(target_test, predictions)**0.5

print('rmse after transformation is = '+"{:.10}".format(rmse))

rmse after transformation is = 0.4543830983


The rmse before and after transformation is the same. The R2 before transformation is zero but after transformation it increases though within the 0 to 10 range.

The rmse before and after transformation is the same.
The value of the R2 is greater than zero, not greater than one and not equal to zero. This means than the dummy works and so it is relevant.

## Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  Step 1 performed: the data was downloaded
- [ ]  Step 2 performed: the answer to the matrix multiplication problem was provided
    - [ ]  The correct answer was chosen
    - [ ]  The choice was justified
- [ ]  Step 3 performed: the transform algorithm was proposed
    - [ ]  The algorithm was described
    - [ ]  The algorithm was justified
- [ ]  Step 4 performed: the algorithm was tested
    - [ ]  The algorithm was realized
    - [ ]  Model quality was assessed before and after the transformation