In [14]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

import numpy as np

## 1. Data downloading

In [3]:
data = pd.read_csv('/datasets/insurance_us.csv')
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


## 2. Multiplication of matrices

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:** In the training formula We can multiply the the feature matrix by matrix **P** to create an algorithm that makes it more difficult to recover personal information from the transformed data. We'll attempt to create a mathematical proof to show that.

**Justification:** 

with

$$
X' = XP
$$

then

$$
w' = ((X')^T X')^{-1} (X')^T y
$$

$$
w' = ((XP)^T XP)^{-1} (XP)^T y
$$

since

$$
(AB)^T = B^TA^T
$$

thus

$$
w' = (P^TX^T XP)^{-1} P^TX^T y
$$

since 

$$
(AB)^{-1} = B^{-1}A^{-1}
$$

thus

$$
w' = P^{-1}(P^TX^TX)^{-1} P^TX^T y
$$

$$
w' = P^{-1}(X^TX)^{-1}(P^T)^{-1} P^TX^T y
$$

since

$$
A^{-1}A = 1
$$

thus

$$
w' = P^{-1}(X^TX^{-1})X^T y
$$

$$
w' = P^{-1}w
$$

and to verify that our prediction will be the same:

$$
a' = X'w'
$$

knowing that

$$
X' = XP
$$

$$
w' = P^{-1}w
$$

$$
a = Xw
$$

then

$$
a' = XPP^{-1}w
$$

$$
a' = Xw
$$

$$
a' = a
$$

## 3. Transformation algorithm

**Algorithm**

The algorithm will utilize the formula

$$
w' = ((XP)^T XP)^{-1} (XP)^T y
$$

and generate a random 4x4 matrix P that is used for multiplication with the feature matrix.

**Justification**

Since our mathematical proof shows that the prediction remains the same, our algorithm should work. Although our weight will differ, it will not have an effect on the model's prediction. We'll try to implement it next.

## 4. Algorithm test

Our first steps will be to create our training and testing features/targets.

In [10]:
#split training and testing datasets
train, test = train_test_split(data, test_size=.25,random_state=12345)

#create features and targets
train_features = train.drop('Insurance benefits',axis=1)
train_target = train['Insurance benefits']

test_features = test.drop('Insurance benefits',axis=1)
test_target = test['Insurance benefits']


We'll then get the r2 score of a model without using our data transforming algorithm.

In [22]:
#create model
model = LinearRegression()

#fit model
model.fit(train_features,train_target)

#test model
predictions = model.predict(test_features)

#get r2 score
print("R2 score of untransformed model:",r2_score(test_target, predictions))

R2 score of untransformed model: 0.435227571270266


And now we'll check the r2 score of a model that is transformed.

In [23]:
P = np.random.normal(1, 10000, size = (4,4))

# Print to see if matrix can be inverted
print(np.linalg.inv(P))

#transform features
transformed_train_features = train_features.dot(P)
transformed_test_features = test_features.dot(P)

model.fit(transformed_train_features, train_target)

predictions2 = model.predict(transformed_test_features)

#get r2 score
print("R2 score of transformed model:",r2_score(test_target, predictions2))


[[ 6.17638733e-04 -4.17246211e-04  4.36747663e-04 -1.63817251e-04]
 [ 1.22609343e-03 -9.02158366e-04  7.15473292e-04 -6.72547022e-04]
 [ 6.86970018e-04 -5.65587072e-04  5.30414054e-04 -2.10143799e-04]
 [ 1.54327992e-04 -1.38726895e-04  7.31581085e-05 -4.79524908e-05]]
R2 score of model: 0.43522757127013845


As we can see from our test, the r2 score of both the transformed and untransformed model are the same. 