#### In the project, implement:
-  the algorithm (excluding LinearRegression) and demonstrate on it that $\hat{y}$ = $\hat{y_Z}$, where $Z$ - is an invertible matrix.

$$
\begin{align*}
\hat{y}=Xw,&\quad w=(X^T X)^{-1} X^T y \\
\hat{y_Z}=XZw_Z,&\quad w_Z = ((XZ)^T (XZ))^{-1} (XZ)^T y
\end{align*}
$$.

</span>

ou need to protect the customer data of the insurance company “Hot potop”. Develop a method of data transformation so that it is difficult to recover personal information from it. Justify why your method works.
The data should be protected in such a way that the performance of machine learning models does not degrade. It is not necessary to search for the optimal model.

## Data Loading

In [1]:
import numpy as np
import pandas as pd
from numpy.linalg import inv
from sklearn.metrics import r2_score

In [2]:
df = pd.read_csv('/Users/fidanb/Downloads/yandex praktikum/all datasets/insurance.csv')
df

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0
...,...,...,...,...,...
4995,0,28.0,35700.0,2,0
4996,0,34.0,52400.0,1,0
4997,0,20.0,33900.0,2,0
4998,1,22.0,32700.0,3,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Пол                  5000 non-null int64
Возраст              5000 non-null float64
Зарплата             5000 non-null float64
Члены семьи          5000 non-null int64
Страховые выплаты    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [4]:
df.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


#### Conclusion: We have clean data without missing values, with only two data types present: int64 and float64.

## 2. Matrix Multiplication

In this project, you can write formulas directly in Jupyter Notebook.
To include a formula within the text, enclose it with single dollar signs \\$. For a standalone formula (displayed separately), use double dollar signs \\$\\$. These formulas are written in LaTeX syntax.
As an example, we wrote linear regression formulas. You can copy and edit them to solve the task.
Using LaTeX is optional.

Notation:

 $X$ — feature matrix (the zeroth column consists of ones)

 $y$ — target vector

 $P$ — matrix by which the features are multiplied

 $w$ — vector of linear regression weights (the zeroth element corresponds to the intercept)

Predictions:

$$
a = Xw
$$

Learning task:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

#### Answer:
It does not change, because when calculating predictions, the invertible matrix turns into the identity matrix, which does not affect the final result when multiplied.

**Justification:** 

Predictions:

$$
a = Xw
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

Thus, the prediction calculation takes the form:

$$
a = X (X^T X)^{-1} X^T y
$$


Suppose we have an invertible square matrix Z.
If we multiply the feature matrix X by Z, the predictions are computed as:

$$
a = XZw
$$

The training formula is:

$$
w = (Z^T X^T  X Z)^{-1} Z^T X^T  y
$$

Substituting into the prediction formula:
$$
a = X Z(Z^T X^T  X Z)^{-1} Z^T X^T  y
$$

or

$$
a = X Z  (X)^{-1} (Z)^{-1} (Z^T)^{-1} (X^T)^{-1} Z^T X^T  y
$$


Since multiplying a square matrix by its inverse yields the identity matrix:

$$ A (A)^{-1} = E $$

Since multiplying a square matrix by its inverse yields the identity matrix:

$$
a = X E E (X)^{-1} (X^T)^{-1}  X^T  y
$$

According to the property of the identity matrix (the product of an identity matrix with itself is an identity matrix):

$$ A E = A $$

We substitute the features back under the common factor and obtain the final formula, which is equal to the original prediction formula:

$$
a = X(X^T X )^{-1}  X^T  y
$$

or

$$
a = Xw
$$


<span style="color:Dark Blue">
Formula for calculating the coefficient $w$:

$$
w = (X^T  X)^{-1} X^T  y
$$

Formula for calculating the coefficient $w_Z$ after multiplying by an invertible matrix Z:

$$
w_Z = (Z^T X^T  X Z)^{-1} Z^T X^T  y
$$

Expanding the brackets, we get:

$$
w_Z = (Z^T)^{-1} (X^T)^{-1} (Z)^{-1} (X)^{-1}   Z^T X^T  y
$$

Since multiplying a square matrix by its inverse gives the identity matrix:

$$ A (A)^{-1} = E $$

We get:

$$
w_Z = E (Z)^{-1} (X^T)^{-1}  (X)^{-1} X^T  y
$$

We bring the features X back under the overall exponent:

$$
w_Z = (Z)^{-1}(X^T X )^{-1}  X^T  y
$$

We observe the product of the features X with y, which is equal to the formula for calculating the coefficients $w$:

$$
w_Z = (Z)^{-1}w
$$

The above statement demonstrates that we are simply changing the "basis" of the feature space, which alters the values of the coefficients $w$, but does not change the relative relationships between these coefficients.

#### Demonstrate this with a practical example.

In [5]:
from sklearn.linear_model import LinearRegression

In [6]:
X = df.drop('Страховые выплаты', axis=1)
y = df['Страховые выплаты']

In [7]:
class LinearRegression:
    def fit(self, X_train, y_train):
        X = np.concatenate((np.ones((X_train.shape[0], 1)), X_train), axis=1)
        y = y_train
        W = np.linalg.inv(X.T@X)@X.T@y
        self.w = W[1:]
        self.w0 = W[0]

    def predict(self, X_test):
        return X_test.dot(self.w) + self.w0

In [8]:
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
print(r2_score(y, predictions))

0.4249455028666801


In [9]:
#Suppose we have an invertible matrix:
Z = np.array([[1,2,6,9],
              [5,3,7,8],
              [7,1,5,1],
              [9,3,2,4]
             ])
Z

array([[1, 2, 6, 9],
       [5, 3, 7, 8],
       [7, 1, 5, 1],
       [9, 3, 2, 4]])

In [10]:
#class LinearRegression_Z:
    #def fit(self, X_train_Z, y_train_Z):
        #X1 = np.concatenate((np.ones((X_train_Z.shape[0], 1)), X_train_Z), axis=1)
        #yz = y_train_Z
        #calc_wz = np.linalg.inv((X1.T@X1 @ Z.T@Z)) @ (Z.T@X1.T) @ yz
        #print('calc_wz', calc_wz.shape)
        #self.wz = calc_wz[1:]
        #self.w0z = calc_wz[0]      

    #def predict(self, X1_test):
        #return (X1_test@Z)@self.wz + self.w0z

In [11]:
X_z = X@Z

In [15]:
class LinearRegression_Z:
    def fit(self, X_train_Z, y_train_Z):
        X1 = np.concatenate((np.ones((X_train_Z.shape[0], 1)), X_train_Z), axis=1)
        y_z = y_train_Z
        W = np.linalg.inv(X1.T@X1)@X1.T@y_z
        self.w = W[1:]
        self.w0 = W[0]

    def predict(self, X_test):
        return X_test@self.w + self.w0

In [16]:
model = LinearRegression_Z()
model.fit(X_z, y)
predictions = model.predict(X_z)
print(r2_score(y, predictions))

0.4249455028665676
