# Sure Tomorrow Insurance Modeling: Project Description

The Sure Tomorrow insurance company wants to solve several tasks with the help of Machine Learning and you are asked to evaluate that possibility. After loading, cleaning and performing basic exploratory data analysis, the following four tasks will be completed.

- Task 1: Find customers who are similar to a given customer. This will help the company's agents with marketing.
- Task 2: Predict whether a new customer is likely to receive an insurance benefit. Can a prediction model do better than a dummy model?
- Task 3: Predict the number of insurance benefits a new customer is likely to receive using a linear regression model.
- Task 4: Protect clients' personal data without breaking the model from the previous task. It's necessary to develop a data transformation algorithm that would make it hard to recover personal information if the data fell into the wrong hands. This is called data masking, or data obfuscation. But the data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model, just prove that the algorithm works correctly.

Overall conclusions will be summarized at the end of the project.

# Data Preprocessing & Exploration

## 1 Initialization

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns

import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing

from scipy.spatial import distance

from sklearn.model_selection import train_test_split

from IPython.display import display

import math


## 2 Load Data

Load data and conduct a basic check that it's free from obvious issues.

In [4]:
df = pd.read_csv('~/Projects/TT_S11/insurance_us.csv')


In [5]:
# rename columns to make the code more consistent
df = df.rename(columns={'Gender': 'gender', 'Age': 'age', 'Salary': 'income', 'Family members': 'family_members', 'Insurance benefits': 'insurance_benefits'})

In [None]:
# print sample
df.sample(10)

In [None]:
# print info for data types

df.info()

In [8]:
# change age type from float to int
df['age'] = df['age'].astype('int')

In [None]:
# confirm conversion was successful
df.info()

In [None]:
# check descriptive statistics for anomalies
df.describe()

In [None]:
# Print details about insurance benefits variable given descriptive stats for that variable

df.groupby('insurance_benefits').count()

**Results and Conclusions** The variable ranges look appropriate based on the spread of data for each. For family members, the minimum is 0 and the median 1, indicating that the variable is additional family members other than the primary insured individual.

The counts for the insurance benefits variable values was run given the descriptive statistics were limited in what they showed. By running the counts, it is clear that only about 10% of members receive benefits, which may have implications for analysis. For the time being, the counts confirmed the data looks appropriate.

## 3 Exploratory Data Analysis
Check whether there are certain groups of customers by looking at the pair plot.

In [None]:
g = sns.pairplot(df, kind='hist')
g.fig.set_size_inches(12, 12)

It is difficult to spot obvious clusters with several multivariate distributions. Next step is to use other techniques to find similar customers.

## Task 1. Similar Customers

Write a function that returns k nearest neighbors for an $n^{th}$ object based on a specified distance metric. The number of received insurance benefits should not be taken into account for this task. 

Test it for four combination of two cases
- Scaling
  - the data is not scaled
  - the data is scaled with the [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) scaler
- Distance Metrics
  - Euclidean
  - Manhattan


In [16]:
# create list of feature names to use below

feature_names = ['gender', 'age', 'income', 'family_members']

In [17]:
# create function that returns k nearest neighbors

def get_knn(df, n, k, metric):
    
    """
    Returns k nearest neighbors

    :param df: pandas DataFrame used to find similar objects within
    :param n: object no for which the nearest neighbours are looked for
    :param k: the number of the nearest neighbours to return
    :param metric: name of distance metric
    """

    nbrs = sklearn.neighbors.NearestNeighbors(metric=metric)
    nbrs.fit(df[feature_names])
    nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[n][feature_names]], k, return_distance=True)
    
    df_res = pd.concat([
        df.iloc[nbrs_indices[0]], 
        pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance'])
        ], axis=1)
    
    return df_res

In [18]:
# scale the data
transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_numpy())

df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to_numpy())

In [None]:
df_scaled.sample(5)

In [None]:
# identify similar records for a given one for every combination

# original data with Manhattan metric 
get_knn(df, 50, 6, distance.cityblock)

In [None]:
# original data with Euclidean metric 
get_knn(df, 50, 6, distance.euclidean)

In [None]:
# scaled data with Manhattan metric 
get_knn(df_scaled, 50, 6, distance.cityblock)

In [None]:
# scaled data with Euclidean metric 
get_knn(df_scaled, 50, 6, distance.euclidean)

**Results and Conclusions**

**Does the data being not scaled affect the kNN algorithm? If so, how does that appear?** 

Scaling the data pulls different observations as most similar to the test object as compared to unscaled data. When running to identify the five nearest neighbors, only one observation was the same in the scaled and unscaled data.

**How similar are the results using the Manhattan distance metric (regardless of the scaling)?** 

Using Manhattan and Euclidean distance metrics gives very similar results. The observations identified as the nearest neighbors were nearly identical for both metrics when running for the five nearest neighbors. The distances were slightly different but not by large degrees.

## Task 2. Is a Customer Likely to Receive an Insurance Benefit?

Evaluate whether the kNN classification approach performs better than a dummy model for determining whether a customer is likely to receive an insurance benefit, i.e., insurance_benefits variable is greater than zero as the target. This is a binary classification task.

For this task:
- Build a KNN-based classifier and measure its quality with the F1 metric for k=1..10 for both the original data and the scaled one. That'd be interesting to see how k may influece the evaluation metric, and whether scaling the data makes any difference. 
- Build the dummy model which is just random for this case. It should return "1" with some probability. Let's test the model with four probability values: 0, the probability of paying any insurance benefit, 0.5, 1.

The probability of paying any insurance benefit can be defined as

$$
P\{\text{insurance benefit received}\}=\frac{\text{number of clients received any insurance benefit}}{\text{total number of clients}}.
$$

Split the whole data in the 70:30 proportion for the training/testing parts.


In [None]:
# calculate the target

df['insurance_benefits_received'] = 0

for i in range(len(df)):
    if df['insurance_benefits'][i] == 0:
        df['insurance_benefits_received'][i] = 0
    else:
        df['insurance_benefits_received'][i] = 1


In [None]:
# check target calculated accurately
print(df.sample(15))

In [30]:
# Create a scaled data set
transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_numpy())

df_scaled_bin = df.copy()
df_scaled_bin.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to_numpy())

In [None]:
# print to confirm data was scaled
print(df_scaled_bin.head())

In [None]:
# Create features and target data for the main df
features = df[feature_names]
target = df['insurance_benefits_received']

print(features.head())
print()
print(target.head())

In [None]:
# Create features and target data for scaled data
features_sc = df_scaled_bin[feature_names]
target_sc = df_scaled_bin['insurance_benefits_received']

print(features_sc.head())
print()
print(target_sc.head())

In [None]:
# create training and test data subsets for main df

features_train, features_test, target_train, target_test = train_test_split(features,
                                                                   target,
                                                                   test_size=0.3,
                                                                   random_state=12345)

print("Training data:", len(features_train)/len(df), "Test data:", len(features_test)/len(df))

In [None]:
# Create training and test subsets for scaled data

features_train_sc, features_test_sc, target_train_sc, target_test_sc = train_test_split(features_sc,
                                                                   target_sc,
                                                                   test_size=0.3,
                                                                   random_state=12345)

print("Training data:", len(features_train_sc)/len(df_scaled_bin), "Test data:", len(features_test_sc)/len(df_scaled_bin))


In [None]:
# check for the class imbalance with value_counts() for main df data

print("Counts")
print("Full data set:", target.value_counts())
print()
print("Proportion")
print("Full data set:", target.value_counts(normalize=True))
print()
print("Training data set:", target_train.value_counts(normalize=True))
print()
print("Test data set:", target_test.value_counts(normalize=True))

In [None]:
# check for the class imbalance for scaled data

print("Counts")
print("Full data set:", target_sc.value_counts())
print()
print("Proportion")
print("Full data set:", target_sc.value_counts(normalize=True))
print()
print("Training data set:", target_train_sc.value_counts(normalize=True))
print()
print("Test data set:", target_test_sc.value_counts(normalize=True))


In [38]:
# create function to run F1 score and confusion matrix

def eval_classifier(y_true, y_pred):
    
    f1_score = sklearn.metrics.f1_score(y_true, y_pred)
    print(f'F1: {f1_score:.2f}')
    
    cm = sklearn.metrics.confusion_matrix(y_true, y_pred, normalize='all')
    print('Confusion Matrix')
    print(cm)

In [None]:
# Use a kNN classifier to predict target on test data for k=1...10

for i in range(1, 11):
    print("Integer", i)
    neigh=sklearn.neighbors.KNeighborsClassifier(n_neighbors=i, metric='euclidean')
    neigh.fit(features_train, target_train)
    eval_classifier(target_test, neigh.predict(features_test))
    print()


In [None]:
# Use a kNN classifier to predict target on scaled test data for k=1...10

for i in range(1, 11):
    print("Integer", i)
    neigh=sklearn.neighbors.KNeighborsClassifier(n_neighbors=i, metric='euclidean')
    neigh.fit(features_train_sc, target_train_sc)
    eval_classifier(target_test_sc, neigh.predict(features_test_sc))
    print()


In [41]:
# generating output of a random model

def rnd_model_predict(P, size, seed=42):

    rng = np.random.default_rng(seed=seed)
    return rng.binomial(n=1, p=P, size=size)

In [None]:
# run probability and F1 score for random model

for P in [0, df['insurance_benefits_received'].sum() / len(df), 0.5, 1]:

    print(f'The probability: {P:.2f}')
    y_pred_rnd = rnd_model_predict(P, len(df))
        
    eval_classifier(df['insurance_benefits_received'], y_pred_rnd)
    
    print()

**Results and Conclusions**
The scaled data, k=1, produces the best results with F1 score of 0.97, which is a very strong result. Overall, the scaled data produces stronger results (F1 score range of 0.97-0.88 for k=1-10) than the original data (F1 score 0.61-0.00 for k=1-10). However, both data sets produced stronger results than the random model, where the F1 score did not surpass 0.20.

## Task 3. Regression (with Linear Regression)

With `insurance_benefits` as the target, evaluate what RMSE would be for a Linear Regression model.

Build an implementation of LR. Check RMSE for both the original data and the scaled one. 

The following will guide the task:
- $X$ — feature matrix, each row is a case, each column is a feature, the first column consists of unities
- $y$ — target (a vector)
- $\hat{y}$ — estimated tagret (a vector)
- $w$ — weight vector

The task of linear regression in the language of matrices can be formulated as

$$
y = Xw
$$

The training objective is to find such $w$ that it would minimize the L2-distance (MSE) between $Xw$ and $y$:

$$
\min_w d_2(Xw, y) \quad \text{or} \quad \min_w \text{MSE}(Xw, y)
$$

The analytical solution for the above:

$$
w = (X^T X)^{-1} X^T y
$$

The formula above can be used to find the weights $w$ and the latter can be used to calculate predicted values:

$$
\hat{y} = X_{val}w
$$

In [43]:
# create LR class

class MyLinearRegression:
    
    def __init__(self):
        
        self.weights = None
    
    def fit(self, X, y):
        
        # adding the unities
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        self.weights = np.linalg.inv(np.dot(X2.T, X2)).dot(X2.T).dot(y)

    def predict(self, X):
        
        # adding the unities
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        y_pred = X2.dot(self.weights)
        
        return y_pred

In [44]:
# create RMSE function

def eval_regressor(y_true, y_pred):
    
    rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    print(f'RMSE: {rmse:.2f}')
    
    r2_score = math.sqrt(sklearn.metrics.r2_score(y_true, y_pred))
    print(f'R2: {r2_score:.2f}')    

In [None]:
# run LR on original data and evaluate with RMSE

X = df[['age', 'gender', 'income', 'family_members']].to_numpy()
y = df['insurance_benefits'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)

lr = MyLinearRegression()

lr.fit(X_train, y_train)
print(lr.weights)

y_test_pred = lr.predict(X_test)
eval_regressor(y_test, y_test_pred)

In [None]:
# run LR and evaluate RMSE with scaled data

X_sc = df_scaled[['age', 'gender', 'income', 'family_members']].to_numpy()
y_sc = df_scaled['insurance_benefits'].to_numpy()

X_sc_train, X_sc_test, y_sc_train, y_sc_test = train_test_split(X_sc, y_sc, test_size=0.3, random_state=12345)

lr_sc = MyLinearRegression()

lr_sc.fit(X_sc_train, y_sc_train)
print(lr_sc.weights)

y_test_pred_sc = lr_sc.predict(X_sc_test)
eval_regressor(y_sc_test, y_test_pred_sc)

**Results and Conclusions**
The RMSE for both the original and scaled data is the same at 0.34, and R2 is 0.66. Given insurance benefits range from 0 to 5, an RMSE is a modest result, suggesting the model is a modest fit for the data for both scaled and unscaled data.

## Task 4. Obfuscating Data

Obfuscate data by multiplying the numerical features (matrix $X$) by an invertible matrix $P$. 

$$
X' = X \times P
$$


In [47]:
personal_info_column_list = ['gender', 'age', 'income', 'family_members']
df_pn = df[personal_info_column_list]


In [48]:
X = df_pn.to_numpy()

In [49]:
# generate random matrix P
rng = np.random.default_rng(seed=42)
P = rng.random(size=(X.shape[1], X.shape[1]))

In [50]:
# create function to check invertibility by checking the determinant is not equal to 0

def inv_P(matrix):
    if matrix.shape[0] != matrix.shape[1]:
        return "not square"
    det = np.linalg.det(matrix)
    return "invertible" if det != 0 else "not invertible"

In [None]:
# run function for P to confirm matrix P is invertible

print(inv_P(P))

In [52]:
# create transformed data matrix
X_transformed = np.dot(X, P)

In [None]:
# print original and transformed matrices to check if ages or income apparent
print(X[:5])

print(X_transformed[:5])


After transformation, ages and income not decipherable. Next, check if possible to recover the original data from X' if P is known.

In [None]:
# to recover the original data multiply the transformed data by the inverted P matrix

# first create the inverted matrix for P
P_inverse = np.linalg.inv(P)

print(P_inverse)

In [55]:
# multiply the inverted P matrix with transformed data using the dot function
X_recovered = np.dot(X_transformed, P_inverse)

In [None]:
# check if recovered and original matrix are the same
print("Original data recovered accurately:", np.allclose(X, X_recovered))

In [None]:
# print original data, transformed data, and reversed (recovered) data
# print sample from each data set

print()
print("Original X")
print(X[:5])
print()
print("Transformed X")
print(X_transformed[:5])
print()
print("Recovered X")
print(X_recovered[:5])

**Results and Conclusions**

The original and recovered data sets are similar but different by very small amounts. Recovering the data entails multiplying the transformed data by the inverted P matrix, both of which include rounded decimal numbers that will result in slightly different recovered data.

In this section, a random, invertible matrix, P, was created and multiplied with the features data to create a new, obfuscated data set where data such as age and income was unrecognizable (e.g., if there was a data security breach). This data was then retransformed using an inverted version of the P matrix; this was done to determine if the original data could be accurately recovered, which would be required in an actual business application. 


## Proof that Data Obfuscation Can Work with LR

Analytical proof that the above obfuscation method does not affect linear regression predicted values.

For this proof, I will first solve for the coefficient vector $w_p $ using matrix properties. I will then apply the simplified formula to the linear regression prediction formula $a = Xw $

We can use the following properties of matrices to do this, which will be referenced in the proof.

<table>
<tr>
<td>Distributivity</td><td>$A(B+C)=AB+AC$</td>
</tr>
<tr>
<td>Non-commutativity</td><td>$AB \neq BA$</td>
</tr>
<tr>
<td>Associative property of multiplication</td><td>$(AB)C = A(BC)$</td>
</tr>
<tr>
<td>Multiplicative identity property</td><td>$IA = AI = A$</td>
</tr>
<tr>
<td></td><td>$A^{-1}A = AA^{-1} = I$
</td>
</tr>    
<tr>
<td></td><td>$(AB)^{-1} = B^{-1}A^{-1}$</td>
</tr>    
<tr>
<td>Reversivity of the transpose of a product of matrices,</td><td>$(AB)^T = B^TA^T$</td>
</tr>    
</table>


**4.1 Solving for $w_p $**

We solve for the coefficient vector $w_p$ of the obfuscation algorithm using matrix properties as follows. This will show how $w $ and $w_p $ are linked. The original formula:

$w_P = [(XP)^T XP]^{-1} (XP)^T y$

Separate the transpose matrices using reversivity of the transpose of a product of matrices property:

$w_p = [(P^TX^T)XP]^{-1}P^TX^Ty $

Distribute the inverse signs:

$w_p = [(XP)^{-1}(P^TX^T)^{-1}]P^TX^Ty $

$w_p = [P^{-1}X^{-1}X^{T-1}P^{T-1}]P^TX^Ty $

Reformulate using the associative property of multiplication:

$w_p = P^{-1}[X^{-1}(X^T)^{-1}](P^{T-1})P^TX^Ty $

Replace $(P^{T-1})P^T $ with the identity matrix:

$w_p = P^{-1}[X^{-1}(X^T)^{-1}]IX^Ty $

Based on the multiplicative identity property, $IA = AI = A $ and we can remove the identity matrix:

$w_p = P^{-1}[X^{-1}(X^T)^{-1}]X^Ty $

Redistribute the inverse signs outside the brackets:

$w_p = P^{-1}[(X^T)X]^{-1}X^Ty $

Replace the ordinary least squares formula with $w $

$w_p = P^{-1}w $


**4.2 Consider affect of obfuscation on linear regression model**

We next look at the predicted values for $a' $

The linear regression model (predicted values) for $a $ is $a = Xw $

When adding obfuscation, the model changes to account for the changed intercept and weights: $a' = XPw_P $

Replace $w_p $ with the value from above:

$a' = XPP^{-1}w $

Replace $PP^{-1} with the identify matrix and then remove:

$a' = XIw $

$a' = Xw $

Replace $Xw $ with $a $ from the linear regression model and we get:

$a' = a $

This shows that we get the same predictions using the original and obfuscated data.

**Analytical Proof**
Implications for the quality of the linear regression when measured by RMSE are as follows.  Based on the logic and algebra above, the coefficient vector of the obfuscated data (matrix multiplied by an invertible matrix $P $) is the same as the original coefficient vector multiplied by the inverted matrix P or $P^{-1}w $. By extension, the linear regression predictions for both the original data and obfuscated data will be the same, i.e., $a = a' $. This means that the RMSE will also be the same for the original and obfuscated data.

## 5 Test Linear Regression with Data Obfuscation

Build a procedure or a class that runs Linear Regression optionally with the obfuscation. Run Linear Regression for the original data and the obfuscated one, compare the predicted values and the RMSE, $R^2$ metric values. 

**Procedure**

- Create a square matrix $P$ of random numbers.
- Check that it is invertible. If not, repeat the first point until we get an invertible matrix.
- Create features and target from scaled data earlier in the project. Scaled data is used given it performed better than original non-scaled data.
- Obfuscate features data and save as $XP$ matrix. 
- Create training and test data for scaled data, both regular scaled and obfuscated scaled data sets.
- Run linear regression model on regular scaled data and obfuscated scaled data. Run accompanying RMSE and R2 for data and compare between obfuscated and unobfuscated data.

In [58]:
# create square matrix P

rng = np.random.default_rng(seed=42)
P = rng.random(size=(X.shape[1], X.shape[1]))

In [None]:
# print sample to confirm created and check for invertibility
print(P[:5])

print(inv_P(P))

In [60]:
# create a scaled data set

transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_numpy())

df_scaled_lr = df.copy()
df_scaled_lr.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to_numpy())

# create features and target from scaled data

features_sc = df_scaled_lr[feature_names]
target_sc = df_scaled_lr['insurance_benefits']


In [61]:
# create obfuscated data for features, XP

XP = np.dot(features_sc, P)


In [62]:
# create training and test data for orig scaled data

X_sc_train, X_sc_test, y_sc_train, y_sc_test = train_test_split(features_sc, 
                                                                target_sc, 
                                                                test_size=0.3, 
                                                                random_state=12345)


In [63]:
# create training and test data for obf data - target is the same data as orig scaled data

XP_train, XP_test, y_obf_train, y_obf_test = train_test_split(XP, 
                                                                target_sc, 
                                                                test_size=0.3, 
                                                                random_state=12345)


In [None]:
# run model on orig scaled data with eval metrics

lr_sc = MyLinearRegression()

lr_sc.fit(X_sc_train, y_sc_train)
print(lr_sc.weights)

y_test_pred_sc = lr_sc.predict(X_sc_test)
eval_regressor(y_sc_test, y_test_pred_sc)

In [None]:
# run model on obf data with eval metrics

lr_obf = MyLinearRegression()

lr_obf.fit(XP_train, y_obf_train)
print(lr_obf.weights)

y_test_pred_obf = lr_obf.predict(XP_test)
eval_regressor(y_obf_test, y_test_pred_obf)

**Results and Conclusions**
In this section, a Linear Regression model was tested with scaled data, comparing the results with the scaled data and with scaled data in which identifiable variables (i.e., features) were obfuscated. The RMSE and R2 results were the same for both data sets, demonstrating that obfuscated data can be used in place of unobfuscated data without compromising results.

# Conclusions

In this project, the Sure Tomorrow insurance company wanted to look at customer data to determine the likelihood customers would receive an insurance benefit. To do this, customers were compared to find similar customers, and regression models developed and adjusted to find a model that accurately predicts whether a given customer would likely receive an insurance benefit. Given the sensitivity of customer insurance data, customer data was obfuscated to better protect customer information, and model results compared with unobfuscated data to ensure continued model accuracy. The independent variables in this data set were gender, age, income level, and family size (number of additional family members other than primary beneficiary). The dependent variable was the number of insurance benefits received. 

The following was completed:
- Libraries and data loaded, and data reviewed to ensure data types appropriate and no major data anomalies. Data types adjusted as appropriate.
- Basic exploratory data analysis completed (graphs) to determine any obvious patterns.
- A k nearest neighbors (kNN) algorithm was used to identify similar data points within the data set, i.e., for any given customer, $k $ customers most similar to the originally identified customer. 
- The kNN algorithm was tested to compare Euclidean and Manhattan distance metrics and the effect of scaling customer data (independent variables).
- A k nearest neighbors classification algorithm was tested to predict whether customers would receive insurance benefits or not. For this model, the target was whether or not a customer received any benefits (not the number of benefits received). Original and scaled data were tested and compared with the model to determine whether scaled data resulted in a more accurate model. The number of comparable data points ($k $) were adjusted to determine the best fit. The kNN classification model was also compared to a dummy, random model. 
- A linear regression model was tested with original and scaled data to predict the number of insurance benefits received by a given customer. To do this, a class was developed that used a features matrix with a unities column, then finding the weight vector that best results in predicted number of benefits received. Predicted results and actual results were evaluated using RMSE and R2.
- Data was masked by multiplying features data with an invertible matrix, P. The obfuscated data was then recovered by an inverted P matrix to confirm it was recoverable and comparable to the original data.
- An analytical proof was developed to demonstrate that obfuscated data resulted in the same results when using a linear regression model.
- Finally, a linear regression model was used with the obfuscated data and results compared to unobfuscated data. This step complements the analytical proof above and tests practically whether obfuscated data results in differences in the linear regression model.

The following conclusions emerged:
- Scaling data resulted in meaningful differences in identifying similar data points, as well as in more accurate data modeling with a classification model. Scaled data performed the same when using linear regression based on RMSE and R2 evaluation metrics.
- Using Euclidean and Manhattan distance metrics resulted in only small differences in distances when identifying similar data points.
- The kNN classification model had strong results with scaled data, with the best results for k=1 (F1 score = 0.97). The kNN model performed better overall than the random model for which the highest F1 score was 0.20 based on performance testing.
- The kNN Linear Regression model had modest results (RMSE = 0.34, R2 = 0.66), with no differences in evaluation metrics results for scaled vs. unscaled data. 
- Obfuscated data masked identifiable data, such as income level or age, and was recoverable using an inverted version of the P matrix that was initially used to transform the data. Recovered data was slightly different than original data, but not significantly (i.e., differences were within tolerance values). 
- Using linear algebra and matrix properties, it was proved that obfuscated data will not affect linear regression model accuracy as compared to unobfuscated data.
- The final linear regression model testing futher demonstrated that obfuscating data does not hinder the accuracy of the model. RMSE and R2 results were the same for both unobfuscated and obfuscated results (RMSE = 0.34, R2 = 0.66).

Based on the full project, it is determined that a linear classification model will accurately predict whether a given insurance customer will receive any insurance benefits, particularly if using scaled features data (independent variables data). Furthermore, a regression model will result in being able to accurately predict the number of benefits a customer will receive based on independent variables (gender, age, income, number of family members). Using obfuscated data protects clients' data without compromising model results.