**Main objective of the analysis that specifies whether your model will be focused on prediction or interpretation.**

train regression models on dataset and choosing the best model based on accuracy

**Brief description of the data set you chose and a summary of its attributes**

The data contains medical information and costs billed by health insurance companies. It contains 1338 rows of data and the following columns: age, gender, BMI, children, smoker, region, insurance charges.

| S No. | Column | Description| Data Type | Category|
| --- | --- | --- | --- | --- |
|1 | Age | age of primary beneficiary | Int | Discrete |
|2 | Sex | insurance contractor gender, female, male | String | Nominal |
|3 | BMI | Body mass index, providing an understanding of body, weights that are relatively high or low relative to height | Float | Continuous |
|4 |Children | Number of children covered by health insurance / Number of dependents | Int | Discrete |
|5 | Smoker | Smoking status of contractor, yes, no | String | Nominal |
|6 | Region | the beneficiary's residential area in the US, northeast, southeast, southwest, northwest. | String | Nominal |
|7 | Charges | Individual medical costs billed by health insurance| Float | Continuous |

**Plan for Data Exploration, Feature Engineering and Modelling**

The steps in solving the Regression Problem are as follows:
1. Packages to be installed
2. Load the libraries
3. Load the dataset
4. General information about the dataset
5. Exploratory Data Analysis (EDA)
6. Modeling
7. Recommendations

## Packages to be installed

1. tpot
2. auto-sklearn
3. scipy

In [None]:
!conda install -c anaconda swig dask[distributed] --yes
!pip install deap update_checker tqdm stopit xgboost
!pip install tpot
!pip install auto-sklearn
!pip install 'ray[default]'
!pip install scikit-optimize
!pip install scipy==1.7.0

## Load the libraries

1. numpy
2. pandas
3. matplotlib
4. seaborn
5. sklearn
6. autosklearn
7. tpot

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, PolynomialFeatures
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.model_selection import ShuffleSplit, RepeatedKFold, cross_val_score, GridSearchCV, cross_val_predict
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, ElasticNet, SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import explained_variance_score
import warnings
from scipy.linalg import LinAlgWarning
from sklearn.exceptions import ConvergenceWarning
import autosklearn.regression
from tpot import TPOTRegressor

## Load the dataset

location of dataset

In [None]:
dataset = 'https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv'

reading the dataset into dataframe

In [None]:
df = pd.read_csv(dataset)

## General information about the dataset

sampling the data

In [None]:
print(df.head())

number of rows and coulmns in dataset

In [None]:
print(df.shape)

dataset information

In [None]:
print(df.info())

**Actions taken for data cleaning and feature engineering**

Capitalize column names

In [None]:
df.columns = df.columns.str.capitalize()

Classifying columns as Numerical or Categorical

In [None]:
num_cols = df.select_dtypes('number').columns.tolist()
cat_cols = df.select_dtypes('object').columns.tolist()

Features Encoding

In [None]:
oe = OrdinalEncoder()
sc = StandardScaler()
df2 = df.iloc[:,:-1]
num_features = num_cols[:-1]
df2[cat_cols] = oe.fit_transform(df2[cat_cols])
df2[num_features] = sc.fit_transform(df2[num_features])

 Split the data into test and train

In [None]:
features = df2.columns
target = df.columns[-1]
X = df2
Y = df[target]
ss = ShuffleSplit(n_splits=1, test_size=.2, random_state=0)
train_indecies = list(ss.split(X,y=Y))
train_index, test_index = train_indecies[0][0], train_indecies[0][1]
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = Y.loc[train_index], Y.loc[test_index]

## Exploratory Data Analysis (EDA)

Summary Statistics for Numerical columns

In [None]:
print(df.describe())

Summary Statistics for Categorical columns

In [None]:
print(df[cat_cols].describe().T)

Visual Exploration of Numerical Columns

distribution of Age for candidates

In [None]:
df['Age'].hist(bins=5);

correlation between Age and Charges

In [None]:
x = df['Age']
y = df['Charges']
plt.scatter(x,y)
plt.xlabel('Age')
plt.ylabel('Charges');

Visual Exploration of Categorical columns

number of candidates in each region

In [None]:
sns.countplot(data=df, x='Region',order=df['Region'].value_counts().index);

distribution of charges for each region

In [None]:
sns.boxplot(y=df['Region'], x=df['Charges'],order=df['Region'].value_counts().index);

Pair plot of numerical features

In [None]:
sns.pairplot(df[num_cols], plot_kws=dict(alpha=.1, edgecolor='none'));

heatmap of numerical features

In [None]:
sns.heatmap(df[num_cols]);

correlation plot of numerical features

In [None]:
corr = df.corr()
mask = np.triu(corr)
sns.heatmap(corr, mask=mask, cmap='Wistia', center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5},  annot= True);

Feature Importance

In [None]:
fe = ExtraTreesRegressor(n_estimators=10)
fe.fit(X, Y)
fedf = pd.DataFrame({'Feature':features,'Feature_importance %' : fe.feature_importances_ * 100})
fedf = fedf.sort_values(by=['Feature_importance %'], ascending=False)
print(fedf)
fedf.plot.bar(x='Feature',y='Feature_importance %');

## Modeling

**Summary of training at least three linear regression models which should be variations that cover using a simple linear regression as a baseline, adding polynomial effects, and using a regularization regression. Preferably, all use the same training and test splits, or the same cross-validation method.**

regression models used for the training dataset and the results
1. LinearRegression
2. Polynomial
3. Regularization(ElasticNet)
4. SGD
5. Decision Tree 
6. Random Forest
7. KNN

In [None]:
#training the models
warnings.filterwarnings('ignore', category=LinAlgWarning)
warnings.filterwarnings('ignore', category=ConvergenceWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

models = []
models.append(('LR', LinearRegression()))
models.append(('Polynomial', LinearRegression()))
models.append(('Regularization', ElasticNet()))
models.append(('SGD', SGDRegressor()))
models.append(('DT', DecisionTreeRegressor()))
models.append(('RF', RandomForestRegressor()))
models.append(('KNN', KNeighborsRegressor()))
results = []
names = []
for name, model in models:
    if name == 'Polynomial':
        evss = []
        degrees = np.arange(1, 10)
        max_evs, min_deg = 1 , 0
        for deg in degrees:
            poly_features = PolynomialFeatures(degree=deg, include_bias=False)
            x_poly_train = poly_features.fit_transform(X_train)
            poly_reg = model
            poly_reg.fit(x_poly_train, y_train)
            x_poly_test = poly_features.fit_transform(X_test)
            poly_predict = poly_reg.predict(x_poly_test)
            poly_evs = explained_variance_score(y_test, poly_predict)
            evss.append(poly_evs)
            if max_evs > poly_evs:
                max_evs = poly_evs
                min_deg = deg
                
        poly_features = PolynomialFeatures(degree=min_deg, include_bias=False)
        x_poly_train = poly_features.fit_transform(X_train)
        pipeline_p = Pipeline(steps=[('regressor',model)])
        rkfold = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
        cv_results_p = cross_val_score(pipeline, x_poly_train, y_train, cv=rkfold , scoring='explained_variance')
        results.append(cv_results.mean())
        names.append(name)
    elif name == 'Regularization':
        params = {'normalize':[True, False], 'selection':['cyclic', 'random'],
                  'l1_ratio':np.arange(0, 1, 0.01), 'alpha':np.logspace(-4, 0, 100)}
        rkfold = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
        search_r = GridSearchCV(model, params, scoring='explained_variance', cv=rkfold, n_jobs=-1)
        search_r.fit(X_train, y_train)
        results.append(search_r.best_score_)
        names.append(name)
    elif name == 'KNN':
        weights = ['uniform', 'distance']
        algorithm = ['auto', 'ball_tree', 'kd_tree', 'brute']
        leaf_size = list(range(1,50))
        n_neighbors = list(range(1,30))
        p=[1,2]
        hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p, algorithm=algorithm, weights=weights)
        rkfold = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
        search_k = GridSearchCV(model, hyperparameters, scoring='explained_variance', cv=rkfold, n_jobs=-1)
        search_k.fit(X_train, y_train)
        results.append(search_k.best_score_)
        names.append(name)
        #dfgrid = pd.DataFrame(search_k.cv_results_)
    else:
        pipeline = Pipeline(steps=[('regressor',model)])
        rkfold = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
        cv_results = cross_val_score(pipeline, X_train, y_train, cv=rkfold , scoring='explained_variance')
        results.append(cv_results.mean())
        names.append(name)

In [None]:
# Compare Algorithms
results_df = pd.DataFrame({'Regressor': names, 'Explained_Variance': results})
results_df = results_df.sort_values(by=['Explained_Variance'], ascending=False)
print(results_df)
results_df.plot.bar(x='Regressor',y='Explained_Variance');
plt.title('Algorithm Comparison');

**A paragraph explaining which of your regressions you recommend as a final model that best fits your needs in terms of accuracy and explainability.**

Random Forest Regressor have the best score of all regression models for the training sets

so it will be chosen to make the prediction on the test set

In [None]:
rkfold_rf = RepeatedKFold(n_splits=3, n_repeats=3, random_state=1)
tuned_parameters = {'max_depth': [x for x in range(1, 8)] + [None],
    'max_features': [x for x in range(1, X_train.shape[1])],
    'min_samples_split': np.linspace(0.1, 1.0, 10),
    'n_estimators': [x for x in range(1, 100)]}
search_rf = GridSearchCV(RandomForestRegressor(), tuned_parameters, scoring='explained_variance', cv=rkfold_rf, n_jobs=-1)
search_rf.fit(X_train, y_train)
testing_predictions = search_rf.predict(X_test)
test_accuracy = explained_variance_score(y_test,testing_predictions)
print("Test-predictions accuracy: ",test_accuracy)

**Summary Key Findings and Insights, which walks your reader through the main drivers of your model and insights from your data derived from your linear regression model.**

* being Smoker is main factor in deciding the insurance charges for the candidate
* the Random Forest Regressor gave the best results even without GridSearch for the training set, so the model had to be optimized further for the prediction to give the best results possible
* Polynomial, Regularization, KNN have been tweaked to provide best results on training set (finding best degree, gridsearch of parameters, best k) but stil they produce less accurate results

## Recommendations

**Suggestions for next steps in analyzing this data, which may highlight possible flaws in the model and a plan of action to revisit this analysis with additional data or different predictive modeling techniques to achieve a better explanation or a better prediction**

using automated machine learning yield better results than manual or gridseached models

for this dataset will use auto-sklearn and TPOT and compared thier results to results obtained before

In [None]:
automl = autosklearn.regression.AutoSklearnRegressor(
    n_jobs=4,
    tmp_folder='/tmp/autosklearn_regression_example_tmp',
)
automl.fit(X_train, y_train, dataset_name='insurance')
#print(automl.leaderboard())
#print(automl.show_models())
train_predictions = automl.predict(X_train)
asd = {'Regressor': 'AutoSklearn', 'Explained_Variance': explained_variance_score(y_train, train_predictions)}
#print("Train Explained_Variance_Score:", as['Explained_Variance'])
#print("Test Explained_Variance_Score:", sklearn.metrics.explained_variance_score (y_test, test_predictions))

In [None]:
rkfold = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
tpot = TPOTRegressor(n_jobs=-1, generations=10, population_size=10, offspring_size=10, verbosity=0, cv=rkfold, scoring='explained_variance',random_state=1)
tpot.fit(X_train, y_train)
tpd = {'Regressor': 'TPOT', 'Explained_Variance': tpot.score(X_test, y_test)}
#print('Best Pipeline Score= ',tp['Explained_Variance'])

In [None]:
results_df = results_df.append(asd, ignore_index=True)
results_df = results_df.append(tpd, ignore_index=True)
results_df = results_df.sort_values(by=['Explained_Variance'], ascending=False)
print(results_df)
results_df.plot.bar(x='Regressor',y='Explained_Variance');
plt.title('Algorithm Comparison');

the default configuration for automated machine learning give better results than the model that manually selected and modified