## Objective – Explore the dataset and extract insights from the data. Using statistical evidence to

* Prove (or disprove) that the medical claims made by the people who smoke is greater than those who don't?
* Prove (or disprove) with statistical evidence that the BMI of females is different from that of males.
* Is the proportion of smokers significantly different across different regions?
* Is the mean BMI of women with no children, one child, and two children the same?

# Data Set 
- Age :- This is an integer indicating the age of the primary beneficiary (excluding those above 64 years, since they are generally covered by the government).
- Sex :- This is the policy holder's gender, either male or female.
- BMI :- This is the body mass index (BMI), which provides a sense of how over or under-weight a person is relative to their height. BMI is equal to weight (in kilograms) divided by height (in meters) squared. An ideal BMI is within the range of 18.5 to 24.9.
- Children :- This is an integer indicating the number of children / dependents covered by the insurance plan.
- Smoker :- This is yes or no depending on whether the insured regularly smokes tobacco.
- Region :- This is the beneficiary's place of residence in the U.S., divided into four geographic regions - northeast, southeast, southwest, or northwest.
- Charges​ :- Individual medical costs billed to health insurancerance

# by EDA Question to be answered
- Are there more Male beneficary ?
- Are there more smoker ?
- Which region has maximum , medical cost billed to health insurance.?
- What is age of beneficary.?
- Do beneficary having more dependents had more medical cost billed.?

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')
from scipy.stats import skew
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.formula.api import ols      # For n-way ANOVA
from statsmodels.stats.anova import anova_lm 
from   scipy.stats import chi2_contingency # for chi square test

In [None]:
#Reading the file
df=pd.read_csv(r"C:\Users\aashi\python_files\insurance.csv")

In [None]:
df.head()

In [None]:
# viewing the shape
df.shape
print("Missing values\n\n",df.isnull().sum())
print('------------------')
print("UNIQUE VALUES\n",df.nunique())

In [None]:
df.info()

**Types of variables**

- Categorical varibles - sex,smoker,region,children
-  Quantitative variables -age,bmi,charges. Here children is a discrete variable where as age, bmi, and charges are continous variables.

- There are no missing values

In [None]:
insured=df.copy()

In [None]:
#changing object dtype to category  to save memory
insured.sex=insured['sex'].astype("category")
insured.smoker=insured['smoker'].astype("category")
insured.region=insured['region'].astype("category")

In [None]:
insured.describe()

**Observations**
 - Average age of the primary beneficiary is 39.2 and maximum age is 64.
 - Average BMI is 30.66, that is out of normal BMI range, Maximum BMI is 53.13
 - Average medical costs billed to health insurance is 13270, median is 9382 and maximum is 63770
 - Median is less than mean in charges , indicating distrubution is postively skewed .
 - Customer on an average has 1 child.
 - For Age, BMI, children , mean is almost equal to median , suggesting data is normally distrubuted 
 
    

In [None]:
insured.describe(include='category')

In [None]:
#Are there more Male beneficary ?
# Are there more smoker ?
# which region has maximm , claims .?

list_col = insured.select_dtypes(['category']).columns
for i in range(len(list_col)):
    print(insured[list_col[i]].value_counts())
    print()
    

**Observations**
 - 676 male and 662 female, indicated sample has  slightly more males than females. 
 - 1064 nonsomker and 274 smoker, indicating sample has more nonsmokers.
 - Number of  claims from customer who reside in southwest region is more compared to other regions





# Exploratory Data Analysis





## Univariate Analysis

In [None]:
def dist_box(data):
 # function plots a combined graph for univariate analysis of continous variable 
 #to check spread, central tendency , dispersion and outliers 
    Name = data.name.upper()
    fig,axes=plt.subplots(2,1,gridspec_kw = {"height_ratios": (.25, .75)},figsize=(10,5))
   
    mean=data.mean()
    median=data.median()
    mode=data.mode().tolist()[0]
    fig.suptitle("SPREAD OF DATA FOR "+ Name , fontweight='bold')
    sns.boxplot(x=data,showmeans=True,ax=axes[0])
    sns.distplot(data,kde=False,ax=axes[1],color='blue')
    axes[1].axvline(mean,color='g')
    axes[1].axvline(median,color='r')
    axes[1].axvline(mode,color='y')
    plt.legend({'Mean':mean,'Median':median,'Mode':mode})



In [None]:
#select all quantitative columns for checking the spread
list_col=  insured.select_dtypes([np.number]).columns
sns.set(style="darkgrid")
for i in range(len(list_col)):
    dist_box(insured[list_col[i]])
    

**Observations**
- Age of primary beneficary lies approximately between 20 - 65 . Average Age is aprrox. 40. Majority of customer are in range 18- 20's.
- Bmi is normally distrubuted and Average BMI of beneficiary is 30.This BMI is outside the normal range of BMI. There are lot of outliers at upper end
- Most of the beneficary have no childrens.
- Charges distrubution is unimodal and is right skewed .Average cost incured to the insurance is appro. 130000 and highest charge is 63770.There are lot of outliers at upper end.

In [None]:
def bar_perc(plot, feature):
    total = len(feature) # length of the column
    for p in plot.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
        y = p.get_y() + p.get_height()           # hieght of the plot
        plot.annotate(percentage, (x, y), size = 12)

In [None]:
def bar_perc1(plot, feature):
    total = len(feature)
    for p in ax.patches:
        height = p.get_height()
        
        percentage = (height / total) * 100
        ax.text(p.get_x() + p.get_width() / 2, height, f'{percentage:.1f}%', ha='center', va='bottom')


In [None]:
#get all category datatype 
list_col=  insured.select_dtypes(['category']).columns
fig1, axes1 =plt.subplots(1,3,figsize=(14, 5))
for i in range(len(list_col)):
    order = insured[list_col[i]].value_counts(ascending=False).index # to display bar in ascending order
    axis=sns.countplot(x=list_col[i], data=insured , order=order,ax=axes1[i],palette='inferno').set(title=list_col[i].upper())
    bar_perc(axes1[i],insured[list_col[i]])

**Observations**
 - 50.5% of beneficiary are male and 49.5 % are female. Approximately same number of male and female beneficiary.
 - 20.5% of beneficary are smokers.
 - Beneficary are evenly distributed across regions with South East being the most populous one (~27%) with the rest of regions each containing around ~24kid.

## Bivariate & Multivariate Analysis

In [None]:
plt.figure(figsize=(12,5))
sns.heatmap(insured.corr(numeric_only=True),annot=True ,cmap="YlGn" )
plt.show()


**Observation**
 - There is  very little significant correlation between charges &age and charges and bmi.


In [None]:
sns.pairplot(data=insured,hue='sex',kind='scatter',corner=True,palette='viridis')
plt.show()

In [None]:
category_columns = insured.select_dtypes(['category']).columns
category_columns

In [None]:
#Sex vs all numerical variable

fig ,axes1 = plt.subplots(2,2,figsize=(15,10))
list_col = insured.select_dtypes([np.number]).columns

for i in range(len(list_col)):
    row= i//2
    column = i%2
    ax= axes1[row,column]
    sns.boxplot(x=insured['sex'],y=insured[list_col[i]],ax=ax,palette='viridis').set(title='SEX  VS '+list_col[i].upper())
    ax.set(xlabel='')

**Observation**
 - Avergae Age of female beneficiary is slightly higher than male beneficiary
 - No of children both male and female beneficary have is same
 - BMI of Male policy holder has many outliers and Average BMI of male is slightly higher than female
 - Male policy holder has incure more charges to insurance compared to female policy holder. There are lot of outliers in female policy holder

In [None]:
#Smoker vs all numerical variable

fig ,axes1 = plt.subplots(2,2,figsize=(15,10))
list_col = insured.select_dtypes([np.number]).columns

for i in range(len(list_col)):
    row= i//2
    column = i%2
    ax= axes1[row,column]
    sns.boxplot(x=insured['smoker'],y=insured[list_col[i]],ax=ax,palette='PiYG').set(title='SMOKER  VS '+list_col[i].upper())
    ax.set(xlabel='')

**Observation**
- Smoker have incured more cost to insurance than nonsmoker. There are outliers in nonsmoker , need to analyze.
- BMI of non smoker has lot of outliers.

In [None]:
#region vs all numerical variable

fig ,axes1 = plt.subplots(2,2,figsize=(15,10))
list_col = insured.select_dtypes([np.number]).columns

for i in range(len(list_col)):
    row= i//2
    column = i%2
    ax= axes1[row,column]
    sns.boxplot(x=insured['region'],y=insured[list_col[i]],ax=ax,palette='viridis').set(title='REGION  VS '+list_col[i].upper())
    ax.set(xlabel='')

**Observations**
 - Age  and numnber of children across regions is almost same.
 - Average Bmi of policy holder from southeast higher compared to other regions
 - Charges incured because of policy holder from southeast is higher compared to othe regions
 - There are lot of outliers on upper end in charges


In [None]:
#smoker vs Sex
ax=sns.countplot(x='smoker',hue='sex',data=insured)
bar_perc(ax,insured['sex'])
ax.set(title="Smoker vs Sex")
plt.show()

In [None]:
#smoker vs charges
sns.barplot(x=insured.smoker,y=insured.charges,palette='rainbow').set(title="Smoker vs Charges")

In [None]:
#region vs smoker
plt.figure(figsize=(14,5))
ax=sns.countplot(x='region',hue='smoker',data=insured,palette='rainbow')
bar_perc(ax,insured['smoker'])
ax.set(title="Smoker vs Region")
plt.show()

**Observation**
- There are more male smokers than female.
- Southeast region has more smokers
- Smoker have more costlier claims than nonsmoker.

In [None]:
# sex vs region
plt.figure(figsize=(13,5))
ax2=sns.countplot(x='region',hue='sex',data=insured,palette='inferno')
bar_perc(ax2,insured['sex'])
plt.show()


**Observations**
 - There are more smokers in southeast region compared to other regions.

In [None]:
#CHILDREN VS CHARGES
plt.figure(figsize=(12,5))
sns.barplot(x='children',y='charges',data=insured,palette='rainbow').set(title='CHILDREN VS CHARGES')
plt.show()

In [None]:
#Sex Vs Charges

sns.barplot(x=insured.sex,y=insured.charges,palette='Set2').set(title='Sex Vs Charges')

In [None]:
#Region Vs Charges
sns.barplot(x='region',y='charges',data=insured,palette='Set2').set(title='Region Vs Charges')

In [None]:
#COST INCURED BY AGE FOR MALE AND FEMALE

plt.figure(figsize=(14,5))
sns.lineplot(x='age',y='charges',hue='sex',data= insured,palette='inferno',ci=0).set(title=' COST INCURED BY AGE FOR MALE AND FEMALE')
plt.show()

In [None]:
# charges incured by smoker  for male and female 
sns.barplot(x='smoker',y='charges',hue='sex',data=insured,palette='rainbow')

**Observations**
 - Charges incurred for males are more than charges incured for females
 - With increasing age of policy holder charges incured are going high for both male and female.
 - There some spikes for female at an approximate ages of 23,28,43.
 - Most claims are from southeast regions.
 - Males who smoke have most claims and have higher bills 
 - Number of claims made by female who dont smoke is more compared to female who smoke.
 
 


In [None]:
category= pd.cut(insured.bmi,bins=[15,25,35,45,55],labels=['15-25','25-35','35-45','45-55'])
insured.insert(5,'BMI_group',category)

In [None]:
insured.head()

In [None]:
charges_by_chilren=insured.groupby(insured.children).charges.mean()
charges_by_chilren=charges_by_chilren.reset_index()
plt.figure(figsize=(14,5))
sns.barplot(x='children',y='charges',data=charges_by_chilren,palette='coolwarm')
charges_by_chilren

**Observations**
 - There is no relation between no.of.childrens and charges incured

In [None]:

insured.groupby(insured.BMI_group).charges.mean()

In [None]:
# adding a new column
category1=pd.cut(insured.age,bins=[18,28,38,48,58,68],labels=['18-28','28-38','38-48','48-58','58-68'])
insured.insert(6,'AgeBin',category1)

In [None]:
insured.groupby(insured.AgeBin).charges.mean()

In [None]:
#Age Vs Charges
sns.barplot(x=insured.AgeBin,y=insured.charges,palette='viridis').set(title='Age Vs Charges')


In [None]:
sns.barplot(x=insured.BMI_group,y=insured.charges,palette='viridis')


In [None]:


plt.figure(figsize=(15,7))
sns.barplot(x=insured["BMI_group"],y=insured["age"],hue=insured['sex'],ci=0,palette='GnBu').set(title= 'Age and Bmi of Males and Females')
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()

In [None]:
sns.barplot(x='BMI_group',y='charges',hue='sex',data=insured,palette='GnBu').set(title="Fig 2:BMI group and Charges " )

In [None]:
sns.scatterplot(x='bmi',y='charges',hue='smoker',data=insured,palette='GnBu').set(title="Fig 2:BMI difference for smokers and non smokers" )

**Observations**
- FeMales with most BMI has incured most charges to the insurance company
- BMI for male and females are  same
- Beneficary with higher BMI have incurred more cost to insurance.

# Statistical Analysis

## 1.Prove (or disprove) that the medical claims made by the people who smoke is greater than those who don't?

<div class ="alert alert-block alert-info">
    <font size=3><b>    Step 1: Define null and alternative hypothesis</b></font><br>
$\ H_0  :  \mu_1 <= \mu_2  $ The average charges of smokers is less than or equal to nonsmokers 
 <br>
 

$\ H_a  :\mu_1 > \mu_2 $ The average charges of smokers is greater than nonsmokers  <br>
</div>

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 2: Decide the significance level. If P values is less than alpha reject the null hypothesis.</b></font>

α = 0.05

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 3: Identify the test</b></font>

Standard deviation of the population is not known  ,will perform a T stat test .  The > sign in alternate hypothesis indicate  test is right tailed, that is all z values that would cause us to reject null hypothesis are in just one tail to the right of  sampling  distribution curve.

    


<div class ="alert alert-block alert-info">
    <font size=3><b>Step 4: Calculate the test-statistics and p-value</b></font>

In [None]:
smoker = insured.loc[insured.smoker=='yes']
non_smoker = insured.loc[insured.smoker=='no']


In [None]:
print(smoker.count())
non_smoker.count()

In [None]:
# Adjusting the size of the rows to be equal
non_smoker=non_smoker[-274:]

In [None]:
non_smoker.count()

In [None]:
print(f"Average charges incured for smokers is {smoker.charges.mean()} \nAverage charges incured for non smokers is{non_smoker.charges.mean()}")

In [None]:
#smoker vs charges
sns.boxplot(y=insured.charges,x=insured.smoker,data=insured,palette='Set2').set(title="Fig:1 Smoker vs Charges")
plt.show()

In [None]:
charges_smoker = smoker.charges
charges_non_smoker = non_smoker.charges

In [None]:
alpha=0.05

#one tailed t-test
t_statistic , p_value = stats.ttest_ind(charges_smoker,charges_non_smoker,alternative='greater') # alternative parameter specify it is one tailed test 


In [None]:
if p_value < alpha :
    print(f"reject Null hypothesis hence p_value {p_value} is lesser than {alpha}")
    print( 'Smokers have significantly higher medical claims')
else:
    print(f" Null hypothesis is true  hence p_value {p_value} is greater than {alpha}")

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 5: Decide whethere to  reject or failed to reject null hypothesis</b></font><br>    
    We reject the null hypothesis and can conclude that people who smoke have on an average larger medical claim compared to people who don't smoke. Similar result can also been seen in Fig no.1 Smokers Vs Charges

# 2.Prove (or disprove) with statistical evidence that the BMI of females is different from that of males.

<div class ="alert alert-block alert-info">
    Let $\mu_1 \mu_2 $ and be the respective population means for BMI of males and BMI of females<br>
    <font size=3><b>    Step 1: Define null and alternative hypothesis</b></font><br>
$\ H_0  : \mu_1 - \mu_2 = 0$ There is no difference between the BMI of Male  and BMI of female.<br>
$\ H_a  : \mu_1 - \mu_2 !=0 $ There is difference between the BMI of Male and BMI of female. <br>


</div>


<div class ="alert alert-block alert-info">
    <font size=3><b>Step 2: Decide the significance level</b></font>

α = 0.05

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 3:Identify the test</b></font><br>Standard deviation of the population is not known ,will perform a T stat test.Not equal to sign in alternate hypothesis indicate its a two tailed test.

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 4: Calculate the test-statistics and p-value</b></font>

In [None]:
#get all observation for male.
df_male=insured.loc[insured.sex=="male"]
#get all observation for females
df_female=insured.loc[insured.sex=="female"]


In [None]:
#get bmi of male and female
bmi_female=df_female.bmi
bmi_male=df_male.bmi

In [None]:
sns.distplot(bmi_male,color='green',hist=False)
sns.distplot(bmi_female,color='red',hist=False)
plt.show()

In [None]:
t_statistic2,p_value2= stats.ttest_ind(bmi_female,bmi_male)
print(f't-statistic{t_statistic2}\np_value2{p_value2}')

In [None]:
if p_value2 < alpha :
    print(f"reject Null hypothesis hence p_value {p_value2} is lesser than {alpha}")
    print( 'BMI of male and female differ significantly')
else:
    print(f"Failed to reject Null hypothesis because p_value {p_value2} is greater than {alpha}")
    print( 'There is no significant difference of BMI  between male and female ')

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 5: Decide to reject or accept null hypothesis</b></font><br>
     We fail to reject the null hypothesis and can conclude that There is no difference between BMI of Female and BMI of Male. 

# 3.Is the proportion of smokers significantly different across different regions?

<div class ="alert alert-block alert-info">
    <font size=3><b>    Step 1: Define null and alternative hypotheses</b></font>

* H<sub>0</sub> Smokers proportions is not significantly different across different regions
* H<sub>a</sub> Smokers proportions is  different across different regions  <br>
</div>

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 2: Decide the significance level</b></font>

α = 0.05

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 3: Identify Test</b></font><br>
    Here we are comparing two different categorical variables, smoker and  different region. So perform a  Chi-sq Test.

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 4: Calculate the test-statistics and p-value</b></font>

In [None]:
contigency =pd.crosstab(insured.region,insured.smoker)
contigency.index

In [None]:
# performing chi square test
chi2 , p_val , dof ,exp_frequencies = chi2_contingency(contigency,correction=False)

In [None]:
print('chi-square statistic: {}\nPvalue: {} \nDegree of freedom: {} \n\n\nexpected frequencies: {} '.format(chi2, p_val, dof, exp_frequencies))

In [None]:
if (p_val < 0.05):
    print('Reject Null Hypothesis')
else:
    print('Failed to reject Null Hypothesis')

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 5: Decide to reject or accept null hypothesis</b></font><br>
     We failed to reject the null hypothesis and conclude that Smoker proportions is not significantly different across different regions.
    


# 4.Is the mean BMI of women with no children, one child, and two children the same? Explain your answer with statistical evidence.

<div class ="alert alert-block alert-info">
    <font size=3><b>    Step 1: Define null and alternative hypotheses</b></font>

* H<sub>0</sub>: μ1 = μ2 = μ3  The mean BMI of women with no children , one child,two children is same <br>

* H<sub>a</sub>:  Atleast  one of mean BMI of women is not same <br>
</div>

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 2: Decide the significance level</b></font>

α = 0.05

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 3: Identify Test</b></font><br>
    
One-way ANOVA - Equality of population through variances of samples.

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 4: Calculate the test-statistics and p-value</b></font>

In [None]:
df_female_child = df_female.loc[df_female['children']<=2]

In [None]:
df_female_child.groupby('children')['bmi'].mean()

In [None]:
# Applying ANOVA and cheking each children count (0,1,2) with the bmi;
formula = 'bmi ~ C(children)'
model = ols(formula, df_female_child).fit()
aov_table = anova_lm(model)
aov_table

<div class ="alert alert-block alert-info">
    <font size=3><b>Step 5: Decide to reject or accept null hypothesis</b></font><br>
     P value is 0.715858 and it is greater than aplha(0.05) ,We failed to  reject the null hypothesis and conclude that mean Bmi of women  with no children,one children, two children is same. 
    

### Recommendation
- Based on EDA and statistical  evidence it can be seen that  customer who smoke or   have higher  BMI  have more higher claims. We can encourage customers to quit smoking by providing them  incentive points  for talking to life coach, get help for improving lifestyle habits,  Quit Tobacco- 28 day program. Give gift cards when customer accumulates specific number of points.
- We can have Active wellness programs which can help up reduce claims related to BMI.
- High BMI  is primarily because of unhealthy life choices. We can provide customers with Diet plans and wellness health coaches which can help them to make right choices.
- Provide discount coupons  for Gym  or  fitness devices encouraging customers to exercise.


# Linear Regression

## Definition & Working principle
Let's build model using **Linear regression**.

Linear regression is a **supervised learining** algorithm used when target / dependent variable  **continues** real number. It establishes relationship between dependent variable $y$ and one or more independent variable $x$ using best fit line.   It work on the principle of ordinary least square $(OLS)$ / Mean square errror $(MSE)$. In statistics ols is method to estimated unkown parameter of linear regression function, it's goal is to minimize sum of square difference between observed dependent variable in the given data set and those predicted by linear regression fuction. 


## Data Preprocessing
### Encoding
Machine learning algorithms cannot work with categorical data directly, categorical data must be converted to number.
 1. Label Encoding
 2. One hot encoding
 3. Dummy variable trap

## 1. Label Encoding:
Label encoding converts each unique category in a feature into an integer value. This method is suitable for ordinal
categorical variables (those that have a meaningful order, such as low, medium, high).

Cons:
Not suitable for nominal (unordered) categorical variables as it introduces an arbitrary order which can confuse 
machine learning algorithms.
                                                                                                        
                                                                                                        
##  2. One hot encoding
One-hot encoding converts each unique category into a separate binary column (0 or 1). This is useful for nominal categorical 
variables (those that do not have a meaningful order, such as colors or countries).                                                                                                   
Cons:
Can lead to a dummy variable trap.
If a feature has many unique categories, it can lead to a high-dimensional dataset (known as the "curse of dimensionality").  


##  3. Dummy variable trap
The dummy variable trap occurs when one of the dummy variables (binary columns from one-hot encoding) is redundant. If you have 
 k categories, the information for the last category can be derived from the first 
k−1 columns, which leads to multicollinearity




By using pandas get_dummies function we can do all above three step in line of code. We will this fuction to get dummy variable for sex, children,smoker,region features. By setting drop_first =True function will remove dummy variable trap by droping one variable and original variable.The pandas makes our life easy.

In [None]:
# dummy variables

categorical_columns = ['sex','children', 'smoker', 'region']
df_encode = pd.get_dummies(data=df,prefix='OLE',prefix_sep='_',columns=categorical_columns,drop_first=True,dtype='int8')
df_encode.head()

In [None]:
# Lets verify the dummay variable process
print('Columns in original data frame:\n',df.columns.values)
print('\nNumber of rows and columns in the dataset:',df.shape)
print('\nColumns in data frame after encoding dummy variable:\n',df_encode.columns.values)
print('\nNumber of rows and columns in the dataset:',df_encode.shape)

In [None]:
fig ,ax = plt.subplots(1,2,figsize=(15,5))

sns.distplot(df['charges'],bins=30,color='r',ax=ax[0])
ax[0].set_title('Distribution of insurance charges')


sns.distplot(np.log10(df['charges']),bins=40,color='b',ax=ax[1])
ax[1].set_title('Distribution of insurance charges in $log$ sacle')
ax[1].set_xscale('log');

### Box -Cox transformation
A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests. All that we need to perform this transformation is to find lambda value and apply the rule shown below to your variable.  
$$\mathbf{ \begin {cases}\frac {y^\lambda - 1}{\lambda},& y_i\neg=0 \\
 log(y_i) & \lambda = 0 \end{cases}}$$
 The trick of Box-Cox transformation is to find lambda value, however in practice this is quite affordable. The following function returns the transformed variable, lambda value,confidence interval


Where:

y is the original data,
λ is the transformation parameter.
The transformation parameter 
λ is estimated from the data to maximize the likelihood of achieving normality.

When to Use Box-Cox Transformation:
When your data is positively skewed.
When your data has heteroscedasticity, i.e., unequal variance in residuals.
When you're working with models that assume normally distributed errors.
Limitations:
The Box-Cox transformation can only be applied to positive data. If your data contains negative or zero values, you'll need to shift it by adding a constant to make all values positive before applying the transformation.
If the data is negative, you can consider using the Yeo-Johnson transformation, which is a generalization of Box-Cox that works for both positive and negative data.

In [None]:
from scipy.stats import boxcox
y_bc,lam, ci= boxcox(df_encode['charges'],alpha=0.05)
print(ci,lam)
#df['charges'] = y_bc  
# it did not perform better for this model, so log transform is used

In [None]:
#df_encode['charges'] = y_bc
df_encode['charges'] = np.log(df_encode['charges'])

In [None]:
from sklearn.preprocessing import StandardScaler
numeric_columns = ['age', 'bmi', 'charges']
scalar = StandardScaler()
df_scaled = df_encode.copy()
df_scaled[numeric_columns]=scalar.fit_transform(df_scaled[numeric_columns])
df_scaled.head()

In [None]:
x=df_scaled.drop(columns='charges') # removing the dependent variable column 
y=df_scaled['charges'] # taking the dependent column

In [None]:

from sklearn.model_selection import train_test_split
x_train , x_test ,y_train , y_test = train_test_split(x,y,test_size=0.2,random_state=23)
x_train.head()

## Model building

In [None]:
from sklearn.linear_model import LinearRegression
lr= LinearRegression()
lr.fit(x_train,y_train)

print(lr.intercept_,lr.coef_)

In [None]:

print('intercept is',lr.intercept_,'\n\nSlope of the columns are\n',lr.coef_)

## Model evaluation
We will predict value for target variable by using our model parameter for test data set. Then compare the predicted value with actual valu in test set. We compute **Mean Square Error** using formula 
$$\mathbf{ J(\theta) = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}_i - y_i)^2}$$

$\mathbf{R^2}$ is statistical measure of how close data are to the fitted regression line. $\mathbf{R^2}$ is always between 0 to 100%. 0% indicated that model explains none of the variability of the response data around it's mean. 100% indicated that model explains all the variablity of the response data around the mean.

$$\mathbf{R^2 = 1 - \frac{SSE}{SST}}$$
**SSE = Sum of Square Error**  
**SST = Sum of Square Total**  
$$\mathbf{SSE = \sum_{i=1}^{m}(\hat{y}_i - y_i)^2}$$
$$\mathbf{SST = \sum_{i=1}^{m}(y_i - \bar{y}_i)^2}$$
Here $\mathbf{\hat{y}}$ is predicted value and $\mathbf{\bar{y}}$ is mean value of $\mathbf{y}$.

In [None]:
y_pred = lr.predict(x_train)

from sklearn.metrics import mean_squared_error , r2_score

residual = y_train - y_pred

#Evaluvation: MSE
print('MSE : ',(np.square(residual)).sum()/residual.count())

mse = mean_squared_error(y_pred,y_train)
print(mse)

r2 = r2_score(y_train,y_pred)
print('Rsquare is : ',r2)

ax=sns.distplot(residual,bins=15)
ax.axvline(residual.mean(),color='r',linestyle='--')
plt.show()

In [None]:
#test data set

y_test_pred = lr.predict(x_test)

mse2 = mean_squared_error(y_test,y_test_pred)
print('MSE is :',mse)

rsqu= r2_score(y_test,y_test_pred)
print('r score is :',rsqu)

The model returns $R^2$ value of 80.2%, so it fit our data test very well, but still we can imporve the the performance of by diffirent technique. Please make a note that we have transformer out variable by applying natural log. When we put model into production antilog is applied to the equation.

### adjusted R2

In [None]:
n = x_test.shape[0]  # number of samples
p = x_test.shape[1]  # number of predictors

# Calculate Adjusted R-squared
adjusted_r_squared = 1 - (1 - rsqu) * (n - 1) / (n - p - 1)
print("ADJUSTED R 2 IS : ",adjusted_r_squared)

In [None]:
residual2 = y_test - y_test_pred
ax=sns.distplot(residual2,bins=15)
ax.axvline((y_test - y_test_pred).mean(),color='r',linestyle='--')
plt.show()

## Model Validation
In order to validated model we need to check few assumption of linear regression model. The common assumption for *Linear Regression* model are following
1. Linear Relationship: In linear regression the relationship between the dependent and independent variable to be *linear*. This can be checked by scatter ploting Actual value Vs Predicted value
2. The residual error plot should be *normally* distributed.
3. The *mean* of *residual error* should be 0 or close to 0 as much as possible
4. The linear regression require all variables to be multivariate normal. This assumption can best checked with Q-Q plot.
5. Linear regession assumes that there is little or no *Multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. The variance inflation factor *VIF* identifies correlation between independent variables and strength of that correlation. $\mathbf{VIF = \frac {1}{1-R^2}}$, If VIF >1 & VIF <5 moderate correlation, VIF < 5 critical level of multicollinearity.
6. Homoscedasticity: The data are homoscedastic meaning the residuals are equal across the regression line. We can look at residual Vs fitted value scatter plot. If heteroscedastic plot would exhibit a funnel shape pattern.

In [None]:
# check for linearity

f,ax = plt.subplots(1,2,figsize=(14,5))
sns.scatterplot(x=y_test,y=y_test_pred,ax=ax[0],color='r')
ax[0].set_title('Check for Linearity:\n Actual Vs Predicted value')
# Check for Residual normality & mean

sns.distplot((y_test - y_test_pred),ax=ax[1],color='b')
ax[1].axvline((y_test - y_test_pred).mean(),color='k',linestyle='--')
ax[1].set_title('Check for Residual normality & mean: \n Residual eror');

In [None]:
# Check for Multivariate Normality
# Quantile-Quantile plot 
f,ax = plt.subplots(1,2,figsize=(14,6))
import scipy as sp
_,(_,_,r)= sp.stats.probplot(residual2,fit=True,plot=ax[0])
ax[0].set_title('Check for Multivariate Normality: \nQ-Q Plot')

#Check for Homoscedasticity
sns.scatterplot(y = residual2, x= y_test_pred, ax = ax[1],color='r') 
ax[1].set_title('Check for Homoscedasticity: \nResidual Vs Predicted')


In [None]:
# check for multicollinearity
VIF = 1/(1-rsqu)
VIF

The model assumption linear regression as follows
1. In our model  the actual vs predicted plot is curve so linear assumption fails
2. The residual mean is zero and residual error plot right skewed
3. Q-Q plot shows as value log value greater than 1.5 trends to increase
4. The plot is exhibit heteroscedastic, error will insease after certian point.
5. Variance inflation factor valu  than 5so no multicolle exists but i decided not to take multicollinearity into accountrity.

In [None]:
df_scaled.corr()