# Insurance Case Study

## Problem Statement

MedicaInsure is a medical insurance provider. Leveraging customer information is of paramount importance for most businesses. In the case of an insurance company, analysis of customer attributes like age, sex, smoking habits, etc. can be crucial in making decisions regarding the premium amount to be charged. 

The insurance company wants to know whether the proportion of female smokers is different from the proportion of male smokers in their customer population.

They have provided a sample dataset of customers and the charges claimed by them.

### Import the necessary libraries

In [14]:
import numpy as np
import pandas as pd

# import the required function
from statsmodels.stats.proportion import proportions_ztest

### Reading the data into the DataFrame

In [15]:
df = pd.read_csv('insurance.csv')

In [16]:
# checking the shape of the data
df.shape

(1338, 7)

* The datset consists of 1338 rows and 7 columns

In [17]:
# inspecting the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


* There are 4 numeric variables and 3 categorical variables

In [18]:
# checking if there are any missing values
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

* There are no missing values in the data

In [19]:
# checking the first 5 rows of the data
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Step 1: Define null and alternative hypotheses

* 'sex' and 'smoker' are two categorical variables.
* We want to see if the proportion of smokers in the female population is significantly different from the proportion of smokers in male population.

**$H_0:$ The proportion of smokers in the female population is equal to the proportion of smokers in the male population.**

**$H_a:$ The proportion of smokers in the female population is not equal to the proportion of smokers in the male population**

## Step 2: Select Appropriate test

The formulated hypotheses are concerned with proportions. A test of proportions can be used to analyse the hypotheses and draw a conclusion. We shall use a Proportions Z test for this problem.

## Step 3: Decide the significance level

Here, we select α = 0.05.

## Step 4: Data Preparation

### Preparing data for test

In [20]:
# number of female smokers
female_smokers = df[df['sex']=='female'].smoker.value_counts()['yes']
# number of male smokers
male_smokers = df[df['sex']=='male'].smoker.value_counts()['yes']

print('The numbers of female and male smokers are {0} and {1} respectively'.format(female_smokers, male_smokers))

# number of females in the data
n_females = df.sex.value_counts()['female']

# number of males in the data
n_males = df.sex.value_counts()['male']

print('The total numbers of females and males are {0} and {1} respectively'.format(n_females, n_males))

The numbers of female and male smokers are 115 and 159 respectively
The total numbers of females and males are 662 and 676 respectively


In [21]:
print(f' The proportions of smokers in females and males are {round(115/662,2)}, {round(159/676,2)} respectively')

 The proportions of smokers in females and males are 0.17, 0.24 respectively


* The proportions in the sample are different. Let's conduct the test to see if this difference is significant.

## Step 5: Calculate the p-value

In [22]:
# find the p-value using proportion_ztest
stat, pval = proportions_ztest([female_smokers, male_smokers] , [n_females, n_males], alternative = 'two-sided')

# print the p-value
print('The p-value is '+ str(pval))

The p-value is 0.005324114164320532


## Step 6: Compare the p-value with $\alpha$

In [23]:
# print the conclusion based on p-value
if pval < 0.05:
    print(f'As the p-value {pval} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {pval} is greater than the level of significance, we fail to reject the null hypothesis.')

As the p-value 0.005324114164320532 is less than the level of significance, we reject the null hypothesis.


## Step 7: Conclusion

Since the pvalue is < 0.05, we reject the null hypothesis. Hence, we have enough statistical evidence to say that the proportion of smokers in the female population is different from the proportion of smokers in the male population.

### Insight

The proportions of female smokers is different from the proportions of male smokers in the insurance company's customer population.