# U.S. Medical Insurance Costs
## Scoping

### Insurance data
7 data points: age, sex, bmi, children, smoker, region, charges.
<br> Some age data might need stripping, but no missing data otherwise.
<br> No idea of the temporality of the data.

### Main questions
Interesting data to obtain might be: share of male/females, mean age, mean bmi, mean children, share of smokers, mean charges. It would help to build the profile of an average person in our database. Is it close to the average person in the country?
<br> How can we link the different variables together? Which one appear to be linked to one another?
<br> Are charges consistent across geography? What is the mean persona for each region? Are there stark differences?

### Organisation of our notebook
#### 1: data formatting (import csv, create lists for the variables, define dictionaries)
#### 2: data analysis
> ##### 2.1.: General analysis of the population
> ##### 2.2.: Analysis by sex
> ##### 2.3.: Analysis by age
> ##### 2.3.: Analysis by geography
> ##### 2.4.: Linear regression

### List of all functions defined in this notebook

data_dict: creates a dictionary of all the data in our CSV, with IDs as keys.
<br> transform_quant_nb (nb = 2,4): for a list of qualitative data, returns a list of quantitative data with indexes to define.
<br> key_dict: for a given dictionary, returns a new dictionary with a chosen category of data as key.

<br> count_var_dict: for a given dictionary and key, returns the sum and length of the chosen category of data.
<br> mean_var (uses count_var_dict): returns the average value for the chosen data.

<br> unique_key: for a given list of data, returns the unique values of the data.
<br> unique_key_count (uses unique_key): for a given list of data, returns the frequency of each unique value in the list.
<br> share_unique_key (uses unique_key_count): for a given list of data, returns the percentage of each unique value in the list.
<br> dictionary_repartition_type (type = abs, pc) (uses unique_key and (unique_key_count or share_unique_key)): for a list of data, returns a dictionary of the repartition depending on the parameter, either in absolute value or percentage.

<br> classification_nb (nb = 5): for a list of data and nb bounds, divides our data points in categories.
<br> dictionary_repartition_class_nb_type (nb = 5, type = abs, pc) (uses classification_nb): returns a dictionary of the repartition depending on the class, either in absolute value or percentage.

## Data formatting

In [1]:
#### 1st step: Import csv
import csv

We need to have a first look at insurance.csv.

In [2]:
#### Let's define lists where we'll put the data from the csv file:
ages_data = []
sexes_data = []
bmis_data = []
children_data = []
smokers_data = []
regions_data = []
charges_data = []


In [3]:
#### Let's fill these lists with the data from the file:
with open('insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        ages_data.append(row['age'])
        sexes_data.append(row['sex'])
        bmis_data.append(row['bmi'])
        children_data.append(row['children'])
        smokers_data.append(row['smoker'])
        regions_data.append(row['region'])
        charges_data.append(row['charges'])

In [4]:
#### And let's build a list of IDs to "identify" our individuals:
numbers_data = []
for i in range(len(ages_data)):
    numbers_data.append(i)

In [5]:
#### Debugging
if len(numbers_data) == len(ages_data):
    print("Debugging OK")

i = 0
print(numbers_data[i], ages_data[i], sexes_data[i], bmis_data[i], children_data[i], smokers_data[i], regions_data[i], charges_data[i])

Debugging OK
0 19 female 27.9 0 yes southwest 16884.924


We now have, for each category of data, lists that contain the data.

#### Let's create a first dictionary that has the ID as a key, and a dictionary of all the pieces of data as a value.

In [6]:
#### A function that defines the dictionary of all data:
def data_dict(ids, ages, sexes, bmis, children, smokers, regions, charges):
    indexes = ['ID', 'Age', 'Sex', 'BMI', 'Children', 'Smoker', 'Region', 'Charges']
    master_insurance = {}
    for i in range(len(ids)):
        #print(i)
        #print(ids[i], ages[i], sexes[i], bmis[i], smokers[i], regions[i], charges[i])
        data_index = [ids[i], ages[i], sexes[i], bmis[i], children[i], smokers[i], regions[i], charges[i]]
        zipped_data = zip(indexes, data_index)
        data_insurance = {index:data for index, data in zipped_data}
        master_insurance.update({ids[i]:data_insurance})
    return master_insurance

In [7]:
#### The dictionary by ID:
insurance_data = data_dict(numbers_data, ages_data, sexes_data, bmis_data, children_data, smokers_data, regions_data, charges_data)

In [8]:
#### Test
print(insurance_data[15])
#print(insurance_data)

{'ID': 15, 'Age': '19', 'Sex': 'male', 'BMI': '24.6', 'Children': '1', 'Smoker': 'no', 'Region': 'southwest', 'Charges': '1837.237'}


##### However, for ease of analysis purposes, we'd like to have data that is entirely quantitative. 

In [9]:
#### Transformation of the qualitative data in quantitative data function (boolean):

def transform_quant_2(name_list, option_0, option_1):
    name_list_nb = []
    for item in name_list:
        if item == option_0:
            name_list_nb.append('0')
        else:
            name_list_nb.append('1')
    return name_list_nb

#### Region: 0 for southwest, 1 for southeast, 2 for northwest, 3 for northeast
#regions_data_nb = []
#for item in 

#### Transformation of the qualitative data in quantitative data function (4 options):
def transform_quant_4(name_list, option_0, option_1, option_2, option_3):
    name_list_nb = []
    for item in name_list:
        if item == option_0:
            name_list_nb.append('0')
        elif item == option_1:
            name_list_nb.append('1')
        elif item == option_2:
            name_list_nb.append('2')
        else:
            name_list_nb.append('3')
    return name_list_nb

#### New lists of quantitative data (smokers, sexes; regions):
smokers_data_nb = transform_quant_2(smokers_data, 'no', 'yes')            
sexes_data_nb = transform_quant_2(sexes_data, 'male', 'female')
regions_data_nb = transform_quant_4(regions_data, 'northeast', 'northwest', 'southeast', 'southwest')

In [10]:
#### Debugging
#print(regions_data_nb)

i = 0
print(numbers_data[i], ages_data[i], sexes_data_nb[i], bmis_data[i], children_data[i], smokers_data_nb[i], regions_data_nb[i], charges_data[i])

0 19 1 27.9 0 1 3 16884.924


We now have lists with fully quantitative data.
#### Let's define a new dictionary with this reformatted data.

In [11]:
### New dictionary with quantitative data only:
insurance_data_nb = data_dict(numbers_data, ages_data, sexes_data_nb, bmis_data, children_data, smokers_data_nb, regions_data_nb, charges_data)

In [12]:
print(insurance_data_nb[0])

{'ID': 0, 'Age': '19', 'Sex': '1', 'BMI': '27.9', 'Children': '0', 'Smoker': '1', 'Region': '3', 'Charges': '16884.924'}


##### What if we wish to have a dictionary with something other than 'ID' as a key?
#### Let's create a dictionary with a certain category as a key (for instance, the region), and a list of the dictionaries as a value.

In [13]:
#### A function that defines the dictionary by a given key:
def key_dict(dictionary, key_choice):
    key_dict = {}
    for key in dictionary:
        current_key_item = dictionary[key][key_choice]
        current_cane = dictionary[key]
        if current_key_item in key_dict:
            key_dict[current_key_item].append(current_cane)
        else:
            key_dict[current_key_item] = [current_cane]
    return key_dict

In [14]:
#### The dictionaries by regions (test):
regions_dictionary = key_dict(insurance_data, 'Region')
regions_dictionary_nb = key_dict(insurance_data_nb, 'Region')

#print(regions_dictionary['southwest'])
print(regions_dictionary_nb.keys())

dict_keys(['3', '2', '1', '0'])


Now, we can go on to analyse the data.

## Data analysis

### A general profile of our population

In [15]:
#### Define a function that will give the running count and the total value of a list variable:
def count_var_dict(dictionary, variable_name):
    total_var = float(0)
    running_count = 0
    for key in dictionary:
        current_id = dictionary[key]
        #print(current_id[variable_name])
        running_count += 1
        total_var += float(current_id[variable_name])
    return total_var, running_count

In [16]:
#### Define a function that will compute the mean of a given variable, and formats it:
def mean_var(dictionary, variable_name, result_format):
    total_var, running_count = count_var_dict(dictionary, variable_name)
    mean_variable = result_format.format(total_var / running_count)
    return mean_variable

In [17]:
#### Use the functions so that we get the profile of an average person in our database:
mean_age = mean_var(insurance_data_nb, 'Age', '{:.3}')
share_of_women = mean_var(insurance_data_nb, 'Sex', '{:.2%}')
mean_bmi = mean_var(insurance_data_nb, 'BMI', '{:.3}')
mean_nb_children = mean_var(insurance_data_nb, 'Children', '{:.3}')
share_of_smokers = mean_var(insurance_data_nb, 'Smoker', '{:.2%}')
mean_charges = mean_var(insurance_data_nb, 'Charges', '{:,.6}')

#### Analysis
explanation = "The average person in our database is " + str(mean_age) + " years old, has a BMI of " + str(mean_bmi) + ", " + str(mean_nb_children) + " children, has a " + str(share_of_women) + " chance of being a woman and a " + str(share_of_smokers) + " chance of being a smoker, and pays $" + str(mean_charges) + " in charges."
#print(mean_age, share_of_women, mean_bmi, mean_nb_children, share_of_smokers, mean_charges)
print(explanation)

The average person in our database is 39.2 years old, has a BMI of 30.7, 1.09 children, has a 49.48% chance of being a woman and a 20.48% chance of being a smoker, and pays $13,270.4 in charges.


###### What do we observe? 
Well, 30.7 is a very high average BMI. A BMI over 30 indicates obesity–a healthy BMI is between 18.5 and 24.9. 
According to the CDC, the BMI of an average American adult is 26.5 (which is in the overweight range). The CDC also estimates the share of smokers among US adults as 14%, while our sample average is close to 20.5%.
<br> Therefore, our sample is not representative of the American population, but is indicative of people whose health need more attention. They are therefore probably more prone to require health insurance.
##### This means that the amount of charges paid on average by our sample is probably over-valued relative to the US population as a whole.

### General instruments of study

#### Let's see how we can split up the data so that we can make sense of it.
First, we need to be able to categorize our data with the frequency of its datapoints.

In [18]:
##### Define a function that will return the unique datapoints for a given list:
def unique_key(list_data):
    unique_key_list = []
    for item in list_data:
        if unique_key_list.count(item) == 0:
            unique_key_list.append(item)
    return sorted(unique_key_list)

#### Test
children_unique = unique_key(children_data)
print(children_unique)

['0', '1', '2', '3', '4', '5']


In [19]:
#### Define a function that will return the population for each entry:
def unique_key_count(list_data):
    unique_count_list = []
    for item in unique_key(list_data):
        unique_count_list.append(list_data.count(item))
    return unique_count_list

#### Test
nb_children_unique_count = unique_key_count(children_data)
print(nb_children_unique_count)

[574, 324, 240, 157, 25, 18]


In [20]:
#### Define a function that will return the share of total for each entry:
def share_unique_key(list_data):
    key_unique_count_pc = []
    eff = len(list_data)
    for num in unique_key_count(list_data):
        key_unique_count_pc.append('{:.2%}'.format(num / eff))
    return key_unique_count_pc

#### Test
nb_children_unique_count_pc = share_unique_key(children_data)
print(nb_children_unique_count_pc)

['42.90%', '24.22%', '17.94%', '11.73%', '1.87%', '1.35%']


In [21]:
#### Define a function that returns a dictionary of the repartition depending on the parameter, in absolute value:
def dictionary_repartition_abs(list_data):
    key_repartition_abs = {key:value for key, value in zip(unique_key(list_data), unique_key_count(list_data))}
    return key_repartition_abs

#### Define a function that returns a dictionary of the repartition depending on the parameter, in percentage:
def dictionary_repartition_pc(list_data):
    key_repartition_pc = {key:value for key, value in zip(unique_key(list_data), share_unique_key(list_data))}
    return key_repartition_pc 

#### Test
repartition_children_abs = dictionary_repartition_abs(children_data)
print(repartition_children_abs)
repartition_children_pc = dictionary_repartition_pc(children_data)
print(repartition_children_pc)

{'0': 574, '1': 324, '2': 240, '3': 157, '4': 25, '5': 18}
{'0': '42.90%', '1': '24.22%', '2': '17.94%', '3': '11.73%', '4': '1.87%', '5': '1.35%'}


However, when the quantitative data is too dispersed, it lacks clarity. An example below:  

In [22]:
age_repartition_abs = dictionary_repartition_abs(ages_data)
print(age_repartition_abs)
age_repartition_pc = dictionary_repartition_pc(ages_data)
print(age_repartition_pc)

{'18': 69, '19': 68, '20': 29, '21': 28, '22': 28, '23': 28, '24': 28, '25': 28, '26': 28, '27': 28, '28': 28, '29': 27, '30': 27, '31': 27, '32': 26, '33': 26, '34': 26, '35': 25, '36': 25, '37': 25, '38': 25, '39': 25, '40': 27, '41': 27, '42': 27, '43': 27, '44': 27, '45': 29, '46': 29, '47': 29, '48': 29, '49': 28, '50': 29, '51': 29, '52': 29, '53': 28, '54': 28, '55': 26, '56': 26, '57': 26, '58': 25, '59': 25, '60': 23, '61': 23, '62': 23, '63': 23, '64': 22}
{'18': '5.16%', '19': '5.08%', '20': '2.17%', '21': '2.09%', '22': '2.09%', '23': '2.09%', '24': '2.09%', '25': '2.09%', '26': '2.09%', '27': '2.09%', '28': '2.09%', '29': '2.02%', '30': '2.02%', '31': '2.02%', '32': '1.94%', '33': '1.94%', '34': '1.94%', '35': '1.87%', '36': '1.87%', '37': '1.87%', '38': '1.87%', '39': '1.87%', '40': '2.02%', '41': '2.02%', '42': '2.02%', '43': '2.02%', '44': '2.02%', '45': '2.17%', '46': '2.17%', '47': '2.17%', '48': '2.17%', '49': '2.09%', '50': '2.17%', '51': '2.17%', '52': '2.17%', '53

To clarify the data, we need to segment it further. 
#### Let's do the same thing as above but categorizing our data in classes beforehand.

In [23]:
#### Let's create a function that will create classes among a list of data:
def classification_5(data_list, sup1, sup2, sup3, sup4, sup5):
    data_modified = []
    for item in data_list:
        if int(item) < sup1:
            data_modified.append(sup1)
        elif int(item) < sup2:
            data_modified.append(sup2)
        elif int(item) < sup3:
            data_modified.append(sup3)
        elif int(item) < sup4:
            data_modified.append(sup4)
        elif int(item) < sup5:
            data_modified.append(sup5)
        else:
            return "Some values exceed the maximum born."
    data_classes = ['Under ' + str(sup1), 'Under ' + str(sup2), 'Under ' + str(sup3), 'Under ' + str(sup4), 'Under ' + str(sup5)]
    data_classified = unique_key_count(data_modified)
    return data_classes, data_classified, data_modified


#### Test
new_ages_data = classification_5(ages_data, 25, 35, 45, 55, 65)[2]

In [24]:
#### Define a function that does the same thing as dictionary_repartition_abs, but with our classified data:
def dictionary_repartition_class_5_abs(data_list, sup1, sup2, sup3, sup4, sup5):
    data_classes, data_classified, data_modified = classification_5(data_list, sup1, sup2, sup3, sup4, sup5)
    key_repartition_abs = {key:value for key, value in zip(data_classes, data_classified)}
    return key_repartition_abs

#### Define a function that does the same thing as dictionary_repartition_pc, but with our classified data:
def dictionary_repartition_class_5_pc(data_list, sup1, sup2, sup3, sup4, sup5):
    data_classes, data_classified, data_modified = classification_5(data_list, sup1, sup2, sup3, sup4, sup5)
    data_classified_pc = ['{:.2%}'.format(item / sum(data_classified)) for item in data_classified]
    key_repartition_pc = {key:value for key, value in zip(data_classes, data_classified_pc)}
    return key_repartition_pc

#### Test
rep_abs = dictionary_repartition_class_5_abs(ages_data, 25, 35, 45, 55, 65)
print(rep_abs)
rep_pc = dictionary_repartition_class_5_pc(ages_data, 25, 35, 45, 55, 65)
print(rep_pc)

{'Under 25': 278, 'Under 35': 271, 'Under 45': 260, 'Under 55': 287, 'Under 65': 242}
{'Under 25': '20.78%', 'Under 35': '20.25%', 'Under 45': '19.43%', 'Under 55': '21.45%', 'Under 65': '18.09%'}


After having our different categories, either "natural" or defined as above, we'll need to define a profile for each of them.

#### Let's define functions that will compute the mean value of all the data categories for the chosen category.

In [25]:
#### Def counting loops:
def int_count(current_key_cane, var_key):
    var = 0
    count_int = 0
    for number in current_key_cane:
        var += int(number[var_key])
    return var

def float_count(current_key_cane, var_key):
    var = 0
    count_float = 0
    for number in current_key_cane:
        var += float(number[var_key])
    return var

In [26]:
#### Define a function that will compute the values for each region from the dictionary:
def mean_key_profile_fun(dictionary_nb_key):
    mean_profile = {}
    for cane in sorted(dictionary_nb_key):
        print(cane)
        current_id = dictionary_nb_key[cane]
        count = len(current_id)
        average_profile_key = {}
        #### Add the mean values to the new dictionary:
        average_profile_key['Number of people'] = count 
        average_profile_key['Age'] = '{:.3}'.format(int_count(current_id, 'Age') / count)
        average_profile_key['Share of women'] = '{:.2%}'.format(int_count(current_id, 'Sex') / count)
        average_profile_key['BMI'] = '{:.3}'.format(float_count(current_id, 'BMI') / count)
        average_profile_key['Children'] = '{:.3}'.format(int_count(current_id, 'Children') / count)
        average_profile_key['Share of smokers'] = '{:.2%}'.format(int_count(current_id, 'Smoker') / count)
        average_profile_key['Region'] = int(int_count(current_id, 'Region') / count)
        average_profile_key['Charges'] = '{:,.7}'.format(float_count(current_id, 'Charges') / count)
        #explanation = "For n = " + str(average_profile_key['Number of people']) + ", the average person in the " +  str(average_profile_key['Region']) + " region is " + str(average_profile_key['Age']) + " years old, has a BMI of " + str(average_profile_key['BMI']) + ", " + str(average_profile_key['Children']) + " children, has a " + str(average_profile_key['Share of women']) + " chance of being a woman and a " + str(average_profile_key['Share of smokers']) + " chance of being a smoker, and pays $" + str(average_profile_key['Charges']) + " in charges."
        #print(explanation)
        #### Build a dictionary for the regions:
        mean_profile[cane] = average_profile_key
        print(average_profile_key)
    return mean_profile

Now, we're ready to actually study the data.

### A study of the sex variable

Is there a link between sex and other variables? Relation between different variables: share of women and smoking status (do women smoke more than men?), share of women and charges (do women pay more than men?), share of women and bmi (do women have a higher bmi than men?)...

In [27]:
repartition_sexes_pc = dictionary_repartition_pc(sexes_data)
print(repartition_sexes_pc)

{'female': '49.48%', 'male': '50.52%'}


Remember: we defined earlier that 'female' = 1, 'male' = 0.
<br> Our sample is close to the US population in terms of sex ratio, even though in reality, it is closer to the opposite: 50.8% female, 49.2% male. It's in the margin of error.

In [28]:
#### The dictionary by sexes:
sexes_dictionary_nb = key_dict(insurance_data_nb, 'Sex')
mean_sex_data = mean_key_profile_fun(sexes_dictionary_nb)
print(mean_sex_data)

0
{'Number of people': 676, 'Age': '38.9', 'Share of women': '0.00%', 'BMI': '30.9', 'Children': '1.12', 'Share of smokers': '23.52%', 'Region': 1, 'Charges': '13,956.75'}
1
{'Number of people': 662, 'Age': '39.5', 'Share of women': '100.00%', 'BMI': '30.4', 'Children': '1.07', 'Share of smokers': '17.37%', 'Region': 1, 'Charges': '12,569.58'}
{'0': {'Number of people': 676, 'Age': '38.9', 'Share of women': '0.00%', 'BMI': '30.9', 'Children': '1.12', 'Share of smokers': '23.52%', 'Region': 1, 'Charges': '13,956.75'}, '1': {'Number of people': 662, 'Age': '39.5', 'Share of women': '100.00%', 'BMI': '30.4', 'Children': '1.07', 'Share of smokers': '17.37%', 'Region': 1, 'Charges': '12,569.58'}}


Women smoke way less than men on average and have a slightly inferior BMI, so despite being a bit older, they still tend to pay less charges than men. 

### A study of the age variable

#### Let's find the repartition of persons across age:

In [29]:
repartition_ages_pc = dictionary_repartition_pc(ages_data)
print(repartition_ages_pc)
print(rep_pc)

{'18': '5.16%', '19': '5.08%', '20': '2.17%', '21': '2.09%', '22': '2.09%', '23': '2.09%', '24': '2.09%', '25': '2.09%', '26': '2.09%', '27': '2.09%', '28': '2.09%', '29': '2.02%', '30': '2.02%', '31': '2.02%', '32': '1.94%', '33': '1.94%', '34': '1.94%', '35': '1.87%', '36': '1.87%', '37': '1.87%', '38': '1.87%', '39': '1.87%', '40': '2.02%', '41': '2.02%', '42': '2.02%', '43': '2.02%', '44': '2.02%', '45': '2.17%', '46': '2.17%', '47': '2.17%', '48': '2.17%', '49': '2.09%', '50': '2.17%', '51': '2.17%', '52': '2.17%', '53': '2.09%', '54': '2.09%', '55': '1.94%', '56': '1.94%', '57': '1.94%', '58': '1.87%', '59': '1.87%', '60': '1.72%', '61': '1.72%', '62': '1.72%', '63': '1.72%', '64': '1.64%'}
{'Under 25': '20.78%', 'Under 35': '20.25%', 'Under 45': '19.43%', 'Under 55': '21.45%', 'Under 65': '18.09%'}


In [30]:
### New dictionary with updated age data:
insurance_data_nb_age = data_dict(numbers_data, new_ages_data, sexes_data_nb, bmis_data, children_data, smokers_data_nb, regions_data_nb, charges_data)

#### The dictionary by age:
ages_dictionary_nb = key_dict(insurance_data_nb_age, 'Age')
mean_age_data = mean_key_profile_fun(ages_dictionary_nb)
print(mean_age_data)

25
{'Number of people': 278, 'Age': '25.0', 'Share of women': '48.20%', 'BMI': '30.0', 'Children': '0.604', 'Share of smokers': '21.58%', 'Region': 1, 'Charges': '9,011.34'}
35
{'Number of people': 271, 'Age': '35.0', 'Share of women': '48.71%', 'BMI': '30.1', 'Children': '1.28', 'Share of smokers': '20.66%', 'Region': 1, 'Charges': '10,352.39'}
45
{'Number of people': 260, 'Age': '45.0', 'Share of women': '49.62%', 'BMI': '30.4', 'Children': '1.49', 'Share of smokers': '23.46%', 'Region': 1, 'Charges': '13,134.17'}
55
{'Number of people': 287, 'Age': '55.0', 'Share of women': '50.17%', 'BMI': '31.1', 'Children': '1.39', 'Share of smokers': '19.16%', 'Region': 1, 'Charges': '15,853.93'}
65
{'Number of people': 242, 'Age': '65.0', 'Share of women': '50.83%', 'BMI': '31.8', 'Children': '0.682', 'Share of smokers': '17.36%', 'Region': 1, 'Charges': '18,513.28'}
{25: {'Number of people': 278, 'Age': '25.0', 'Share of women': '48.20%', 'BMI': '30.0', 'Children': '0.604', 'Share of smokers':

In a not-so-surprising way, the cost of insurance seems highly correlated with age. People aged 25 years old and under pay on average 9,011, people 35 and under 10,352, people 45 and under 13,134, people 55 and under 15,854, people 65 and under 18,513.
<br> The average BMI also increases, but only a little; same for the share of women; the share of smokers is at its maximum for people under 45 years old, but then goes down. Therefore age seems to be a major factor when it comes to the cost of insurance.

### A study of the geography variable

Let's find the repartition of persons across regions:

In [31]:
region_repartition_pc = dictionary_repartition_abs(regions_data)
print(region_repartition_pc)

{'northeast': 324, 'northwest': 325, 'southeast': 364, 'southwest': 325}


We find that the population is quite well distributed across regions, with a slight uptick in the Southeast region.

In [32]:
#### Test the functions:
#mean_per_region_nb1 = mean_key_profile(regions_dictionary_nb)
mean_per_region_nb2 = mean_key_profile_fun(regions_dictionary_nb)
print(mean_per_region_nb2)

0
{'Number of people': 324, 'Age': '39.3', 'Share of women': '49.69%', 'BMI': '29.2', 'Children': '1.05', 'Share of smokers': '20.68%', 'Region': 0, 'Charges': '13,406.38'}
1
{'Number of people': 325, 'Age': '39.2', 'Share of women': '50.46%', 'BMI': '29.2', 'Children': '1.15', 'Share of smokers': '17.85%', 'Region': 1, 'Charges': '12,417.58'}
2
{'Number of people': 364, 'Age': '38.9', 'Share of women': '48.08%', 'BMI': '33.4', 'Children': '1.05', 'Share of smokers': '25.00%', 'Region': 2, 'Charges': '14,735.41'}
3
{'Number of people': 325, 'Age': '39.5', 'Share of women': '49.85%', 'BMI': '30.6', 'Children': '1.14', 'Share of smokers': '17.85%', 'Region': 3, 'Charges': '12,346.94'}
{'0': {'Number of people': 324, 'Age': '39.3', 'Share of women': '49.69%', 'BMI': '29.2', 'Children': '1.05', 'Share of smokers': '20.68%', 'Region': 0, 'Charges': '13,406.38'}, '1': {'Number of people': 325, 'Age': '39.2', 'Share of women': '50.46%', 'BMI': '29.2', 'Children': '1.15', 'Share of smokers': '

##### Results

We observe that the Southeast region has, at $14,735.41, the highest charges on average, compared to the other regions.

The main outlier is the share of smoker: at 25%, it is far superior to the 17.85% share of total population in the Southwest and Northwest regions, and even to the 20.68% share in the Northeast region.

### Linear regression

A quick study of different couples of variables: bmi and charges, age and charges, children and charges, smoking status and charges.
###### So let's define a function that will help compare all the couples we like.

In [33]:
#### Linear regression function:
def get_y(m, b, x):
    y = m * x + b
    return y

#### Calculating error to one point function:
def calculate_error(m, b, point):
    x_point, y_point = point
    diff = get_y(m, b, x_point) - y_point
    dist = abs(diff)
    return dist

# Calculate the total error function:
def calculate_all_error(m, b, points):
    total = 0
    for point in points:
        total += calculate_error(m, b, point)
    return total

#### linear regression function:
def linear_reg(dictionary, choice_1, choice_2, inf_ms, sup_ms, inf_bs, sup_bs):
    #### 
    datapoints = [(float(dictionary[key][choice_1]), float(dictionary[key][choice_2])) for key in dictionary]
    #### list of the possible values for the points of the linear regression equation:
    possible_ms = [m * 10 for m in range(inf_ms, sup_ms)]
    possible_bs = [b * 10 for b in range(inf_bs, sup_bs)]
    #### initialization of the variables for the following loop:
    smallest_error = float("inf")
    best_m = 0
    best_b = 0
    #### loop that computes the closest approximation for the linear regression:
    for m in possible_ms:
        for b in possible_bs:
            test = calculate_all_error(m, b, datapoints)
            if test < smallest_error:
                best_m = m
                best_b = b
                smallest_error = test
    smallest_error = '{:,.9}'.format(smallest_error)
    #### print the results:
    print(best_m, best_b, smallest_error)

In [34]:
#### Test of the linear regression function:
lin_reg_bmi_charges = linear_reg(insurance_data, 'BMI', 'Charges', 10, 50, 400, 600)

130 5420 11,133,909.5


In [35]:
lin_reg_age_charges = linear_reg(insurance_data, 'Age', 'Charges', 10, 30, -400, -200)

270 -3230 8,975,710.34


In [36]:
lin_reg_children_charges = linear_reg(insurance_data, 'Children', 'Charges', 0, 50, 800, 1000)

140 9150 11,171,118.0


In [37]:
lin_reg_age_charges = linear_reg(insurance_data_nb, 'Smoker', 'Charges', 0, 50, 800, 1000)

490 9380 11,039,431.0
