# U.S. Medical Insurance Costs

- The purpose of this project is to analyze the population from the 'insurance.csv' file, in order to inspect how different
  factors such as number of children, smoking, BMI percentage... can impact the insurance's costs.

  For that an exploratory analysis using functions, diccionaries, loops and python algebra is been implemented during the course
  of the project, sorting the data for clearness and efectiveness.

  Lastly the data is been analyzed, resulting in conclusions about the data, possible trends regarding health factors and costs, and 
  possible bias in the data collected.

First the CSV  is imported and the dataset dictionary created to ease the work

In [1]:
import csv
with open('insurance.csv') as insurance_file:
    dict = csv.DictReader(insurance_file, delimiter=',')

    data = list(dict)
    
    dataset = {}
    i = 0
    for row in data:
        dataset.update({i:row})
        i += 1



Analysis:
* `Average age` of the patients

In [2]:
def avg_age(dataset):
    sum_ages = 0
    for subdict in dataset.values():
        sum_ages += int(subdict['age'])
    average = sum_ages / len(dataset.keys())
    return average

average_age = avg_age(dataset)
print("The average age is", average_age)

def age_spread(dict):
    ages = []
    for subset in dict.values():
        ages.append(float(subset['age']))
    
    max_age = max(ages)
    min_age = min(ages)
    print("Max age:", max_age)
    print("Min age", min_age)

age_spread(dataset)

The average age is 39.20702541106129
Max age: 64.0
Min age 18.0


* `Number of men and women`

In [3]:
def genre(dataset):
    men = 0
    women = 0
    for subdict in dataset.values():
        if subdict['sex'] == 'male':
            men += 1
        else: women += 1
    return men, women

men, women = genre(dataset)
print("Number of men:", men)
print("Number of women:", women)

# Here the population is divided in 2 datasets [male_data, female_data] for further analysis with them
def male_divider(dataset):
    men_dataset = {}
    i=0
    for subdict in dataset.values():
        if subdict['sex'] == 'male':
            men_dataset.update({i:subdict})
            i +=1        
    return men_dataset

def female_divider(dataset):
    female_dataset = {}
    i=0
    for subdict in dataset.values():
        if subdict['sex'] == 'female':
            female_dataset.update({i:subdict})
            i +=1        
    return female_dataset

male_data = male_divider(dataset)
female_data = female_divider(dataset)



Number of men: 676
Number of women: 662


* `Fertility by men / women`

In [4]:
def fertility(dataset):
    fertility = 0
    for subdict in dataset.values():
        fertility += int(subdict['children'])
    return fertility

men_fertility = fertility(male_data)
women_fertility = fertility(female_data)

print("Men fertility:", men_fertility)
print("Women fertility:", women_fertility)


Avg_children_men = men_fertility / men
Avg_children_women = women_fertility / women

print("Average nº of children of men:" +str(Avg_children_men))
print("Average nº of children of women:" +str(Avg_children_women))

# this is to analyze if there's any kind of bias relating the fertility but seems a reasonable average value
children_diff = men_fertility - women_fertility
men_surplus = men - women
print("Average children of the 14 men surpluss:", children_diff/men_surplus)

Men fertility: 754
Women fertility: 711
Average nº of children of men:1.1153846153846154
Average nº of children of women:1.0740181268882176
Average children of the 14 men surpluss: 3.0714285714285716


* Costs by `number of children`

In [5]:
def children_cost_classifier(dataset):
    children_costs = {"0":0, "1":0, "2":0, "3":0, "+4":0}
    parents_counter = {"0":0, "1":0, "2":0, "3":0, "+4":0}
    for subdict in dataset.values():
        if subdict['children'] == '0':
            children_costs["0"] += float(subdict['charges'])
            parents_counter["0"] += 1
        elif subdict['children'] == '1':
            children_costs["1"] += float(subdict['charges'])
            parents_counter["1"] += 1
        elif subdict['children'] == '2':
            children_costs["2"] += float(subdict['charges'])
            parents_counter["2"] += 1
        elif subdict['children'] == '3':
            children_costs["3"] += float(subdict['charges'])
            parents_counter["3"] += 1
        else:
            children_costs["+4"] += float(subdict['charges'])
            parents_counter["+4"] += 1
    return children_costs, parents_counter

children_cost, parent_counter = children_cost_classifier(dataset)

print(children_cost, parent_counter)

def avg_children_cost(dict1, dict2):
    cost = list(dict1.values())
    persons = list(dict2.values())
    keys = list(dict1.keys())
    for i in range(len(cost)):
        average_cost = cost[i]/persons[i]
        print("Average cost for", keys[i], "children:", average_cost)

avg_c_cost = avg_children_cost(children_cost, parent_counter)
        

{'0': 7098069.995338997, '1': 4124899.673449997, '2': 3617655.296149999, '3': 2410784.983589999, '+4': 504415.04222999985} {'0': 574, '1': 324, '2': 240, '3': 157, '+4': 43}
Average cost for 0 children: 12365.975601635882
Average cost for 1 children: 12731.171831635793
Average cost for 2 children: 15073.563733958328
Average cost for 3 children: 15355.31836681528
Average cost for +4 children: 11730.582377441857


* Classification by `regions - total costs per region`

In [6]:
def population_classifier(dataset):
    population = {"southwest":[0,0], "southeast":[0,0], "northwest":[0,0], "northeast":[0,0]}
    for subset in dataset.values():
        if subset['region'] == 'southwest':
            population["southwest"][0] += 1
            population["southwest"][1] += float(subset['charges'])
        elif subset['region'] == 'southeast':
            population["southeast"][0] += 1
            population["southeast"][1] += float(subset['charges'])
        elif subset['region'] == 'northwest':
            population["northwest"][0] += 1
            population["northwest"][1] += float(subset['charges'])
        else:
            population["northeast"][0] += 1
            population["northeast"][1] += float(subset['charges'])
    return population
        
regions = population_classifier(dataset)
print(regions)

{'southwest': [325, 4012754.647620001], 'southeast': [364, 5363689.763290002], 'northwest': [325, 4035711.9965399993], 'northeast': [324, 4343668.583308999]}


* `Smokers` Analysis:

In [7]:
def smoker_classifier(dataset):
    smokers = {"population":0, "min cost":16000.0, "max cost":0, "total cost":0, "average cost":0} # in min cost a random value is added to simplify the code
    non_smokers = {"population":0, "min cost":16000.0, "max cost":0, "total cost":0, "average cost":0}
    for subset in dataset.values():
        if subset['smoker'] == 'yes':
            smokers["population"] += 1
            smokers["total cost"] += float(subset['charges'])
            if float(subset['charges']) > smokers["max cost"]:
                smokers["max cost"] = float(subset['charges'])
            if float(subset['charges']) < smokers["min cost"]:
                smokers["min cost"] = float(subset['charges'])
        else:
            non_smokers["population"] += 1
            non_smokers["total cost"] += float(subset['charges'])
            if float(subset['charges']) > non_smokers["max cost"]:
                non_smokers["max cost"] = float(subset['charges'])
            if float(subset['charges']) < non_smokers["min cost"]:
                non_smokers["min cost"] = float(subset['charges'])

    smokers["average cost"] = smokers["total cost"] / smokers["population"]
    non_smokers["average cost"] = non_smokers["total cost"] / non_smokers["population"]
    
    return smokers, non_smokers

smokers_classification, non_s_classification = smoker_classifier(dataset)

print("Smokers classification:", smokers_classification) 
print("Non smokers classification:", non_s_classification)

non_s_percentage = (non_s_classification['population'] / (men + women)) * 100
print(non_s_percentage) 
    



Smokers classification: {'population': 274, 'min cost': 12829.4551, 'max cost': 63770.42801, 'total cost': 8781763.52184, 'average cost': 32050.23183153285}
Non smokers classification: {'population': 1064, 'min cost': 1121.8739, 'max cost': 36910.60803, 'total cost': 8974061.468918996, 'average cost': 8434.268297856199}
79.52167414050822


* `BMI` Analysis:

In [8]:
def bmi_classifier(dict, genre):
    bmi_classification = {"average BMI":0, "max BMI":[0,0], "min BMI":[0,0], "average charge": 0} #max and min bmi store the bmi value and the equivalent insurance cost
    total_bmi= 0                                                                              
    total_charge = 0
    bmis = []
    charges = []

    for subdict in dict.values():
        total_bmi += float(subdict['bmi'])
        total_charge += float(subdict['charges'])
        bmis.append(float(subdict['bmi']))
        charges.append(float(subdict['charges']))
    
    bmi_classification['max BMI'][0] = max(bmis)
    bmi_classification['max BMI'][1] = max(charges)
    bmi_classification['min BMI'][0] = min(bmis)
    bmi_classification['min BMI'][1] = min(charges)
    bmi_classification['average BMI'] = total_bmi / genre
    bmi_classification['average charge'] = total_charge / genre

    return bmi_classification

men_bmi = bmi_classifier(male_data, men)
women_bmi = bmi_classifier(female_data, women)

print("Men classification:", men_bmi)
print("Women classification:", women_bmi)


Men classification: {'average BMI': 30.943128698224832, 'max BMI': [53.13, 62592.87309], 'min BMI': [15.96, 1121.8739], 'average charge': 13956.751177721886}
Women classification: {'average BMI': 30.377749244713023, 'max BMI': [48.07, 63770.42801], 'min BMI': [16.815, 1607.5101], 'average charge': 12569.57884383534}


## Conclusions:

1. #### Age distribution:
    The population age's spread goes from 18 to 64 years with an average of 39.2, a balanced age distribution for the analysis.

2. #### Genre distribution:
    The data is well distributed with a total number of 676 males and 662 females, wich leaves a similar distribution of genres in the population.



3. #### Fertility:
    The fertility analysis has left this data:

    * Male:          Avergare number of children = 1.12          Total male's population children = 754
    * Female:        Avergare number of children = 1.074          Total female's population children = 711

        * Average children from the male surpluss: 3.07 per male  (14 male surpluss)
    
    From this data it is clear that men from the population are in general more fertile than women. Also the surpluss seems plausible so apparently there ins't any
    bias regarding this data.



4. #### Family Costs:
    The population is been classified by the number of children, from 0 to +4 children, to see how this would impact insurance costs.

    * Average Costs:        0 = 12366       
                            1 = 12731       
                            2 = 15073       
                            3 = 15355       
                            +4 = 11731 
    
    A rise by number of children in the population insurance costs is visible until reaching +4 children, were them clearly descend, this could be produced by government
    aids to favor and help big families.


5. #### Regions:
    First the population's been divided by regions to see if the data is spread evenly, to then analyze the total insurance charge per region.

    * Population:   Southwest = 325       Southeast = 364       Northwest = 325       Northeast = 324
    * Costs:        Southwest = 4012754        Southeast = 5363689        Northwest = 4035711        Northeast = 4343668 

    The population's spreed is even, with more participants from the Southeast region, which is also the region where the insurance charge is also higher.


6. #### Smokers:
    The analysis is conducted to see the health trend in the population and to what level smoking impacts in insurance costs.

    * Smokers: 274      Average cost = 32050         Max cost = 63770 
    * Non smokers: 1064      Average cost = 8434         Max cost = 36910 

    From this data is concluded that the health level is excelent, with a 79% of the population been non smokers. 
    It is clear too that smoking directly correlates with the insurance costs, been the most impactful factor from the whole analysis.


7. #### BMI:
    The population is been again divided in males and females, to analyze any possible trends in the 'BMI' percentage, and also if it correlates in higher costs.

    * Male:
        Average BMI = 30.94     Average cost = 13956
        * Min BMI = 15.96       Cost = 1121
        * Max BMI = 53.13       Cost = 62592
    
    * Female:
        Average BMI = 30.38     Average cost = 12569
        * Min BMI = 16.815       Cost = 1607
        * Max BMI = 48.07       Cost = 63770

    With this data is plausible to appoint that there's no tendecys between genres and BMI's percentage, it develops evenly for males and females.
    It is also the second most impactful variable in the insurance costs. 



After analyzing all the variables, bias wasn't apparently found, so it is concluded that the data is reliable.

