# U.S. Medical Insurance Costs

This data analysis project aims to investigate medical insurance costs among different groups of the population and gain insights from the data. By applying analytical skills and using Python fundamentals, I explored the insurance.csv dataset to analyze the data.

In [2]:
# Import a csv library
import csv

As the next step, I will utilize the **csv.DictReader** class, which enables me to read a CSV file and access the data using dictionary keys. This will facilitate easier manipulation and analysis of the data.

In [3]:
# Specify the CSV file path
csv_file_path = 'insurance.csv'

# Create an empty list to store the patient records
data = []
# Open the CSV file in read mode
with open(csv_file_path, 'r', newline='') as file:
    # Create a DictReader object
    csv_dict_reader = csv.DictReader(file)

    # Iterate through each row in the CSV file
    for row in csv_dict_reader:
        # Access each value using dictionary keys
        age = row['age']
        sex = row['sex']
        bmi = row['bmi']
        children = row['children']
        smoker = row['smoker']
        region = row['region']
        charges = row['charges']

        # Create a dictionary representing a patient record
        record = {
            'age': age,
            'sex': sex,
            'bmi': bmi,
            'children': children,
            'smoker': smoker,
            'region': region,
            'charges': charges
        }

        # Append the record to the data list
        data.append(record)


**insurance.csv** contains the following columns:

* ***age***: age of primary beneficiary

* ***sex***: insurance contractor gender, female, male

* ***bmi***: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

* ***children***: Number of children covered by health insurance / Number of dependents

* ***smoker***: Smoking

* ***region***: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

* ***charges***: Individual medical costs billed by health insurance
  
There are no signs of missing data. 

Once our data is imported, neatly organized, and stored in a dictionary, we can begin our analysis. First of all, we need to plan what to investigate. The following operations will be implemented:

- Find the proportions between males and females in the dataset.
- Calculate the average insurance charge for each gender.

Based on the results obtained, we can further explore the reasons behind the differences (or lack thereof) in charges between males and females. We will perform the following investigations:

- Compare the average age for each gender group.
- Calculate the discrepancy of the average insurance charges between males and females with different numbers of children.
- Analyze the distribution of overweight individuals (BMI between 25.0 and <30) and obese individuals (BMI >= 30) between males and females.
- Calculate the percentage of smokers in each gender group.

By conducting these analyses, we can gain insights into potential factors influencing the differences in charges between males and females.

To perform these inspections, a class called `InsuranceAnalysis` has been built out which contains fives methods:
* `gender_proportions()`
* `average_charge_per_gender()`
* `average_age_per_gender()`
* `children_discrepancy()`
* `bmi_distribution()`
* `smoker_percentage()`

The class has been built out below.

In [124]:
class InsuranceAnalysis:
    def __init__(self, data):
        self.data = data
        
# method to calculate the proportions between males and females in percentage within the dataset
    def gender_proportions(self):
        total_count = len(self.data)
        male_count = sum(1 for record in self.data if record['sex'] == 'male')
        female_count = sum(1 for record in self.data if record['sex'] == 'female')
        male_proportion = male_count / total_count 
        female_proportion = female_count / total_count 
        return "The proportion of males is: {:.2%}, females: {:.2%}".format(male_proportion, female_proportion)

# method to find the average insurance charge for each gender
    def average_charge_per_gender(self):
        total_male_count = 0
        total_male_charge = 0
        total_female_count = 0
        total_female_charge = 0
        for record in self.data:
            if record['sex'] == 'male':
                total_male_count += 1
                total_male_charge += float(record['charges'])
            elif record['sex'] == 'female':
                total_female_count += 1
                total_female_charge += float(record['charges'])
        average_male_charge = total_male_charge / total_male_count
        average_female_charge = total_female_charge / total_female_count
        return "Average charge fo males:", round(average_male_charge,2), "females: ", round(average_female_charge,2)

# method that allows us to ivestigate average age among males and females
    def average_age_per_gender(self):
        total_male_count = 0
        total_male_age = 0
        total_female_count = 0
        total_female_age = 0
        for record in self.data:
            if record['sex'] == 'male':
                total_male_count += 1
                total_male_age += int(record['age'])
            elif record['sex'] == 'female':
                total_female_count += 1
                total_female_age += int(record['age'])
        average_male_age = total_male_age / total_male_count
        average_female_age = total_female_age / total_female_count
        return "Average age of males is:", round(average_male_age), "females: ", round(average_female_age)

# method to analyze the impact of the number of children on insurance charges for males and females
    def children_discrepancy(self):
        children_discrepancy = {}
        
        # Group the data by the number of children
        children_groups = {}
        for record in self.data:
            children = record['children']
            if children not in children_groups:
                children_groups[children] = {'male_charges': [], 'female_charges': []}
            if record['sex'] == 'male':
                children_groups[children]['male_charges'].append(float(record['charges']))
            elif record['sex'] == 'female':
                children_groups[children]['female_charges'].append(float(record['charges']))
        
        # Calculate the discrepancy of insurance charges between males and females for each group
        for children, charges in children_groups.items():
            male_average_charge = sum(charges['male_charges']) / len(charges['male_charges'])
            female_average_charge = sum(charges['female_charges']) / len(charges['female_charges'])
            children_discrepancy[children] = female_average_charge - male_average_charge
            # Sort the dictionary by keys (number of children) in ascending order
            sorted_discrepancy = dict(sorted(children_discrepancy.items()))
        
        return sorted_discrepancy
    
# method that let know proportions of individuals of each gender have more than have overweight or obesity
    def bmi_distribution(self):
        overweight_male_count = 0
        obese_male_count = 0
        total_male_count = 0
        overweight_female_count = 0
        obese_female_count = 0
        total_female_count = 0
        for record in self.data:
            if record['sex'] == 'male':
                total_male_count += 1
                bmi = float(record['bmi'])
                if 25.0 <= bmi < 30:
                    overweight_male_count += 1
                elif bmi >= 30:
                    obese_male_count += 1
            if record['sex'] == 'female':
                total_female_count += 1
                bmi = float(record['bmi'])
                if 25.0 <= bmi < 30:
                    overweight_female_count += 1
                elif bmi >= 30:
                    obese_female_count += 1
        overweight_male_proportion = overweight_male_count / total_male_count
        obese_male_proportion = obese_male_count / total_male_count
        overweight_female_proportion = overweight_female_count / total_female_count
        obese_female_proportion = obese_female_count / total_female_count
        return "The proportion of overweight between male and female is {:.2%} and {:.2%} respectively".format(overweight_male_proportion,overweight_female_proportion), "The proportion of obesity between male and female is {:.2%} and {:.2%} respectively".format(obese_male_proportion,obese_female_proportion)
        
# method that allows us to explore the prevalence of smokers among males and females
    def smoker_percentage(self):
        total_male_count = 0
        smoker_male_count = 0
        total_female_count = 0
        smoker_female_count = 0
        for record in self.data:
            if record['sex'] == 'male':
                total_male_count += 1
                if record['smoker'] == 'yes':
                    smoker_male_count += 1
            if record['sex'] == 'female':
                total_female_count += 1
                if record['smoker'] == 'yes':
                    smoker_female_count += 1
        smoker_male_percentage = (smoker_male_count / total_male_count)
        smoker_female_percentage = (smoker_female_count / total_female_count)
        return "Percentage of smokers among males is {:.2%}, females: {:.2%}".format(smoker_male_percentage,smoker_female_percentage)
        

The next step is to create an instance of the class called `insurance_analysis`. With this instance, each method can be used to see the results of the analysis.

In [125]:
# Create an instance of the InsuranceAnalysis class with the data argument
insurance = InsuranceAnalysis(data)

In [37]:
insurance.gender_proportions()

'The proportion of males is: 50.52%, females: 49.48%'

This step of the analysis examines the gender balance in the **insurance.csv** dataset. It is crucial to verify if this dataset represents a broader population of individuals. When using this dataset to create a classification model, it becomes imperative to ensure that the attributes are balanced.

The results indicate that the ***distribution of genders in the sample is nearly equal***, which is a positive factor as it helps in creating a more representative and unbiased classification model.

In [60]:
insurance.average_charge_per_gender()

('Average charge fo males:', 13956.75, 'females: ', 12569.58)

On average, males tend to pay slightly more than females, with a difference of approximately $1400 per year. This finding highlights ***a gender-based discrepancy in insurance charges***, suggesting that there may be factors influencing the pricing that vary between genders.

In [66]:
insurance.average_age_per_gender()

('Average age of males is:', 39, 'females: ', 40)

The average age of males is approximately 39 years old, while for females it is around 40 years old. This insight suggests that ***age alone may not be a significant factor contributing to the gender-based difference in insurance charges***, due to the fact that younger individuals tend to have lower insurance charges compared to older individuals.

To delve deeper into this observation, it would be beneficial to conduct a more comprehensive analysis. This analysis should include examining the range and standard deviation of the age distribution within the 'insurance.csv' dataset. 

In [126]:
insurance.children_discrepancy()

{'0': -926.9824600129268,
 '1': -1112.162043343762,
 '2': -2245.777998489488,
 '3': -2923.5623521363686,
 '4': 155.38973324675135,
 '5': 1922.3481087499986}

The result shows that the number of children can influence the discrepancy of insurance costs between males and females. For groups with no children up to three children, on average, males have insurance costs that are lower than females. However, for groups with four or five children, females tend to have higher costs compared to males.

In [105]:
insurance.bmi_distribution()

('The proportion of overweight between male and female is 27.66% and 30.06% respectively',
 'The proportion of obesity between male and female is 55.18% and 50.45% respectively')

***More than 55% of males in the dataset are classified as having obesity, which is almost 5% higher compared to females.*** While the distribution of overweight individuals shows a slight predominance of females, approximately 2.4%, it is important to note that obesity can be characterized as a significant risk factor that contributes to the development of various diseases and accordingly higher insurance costs.

Obesity is known to be associated with various health risks and medical conditions, which may lead to increased healthcare costs and higher insurance charges. Therefore, the higher proportion of males with obesity could potentially explain the difference in insurance payments between genders.

In [99]:
insurance.smoker_percentage()

'Percentage of smokers among males is 23.52%, females: 17.37%'

***Males exhibit a significantly higher smoking prevalence, with more than a 6 percent difference compared to females.*** The increased risk associated with smoking may result in higher insurance costs for males compared to females.

## Conclusion

* The analysis reveals a gender-based discrepancy in insurance charges, with ***males tending to pay slightly more than females***, approximately $1400 per year. This finding suggests that there are factors influencing the pricing that vary between genders.

* For groups with ***no children up to three children, males tend to have lower costs compared to females***. However, for groups with ***four or five children, females tend to have higher costs compared to males***.
  
* Among the explored factors (age, BMI, smoking) one potential explanation for this difference is the ***tendency of males to have a higher prevalence of obesity*** compared to females. Obesity is known to be a significant risk factor for various diseases, and individuals with obesity are generally considered to be at higher risk, resulting in increased healthcare expenses and potentially higher insurance premiums.

* Another contributing factor to the gender-based discrepancy in insurance charges is the ***higher prevalence of smoking among males***. Smoking is widely recognized as a major risk factor for several health conditions, which can lead to higher healthcare costs and insurance premiums.

* By taking into account factors such as number of children, obesity and smoking prevalence, insurers can develop more accurate and fair pricing strategies.

* Moreover, promoting healthy lifestyle choices and raising awareness about the health risks of obesity and smoking can help reduce the prevalence of these risk factors among both males and females. This, in turn, may lead to improved overall health outcomes and potentially lower insurance costs for everyone.