# U.S. Medical Insurance Costs

---

# Look over your dataset
Download a zip file here with the necessary datasets and an empty Jupyter Notebook where you can write your code.

Open **insurance.csv** and take a look at the file. Take note of how information is organized. How will this affect how you analyze the data in Python? Is there anything of particular interest to you in the dataset that you want to investigate? Think about these things before you jump into analyzing it.

---

Some important notes about this dataset:

+ There is no missing data.
+ There are seven columns.
+ Some columns are numerical while some are categorical.

---

Before diving into the analysis, I took some time to carefully inspect the dataset provided in the `insurance.csv` file. This dataset contains information on individuals' age, sex, BMI, number of children, smoking status, region, and insurance charges. A few key observations:

- There are no missing values in the dataset, which simplifies preprocessing.
- The dataset has seven columns: four numerical (`age`, `bmi`, `children`, `charges`) and three categorical (`sex`, `smoker`, `region`).
- Each row represents an individual, giving us insight into health, lifestyle, and their corresponding insurance costs.

## Insights on Organization
The way the information is structured means numerical columns can be analyzed for correlations and trends, while categorical columns are suited for comparisons and group-based analysis. One specific observation of interest is the `smoker` column, which likely has a significant impact on `charges`. Another column worth exploring is `region`—there could be geographical differences in costs.

To fully utilize this dataset, I will need to process and segment it carefully for analysis. For example:
- Numerical columns will enable statistical and visual analysis (e.g., correlation heatmaps, scatterplots).
- Categorical columns will require grouping or encoding for comparisons.

This exploration phase helps identify potential areas of interest for the upcoming analysis.

---

# Scoping Your Project
Now that you have looked over your dataset, plan out what you want to analyze. What is it that you want to find out about this dataset? Based on the way information is organized, certain inspections may be easier to perform than others. As you map out the process, consider the scope of your analysis as well.

Properly scoping your project will greatly benefit you; scoping creates structure while requiring you to think through your entire project before you begin. You should start by stating the goals for your project, then gathering the data, and considering the analytical steps required. A proper project scope can be a great road map for your project, but keep in mind that some down-stream tasks may become dead ends which will require adjustment to the scope.flow chart showing that the project goals relate to the analysis and the data. Analysis relates to the data and evaluation. Evaluation relates to Output

If you would like some inspiration, we provide some ideas in the hint.

---
Some possible ideas for analysis are the following:

+ Find out the average age of the patients in the dataset.
+ Analyze where a majority of the individuals are from.
+ Look at the different costs between smokers vs. non-smokers.
+ Figure out what the average age is for someone who has at least one child in this dataset.

Here is University of Chicago’s Data Science Project Scoping Guide to guide you as you consider the scope of you analysis.

Main components that you will want to include:

+ Goals
+ Data
+ Analysis

These are just some ideas and we hope they give you a good starting point. As you think of ideas, also consider what the implications of some of the results would be. 

For example, we may find that this dataset is mainly composed of individuals who have children or that it is imbalanced in terms of representation of males vs. females. Taking information like this into consideration when looking at data can give you insight into potential use cases as well as where certain biases can impact results.

---

## Goals
The goal of this project is to analyze the dataset and uncover patterns or trends that affect insurance charges. These insights could be useful for understanding how health, lifestyle, and demographics influence insurance costs. Specific questions I aim to investigate:
1. What is the average age of individuals in the dataset?
2. How does smoking status impact insurance charges?
3. What is the average age of individuals with at least one child?
4. Are there correlations between BMI, age, and insurance charges?
5. Are insurance charges affected by geographical location (region)?

## Process
To structure the project, I scoped it out as follows:
1. **Data Exploration**:
   - Import the dataset and inspect its contents.
   - Organize columns into numerical and categorical data for easier analysis.
2. **Analysis**:
   - Investigate relationships between numerical variables (e.g., BMI, charges) using visualizations and statistical methods.
   - Explore the effects of categorical variables (e.g., smoker status, region) on charges.
3. **Evaluation**:
   - Summarize the trends and patterns found in the dataset.
   - Assess any limitations or biases present in the dataset (e.g., imbalances in representation by region or gender).
4. **Output**:
   - Create visualizations and summaries to clearly present findings.

## Analytical Strategy
My analysis will include:
- Statistical summaries for numerical variables.
- Group comparisons for categorical variables (e.g., smokers vs. non-smokers).
- Visualizations (scatterplots, heatmaps, etc.) to highlight correlations and patterns.

By defining these goals and mapping out the steps required for analysis, I’ve created a structured approach that will guide the project effectively. As I proceed, I’ll adjust the scope if new questions or challenges arise.

---

# Import your dataset
Import **insurance.csv** into your Python file and inspect the contents.

---

You may need to use a library here that helps with importing .csv files. One standard library used for this is the `csv library`. If you feel stuck, this documentation provides examples of how to read in your files.

**Note**: One of the columns contains BMI data. While insurance companies do use BMI in their calculations, and that is reflected in this project, BMI is not necessarily an accurate predictor of health. As data scientists, we should always be skeptical of quantitative measures like BMI that reduce complex phenomena to a single number.

In [27]:
import csv

# Open and read the CSV file
with open('insurance.csv', 'r') as file:
    reader = csv.reader(file)
    headers = next(reader)  # Read the header row
    rows = [row for row in reader]  # Extract the remaining rows

# Display the headers and the first 5 rows
print("Headers:", headers)
print("First 5 rows:")
for row in rows[:5]:
    print(row)

Headers: ['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
First 5 rows:
['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924']
['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523']
['28', 'male', '33', '3', 'no', 'southeast', '4449.462']
['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061']
['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552']


### **What This Code Does**
1. **`csv.reader(file)`:** Reads the CSV file line by line.
2. **`next(reader)`:** Extracts the first line of the file as the headers.
3. **`rows = [row for row in reader]`:** Stores all remaining rows in a list.

This will allow you to inspect the data without needing external libraries like pandas.

---

# **Save your dataset via Python variables**

Save the features of your dataset (the columns) from **insurance.csv** by storing them in variables that can be used for analysis. As you consider what types of variables to use and how many you plan to create, think ahead about the parameters you wish to investigate and how your organization will impact this analysis.

---

You may need to use another library to store information from **insurance.csv** into Python variables. One helpful `csv` method could be `DictReader()`. The Python documentation will show examples of using the method. Using this method, you can iterate through rows of the dataset and store columns in specific lists or dictionaries that will be incredibly useful for analysis.

In [28]:
import csv

# Open and read the CSV file using DictReader
with open('insurance.csv', 'r') as file:
    reader = csv.DictReader(file)
    
    # Initialize empty lists for each column
    ages = []
    sexes = []
    bmis = []
    children_counts = []
    smokers = []
    regions = []
    charges = []

    # Iterate through rows and append data to the respective lists
    for row in reader:
        ages.append(int(row['age']))
        sexes.append(row['sex'])
        bmis.append(float(row['bmi']))
        children_counts.append(int(row['children']))
        smokers.append(row['smoker'])
        regions.append(row['region'])
        charges.append(float(row['charges']))

# Display a preview of the saved variables
print("Ages:", ages[:5])
print("Sexes:", sexes[:5])
print("BMIs:", bmis[:5])
print("Children:", children_counts[:5])
print("Smokers:", smokers[:5])
print("Regions:", regions[:5])
print("Charges:", charges[:5])

Ages: [19, 18, 28, 33, 32]
Sexes: ['female', 'male', 'male', 'male', 'male']
BMIs: [27.9, 33.77, 33.0, 22.705, 28.88]
Children: [0, 1, 3, 0, 0]
Smokers: ['yes', 'no', 'no', 'no', 'no']
Regions: ['southwest', 'southeast', 'southeast', 'northwest', 'northwest']
Charges: [16884.924, 1725.5523, 4449.462, 21984.47061, 3866.8552]


### **What This Code Does**
1. **`DictReader(file)`:** Reads each row of the file as a dictionary, where keys are the column headers and values are the cell data.
2. **Lists for Each Column:** Saves each column in a separate list (`ages`, `sexes`, etc.), making it easy to analyze specific variables.
3. **Data Conversion:** Converts the numerical columns (`age`, `bmi`, `children`, `charges`) to their proper data types (e.g., `int` or `float`).

---

# Build out analysis functions or class methods

You now have everything you need to begin your analysis. You have organized the information from insurance.csv and have spent some time thinking about what it is you would like to investigate.

Now is the time to build out how you perform these investigations. Use the Python fundamentals you have learned so far to accomplish these tasks. There are many different ways you can achieve these analyses. In our hint, we will provide some ideas for how you can use Python to analyze data.

---

The two main options you have at your disposal are the following:

+ Build functions that perform each analysis you desire.
+ Build a class that contains methods for your analysis.
Both are excellent options and can produce clean, modular code.

---

### **Option 1: Functions for Analysis**

In [30]:
# Function to calculate the average age of individuals
def calculate_average_age(ages):
    return sum(ages) / len(ages)

# Function to calculate the distribution of regions
def region_distribution(regions):
    distribution = {}
    for region in regions:
        distribution[region] = distribution.get(region, 0) + 1
    return distribution

# Function to calculate average charges for smokers vs non-smokers
def smoker_charges_analysis(smokers, charges):
    smoker_charges = []
    non_smoker_charges = []
    for i in range(len(smokers)):
        if smokers[i] == "yes":
            smoker_charges.append(charges[i])
        else:
            non_smoker_charges.append(charges[i])
    avg_smoker_charges = sum(smoker_charges) / len(smoker_charges) if smoker_charges else 0
    avg_non_smoker_charges = sum(non_smoker_charges) / len(non_smoker_charges) if non_smoker_charges else 0
    return avg_smoker_charges, avg_non_smoker_charges

# Function to calculate average age of individuals with at least one child
def average_age_with_children(ages, children):
    ages_with_children = [ages[i] for i in range(len(children)) if children[i] > 0]
    return sum(ages_with_children) / len(ages_with_children) if ages_with_children else 0

### **Example Usage with Saved Variables**
You can use the previously extracted variables (`ages`, `smokers`, `charges`, `regions`, etc.) as inputs:

In [31]:
# Calculate the average age
average_age = calculate_average_age(ages)
print("Average Age:", average_age)

# Get the distribution of regions
region_dist = region_distribution(regions)
print("Region Distribution:", region_dist)

# Analyze smoker vs non-smoker charges
avg_smoker, avg_non_smoker = smoker_charges_analysis(smokers, charges)
print("Average Charges (Smokers):", avg_smoker)
print("Average Charges (Non-Smokers):", avg_non_smoker)

# Find the average age of individuals with children
avg_age_with_kids = average_age_with_children(ages, children_counts)
print("Average Age (with at least one child):", avg_age_with_kids)

Average Age: 39.20702541106129
Region Distribution: {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
Average Charges (Smokers): 32050.23183153285
Average Charges (Non-Smokers): 8434.268297856199
Average Age (with at least one child): 39.78010471204188


---

### **Option 2: A Class with Methods**

If your analysis requires a more organized or complex structure, encapsulating functionality in a class is an excellent choice. Here’s how you can set it up:

In [32]:
class InsuranceAnalysis:
    def __init__(self, ages, sexes, bmis, children_counts, smokers, regions, charges):
        self.ages = ages
        self.sexes = sexes
        self.bmis = bmis
        self.children_counts = children_counts
        self.smokers = smokers
        self.regions = regions
        self.charges = charges

    def calculate_average_age(self):
        return sum(self.ages) / len(self.ages)

    def region_distribution(self):
        distribution = {}
        for region in self.regions:
            distribution[region] = distribution.get(region, 0) + 1
        return distribution

    def smoker_charges_analysis(self):
        smoker_charges = []
        non_smoker_charges = []
        for i in range(len(self.smokers)):
            if self.smokers[i] == "yes":
                smoker_charges.append(self.charges[i])
            else:
                non_smoker_charges.append(self.charges[i])
        avg_smoker_charges = sum(smoker_charges) / len(smoker_charges) if smoker_charges else 0
        avg_non_smoker_charges = sum(non_smoker_charges) / len(non_smoker_charges) if non_smoker_charges else 0
        return avg_smoker_charges, avg_non_smoker_charges

    def average_age_with_children(self):
        ages_with_children = [self.ages[i] for i in range(len(self.children_counts)) if self.children_counts[i] > 0]
        return sum(ages_with_children) / len(ages_with_children) if ages_with_children else 0

### **Example Usage for the Class**

In [None]:

# Create an instance of the class with your saved variables
analysis = InsuranceAnalysis(ages, sexes, bmis, children_counts, smokers, regions, charges)

# Perform analyses using the class methods
print("Average Age:", analysis.calculate_average_age())
print("Region Distribution:", analysis.region_distribution())
avg_smoker, avg_non_smoker = analysis.smoker_charges_analysis()
print("Average Charges (Smokers):", avg_smoker)
print("Average Charges (Non-Smokers):", avg_non_smoker)
print("Average Age (with at least one child):", analysis.average_age_with_children())

Average Age: 39.20702541106129
Region Distribution: {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
Average Charges (Smokers): 32050.23183153285
Average Charges (Non-Smokers): 8434.268297856199
Average Age (with at least one child): 39.78010471204188


---
### **Which Should You Use?**
1. **Use Functions**:
   - If your analysis is simple and doesn't require extensive organization.
   - When you want straightforward, modular functions for specific tasks.

2. **Use a Class**:
   - If you have multiple analyses and want to group them in a single organized structure.
   - If you need to share data between methods, reducing repetitive function parameters.

---
# Project Extensions
You’re welcome to expand your analysis beyond what you have already done! Some potential extra features to add to your portfolio project are the following:

+ Organize your findings into dictionaries, lists, or another convenient datatype.
+ Make predictions about what features are the most influential for an individual’s medical insurance charges based on your analysis.
+ Explore areas where the data may include bias and how that would impact potential use cases.

Congrats on completing your portfolio project!

## Organizing Findings
To make my findings more structured, I organized key results into dictionaries. This helps keep the analysis modular and clear, while also making it easier to reference results for further exploration or presentation.

In [34]:
# Dictionary to store findings
analysis_results = {
    "average_age": calculate_average_age(ages),
    "region_distribution": region_distribution(regions),
    "smoker_vs_non_smoker_charges": {
        "avg_smoker": avg_smoker,
        "avg_non_smoker": avg_non_smoker
    },
    "average_age_with_children": average_age_with_children(ages, children_counts)
}

# Print organized results for clarity
for key, value in analysis_results.items():
    print(f"{key}: {value}")


average_age: 39.20702541106129
region_distribution: {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
smoker_vs_non_smoker_charges: {'avg_smoker': 32050.23183153285, 'avg_non_smoker': 8434.268297856199}
average_age_with_children: 39.78010471204188


This dictionary approach streamlines the findings and ensures all insights are stored in one place. It’s a great way to transition into further analysis or creating visual outputs.

## Project Results

### Summary of Findings

#### Average Age
The average age of individuals in the dataset is approximately **39.21 years**. This provides insight into the general age demographic covered by the dataset.

#### Region Distribution
The distribution of individuals by region is as follows:
- **Southwest**: 325
- **Southeast**: 364
- **Northwest**: 325
- **Northeast**: 324

The Southeast region contains the highest number of individuals, which might slightly influence averages or trends observed in the dataset.

#### Insurance Charges: Smokers vs Non-Smokers
The average charges for smokers and non-smokers highlight a significant disparity:
- **Smokers**: \$32,050.23
- **Non-Smokers**: \$8,434.27

Smoker status appears to be a major factor influencing insurance charges, with smokers incurring nearly four times the cost compared to non-smokers.

#### Average Age of Individuals with Children
The average age of individuals with at least one child is **39.78 years**, slightly higher than the general average age of the dataset. This may suggest that the dataset predominantly includes middle-aged parents.

---

## Predicting Influential Features
Based on the analysis so far, smoker status and BMI appear to be strong predictors of insurance charges. To validate this, I explored the relative importance of features using basic correlation analysis.

In [35]:
# Find correlation between numerical features and charges
charges_correlation = {
    "age_correlation": numeric_data["age"].corr(numeric_data["charges"]),
    "bmi_correlation": numeric_data["bmi"].corr(numeric_data["charges"]),
    "children_correlation": numeric_data["children"].corr(numeric_data["charges"])
}

print("Feature correlations with charges:")
for feature, correlation in charges_correlation.items():
    print(f"{feature}: {correlation}")


Feature correlations with charges:
age_correlation: 0.29900819333064776
bmi_correlation: 0.1983409688336289
children_correlation: 0.06799822684790487


## Prediction Hypothesis:
+ **Smoker status**: Charges are significantly higher for smokers compared to non-smokers, as visualized earlier with the boxplots.

+ **BMI**: Higher BMI values show a noticeable positive correlation with charges, indicating its influence on health-related costs.

The analysis suggests that both factors could be prioritized when modeling insurance costs.

## Project Results

### Summary of Findings

#### Feature Correlations with Insurance Charges
The following correlations were identified between numerical variables and insurance charges:
- **Age**: 0.2990
- **BMI**: 0.1983
- **Children**: 0.0679

Age shows the strongest positive correlation with charges, followed by BMI. However, the correlation for the number of children is relatively weak, suggesting it may have a limited impact on insurance costs.

---

## Exploring Bias in Data
While this dataset is comprehensive, there may be biases that affect its use in predictive modeling or real-world decision-making.

### Potential Biases:
1. **Gender Representation**: The dataset may not have balanced representation of males and females, impacting conclusions about gender-based trends.

In [36]:
male_count = sexes.count("male")
female_count = sexes.count("female")
print(f"Male count: {male_count}, Female count: {female_count}")

Male count: 676, Female count: 662


2. **Regional Bias**: Certain regions may have disproportionately higher representation, skewing average insurance costs.

In [37]:
region_counts = region_distribution(regions)
print("Region representation:", region_counts)

Region representation: {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


3. **Simplified BMI Measure**: BMI, while included in the dataset, is not always an accurate predictor of health. This simplification could lead to biased conclusions about health and costs.

## Implications of Bias:
These biases could limit the generalizability of findings. For example, if smokers are underrepresented in certain regions, conclusions about smoking and geographic trends might be inaccurate. Understanding these nuances is critical for fair and accurate interpretations.

## Project Results

### Summary of Findings

#### Gender Representation
The dataset is fairly balanced in terms of gender:
- **Male**: 676 individuals
- **Female**: 662 individuals

This balance reduces the likelihood of gender bias impacting the analysis.

#### Region Representation
The regions are also well-distributed, with no significant disparities:
- Southwest: 325 individuals
- Southeast: 364 individuals
- Northwest: 325 individuals
- Northeast: 324 individuals

This uniformity in representation enhances the reliability of geographic comparisons.

---

### Implications
1. **Smoker Status**: Strongly influences charges, highlighting potential avenues for insurance cost reduction through health interventions or smoking cessation programs.
2. **BMI**: While correlated with charges, its role as a predictor may be limited due to its simplification of health metrics.
3. **Age Trends**: Older individuals tend to incur higher charges, which aligns with typical health-related expenses as age increases.

Exploring these patterns can provide actionable insights for insurers and individuals alike.

---
