You are consulted by a health insurance company to analyze its insurance dataset. The goal is produce a set of descriptive statistics.  The dataset is in the txt file format (insurance.txt) and is available under the homework folder. 

The file includes 1,338 examples of beneficiaries currently enrolled in the insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year. The features are:
1. age: An integer indicating the age of the primary beneficiary (excluding those above 64 years, since they are generally covered by the government).
2. sex: The policy holder's gender, either male or female. 
3. bmi: The body mass index (BMI), which provides a sense of how over- or under-weight a person is relative to their height. BMI is equal to weight (in kilograms) divided by height (in meters) squared. An ideal BMI is within the range of 18.5 to 24.9.  A person with a BMI value within the range of 25 to 29.9 is considered overweight. A person with a BMI value above 30 is considered obese. 
4. children: An integer indicating the number of children/ dependents covered by the insurance plan. 
5. smoker: A yes or no categorical variable that indicates whether the insured regularly smokes tobacco. 
6. region: The beneficiary's place of residence in the US, divided into four geographic regions: northeast, southeast, southwest, or northwest.
7. expense: total medical expenses charged to the plan for the calendar year

Using the numpy library analyze the data. In particular, read the data file (numpy.loadtxt()), produce the following analysis and store the results into a text file (numpy.savetxt()):
1.	Mean, standard deviation and median of age.
2.	Mean, standard deviation and median of BMI.
3.	Mean, standard deviation and median of BMI grouped by sex.
4.	Mean, standard deviation and median of BMI for smokers and non-smokers.
5.	Mean, standard deviation and median of BMI grouped by region.
6.	Mean, standard deviation and median of BMI of those who have more than 2 children.

In [1]:
def calculator(result, name, np_df):
    # This function calculates the mean, standard deviation, and median of a particular numpy array and append to an 
    # existing array such as header or result of other array and return the array.
    import numpy as np
    df_mean = np.mean(np_df).round(2)
    df_std = np.std(np_df).round(2)
    df_median = np.median(np_df).round(2)
    result = np.append(result, np.array([[name, df_mean, df_std, df_median]]), axis = 0)
    return result

def mode(result, np_df):
    # This function calculates the mode of a particular numpy array and append to an existing array such as header or 
    # result of other array and return the array.
    import numpy as np
    u_list = np.unique(np_df)
    for i in u_list:
        freq = round(np.mean(np_df == i) * 100, 1)
        result = np.append(result, np.array([[i, freq]]), axis = 0)
    return result

In [2]:
import numpy as np
# Import the data
df = np.loadtxt("insurance.txt", dtype = str)

df = df[1:] # Ignore the header
BMI = df[:,2].astype(float)

result = np.array([["Case Study", "Mean", "Std", "Median"]])
result = calculator(result, "Age", df[:,0].astype(int))
result = calculator(result, "BMI of All", df[:,2].astype(float))
result = calculator(result, "BMI of Male", BMI[df[:,1] == "male"])
result = calculator(result, "BMI of Female", BMI[df[:,1] == "female"])
result = calculator(result, "BMI of Smoker", BMI[df[:,4] == "yes"])
result = calculator(result, "BMI of Non-smoker", BMI[df[:,4] == "no"])
result = calculator(result, "BMI of NorthEast", BMI[df[:,5] == "northeast"])
result = calculator(result, "BMI of SouthEast", BMI[df[:,5] == "southeast"])
result = calculator(result, "BMI of SouthWest", BMI[df[:,5] == "southwest"])
result = calculator(result, "BMI of NorthWest", BMI[df[:,5] == "northwest"])
result = calculator(result, "BMI of > 2 Children", BMI[df[:,3].astype(int) > 2])

How do the following factors affect BMI? Justify your comments with supporting descriptive statistics (mean, standard deviation and median). 
1.	Smoking habit
2.	Region
3.	Children

In [3]:
# Discussion about the statistics.
print("""1. The smoking habit doesn't have too much impact on the BMI while the entire data set are considered since 
the mean, standard deviation, and median of BMI are close between smoker and non-smoker.""")
print()
print("""2. Comparing the BMI for 4 regions, it seems like people in the southeast have larger means, standard 
deviation, and median than other regions. This means that people living in southeast are either fatter or shorter 
than other regions and also the variation on the BMI is also larger.""")
print()
print("""3. Comparing the BMI for entire data set and those who have more than 2 children, the mean and median is 
almost same but people have more than 2 children looks like will have smaller variation on the BMI.""")

1. The smoking habit doesn't have too much impact on the BMI while the entire data set are considered since 
the mean, standard deviation, and median of BMI are close between smoker and non-smoker.

2. Comparing the BMI for 4 regions, it seems like people in the southeast have larger means, standard 
deviation, and median than other regions. This means that people living in southeast are either fatter or shorter 
than other regions and also the variation on the BMI is also larger.

3. Comparing the BMI for entire data set and those who have more than 2 children, the mean and median is 
almost same but people have more than 2 children looks like will have smaller variation on the BMI.


What are the primary reasons for the top 20% of the expenses? In particular, sort the data by expense, and compute the mean, and standard deviation of BMI and the mode of smoker and region. How do these values differ from the rest 80% of the population?

In [4]:
# Sort the data by the expenses decreasingly. Split the data into top 20% and rest 80%.
df_sort_by_expense = df[df[:,-1].astype(float).argsort()][::-1]
df_20 = df_sort_by_expense[range(round(df.shape[0]*0.2))]
df_80 = df_sort_by_expense[range(round(df.shape[0]*0.2), df.shape[0])]

# Compute the respective mean, standard deviation, and median for top 20% and rest 80%
result = calculator(result, "BMI of Top 20%", df_20[:,2].astype(float))
result = calculator(result, "BMI of Rest 80%", df_80[:,2].astype(float))

# Save the result to a .txt file.
np.savetxt("Result.txt", result, fmt = ['%19s', '%10s', '%10s', '%10s'])

# Smoke habit for top 20%.
smoke_habit_top20 = np.array([["Condition", "Frequency(%)"]])
smoke_habit_top20 = mode(smoke_habit_top20, df_20[:,4])

# Smoke habit for rest 80%.
smoke_habit_rest80 = np.array([["Condition", "Frequency(%)"]])
smoke_habit_rest80 = mode(smoke_habit_rest80, df_80[:,4])

# Region distribution for top 20%.
region_top20 = np.array([["Region", "Frequency(%)"]])
region_top20 = mode(region_top20, df_20[:,5])

# Region distribution for rest 80%.
region_rest80 = np.array([["Region", "Frequency(%)"]])
region_rest80 = mode(region_rest80, df_80[:,5])

print("""Comparing the BMI mean, standard deviation, and median for top 20% and rest 80%, we can see that people with 
top 20% of insurance expenses have larger BMI mean, median than the rest 80%. This is reasonable becasue normally 
people with higher BMI will have higher chance to get illness so that the insurance company may charge more on them.

However, in order to look into deeper, we compare the the smoking habit and regions for both top 20% and rest 80%. 
      
We can observe that people with top 20% of insurance expense have more people with smoking habit (93.8%). In the 
meanwhile, people with rest 80% of insurance expenses have more people without smoking habit (77.6%). This is 
reasonable that people with smoking habit will have higher chance to get lung cancer, high blood pressure and other 
healthy issues.
      
On the other hand, the region is not a primary factor of insurace expense difference. The largest difference between 
top 20% and rest 80% is the southeast but it is only 7% which is not so large. So, we can consider that region is not 
a primary factor to impact the insurance expense.""")

Comparing the BMI mean, standard deviation, and median for top 20% and rest 80%, we can see that people with 
top 20% of insurance expenses have larger BMI mean, median than the rest 80%. This is reasonable becasue normally 
people with higher BMI will have higher chance to get illness so that the insurance company may charge more on them.

However, in order to look into deeper, we compare the the smoking habit and regions for both top 20% and rest 80%. 
      
We can observe that people with top 20% of insurance expense have more people with smoking habit (93.8%). In the 
meanwhile, people with rest 80% of insurance expenses have more people without smoking habit (77.6%). This is 
reasonable that people with smoking habit will have higher chance to get lung cancer, high blood pressure and other 
healthy issues.
      
On the other hand, the region is not a primary factor of insurace expense difference. The largest difference between 
top 20% and rest 80% is the southeast but it is onl