# U.S. Medical Insurance Costs

In this project, a csv file containing medical insurance costs data will be analyzed using Python fundamentals. The data will be analyzed to extract and learn the patient information in the file to gain future insights for potential use cases for the datasets.

The only library used in this project for now is only `csv` library. Depending on the cases that we will analyzed in the **insurance.csv** data we might need to use more library in the future. However, for now I will only be using the `csv` library.

In [1]:
# importing csv library
import csv

Next we will read through the **insurance.csv** first to understand the data given. The following are things to be identified in the data:
* The names of columns and rows
* Any noticable missing data
* Types of values (numerical vs. categorical)

Identifying these features could help us to think more critically about our analysis and plan on how to import the data into our Python file.

In **insurance.csv**, there are following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. From this information, seven empty lists are created to store each individual column of data from **insurance.csv**.

In [2]:
# The columns are:
ages = []
sexes = []
bmis = []
num_children = []
smoker_statuses = []
regions = []
insurance_costs = []


Below is a function that will be used to efficiently load the data correspond to the columns in the csv file.

In [3]:
# function to load the csv data in python
def load_data(column_list, csv_file, column_name):
    # open the csv file
    with open(csv_file) as csv_file:
        # read the data from the csv file
        csv_dict = csv.DictReader(csv_file)
        # loop through the dictionary
        for row in csv_dict:
            # append the values corresponding to the column list created before
            column_list.append(row[column_name])
    # return the list
    return column_list

In [4]:
# loading the data according to the column name
load_data(ages, "insurance.csv", "age")
load_data(sexes, "insurance.csv", "sex")
load_data(bmis, "insurance.csv", "bmi")
load_data(num_children, "insurance.csv", "children")
load_data(smoker_statuses, "insurance.csv", "smoker")
load_data(regions, "insurance.csv", "region")
load_data(insurance_costs, "insurance.csv", "charges")



['16884.924',
 '1725.5523',
 '4449.462',
 '21984.47061',
 '3866.8552',
 '3756.6216',
 '8240.5896',
 '7281.5056',
 '6406.4107',
 '28923.13692',
 '2721.3208',
 '27808.7251',
 '1826.843',
 '11090.7178',
 '39611.7577',
 '1837.237',
 '10797.3362',
 '2395.17155',
 '10602.385',
 '36837.467',
 '13228.84695',
 '4149.736',
 '1137.011',
 '37701.8768',
 '6203.90175',
 '14001.1338',
 '14451.83515',
 '12268.63225',
 '2775.19215',
 '38711',
 '35585.576',
 '2198.18985',
 '4687.797',
 '13770.0979',
 '51194.55914',
 '1625.43375',
 '15612.19335',
 '2302.3',
 '39774.2763',
 '48173.361',
 '3046.062',
 '4949.7587',
 '6272.4772',
 '6313.759',
 '6079.6715',
 '20630.28351',
 '3393.35635',
 '3556.9223',
 '12629.8967',
 '38709.176',
 '2211.13075',
 '3579.8287',
 '23568.272',
 '37742.5757',
 '8059.6791',
 '47496.49445',
 '13607.36875',
 '34303.1672',
 '23244.7902',
 '5989.52365',
 '8606.2174',
 '4504.6624',
 '30166.61817',
 '4133.64165',
 '14711.7438',
 '1743.214',
 '14235.072',
 '6389.37785',
 '5920.1041',
 '176

Now the data has been organized into labeled lists and ready to be analyzed. There are many aspects of the data that can be analyzed. Here are some aspects that will be analyzed in this project for now:
* average age of the patients
* number of men vs. women counted in the dataset
* average bmis and average_bmi for men and women
* average number of children 
* ratio of smoker to non_smoker
* geographical location of the patients
* average yearly medical charges of the patients
* creating dictionary that contains all patients information


To perform these analysis, a class called `PatientInfo` will be built which contains fives method corresponding to above aspects:
* average_age()
* analyze_sex()
* analyze_bmi()
* average_num_children()
* ratio_smoker()
* analyze_region()
* average_medical_charge()
* create_patient_dict()

In [5]:
class PatientInfo:
    # init method that takes in each list parameter
    def __init__(self, patients_ages, patients_sexes, patients_bmis, patients_num_children,
                patients_smoker_statuses, patients_regions, patients_charges):
        self.patients_ages = patients_ages
        self.patients_sexes = patients_sexes
        self.patients_bmis = patients_bmis
        self.patients_num_children = patients_num_children
        self.patients_smoker_statuses = patients_smoker_statuses
        self.patients_regions = patients_regions
        self.patients_charges = patients_charges
        
    # method used to calculate the average age of the patients in the medical insurance record
    def average_age(self):
        # initialize total age
        total_age = 0
        # iterating through the ages list
        for age in self.patients_ages:
            # add age for each iterating into total age
            total_age += int(age)
        # return the average age
        print ("Average Patient Age: " + str(round(total_age/len(self.patients_ages), 2)) + " years")
    
    # method used to count men and women in the datasets
    def analyze_sex(self):
        # initialize the count for both women and men
        count_men = 0
        count_women = 0
        # iterating through the sex list
        for sex in self.patients_sexes:
            # add to the count according the sexes
            if sex == "female":
                count_women += 1
            else:
                count_men += 1
        print("Number of males: " + str(count_men))
        print("Number of females: " + str(count_women))
        return(count_men, count_women)
    
    # method used to calculate the average bmi for all for both men and women
    def analyze_bmi(self):
        # initialize the total bmi for each cases
        total_bmi = 0
        total_bmi_women = 0
        total_bmi_men = 0
        # iterating through the bmis list
        for i in range(len(self.patients_bmis)):
            # add to the total_bmi
            total_bmi += float(self.patients_bmis[i])
            # add based on the gender
            if self.patients_sexes[i] == "female":
                total_bmi_women += float(self.patients_bmis[i])
            else:
                total_bmi_men += float(self.patients_bmis[i])
                
        # calculate averages based on the cases
        average_bmi = round(total_bmi/len(self.patients_bmis), 2)
        # here we will used the analyze_sex method to get the total women and men
        average_men_bmi = round(total_bmi_men/self.analyze_sex()[0], 2)
        average_women_bmi = round(total_bmi_women/self.analyze_sex()[1], 2)
        print("Average bmi: " + str(average_bmi))
        print("Average male bmi: " + str(average_men_bmi))
        print("Average female bmi: " + str(average_women_bmi))
        return (average_bmi, average_men_bmi, average_women_bmi)
    
    # method used to determine the average number of children the patients have
    def average_num_children(self):
        # initialize total num of children
        total_children = 0
        # iterating through the num_children list
        for children in self.patients_num_children:
            # add num of children for each iterating into total_children
            total_children += int(children)
        # return the average num children
        return ("Average Number of children: " + str(int(round(total_children/len(self.patients_num_children), 0))) + " children") 
    
    # method used to determine the ratio between smoker and non-smoker patients
    def ratio_smoker(self):
        # initialize the total smoker and non-smoker
        total_smoker = 0
        for smoker in self.patients_smoker_statuses:
            # check the status and add accordingly
            if smoker == "yes":
                total_smoker += 1
        # calculate the ratio of smoker to the total patient in the record
        smoker_ratio = round(total_smoker/len(self.patients_smoker_statuses), 2)
        return ("Ratio smoker is " + str(smoker_ratio))
    
    # method used to determine the unique geograhical region patients are from and find the where most patients came from
    def analyze_region(self):
        # create an empty dictionary with the key is the unique region and the values are the number of patients 
        # came from the unique region
        region_dict = {}
        # iterating through the region list
        for region in self.patients_regions:
            # to check if the region is not yet in the region dict
            if region not in region_dict.keys():
                # add the region and start counting the people in the new region
                region_dict[region] = 1
            # continue counting the number of people in the existing region in the dict key
            region_dict[region] += 1
        
        # print the most region
        find_max = max(region_dict, key = region_dict.get)
        print("Most patients came from " + find_max + ".")
        # return region dict
        return region_dict
    
    # method used to calculate the average medical charges of the patients
    def average_medical_charges(self):
        # intialize the total charges
        total_charges = 0
        # iterating through the charges list
        for charge in self.patients_charges:
            # add the value into the total charges
            total_charges += float(charge)
        # return the average
        return ("Average medical charges: " + str(round(total_charges/len(self.patients_charges), 2)) + " dollars")
    
    # method to create dictionary with all patients information
    def create_patient_dict(self):
        self.patients_dict = {}
        self.patients_dict["age"] = [int(age) for age in self.patients_ages]
        self.patients_dict["sex"] = self.patients_sexes
        self.patients_dict["bmi"] = [round(float(bmi),2) for bmi in self.patients_bmis]
        self.patients_dict["num_children"] = [int(num_children) for num_children in self.patients_num_children]
        # changing smoker status to ones and zeros where 1 represent that the patient is a smoker and vice versa
        # we could do the same for the sexes key
        self.patients_dict["smoker"] = [1 if smoker == "yes" else 0 for smoker in self.patients_smoker_statuses] 
        self.patients_dict["region"] = self.patients_regions
        self.patients_dict["charges"] = [round(float(charge),2) for charge in self.patients_charges]
        return self.patients_dict
        

In [6]:
patient_info = PatientInfo(ages, sexes, bmis, num_children, smoker_statuses, regions, insurance_costs)

In [7]:
patient_info.average_age()

Average Patient Age: 39.21 years


In [8]:
patient_info.analyze_sex()

Number of males: 676
Number of females: 662


(676, 662)

In [9]:
patient_info.analyze_bmi()

Number of males: 676
Number of females: 662
Number of males: 676
Number of females: 662
Average bmi: 30.66
Average male bmi: 30.94
Average female bmi: 30.38


(30.66, 30.94, 30.38)

In [10]:
patient_info.average_num_children()

'Average Number of children: 1 children'

In [11]:
patient_info.analyze_region()

Most patients came from southeast.


{'southwest': 326, 'southeast': 365, 'northwest': 326, 'northeast': 325}

In [12]:
patient_info.ratio_smoker()

'Ratio smoker is 0.2'

The ratio is 1/5 = 0.2 which means that for every 5 patients in the record, there is 1 smoker.  

In [13]:
patient_info.average_medical_charges()

'Average medical charges: 13270.42 dollars'

In [14]:
patient_info.create_patient_dict()

{'age': [19,
  18,
  28,
  33,
  32,
  31,
  46,
  37,
  37,
  60,
  25,
  62,
  23,
  56,
  27,
  19,
  52,
  23,
  56,
  30,
  60,
  30,
  18,
  34,
  37,
  59,
  63,
  55,
  23,
  31,
  22,
  18,
  19,
  63,
  28,
  19,
  62,
  26,
  35,
  60,
  24,
  31,
  41,
  37,
  38,
  55,
  18,
  28,
  60,
  36,
  18,
  21,
  48,
  36,
  40,
  58,
  58,
  18,
  53,
  34,
  43,
  25,
  64,
  28,
  20,
  19,
  61,
  40,
  40,
  28,
  27,
  31,
  53,
  58,
  44,
  57,
  29,
  21,
  22,
  41,
  31,
  45,
  22,
  48,
  37,
  45,
  57,
  56,
  46,
  55,
  21,
  53,
  59,
  35,
  64,
  28,
  54,
  55,
  56,
  38,
  41,
  30,
  18,
  61,
  34,
  20,
  19,
  26,
  29,
  63,
  54,
  55,
  37,
  21,
  52,
  60,
  58,
  29,
  49,
  37,
  44,
  18,
  20,
  44,
  47,
  26,
  19,
  52,
  32,
  38,
  59,
  61,
  53,
  19,
  20,
  22,
  19,
  22,
  54,
  22,
  34,
  26,
  34,
  29,
  30,
  29,
  46,
  51,
  53,
  19,
  35,
  48,
  32,
  42,
  40,
  44,
  48,
  18,
  30,
  50,
  42,
  18,
  54,
  32,
  37,
  4

Now we have an organized patient medical record dictionary that can be used for further analysis. We could do further analysis like plotting the histogram of scatter plot for every cases or even made prediction of medical charges using linear regression or multiple linear regression. But for now let's just stop here and continue the analysis after I'm done studying the whole Data Science course through the CodeCademy. FUN STUFFF!!! 