#Header
COP 4045 - Python Programming - Dr. Oge Marques - FAU - Summer 2021

A9:       Diabetes Classifier
_________________
Name:     Anthony Polese

Z Number: Z23582157

Due:      28 July 2021

Note:     Everything done by Anthony Polese
_________________

#Instructions
The program expects 2 .csv files: one for training the program and one for testing the program. After starting the program, you will be prompted to 
1. submit the files from your computer 
2. enter the names of the files (including ".csv" extension)

From there the program should successfully run.

#Imports

In [47]:
import csv
from google.colab import files

#Functions

In [48]:
def sum_lists(list1, list2):
    '''
    for two lists with a length of 8, each corresponding element is added
    and the sums are returned as a list in the same order as added
    '''
    sums_list = []
    #the elements at each index are added and the sum is appended to sums_list
    for index in range(8):
        sums_list.append(list1[index] + list2[index])
    return sums_list

def make_averages(sums_list, count):
    '''
    each element in sums_list is divided by count, appeneded to a new list
    and this new list is returned
    '''
    averages_list = []
    
    for value in sums_list:
        avg_value = value / count
        averages_list.append(avg_value)
    
    return averages_list

def make_data_set(reader):
    '''
    given a .csv reader, data is read in and returned as a list of tuples with 
    the format (id, outcome, 8 patient attributes to predict diabetes diagnosis)
    '''
    training_set_list = []
    id_count          = 0 #attached to identify each tuple, incremented each loop
    for line_list in reader:
        #skip header
        if line_list[0] == "Pregnancies":
            continue

        #convert outcome int data to more descriptive string identifier
        outcome_str = line_list[8]
        if outcome_str == "0": #diabetes negative
            outcome_str = "n" 
        else:                  #diabetes positive
            outcome_str = "p" 

        #(id, outcome, 8 patient attributes)
        patient_tuple = (str(id_count),       outcome_str, \
                         int(line_list[0]),   int(line_list[1]), \
                         int(line_list[2]),   int(line_list[3]), \
                         int(line_list[4]),   float(line_list[5]), \
                         float(line_list[6]), int(line_list[7]))
        training_set_list.append(patient_tuple)
        
        id_count += 1
    
    return training_set_list

def get_negative_positive_averages_lists(training_set_list):
    '''
    given training_set_list, returns lists containing the average value 
    for each attribute for diabetes negative and positive patients
    '''
    negative_sums_list = [0] * 8 #list of sum of attribute values for diabetes negative patients
    positive_sums_list = [0] * 8 #list of sum of attribute values for diabetes positive patients
    negative_count     = 0       #number of patients who are diabetes negative
    positive_count     = 0       #number of patients who are diabetes positive
    
    #make lists of sums of attribute values for both patient types
    for patient_tuple in training_set_list:
        if patient_tuple[1] == "n": #diabetes negative
            negative_sums_list = sum_lists(negative_sums_list, patient_tuple[2:])
            negative_count += 1
        else:                       #diabetes positive
            positive_sums_list = sum_lists(positive_sums_list, patient_tuple[2:])
            positive_count += 1

    #make lists of average attribute values for both patient types
    negative_averages_list = make_averages(negative_sums_list, negative_count)
    positive_averages_list = make_averages(positive_sums_list, positive_count)

    return negative_averages_list, positive_averages_list

def get_classifier_list(neg_avgs_list, pos_avgs_list):
    '''
    returns a list containing the average value between diabetes negative 
    and positive patients for each attribute
    '''
    #sums the corresponding values between each list
    #then take each new value and divide by 2
    classifier_list = make_averages(sum_lists(neg_avgs_list, pos_avgs_list), 2)

    return classifier_list

def classify_test_set_list(testing_set_list, classifier_list, min_index):
    '''
    returns list of patient_tuples now in format of (id, # of attributes suggesting
    diabetes negative, # of attributes suggesting positive, actual diagnosis)
    '''
    result_list = []
    #go through each patient_tuple and convert to new format
    for patient_tuple in testing_set_list:
        negative_count = 0 #number of attributes suggesting diabetes negative
        positive_count = 0 #number of attributes suggesting diabetes positive
        id_str, diagnosis_str = patient_tuple[:2]
        #check each attribute for given patient_tuple
        for index in range(8):
            #in order to have odd # of attributes, the attribute with the smallest 
            #difference between diabetes neg. and pos. patients is skipped/not considered
            if index == min_index: 
                continue           
            
            #if value is indicative of diabetes positive
            if patient_tuple[index + 2] > classifier_list[index]:
                positive_count += 1
            #if value is indicative of diabetes negative
            else:
                negative_count += 1
        
        #store results and append to list
        result_tuple = (id_str, negative_count, positive_count, diagnosis_str)
        result_list.append(result_tuple)
    
    return result_list

def report_results(result_list):
    '''
    Counts number of misdiagnoses by comparing predicted results to actual results
    and displays the results
    '''
    total_count      = 0 #number of patients
    inaccurate_count = 0 #number of times the predictor predicted wrong diagnosis
    
    #iterate through each patient
    for result_tuple in result_list:
        negative_count, positive_count, diagnosis_str = result_tuple[1:4]
        total_count += 1
        #if predicted diabetes negative but actually was diabetes positive
        if   (negative_count > positive_count) and (diagnosis_str == 'p'):
            inaccurate_count += 1
        #if predicted diabetes positive but was actually negative
        elif (negative_count < positive_count) and (diagnosis_str == 'n'):
            inaccurate_count += 1
    
    #display results
    print(f"Out of {total_count} patients, there were {inaccurate_count} misdiagnoses")


def get_least_discriminative_feature_index(neg_avgs_list, pos_avgs_list):
    '''
    finds attribute value between negative_averages_list and positive_averages_list 
    with the smallest difference between lists and returns index of that attribute
    '''
    difference_list = [] #stores difference between lists for each attribute

    #iterate through all attributes and find differences
    iteration_range_int = len(neg_avgs_list) #neg. and pos. lists should have 
                                             #same length so either is fine
    for i in range(iteration_range_int):
        difference = abs(neg_avgs_list[i] - pos_avgs_list[i])
        difference_list.append(difference)
    
    min_difference = min(difference_list)

    #find index value containing the min_difference
    min_index = 0
    for i in range(8):
        if difference_list[i] == min_difference:
            min_index = i 

    return min_index

def get_file_streams():
    '''
    prompts user for file names of training file and testing file. Reprompts user
    until file streams are successfully opened. File streams returned
    '''
    #get training file stream
    invalid_bool = True
    while invalid_bool: #loop until file stream is successfully opened
        try:
            print("\nEnter name of data training file (include file format, such as \".csv\"): ")
            training_file_name    = input()
            training_input_stream = open(training_file_name, "r")
            invalid_bool          = False #opening file stream was successful
        except FileNotFoundError:
            print("\nInvalid file name. Try again.")
    
    #get testing file stream
    invalid_bool = True
    while invalid_bool: #loop until file stream is successfully opened
        try:
            print("\nEnter name of data testing file (include file format, such as \".csv\"): ")
            testing_file_name     = input()
            testing_input_stream  = open(testing_file_name, "r")
            invalid_bool          = False #opening file stream was successful
        except FileNotFoundError:
            print("\nInvalid file name. Try again.")
    
    return training_input_stream, testing_input_stream            

#Program

In [49]:
#prompt user to upload training and testing files
print("Upload training file")
uploaded = files.upload()
print("\nUpload testing file")
uploaded = files.upload()

#get training and testing file streams
training_input_stream, testing_input_stream = get_file_streams()
#open .csv readers for training and testing file streams
training_reader = csv.reader(training_input_stream)
testing_reader  = csv.reader(testing_input_stream)

#reading in training data to make a list of tuples
print("Reading in training data...")
training_set_list = make_data_set(training_reader)
print("Done reading training data.\n")

print("Training classifier...")
#getting average attribute values for diabetes negative and positive patients
neg_avgs_list, pos_avgs_list = get_negative_positive_averages_lists(training_set_list)
#using these averages to find index of attribute with smallest difference between
#negative and positive patients
min_index = get_least_discriminative_feature_index(neg_avgs_list, pos_avgs_list)
#creating classifer, a list of the average attribute values between 
#diabetes positive and negative patients
classifier_list = get_classifier_list(neg_avgs_list, pos_avgs_list)
print("Done training classifier.\n")

#reading in testing data into a list of tuples
print("Reading in test data...")
testing_set_list = make_data_set(testing_reader)
print("Done reading test data.\n")

#file streams no longer needed
training_input_stream.close()
testing_input_stream.close()

#classifying tuples in test_set_list to see if its attributes are
#typically for diabetes positive or negative patients
print("Classifying records...")
result_list = classify_test_set_list(testing_set_list, classifier_list, min_index)
print("Done classifying.\n")

#comparing predicted diagnoses to actual diagnoses 
report_results(result_list)

print("Program finished.")

Upload training file


Saving diabetes_training.csv to diabetes_training (2).csv

Upload testing file


Saving diabetes_testing.csv to diabetes_testing (1).csv

Enter name of data training file (include file format, such as ".csv"): 
diabetes_training.csv

Enter name of data testing file (include file format, such as ".csv"): 
diabetes_testing.csv
Reading in training data...
Done reading training data.

Training classifier...
Done training classifier.

Reading in test data...
Done reading test data.

Classifying records...
Done classifying.

Out of 384 patients, there were 119 misdiagnoses
Program finished.


#References/Sources of Inspiration
* Practice of Computing Using Python (3rd Edition) By William Punch and Richard Enbody. In particular, sections 10.3: The Breast Cancer Classifier and 10.4: Designing the Classifier Algorithm were referenced heavily in this assignment.
* Help from Dr. Marques in Canvas disccusion board
* Help from classmates in our course's WhatsApp groupchat



#Project Notes

#Design Choices:
* In order to have an odd number of attributes, the "least discriminative feature" amongst the patient data is not considered in predicting the diagnosis of the patient. That is, the attribute with the smallest difference between the averages for diabetes negative and diabetes positive patients is not considered in predicting the diagnosis of the patient. 
* There are other options on how to make an odd number of attributes. Alternatively, I could have found the attribute with the most missing values (a cursory glance seems to strongly suggest it is the Insulin data) and removed that attribute from consideration instead.

#Limitations/Future Improvements:
One notable limitation is the nature of the data. Some data has missing values, indicated by values like BMI and blood pressure having values of 0. Glucose, blood pressure, skin thickness, insulin, and BMI all are attributes that occassionally have missing values in the given dataset. As of right now, all patient data is taken as is (which Dr. Marques clarified was acceptable to do in the discussion board) for both the training set and the testing set. 

There are a couple of ways to deal with this:
1. replace each missing value with the average value among all patients that have that attribute recorded
2. discard all patient samples with missing values. 

Implementing either of these options in the future would be a good way to improve the accuracy of the preditor.




#Conclusion
When using ~340 patient samples to train the program and ~340 patients to test the program, I had 69% accurracy, which seems pretty good. Considering that there are some clear ways to improve the program (mentioned above in "Limitations/Future Improvements"), this current accuracy could probably be increased fairly easily.

It is surprising how accurate this program is as of right now considering the improvements still possible. For me, a takeaway from making this program is that you do not need extremely complex logic and computation to make a relatively accurate predictor. Of course, this depends on other things too, like the attribute you are trying to predict, but it seems like as long as you have access to a big amount of meaningful data, some sort of useful prediction can be made with it.

I found it interesting to perform what seemed to be some very basic form of machine learning. It is not a field with which I am very familiar, so my major takeaways were related to how we used large amounts of data in this assignment to make predictions about the patients. 

For example, I now know that using classifier values (i.e. the average value between the average value for two different groups, which in this case were diabetes negative and diabetes positive) for attributes can be an effective way to make predictions about given data. 

I also learned some ways to clean up the data in cases when some fields are missing (i.e. discard the data are replace with the average value for that attribute).