# Exploring Societal Inequity's Effect on (Model-Perceived) Health Outcomes 

### MODELING & ANALYSIS DOCUMENT

## Abstract

## Introduction

For our project, we aim to explore the relationship between diseases and social factors such as sex, race, and town, and how these may reflect societal and enviornmental inequities. Our approach is to identify the most accurate predictive model for our dataset, then use this model to generate risk likelihood scores and evaluate the relationship between different diseases and characteristics indicative of societal inequalities. We will then analyze the implications of these risk factors for inequitable, identity-based risk factors in health outcomes and complications. Our project consists of three documents, one in which we clean our original data, one in which we explore this data visually, and the final one, this one, in which we build and explore our models.

Taking the general trends we witness in our data visualization document, we carried out the second half of our project; building a model that predicts risk scores. Comparing the risk scores, we wanted to see whether trends emerged in terms of socioeconomic status (which we measure by the proxy of town of residence), race, gender, and ethnicity.

## Values Statement

## Material & Methods

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
from itertools import combinations
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np

### Reading in our Data

In [56]:
conditions_diabetes = pd.read_csv('conditions_diabetes.csv')
conditions_pregnancy = pd.read_csv('conditions_pregnancy.csv')
conditions_cancer = pd.read_csv('conditions_cancer.csv')
conditions_heart = pd.read_csv('conditions_heart.csv')
conditions_lungs = pd.read_csv('conditions_lungs.csv')

observations = pd.read_csv('observations_pivot.csv')
patients = pd.read_csv('patient_clean.csv')

#### Note: All of our datasets are grouped by related dieases (for example diabetes and comorbitidies such as diabetic retinopathy), for the rest of the post, when we say "diabetes" or "pregnancy complications," we are talking about diabetes and all present comorbidites, or a grouping of pregnancy complications such as pre/ante eclampsia and misscarriage.

# Diabetes Modeling & Analysis

### 1. Prepping Data

In order to prep our data for modelling we label encoded each of the qualitative variables (keeping track so we could decode them again later). We created a function in order to do this easily multiple times.

In [3]:
le = LabelEncoder()

# our data-prepping function for modeling
def prep_data(patients, conditions, illness_descriptions, observations):

    # make patients column match others for merging, drop unnecessary information and NA vals
    patients.rename(columns={'patient':'PATIENT'}, inplace=True)
    patients = patients.drop(columns=['birthdate', 'marital','deathdate','ssn', 'address', 'drivers', 'passport', 'prefix', 'first', 'last', 'suffix', 'maiden'])
    patients = patients.dropna()
    conditions = conditions.dropna()

    # merge datasets (patient info and corresponding conditions)
    merged_df = pd.merge(patients, conditions, on='PATIENT', how='left')
    merged_df = pd.merge(merged_df, observations, on='PATIENT', how='left')

    # create y
    merged_df["y"] = (merged_df[illness_descriptions] == 1).any(axis=1).astype(int)
    merged_df = merged_df.drop(columns=illness_descriptions)

    # label encode all quantitative vars
    
    # race
    merged_df["race"] = le.fit_transform(merged_df["race"]) 
    race_code = {code: race for code, race in enumerate(le.classes_)}

    # ethnicity
    merged_df["ethnicity"] = le.fit_transform(merged_df["ethnicity"])
    eth_code = {code: ethnicity for code, ethnicity in enumerate(le.classes_)}

    # gender
    merged_df["gender"] = le.fit_transform(merged_df["gender"])  
    gen_code = {code: gender for code, gender in enumerate(le.classes_)}

    # birthplace
    merged_df["birthplace"] = le.fit_transform(merged_df["birthplace"]) 
    bp_code = {code: bp for code, bp in enumerate(le.classes_)}

    # current town of residence
    merged_df["curr_town"] = le.fit_transform(merged_df["curr_town"]) 
    curr_code = {code: bp for code, bp in enumerate(le.classes_)}

    # split data into test and train
    train, test = train_test_split(merged_df, test_size=0.2, random_state=42)
    
    X_train = train.drop(columns=['y'])
    y_train = train['y']
    
    X_test = test.drop(columns=['y'])
    y_test = test['y']
    
    # return split x, y, and all of the code tracking dicts
    return X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code

In [4]:
# using above function to create diabetes test and train set

illness_descriptions = ['PATIENT','Diabetes_CONDITIONS','Prediabetes_CONDITIONS','Diabetic retinopathy associated with type II diabetes mellitus (disorder)_CONDITIONS', 
                        'Nonproliferative diabetic retinopathy due to type 2 diabetes mellitus (disorder)_CONDITIONS', 'Macular edema and retinopathy due to type 2 diabetes mellitus (disorder)_CONDITIONS', 
                        'Microalbuminuria due to type 2 diabetes mellitus (disorder)_CONDITIONS', 'Diabetic renal disease (disorder)_CONDITIONS', 'Neuropathy due to type 2 diabetes mellitus (disorder)_CONDITIONS']
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(patients, conditions_diabetes, illness_descriptions, observations)

### 2. Finding optimal model

Next, we created a function we could reuse that identifies the best performing model on our data from the options random forest, SVC, logistic regression, and decision trees. The best model is what we use to predict the probability that each person has a certain disease (for our purposes, their risk score).

In [5]:
# our model-finding function
def train_model(X_train, y_train):

    #LogisticRegression
    LR = LogisticRegression(max_iter=10000000000000000000)
    LRScore = cross_val_score(LR, X_train, y_train, cv=5).mean()

    #DecisionTreeClassifier
    param_grid = { 'max_depth': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None ]}

    tree = DecisionTreeClassifier()
    grid_search = GridSearchCV(tree, param_grid, cv=5)
    grid_search.fit(X_train, y_train)
    DTCScore  = grid_search.best_score_
    bestDTCDepth = grid_search.best_params_


    # Random Forrest Classifier    
    forrest = RandomForestClassifier(random_state=0)
    grid_search = GridSearchCV(forrest, param_grid, cv=5)
    grid_search.fit(X_train, y_train)

    RFCScore  = grid_search.best_score_
    bestRFCDepth = grid_search.best_params_

    #SVC
    SVM = SVC()

    # use grid search to find best gamma for SVM
    g = {'gamma': 10.0 ** np.arange(-5, 5) }
    grid_search = GridSearchCV(SVM, g, cv=5)
    grid_search.fit(X_train, y_train)

    SVMScore  = grid_search.best_score_   


    print("best LR :", LRScore)
    print("best DTC:", DTCScore)
    print("best max depth: ", bestDTCDepth)
    print("best RFC: ", RFCScore)
    print("best max depth: ", bestRFCDepth)
    print("best SVM: ", SVMScore)

    # store the scores of each model
    max_score = 0
    max_model = ""
    if LRScore > max_score:
        max_score = LRScore
        max_model = "LR"
    if DTCScore > max_score:
        max_score = DTCScore
        max_model = "DTC"
    if RFCScore > max_score:
        max_score = RFCScore
        max_model = "RFC"
    if SVMScore > max_score:
        max_score = SVMScore
        max_model = "SVM"

    print("best score overall is: ", max_score, " with model: ", max_model)
    
# run model finding function on our diabetes data
train_model(X_train, y_train)

best LR : 0.9041854664172261
best DTC: 0.9178790213124979
best max depth:  {'max_depth': 3}
best RFC:  0.9153112505043837
best max depth:  {'max_depth': 5}
best SVM:  0.9016066908770772
best score overall is:  0.9178790213124979  with model:  DTC


The results of our function should that the decision tree classifier is the best model possible, with an accuracy of 91.78%. Our accuracies tend generally lower considering the limited information we allowed the model to have, as we really wanted to see what the model would do when it predicted on identity factors such as race, ethnicity, and birthplace, and not how it would predict given information on the specific procedures and allergies a patient had.

### 3. Create Risk Scores

Predict probabilities for all our entries using the best model we found (random forest). HELLO?

In [6]:
forrest = RandomForestClassifier(max_depth=5)
forrest.fit(X_train, y_train)
pred_prob = forrest.predict_proba(X_test)

For ease we created a risk finding function that can be used across factors and disease probabilities.

In [7]:
def find_risk(code, col, probs):
    # finds the corresponding subset of our probability data
    indices = (X_test[col] == code)
    prob_subset = probs[indices]
    # finds the average of this subset
    av_prob = np.mean(prob_subset[:, 1]) 
    return av_prob   

### 4. Compare Across Race, Gender, Ethnicity
Next, we find the average risk score for different demographic characteristics: Race, Gender, and Ethnicity.

#### Race

In [8]:
diabetesRaceRisk = []

# find risk for each race (after finding on their code from the label encoder)
for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    diabetesRaceRisk.append(newRow)

# print summary table
diabetesRaceRisk = pd.DataFrame(diabetesRaceRisk)
diabetesRaceRisk = diabetesRaceRisk.sort_values(by='risk', ascending=False)
diabetesRaceRisk

Unnamed: 0,race,risk
0,asian,0.479858
2,hispanic,0.323339
3,white,0.315055
1,black,0.242059


Our model tells us that the most susceptible group to diabetes is Asian, then Hispanic and White, with Black being the least susceptible. These results were interesting in that they do indeed indicate that there may be a difference according to race, and made us think of how we could explore demographic information about Massachussetts (where our data is "from"), to understand whether these trends are reflective of larger trends.

#### Gender

In [9]:
diabetesGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    diabetesGenderRisk.append(newRow)

diabetesGenderRisk = pd.DataFrame(diabetesGenderRisk)
diabetesGenderRisk = diabetesGenderRisk.sort_values(by='risk', ascending=False)
diabetesGenderRisk

Unnamed: 0,gender,risk
0,F,0.369851
1,M,0.265544


Our model tells us that women are slightly more likely to experience diabetes (or comorbidities) than men, which is in line with medical research we've seen.

#### Ethnicity

In [10]:
av_risk_eth = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    av_risk_eth.append(new_row)

av_risk_eth_df = pd.DataFrame(av_risk_eth)
av_risk_eth_df = av_risk_eth_df.sort_values(by='risk', ascending=False)
av_risk_eth_df


Unnamed: 0,eth,risk
2,asian_indian,0.717222
13,polish,0.582208
9,german,0.495592
12,mexican,0.429141
1,american,0.422651
14,portuguese,0.393905
6,english,0.371395
17,scottish,0.333342
11,italian,0.315572
5,dominican,0.313424


This table gives us lots of information about risk by ethnicity, most interestingly perhaps, it agrees with our race finding that Asian people are more likely to experience diabetes, in that our most at risk ethnicity was Asian Indian. However, Chinese and West Indian, the ttwo other Asian ethnicities in the datasest are at the bottom of the risk hierarchy, which made us consider that the risk of Asian Indian people specifically, and alone, was what was driving our other race findings.

### 5. Compare Across Wealthier & Poorer Towns of Residence/Birthplace

In order to compare outcomes across towns of varying socioeconomic status, we compiled a list of the richest and poorest towns present in our dataset (using Census data).

In [11]:
# richest towns in Mass
richTowns = ["Dover", "Weston", "Wellesley", "Lexington", "Sherborn", "Cohasset", "Lincoln", "Carlisle", "Hingham", "Winchester", 
                "Medfield", "Concord", "Needham", "Sudbury", "Hopkinton", "Boxford", "Brookline", "Andover",  
                  "Southborough", "Belmont", "Acton", "Marblehead", "Newton", "Nantucket", "Duxbury", "Boxborough", "Westwood","Natick", 
                  "Longmeadow", "Marion", "Groton", "Newbury", "North Andover", "Sharon", "Arlington", "Norwell", "Reading", 
                  "Lynnfield", "Marshfield", "Holliston", "Medway", "Canton", "Milton", "Ipswich", "Littleton", "Westford", "North Reading", "Chelmsford", "Dedham",
                  "Walpole", "Mansfield", "Shrewsbury", "Norwood", "Hanover", "Stow", "Newburyport", "Chatham", "Orleans", "Harwich",
                  "Swampscott","Fairhaven", "Salem"]

# poorest towns in Mass
poorTowns = ["Springfield", "Lawrence", "Holyoke", "Amherst", "New Bedford", "Chelsea", "Fall River", "Athol", "Orange", "Lynn", "Fitchburg", "Gardner", "Brockton", "Malden", "Worcester", "Chicopee", "North Adams", "Everett",
    "Ware", "Dudley", "Greenfield Town", "Weymouth Town", "Montague", "Revere", "Taunton", "Adams", "Huntington", "Charlemont", "Leominster", "Florida", "Colrain", "Hardwick",
    "Palmer Town", "Peabody", "Somerville", "Lowell", "Westfield", "Billerica"]

Create a df with all the information for the rich and poor towns

In [12]:
def find_town_info_row(town, bp_code_swapped, townCounts_df, code_name):
    code = bp_code_swapped[town]
    
    if not townCounts_df[townCounts_df[code_name] == code].empty:
        count = townCounts_df[townCounts_df[code_name] == code]['count'].values[0]
    else:
        count = 0
    
    new_row = {code_name: town, 'code': code, 'count': count}
    
    new_row_df = pd.DataFrame([new_row])
    
    return new_row_df

In [13]:
def find_town_info_all(counts, code_name):
    
    townCounts_df = pd.merge(X_test, counts, on=code_name)
    town_info_rich = pd.DataFrame(columns=[code_name, 'code', 'count'])
    town_info_poor = pd.DataFrame(columns=[code_name, 'code', 'count'])

    bp_code_swapped = {value: key for key, value in bp_code.items()}

    for town in richTowns:
        
        new_row_df = find_town_info_row(town, bp_code_swapped, townCounts_df, code_name)
        town_info_rich = pd.concat([town_info_rich, new_row_df], ignore_index=True)

    for town in poorTowns:
        
        new_row_df = find_town_info_row(town, bp_code_swapped, townCounts_df, code_name)
        town_info_poor= pd.concat([town_info_poor, new_row_df], ignore_index=True)
        
    return town_info_rich, town_info_poor

birthplace_counts = X_test.groupby('birthplace').size().reset_index(name='count')

town_info_rich, town_info_poor = find_town_info_all(birthplace_counts, 'birthplace')

We proceed with the following code to get the list of towns that sum up to 65 people from the richest towns, and 65 people from the poorest towns. 

In [14]:
def get_towns_by_sum_pop(town_info, code_name):
    
    townsUsed = set()
    peopleCount = 0

    for index, row in town_info.iterrows():
        
        if peopleCount > 65:
            break
        
        name = row[code_name]
        count = row['count']
        townsUsed.add(name)
        peopleCount += count
    
    return townsUsed, peopleCount

richTownsUsed, richPeopleCount = get_towns_by_sum_pop(town_info_rich, 'birthplace')
poorTownsUsed, poorPeopleCount = get_towns_by_sum_pop(town_info_poor, 'birthplace')

### Birthplace

In [15]:
def get_av_prob_bp(townsUsed, code_name, bp_code):
    
    town_codes = []
    bp_code_swapped = {value: key for key, value in bp_code.items()}


    for town_full in townsUsed:
        town_codes.append(bp_code_swapped[town_full])
        
    indices = X_test[code_name].isin(town_codes)
    prob_subset = pred_prob[indices]
    av_prob = np.mean(prob_subset[:, 1]) 

    return av_prob

In [16]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.33055564471630217 av_poor_prob:  0.3226796176138018


We find that there is not much difference in the average risk of diabetes when comparing poor and rich birthplace towns. 

## Current Town of Residence

Create a dataframe with the information for rich and poor towns. Then get the list of towns that sum up to 65 people from the richest towns, and 65 people from the poorest towns. 

In [17]:
curr_counts = X_test.groupby('curr_town').size().reset_index(name='count')
town_info_rich, town_info_poor = find_town_info_all(curr_counts, 'curr_town')

richTownsUsed, richPeopleCount = get_towns_by_sum_pop(town_info_rich, 'curr_town')
poorTownsUsed, poorPeopleCount = get_towns_by_sum_pop(town_info_poor, 'curr_town')

In [18]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.24245054097390775 av_poor_prob:  0.2795095286478775


In this comparison, we find that people currently residing in rich towns have lower rates of diabetes than those residing in poorer towns. 

# Pregnancy Analysis

In [19]:
preg_descriptions = ['PATIENT', 'Miscarriage in first trimester_CONDITIONS',
                        'Miscarriage in second trimester_CONDITIONS',
                        'Complication occuring during pregnancy_CONDITIONS',
                        'Preeclampsia_CONDITIONS', 'Antepartum eclampsia_CONDITIONS',
                        'Tubal pregnancy_CONDITIONS', 'Congenital uterine anomaly_CONDITIONS',
                        'Blighted ovum_CONDITIONS']
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(patients, conditions_pregnancy, preg_descriptions, observations)

In [20]:
train_model(X_train, y_train)

best LR : 0.9538094714060378
best DTC: 0.9632185172957705
best max depth:  {'max_depth': 1}
best RFC:  0.9632185172957705
best max depth:  {'max_depth': 1}
best SVM:  0.9632185172957705
best score overall is:  0.9632185172957705  with model:  DTC


### Compute Average Risk scores

Predict probabilities for all our entries using the best model we found

In [21]:
DTC = DecisionTreeClassifier(max_depth=5)
DTC.fit(X_train, y_train)
pred_prob = DTC.predict_proba(X_test)

### Race

In [22]:
pregRaceRisk = []

for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    pregRaceRisk.append(newRow)

pregRaceRisk = pd.DataFrame(pregRaceRisk)
pregRaceRisk = pregRaceRisk.sort_values(by='risk', ascending=False)
pregRaceRisk

Unnamed: 0,race,risk
1,black,0.075689
2,hispanic,0.06128
3,white,0.036454
0,asian,0.0


Here we can see that being black gives a patient more than double the risk of pregnancy issues than being white. Hispanics have the second highest rate of pregnancy complications and Asians have none- probably indicating their lack of presence with pregnancy complications in our dataset.

### Gender

In [23]:
pregGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    pregGenderRisk.append(newRow)

pregGenderRisk = pd.DataFrame(pregGenderRisk)
pregGenderRisk = pregGenderRisk.sort_values(by='risk', ascending=False)
pregGenderRisk

Unnamed: 0,gender,risk
0,F,0.085255
1,M,0.0


This result may seem a bit redundant or silly, it makes sense as generally men do not get pregnant.

### Ethnicity

In [24]:
av_risk_eth = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    av_risk_eth.append(new_row)

av_risk_eth_df = pd.DataFrame(av_risk_eth)
av_risk_eth_df = av_risk_eth_df.sort_values(by='risk', ascending=False)


In [25]:
av_risk_eth_df

Unnamed: 0,eth,risk
17,scottish,0.159329
5,dominican,0.147799
19,west_indian,0.119497
1,american,0.112636
15,puerto_rican,0.073537
7,french,0.047799
8,french_canadian,0.039832
12,mexican,0.039832
11,italian,0.038756
6,english,0.038547


### Birthplace

In [26]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.02573778422835027 av_poor_prob:  0.04847605224963716


### Current Town of Residence 

In [27]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.04847605224963716 av_poor_prob:  0.018384131591678763


We note that for birthplace towns, those in less-wealthy areas have a higher risk for pregnancy complications. However in the risk scores for current town of residence, people in wealthier areas have a higher risk for pregnancy complications.

# Cancer Analysis 


In [28]:
cancer_descriptions = ['PATIENT', 'Non-small cell lung cancer (disorder)_CONDITIONS',
                        'Non-small cell carcinoma of lung  TNM stage 4 (disorder)_CONDITIONS',
                        'Primary small cell malignant neoplasm of lung  TNM stage 4 (disorder)_CONDITIONS',
                        'Non-small cell carcinoma of lung  TNM stage 2 (disorder)_CONDITIONS',
                        'Non-small cell lung cancer (disorder)_CONDITIONS','Suspected lung cancer (situation)_CONDITIONS',
                        'Malignant tumor of colon_CONDITIONS','Overlapping malignant neoplasm of colon_CONDITIONS']


X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(patients, conditions_cancer, cancer_descriptions, observations)

#getting rid of few NaN values
X_train.fillna(0.0, inplace=True)
#train the model
train_model(X_train, y_train)

STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation

best LR : 0.9777557683137083
best DTC: 0.9854554124940391
best max depth:  {'max_depth': 1}
best RFC:  0.9803198708778108
best max depth:  {'max_depth': 7}
best SVM:  0.9546641722607386
best score overall is:  0.9854554124940391  with model:  DTC


Once again we find that the model with the best score is DTC, The Decision Tree Classifier, with about 98% accuracy. 


In [29]:
DTC = DecisionTreeClassifier(max_depth=5)
DTC.fit(X_train, y_train)
pred_prob = DTC.predict_proba(X_test)

### Race

In [30]:
cancerRaceRisk = []

for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    cancerRaceRisk.append(newRow)

cancerRaceRisk = pd.DataFrame(cancerRaceRisk)
cancerRaceRisk = cancerRaceRisk.sort_values(by='risk', ascending=False)
cancerRaceRisk


Unnamed: 0,race,risk
0,asian,0.081074
1,black,0.042912
2,hispanic,0.036358
3,white,0.035829


We find that Asian people are the most likely to have, or get diagnosed with, cancer. 

### Gender

In [31]:
cancerGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    cancerGenderRisk.append(newRow)

cancerGenderRisk = pd.DataFrame(cancerGenderRisk)
cancerGenderRisk = cancerGenderRisk.sort_values(by='risk', ascending=False)
cancerGenderRisk

Unnamed: 0,gender,risk
1,M,0.052269
0,F,0.024786


Notable result that men are over twice as likely to get/get diagnosed with cancer. 

### Ethnicity 

In [53]:
cancerEthRisk = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    cancerEthRisk.append(new_row)

cancerEthRisk = pd.DataFrame(cancerEthRisk)
cancerEthRisk = cancerEthRisk.sort_values(by='risk', ascending=False)

cancerEthRisk

Unnamed: 0,eth,risk
13,polish,0.182222
12,mexican,0.18
2,asian_indian,0.154286
18,swedish,0.15
15,puerto_rican,0.135769
4,chinese,0.135714
0,african,0.117692
6,english,0.115806
1,american,0.112727
16,russian,0.1


There are not significant ethnicity distinctions in risk rates. This may be because we grouped together all kinds of cancer. With greater distinctions, it is possible that there could be greater differences in risk rates.

### Birthplace

In [33]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.011676528599605522 av_poor_prob:  0.04303747534516765


### Current Town of Residence

In [34]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.03700197238658777 av_poor_prob:  0.06381766381766382


We note that for both birthplace towns and town of current residence, people in rich towns are half as likely to get diagnosed with cancer as opposed to people from poorer towns. This suggests environmental and systemic issues that contribute to the poorer health of those from less-wealthy areas. 

# Heart Analysis

In [48]:
heart_descriptions = ['PATIENT','Coronary Heart Disease_CONDITIONS','History of cardiac arrest (situation)_CONDITIONS','Cardiac Arrest_CONDITIONS',
                      'History of myocardial infarction (situation)_CONDITIONS','Myocardial Infarction_CONDITIONS']

np.random.seed(123)
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(patients, conditions_heart, heart_descriptions, observations)

#getting rid of few NaN values
X_train.fillna(0.0, inplace=True)
#train the model
train_model(X_train, y_train)

STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation

best LR : 0.8793808004108433
best DTC: 0.8973478595796193
best max depth:  {'max_depth': 1}
best RFC:  0.8999156303877335
best max depth:  {'max_depth': None}
best SVM:  0.8973478595796193
best score overall is:  0.8999156303877335  with model:  RFC


### Compute Average Risk scores
We found that the best model to predict probabilities for all our entries iin this case would be RFC. 

In [49]:
RFC = RandomForestClassifier(random_state=0)
param_grid = { 'max_depth': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None ]}
grid_search = GridSearchCV(RFC, param_grid, cv=5)
grid_search.fit(X_train, y_train)

RFCScore  = grid_search.best_score_
bestRFCDepth = grid_search.best_params_

pred_prob = grid_search.best_estimator_.predict_proba(X_test)

### 4. Compare Across Race, Gender, Ethnicity

### Race

In [50]:
heartRaceRisk = []

for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    heartRaceRisk.append(newRow)

heartRaceRisk = pd.DataFrame(heartRaceRisk)
heartRaceRisk = heartRaceRisk.sort_values(by='risk', ascending=False)
heartRaceRisk

Unnamed: 0,race,risk
0,asian,0.145
2,hispanic,0.124872
3,white,0.096351
1,black,0.078276


### Gender

In [51]:
heartGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    heartGenderRisk.append(newRow)

heartGenderRisk = pd.DataFrame(heartGenderRisk)
heartGenderRisk = heartGenderRisk.sort_values(by='risk', ascending=False)
heartGenderRisk

Unnamed: 0,gender,risk
1,M,0.117651
0,F,0.083125


### Ethnicity

In [62]:
heartEthRisk = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    heartEthRisk.append(new_row)

heartEthRisk = pd.DataFrame(heartEthRisk)
heartEthRisk = heartEthRisk.sort_values(by='risk', ascending=False)

heartEthRisk

Unnamed: 0,eth,risk
13,polish,0.583031
12,mexican,0.580761
17,scottish,0.577455
2,asian_indian,0.577379
18,swedish,0.560368
0,african,0.55998
16,russian,0.549236
1,american,0.548145
9,german,0.521056
14,portuguese,0.516368


### Birthplace

In [54]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.10892307692307693 av_poor_prob:  0.1103076923076923


### Current Town of Residence 

In [55]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.08661538461538462 av_poor_prob:  0.09256410256410255


# Lungs Analysis 

In [65]:
lungs_descriptions = ['PATIENT','Asthma_CONDITIONS','Pulmonary emphysema (disorder)_CONDITIONS','Seasonal allergic rhinitis_CONDITIONS',
                      'Acute bronchitis (disorder)_CONDITIONS','Chronic obstructive bronchitis (disorder)_CONDITIONS','Childhood asthma_CONDITIONS',
                      'Perennial allergic rhinitis with seasonal variation_CONDITIONS','Perennial allergic rhinitis_CONDITIONS',
                      'Acute bacterial sinusitis (disorder)_CONDITIONS','Chronic sinusitis (disorder)_CONDITIONS','Sinusitis (disorder)_CONDITIONS'
]

np.random.seed(123)
X_train, y_train, X_test, y_test, race_code, eth_code, gen_code, bp_code, curr_code = prep_data(patients, conditions_lungs, lungs_descriptions, observations)

#getting rid of few NaN values
X_train.fillna(0.0, inplace=True)
#train the model
train_model(X_train, y_train)

STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation

best LR : 0.569718645684311
best DTC: 0.6030666519936907
best max depth:  {'max_depth': 5}
best RFC:  0.6210850665786289
best max depth:  {'max_depth': 4}
best SVM:  0.5971387696709585
best score overall is:  0.6210850665786289  with model:  RFC


### Compute Average Risk scores
We found that the best model to predict probabilities for all our entries iin this case would be RFC. 

In [66]:
RFC = RandomForestClassifier(random_state=0)
param_grid = { 'max_depth': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None ]}
grid_search = GridSearchCV(RFC, param_grid, cv=5)
grid_search.fit(X_train, y_train)

RFCScore  = grid_search.best_score_
bestRFCDepth = grid_search.best_params_

pred_prob = grid_search.best_estimator_.predict_proba(X_test)

### 4. Compare Across Race, Gender, Ethnicity

### Race

In [67]:
lungsRaceRisk = []

for code, race in race_code.items():
    avRisk = find_risk(code, 'race', pred_prob)
    newRow = {'race': race, 'risk': avRisk}
    lungsRaceRisk.append(newRow)

lungsRaceRisk = pd.DataFrame(lungsRaceRisk)
lungsRaceRisk = lungsRaceRisk.sort_values(by='risk', ascending=False)
lungsRaceRisk

Unnamed: 0,race,risk
0,asian,0.514338
1,black,0.509092
3,white,0.507817
2,hispanic,0.492106


### Gender

In [68]:
lungsGenderRisk = []

for code, gender in gen_code.items():
    avRisk = find_risk(code, 'gender', pred_prob)
    newRow = {'gender': gender, 'risk': avRisk}
    lungsGenderRisk.append(newRow)

lungsGenderRisk = pd.DataFrame(lungsGenderRisk)
lungsGenderRisk = lungsGenderRisk.sort_values(by='risk', ascending=False)
lungsGenderRisk

Unnamed: 0,gender,risk
0,F,0.519215
1,M,0.49355


### Ethnicity

In [69]:
lungsEthRisk = []

for code, name in eth_code.items():
    av = find_risk(code, 'ethnicity', pred_prob)
    new_row = {'eth': name, 'risk': av}
    lungsEthRisk.append(new_row)

lungsEthRisk = pd.DataFrame(lungsEthRisk)
lungsEthRisk = lungsEthRisk.sort_values(by='risk', ascending=False)

lungsEthRisk

Unnamed: 0,eth,risk
13,polish,0.583031
12,mexican,0.580761
17,scottish,0.577455
2,asian_indian,0.577379
18,swedish,0.560368
0,african,0.55998
16,russian,0.549236
1,american,0.548145
9,german,0.521056
14,portuguese,0.516368


### Birthplace

In [70]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'birthplace', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'birthplace', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.4872429058942045 av_poor_prob:  0.5189382015534418


### Current Town of Residence

In [71]:
av_rich_prob = get_av_prob_bp(richTownsUsed, 'curr_town', bp_code)
av_poor_prob = get_av_prob_bp(poorTownsUsed, 'curr_town', bp_code)

print("av_rich_prob: ", av_rich_prob, "av_poor_prob: ", av_poor_prob)

av_rich_prob:  0.5039833131472166 av_poor_prob:  0.5124758651408807


around 0.5 for everyone is crazy??? maybe lung diseases are just overrepresented in the dataset? 