# COGS 108 - The Effects of Income and Lifestyle Choices on the Risk of Diabetes

# Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [  ] NO - keep private

# Names
- Brian Dinh
- Candy Zhang
- Edwin Liang
- Sindhu Kothe
- Vaishnavi Ramanujan

# Abstract

Please write one to four paragraphs that describe a very brief overview of why you did this, how you did, and the major findings and conclusions.

# Research Question

What is the relationship between the rate of diabetes and individual-specific body conditions? Would lifestyle choices such as smoking, drinking, and an unhealthy diet affect the risk of diabetes? What are the effects of income on the rate of diabetes?

## Background and Prior Work

Diabetes is an autoimmune disorder that affects an overwhelming number of people in the United States. For this reason, it is paramount that we be able to predict whether or not an individual is at risk of developing diabetes given a set amount of information. 

We found that there is prior research<a name="diabetes-rate"></a>[<sup>1</sup>](#dia-rate) exploring the mortality rate in people with type 2 diabetes. From this, we can see that the mortality rate of people with type 2 diabetes has been increasing as they grow older. The research paper in this reference talks about predicting 5 year mortality rates for people with a higher age with diabetes. This would help predict if an immediate intervention is required. We also discovered that there is a clear correlation between diabetes and gender. According to one study <a name="diabetes-gender"></a>[<sup>2</sup>](#dia-gender), the population of older people that were diagnosed with diabetes is overwhelmingly male. On the other hand,lower risk populations that were studied show an overwhelming female bias. However, we see that when the population with diabetes amongst the younger age groups were studied, there was a pretty even split between both the sexes.

Furthermore, we see that there is in fact a correlation between substances usage and the risk of diabetes. According to one study by the NCBI <a name="diabetes-smoke"></a>[<sup>3</sup>](#dia-smoke) (National Center for Biotechnology Information), we see that men who smoked around 25 cigarettes daily were at greater risk of diabetes. Also, men who drank 30.0 - 49.9 grams of alcohol had a higher relative risk of diabetes. However, the study was conducted on only male professionals in the age range of 40 - 75. We also see that the study is relatively old and we don't know how the correlation plays out in today's world. There has also been some research done on the relationship between socioeconomic status including education and income level on diabetes that we hope to further build on with our project. This research<a name="diabetes-income"></a>[<sup>4</sup>](#dia-income) shows that there is a strong correlation between income and type 2 diabetes in Canada and we would like to do research primarily based in the United States. The data we are using, while primarily based in the United States, not only contains information about individual body conditions such as BMI and insulin, but also information about a person's nutrition, income, and lifestyle habits. This data should allow us to determine whether or not the given factors impact an individual's risk of diabetes. We hope to address the gaps we found in these studies and conduct a more representative anaylsis covering both eating habits and socieconomic conditions. 

1. <a name="dia-rate"></a> [^](#diabetes-rate) *Journal of Diabetes Research*. https://www.hindawi.com/journals/jdr/2024/1741878/
2. <a name="dia-gender"></a> [^](#diabetes-gender) *Diabetologia Journal*. https://link.springer.com/article/10.1007/s001250051573
3. <a name="dia-smoke"></a> [^](#diabetes-smoke) *National Center for Biotechnology Information*. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2548937/
4. <a name="dia-income"></a> [^](#diabetes-income) *National Center for Biotechnology Information*. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4603875/


# Hypothesis


We expect that a person who smokes, drinks, and has unhealthy eating habits is more likely to be diagnosed with diabetes. We also think there is a correlation between having a lower income and having less access to healthy, nutrient-rich food. Essentially, we believe that healthy lifestyle choices and socioeconomic equality play important roles in reducing the prevalence of diabetes and its associated risk factors. We expect income inequalities and unhealthy food and lifestyle choices to increase an individual’s risk of developing diabetes. Conversely, we believe that proper nutrition and reduced intake of alcohol / smoking will lower the rate of diabetes.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset 1: Diabetes Health Indicators Dataset

- Dataset Name: Diabetes Health Indicators Dataset
  - Link to dataset: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
  - Number of observations: 253,680
  - Number of variables: 22
  - Description of important variables:
    - BloodPressure: To express the Blood pressure measurement
    - Heavy Alcohol Consumption: 0 = no heavy alcohol consumption, 1 = heavy alcohol consumption
    - Smoker: 0 = non-smoker, 1 = smoker
    - Insulin: To express the Insulin level in blood
    - BMI: To express the Body mass index
    - Diabetes: If the individual has diabetes - 0 = no diabetes, 1 = prediabetes,  2 = has diabetes
    - Income: On a scale 1-8 where 1 = less than 10,000 usd a year, 5 = less than 35,000 usd per year, 8 = 75,000 usd or more per year 
    - Fruits / Veggies: 1 =  consumption of fruits and vegetables at least once per day, 0 = no fruits or veggies
    - Sex: 0 = male, 1 = female
    - Age: scale of 1 to 13, where 1 = 18-24, 9 = 60-64, and 13 = 80 or older
  - Description of Dataset:
    Because all of the important variables and other columns contain only numeric values, we will not need a significant amount of pre-processing for the data. However, for the diabetes column, we plan to merge the prediabetes and the no diabetes column since our goal is to have a binary output on whether a patient has diabetes or not. Additionally, we will also merge the fruits and vegetables columns so that we can further streamline the input variables. The food and veggies column is a proxy for healthy eating and blood pressure, insulin, and heart disease are proxies for individual-specific body conditions.  Because there are no missing values, we would just need to look through the dataset and see if there are outliers, as well as normalize for gender. Because this is a general health survey, we also need to remove unnecessary variables such as if a person has health care or not, difficulty walking up and down stairs, and state of mental health. We plan to merge the two datasets  we have by the diabetes column, as both datasets include whether or not an individual has diabetes.  

In [None]:
#import statements 
import pandas as pd
import numpy as np

In [None]:
#reading the dataset in as a csv
diabetes_df = pd.read_csv("diabetes_012_health_indicators_BRFSS2015.csv")

#We only want to observe the important variables as those are the relevant ones to our research question
important_vars = ['Age', 'HighBP', 'HvyAlcoholConsump', 'Smoker',
                 'BMI', 'Income', 'Fruits', 'Veggies', 'Sex', 'Diabetes_012']
diabetes_df = diabetes_df[important_vars]
diabetes_df_cleaned = diabetes_df.dropna()

#display dataset
diabetes_df_cleaned

## Dataset #2 Food Environment Atlas Dataset

- Dataset Name: Food Enviroment Atlas
    - Link to the dataset: https://www.ers.usda.gov/data-products/food-environment-atlas/data-access-and-documentation-downloads/
    - Number of observations: 102
    - Number of variables: 4
    - Description of Important Variables:
        - State: List of the US States
        - Variables: This variable specifies whether mean is describing income `median_income` or diabetes rate `diabetes rate`
        - mean: state average of either income or diabetes rate
Description of the dataset: This dataset has data regarding nutrition and income and well as the circumstances of those receiving assistance. There were many more variables in the `Variables` column, but as they were unecessary for our analysis, these variables were removed. This leaves us with two variables that are useful to our analysis, which are the median income and diabetes rate. We will be exploring whether there is a correlation between income and diabetes through this dataset. We will create an additional dataset from this data that focuses on nutrition information and the diabetes rate. Because the nutrition and income variables were divided by different geographic regions (the income rate was based on county, and nutrition information was based on state), it was difficult to create one dataset with both those features. Instead, we decided to split the data into two datasets and conduct our EDA. 


In [None]:
#read in the dataset as a csv 
df2 = pd.read_csv("StateAndCountyData.csv")

#filter dataset by dropping unnecessary columns 
df2 = df2[(df2["Variable_Code"] == "MEDHHINC15") | (df2["Variable_Code"] == "PCT_DIABETES_ADULTS13")]
df2 = df2.drop(columns = ["FIPS", "County"])

#group dataset by state and rename row values in Variable column with more intuitive values 
df2 = df2.groupby(["State", "Variable_Code"])["Value"].agg(["sum", "mean"]).reset_index()
df2["Variable_Code"] = df2["Variable_Code"].replace("MEDHHINC15", "median_income")
df2["Variable_Code"] = df2["Variable_Code"].replace("PCT_DIABETES_ADULTS13", "diabetes_rate")
df2 = df2.rename(columns = {"Variable_Code": "Variables"})
df2 = df2.drop(columns = ["sum"])

# display dataset
df2

# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

## EDA Section 1 - Dataset 1: Diabetes Health Indicators

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
# Get mean, medians, and spread
statistics = diabetes_df_cleaned.describe()
statistics

Looking at the statistics, it appears that the mean age is about 8 (on a scale of 1 to 13), and the mean gender is 0.44 (between 0-male and 1-female). Although the age may be slightly skewed towards an older population and the average gender appears to be slightly leaning towards male, the data does not appear to be biased towards one specific group enough to affect our conclusions.  

In [None]:
#import statements 
import seaborn as sns
import matplotlib.pyplot as plt
import math

# Calculate the number of rows and columns based on the number of columns in the DataFrame
num_columns = len(diabetes_df_cleaned.columns)
num_rows = math.ceil(num_columns / 4)  # Assuming 4 plots per row

# Create subplots for each variable
fig, axs = plt.subplots(nrows=num_rows, ncols=4, figsize=(20, 5 * num_rows))

# Flatten the axs array for easier iteration
axs = axs.flatten()

# Iterate through each column and plot the histogram
for i, column in enumerate(diabetes_df_cleaned.columns):
    axs[i].hist(diabetes_df_cleaned[column], bins=20, alpha=0.7)
    axs[i].set_title(column)
    axs[i].set_xlabel('Values')
    axs[i].set_ylabel('Frequency')

# Hide the remaining empty subplots
for j in range(i + 1, len(axs)):
    axs[j].axis('off')

plt.tight_layout()
plt.show()

In [None]:
#Rename Diabetes_012 and combined Diabetes_012 values for sake of consistency
diabetes_df_cleaned = diabetes_df_cleaned.rename(columns = {"Diabetes_012": "Diabetes_01"})
diabetes_df_cleaned["Diabetes_01"].replace(2, 1, inplace = True)

Looking at the above plots, it appears that more individuals in this dataset do not have diabetes (as the 0 bar is higher in the Diabetes graph) than those that do or have pre-diabetes. Additionally, income seems to be more left-skewed towards, the 8 value, which means individuals in this dataset are in the highest income bracket of at least 75,000 USD a year. In terms of the fruits and veggies graphs, more people appear to be consuming fruits and veggies than those that do not, and very few appear to be heavy alcohol drinkers. These are all factors to consider as we continue our EDA to ensure we do not let the skewed distributions wrongly impact our conclusion. 

In [None]:
#view the distribution of the dataset once again
diabetes_df_cleaned.describe()

Next, we will be creating a scatterplot for BMI vs. Blood Pressure with Diabetes as Hue to look at how individual specific body conditions impact the diabetes rate:

In [None]:
#create a scatterplot with bmi vs blood pressure
sns.scatterplot(data=diabetes_df_cleaned, x='BMI', y='HighBP', hue='Diabetes_01')
plt.title('BMI vs. Blood Pressure with Diabetes')
plt.xlabel('BMI')
plt.ylabel('Blood Pressure')
plt.show()

Barplot for Blood Pressure vs Diabetes Rate:

In [None]:
sns.barplot(data=diabetes_df_cleaned, x='HighBP', y='Diabetes_01')
plt.title('Blood Pressure vs. Diabetes Rate')
plt.xlabel('Blood Pressure(1 for high 0 otherwise)')
plt.ylabel('Diabetes Rate')
plt.show()

Barplot for Heavy Alcohol Consumption vs. Diabetes Rate:

In [None]:
sns.barplot(data=diabetes_df_cleaned, x='HvyAlcoholConsump', y='Diabetes_01')
plt.title('Heavy Alcohol Consumption vs. Diabetes Rate')
plt.xlabel('Heavy Alcohol Consumption')
plt.ylabel('Diabetes Rate')
plt.show()

Barplot for Smoking vs. Diabetes Rate:

In [None]:
sns.barplot(data=diabetes_df_cleaned, x='Smoker', y='Diabetes_01')
plt.title('Smoking vs. Diabetes Rate')
plt.xlabel('Smoker')
plt.ylabel('Diabetes Rate')
plt.show()

Barplot for Fruits Consumption vs. Diabetes Rate:

In [None]:
sns.barplot(data=diabetes_df_cleaned, x='Fruits', y='Diabetes_01')
plt.title('Fruits Consumption vs. Diabetes Rate')
plt.xlabel('Fruits Consumption')
plt.ylabel('Diabetes Rate')
plt.show()

Barplot for Veggies Consumption vs. Diabetes Rate

In [None]:
sns.barplot(data=diabetes_df_cleaned, x='Veggies', y='Diabetes_01')
plt.title('Veggies Consumption vs. Diabetes Rate')
plt.xlabel('Veggies Consumption')
plt.ylabel('Diabetes Rate')
plt.show()

## Removing Outliers

We want to get rid of outliers regarding each variable. We could get rid of outliers overall(i.e. we remove a row if one of the variables has an outlier), but that would remove a great chunk of this dataset(253k samples to 85k samples). So it's just better to clean out outliers in respect to each variable and have series for more data.

In [None]:
age_no_outliers = diabetes_df_cleaned[['Age', 'Diabetes_01']]
Q1 = age_no_outliers['Age'].quantile(0.25)
Q3 = age_no_outliers['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
age_no_outliers = age_no_outliers[(age_no_outliers['Age'] >= lower_bound) & (age_no_outliers['Age'] <= upper_bound)]
age_no_outliers = age_no_outliers

bmi_no_outliers = diabetes_df_cleaned[['BMI', 'Diabetes_01']]
Q1 = bmi_no_outliers['BMI'].quantile(0.25)
Q3 = bmi_no_outliers['BMI'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
bmi_no_outliers = bmi_no_outliers[(bmi_no_outliers['BMI'] >= lower_bound) & (bmi_no_outliers['BMI'] <= upper_bound)]
bmi_no_outliers = bmi_no_outliers
bmi_no_outliers

income_no_outliers = diabetes_df_cleaned[['Income', 'Diabetes_01']]
Q1 = income_no_outliers['Income'].quantile(0.25)
Q3 = income_no_outliers['Income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
income_no_outliers = income_no_outliers[(income_no_outliers['Income'] >= lower_bound) & (income_no_outliers['Income'] <= upper_bound)]
income_no_outliers = income_no_outliers

## Analyzing quantitative variables with no outliers

Now that we've gotten rid of outliers we should analyze the quantitative variables by plotting the box plots comparing the quantitative variables for diabetic and non-diabetic samples. We will also analyze the central tendencies and spread values of diabetic and non-diabetic samples for each quantitative variable. Let's start off with age:

**Age**

In [None]:
sns.boxplot(data=age_no_outliers, y='Age', x='Diabetes_01')
plt.show()

In [None]:
age_no_outliers[age_no_outliers['Diabetes_01'] == 0].describe()

In [None]:
age_no_outliers[age_no_outliers['Diabetes_01'] == 1].describe()

It seems that diabetic samples are typically older by some amount. We also notice that for non-diabetic samples, the range of ages is more diverse, noticeable by the fact that both the IQR and standard deviation is higher, indicating higher spread. This makes sense since most people in this dataset are not diabetic, so we expect a more diverse range of values. However, diabetic people seem to be centralized at older ages and there aren't a lot of diabetic samples that are young. All of this indicates a positive correlation between Age and BMI. This can be due to certain factors such as older people being less healthier, meaning that older people are more likely to have diabetes.

**BMI**

In [None]:
sns.boxplot(data=bmi_no_outliers, y='BMI', x='Diabetes_01')
plt.show()

In [None]:
bmi_no_outliers[bmi_no_outliers['Diabetes_01'] == 0].describe()

In [None]:
bmi_no_outliers[bmi_no_outliers['Diabetes_01'] == 1].describe()

Observing the box plots, it seems like the only major difference is that the central tendency BMI of the diabetic patients is higher. The mean and median BMI of the diabetic patients are both higher. The spread seems relatively the same, with diabetic patients having a slightly higher spread. This supports our hypothesis that higher BMI typically correlates with a higher likeliness for diabetes. It may possibly be a factor in causing diabetes as well

**Income**

In [None]:
sns.boxplot(data=income_no_outliers, y='Income', x='Diabetes_01')
plt.show()

In [None]:
income_no_outliers[income_no_outliers['Diabetes_01'] == 0].describe()

In [None]:
income_no_outliers[income_no_outliers['Diabetes_01'] == 1].describe()

The mean and median income of non-diabetic patients is certainly higher, which is what we expect due to richer people having better access to healthcare and better access to healthier food. This seems to state that there is a negative correlation between income and likeliness of diabetes. In addition, it seems like the spread of income in diabetic patients is higher as seen by the higher IQR. This spread is due to the fact that there are many samples that have an income lower than that of the median. This is possibly due to the fact that most diabetic samples have an income close to the mean or median, but due to the worse living conditions of poorer people, there are many poorer diabetic patients that cause the spread to be large.

## Correlations between the quantitative variables

It's helpful to learn the correlations between the quantitative variables to learn their relationships and to check for confounding. If two independent variables correlate with each other, but both correlate with the likelihood of diabetes in some way, one of them may be a confounding variable.

In [None]:
age_bmi_df = diabetes_df_cleaned[['Age', 'BMI']]
Q1 = age_bmi_df.quantile(0.25)
Q3 = age_bmi_df.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
age_bmi_df = age_bmi_df[(age_bmi_df >= lower_bound) & (age_bmi_df <= upper_bound)]
age_bmi_df = age_bmi_df.dropna()

x = age_bmi_df['Age']
y = age_bmi_df['BMI']
plt.scatter(x,y)
plt.xlabel('Age')
plt.ylabel('BMI')
b1, b0 = np.polyfit(np.array(x), np.array(y), 1)
plt.plot(np.array(x), b1 * np.array(x) + b0, color='r')
plt.show()

There seems to be no correlation between Age and BMI

In [None]:
df = diabetes_df_cleaned[['Age', 'Income']]
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df = df[(df >= lower_bound) & (df <= upper_bound)]
df = df.dropna()

x = df['Age']
y = df['Income']
plt.scatter(x,y)
plt.xlabel('Age')
plt.ylabel('Income')
b1, b0 = np.polyfit(np.array(x), np.array(y), 1)
plt.plot(np.array(x), b1 * np.array(x) + b0, color='r')
plt.show()

There seems to be some negative correlation between age and income. This can likely be explained by retirement and reduced income as people get older.

In [None]:
df = diabetes_df_cleaned[['BMI', 'Income']]
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df = df[(df >= lower_bound) & (df <= upper_bound)]
df = df.dropna()

x = df['BMI']
y = df['Income']
plt.scatter(x,y)
plt.xlabel('BMI')
plt.ylabel('Income')
b1, b0 = np.polyfit(np.array(x), np.array(y), 1)
plt.plot(np.array(x), b1 * np.array(x) + b0, color='r')
plt.show()

There is a negative correlation between Income and BMI. This is kind of expected, as those with lower income may be forced to live under unhealthier conditions.

## T-tests for quantitative variables

In [None]:
from scipy.stats import ttest_ind
group0 = age_no_outliers[age_no_outliers['Diabetes_01'] == 0]
group1 = age_no_outliers[age_no_outliers['Diabetes_01'] == 1]
ttest_ind(np.array(group0['Age']), np.array(group1['Age']))

In [None]:
group0 = bmi_no_outliers[bmi_no_outliers['Diabetes_01'] == 0]
group1 = bmi_no_outliers[bmi_no_outliers['Diabetes_01'] == 1]
ttest_ind(group0['BMI'], group1['BMI'])

In [None]:
group0 = income_no_outliers[income_no_outliers['Diabetes_01'] == 0]
group1 = income_no_outliers[income_no_outliers['Diabetes_01'] == 1]
ttest_ind(group0['Income'], group1['Income'])

## Second Analysis You Did - Give it a better title

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## ETC AD NASEUM

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

There are some ethical concerns regarding equitability we must consider. There may be variations between diabetes rates between gender and ages below 21. Additionally, there are claims that ethinicity affects diabetes rates <a name="diabetes-ethnicity"></a>[<sup>1</sup>](#dia-ethnicity). It’s not clear why this correlation is reported, but it is a possibility to consider when searching for a dataset. Certain disabilities can also have an impact on diabetes rate and ideally we’d want a dataset that accounts for those disabilities too. Furthermore more, we should also ensure that our dataset contains a diverse set of class and wealth levels too, as lower income individuals may be forced to have a lower quality of life, which may cause a higher risk of diabetes. Therefore, it’s important to sample a diverse set of people in our dataset to account for as many of these differences as possible. We can check for such datasets either by looking at the description of the dataset and by gauging how diverse the dataset is based off of the description, or checking if age, gender, ethnicity, disabilities, and class are variables in the dataset and doing EDA to make sure that the distribution of those variables are diverse. If we cannot find data that accounts for all of these differences, then we must add disclaimers stating as such in order to prevent misinterpretation.

Even if we have a diverse dataset, though, there may still be some issues with equitability. If we don’t have enough variables in the dataset, we can get some correlations that may lead us to identify false causes and false conclusions. For example, let’s say we exclude BMI for the sake of this scenario. Certain population groups may just so happen to have higher BMIs on average for a variety of reasons. BMI could be the actual reason why they typically have a higher risk for diabetes, but if we exclude BMI, we may falsely conclude that certain demographics naturally are more likely to suffer from diabetes, which can be problematic. False conclusions can lead to treatments that can be harmful to people or groups. To solve this, we would have to consider datasets that have many different diverse variables. Lastly, with an ideal dataset that considers all of these problems, the only major issue would be privacy. We wouldn’t want hospitals to release sensitive information from patients without their knowledge or proper consent, for instance. It has to be sourced from willing participants and handled in an approporiate manner, so that no data can be linked back to an individual patient.

To handle these concerns, we will be both carefully inspecting the datasets we have chosen and their descriptions to ensure that the data, at least on the surface appears to be representative including diverse ages, genders, income levels. We will also be conducting some EDA to ensure that the disctributions of these variables are not skewed. Additionally, we will conduct extensive research to ensure that we are not dropping variables that affect our research question and conclusions. 

1. <a name="dia-ethnicity"></a> [^](#diabetes-ethnicity) *Diabetes UK*. https://www.diabetes.org.uk/diabetes-the-basics/types-of-diabetes/type-2/diabetes-ethnicity

# Discusison and Conclusion

Wrap it all up here.  Somewhere between 3 and 10 paragraphs roughly.  A good time to refer back to your Background section and review how this work extended the previous stuff. 


#note - the first dataset is a telephone survey, maybe biases in how many people pick up / the income of those that do pick up, and may not admit to being heavy smokers / drinkers


# Team Contributions

Speficy who did what.  This should be pretty granular, perhaps bullet points, no more than a few sentences per person.