# PFDA Project 2021
## A Simulation of Risk Factors for Type2 Diabetes in Ireland
### Introduction
The objective of this project is to simulate the risk of developing Type2 Diabetes in Ireland using a set of generally accepted risk factors that will be described in the project. 

What is Diabetes. https://www.diabetes.ie/about-us/what-is-diabetes/
- Diabetes, or Diabetes Mellitus to give it its full name, is a condition caused by an insufficient supply of insulin from the pancreas. As the body consumes sugars and carbohydrates the pancreas regulates the supply of insulin to ensure that the sugar in the blood is allowed to get into muscles and cells. When the pancreas cannot produce any or sufficient insulin the body cannot process sugars in the bloodstream to make it available to other body organs, cells and the brain. The resulting build up of excess sugar in the blood has potentially serious and long term health impacts if not recognised and treated. 

- Diabetes across the globe is one of the great health issues of our time, it was there long before Covid and in all likelihood will be there long after. It is estimated that there are in the region of 450m people that have diabetes worldwide, with by far the greater percentage (~90%) having Type2. It consumes globally in excess of 750mUSD of health spend annually with the US accounting for 40% of this. Overall costs associated with the disorder are in the order of trillionsUSD. It is interesting to note that on a mean annual expenditure per head Ireland is ranked at number 6 in the world. Both the incidence and costs associated with diabetes are are trending substantially upwards with projections indicating a rise to ~650m people worldwide by 2040 with pro-rata increases expected in Ireland. 

https://care.diabetesjournals.org/content/41/5/963

https://www.mayoclinic.org/diseases-conditions/diabetes/symptoms-causes/syc-20371444

https://www.diabetesresearchclinicalpractice.com/article/S0168-8227(20)30138-8/fulltext

https://en.wikipedia.org/wiki/Diabetes


There are three main classifications of diabetes:
1. Type 1 
   - It is unclear exactly what triggers Type 1 diabetes, there may be genetic and/or environmental factors. It is characterised by the immune system attacking and destroying the cells in the pancreas that produce insulin. It is not known exactly what triggers this autoimmune response to insulin cells. 
2. Type 2
   - This type of diabetes is characterised by a reduction in the ability of cells to respond to insulin, potentially leading to a reduction in the amount of insulin produced by the pancreas. It is often, but not strictly, a disease of later life and there are a clear set of risk factors that can predispose to developing this disorder. It affects men and women equally. 
3. Gestational 
   - This type is specific to pregnancy and develops in women with no previous history of diabetes. It is unclear why it arises and it generally resolves post birth. 

For the purpose of this project I will only be considering Type 2 Diabetes, reason being that there are some very clear risk factors that can predispose for developing the condition, unlike Type 1 where it is unknown and Gestational where it is specifically preganancy related. Unless specifically stated the term "diabetes" as used in the project is taken to refer to type2. 

As stated above the objective is to assess these risk factors on a statistical basis and from that create a simulation based on synthesised data. This will be carried out in Python using the numpy.random package as well as pandas and pyplot. The project will be created in a Jupyter notebook and it and associated files will be hosted on Github.  

![[Test]](https://els-jbs-prod-cdn.jbs.elsevierhealth.com/cms/attachment/6b40d444-3dbe-4511-8432-2f944ea240a5/gr3.jpg)

### Project approach.
The project will be carried out in the following stages:
- Online research of primary risk factors that can be used for the simulation.
- From the online research translate the risk factors into tables using numpy.
- Combine the risk factor data to produce an overall data set where the output is an overall risk factor in the range 0 - 1 of developing type 2 diabetes, where 0 is no or very low risk and 1 is a very high risk. It shoudl be borne in mind that the risk factor is just that, i.e. an indication of the likelihood of developing diabetes, there is no guarantee that a score of 0 means no possibility of developing the disease or a score of 1 means that it is certain. 
- Plot the outcomes and comment on the findings. 

#### Stage 1 - Online Research.
This section will review available data related to the generally accepted risk factors associated with diabetes. Given the scale, costs, health and social impacts of diabetes globally, and here in Ireland, it is not surprising there is an enormous amount of resesarch and data available. The challenge here is to find and assess supporting data that is specific to an Ireland context and to that end I have largely restricted the research to this area. However where necessary I will use available information not specific to Ireland, in particular as it relates to disease specific data as against its geographic context. 

Diabetes Risk Factors. https://www.diabetes.ie/risk-factors/

The diabetes Ireland website is a valuable source of information and data related to the disease in Ireland. According to the data there are approximately 200,000 people with the condition and of these about 30,000 are undiagnosed. There may also be up to 150,000 in what is termed a pre-diabetes stage, i.e. people who have blood sugar levels higher than normal but have not yet progressed to developing diabetes. Diabetes is termed a "silent killer" as many can have the condition but not have any recognisable symptoms. If left undiagnosed and untreated there can be serious health outcomes. 

Primary Risk Factors. 
Below is the list of risk factors that I will be looking at to create the overall risk factor data set. While there are quite a few other risk factors, having reviewed a number of online publications these 4 are common across all and are deemed to be amongst the highest risk. 
- Age. The risk of developing diabetes increases with age. The risk increases for those over 40, and expecially over 60. Increasing age is generally associated with weight gain and less exercise, 2 other risk factors. Younger individuals can also develop the condition. 
  - For the project I will be segmenting (binning) the data into 6 age groups, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79.
- Weight. This is a primary risk factor and in the developed world the the growth in diabetes is highly reflective of increasing weight in populations. Body cells become more resistant to insulin the greater the amount of fatty tissue in the body. 
  - For this risk factor I will be using the Body Mass Index (BMI) classification where BMI is calculated as BMI = kg/m2 where kg is a person's weight in kilograms and m2 is their height in metres squared. 
- Physical Activity. Exercise is an important aspect of diabetes prevention. It helps to control weight and helps cells use up sugar in the blood. At least 30 minutes a day is recommended. On the flip side a lack of exercise has the opposite effect. 
- Smoking. Smoking is a high risk factor as chemicals in cigarettes interfere with normal cell functions and can decrease the effectivity of insulin in the body. According to teh US FDA somkers have a 30% to 40% higher risk of developing diabetes than nonsmokers.  https://www.fda.gov/tobacco-products/health-effects-tobacco-use/cigarette-smoking-risk-factor-type-2-diabetes

Data for Risk Factors
Having identified the risk factors to be used in the project the next stage is to look at the available data associated with these as this will be needed to quantify the risk factorrs to be used in the syhthesised data set. 

#### Age https://www.cso.ie/en/releasesandpublications/ep/p-pme/populationandmigrationestimatesapril2021/mainresults/
The most reliable source of data for estimates of population is the Central Statistics Office (CSO) www.cso.ie. For the pupose of the project I am using the most recent data as of April 2021 which gives the overall population in the country of 5.01m. For the purpose of the project I have used Table 1.8 from this report which provides a complete breakdown of the population by age category from 2015 to 2021. The 2021 data will be used and is stored in the Datasets folder as IrelandPop_AgeCat.xlsx. 

The data table is shown below. 

In [95]:
import pandas as pd
import numpy as np
from numpy import random


In [96]:
#Import the population datatable
df = pd.read_csv("./Datasets/IrelandPop_AgeCat.csv")
df

Unnamed: 0,Category,Yr2015,Yr2016,Yr2017,Yr2018,Yr2019,Yr2020,Yr2021
0,0 - 4,337.9,331.4,324.6,319.3,315.2,309.5,302.6
1,5 - 9,349.3,355.3,359.1,356.9,352.3,344.1,335.0
2,10 - 14,314.4,318.9,323.3,332.6,341.4,350.0,358.0
3,15 - 19,296.1,301.2,308.5,316.2,319.9,323.9,325.3
4,20 - 24,275.5,273.5,276.3,289.3,298.1,307.2,310.3
5,25 - 29,298.7,296.7,292.4,291.1,289.2,292.2,291.6
6,30 - 34,369.2,360.3,347.9,335.8,330.8,324.0,317.2
7,35 - 39,379.8,388.1,394.6,398.3,397.0,386.5,375.7
8,40 - 44,353.0,356.5,363.1,369.9,379.9,393.7,402.1
9,45 - 49,317.0,324.9,333.1,341.2,351.1,358.5,364.4


In [105]:
# Getting just the 2021 data.
df_21 = df[['Category' , 'Yr2021']]
df_21

Unnamed: 0,Category,Yr2021
0,0 - 4,302.6
1,5 - 9,335.0
2,10 - 14,358.0
3,15 - 19,325.3
4,20 - 24,310.3
5,25 - 29,291.6
6,30 - 34,317.2
7,35 - 39,375.7
8,40 - 44,402.1
9,45 - 49,364.4


In [107]:
df_21.loc[4:5]

Unnamed: 0,Category,Yr2021
4,20 - 24,310.3
5,25 - 29,291.6


In [116]:
df_21.pivot(index=None, columns='Category', values ='Yr2021')

Category,0 - 4,10 - 14,15 - 19,20 - 24,25 - 29,30 - 34,35 - 39,40 - 44,45 - 49,5 - 9,50 - 54,55 - 59,60 - 64,65 - 69,70 - 74,75 - 79,80 - 84,85 years and over
0,302.6,,,,,,,,,,,,,,,,,
1,,,,,,,,,,335.0,,,,,,,,
2,,358.0,,,,,,,,,,,,,,,,
3,,,325.3,,,,,,,,,,,,,,,
4,,,,310.3,,,,,,,,,,,,,,
5,,,,,291.6,,,,,,,,,,,,,
6,,,,,,317.2,,,,,,,,,,,,
7,,,,,,,375.7,,,,,,,,,,,
8,,,,,,,,402.1,,,,,,,,,,
9,,,,,,,,,364.4,,,,,,,,,
