# Data Mining 


## Abstract

Five datasets were obtained from the 2017-March 2020 Pre-pandemic National Health and Nutrition Examination Survey (NHANES). These datasets were cleaned and merged to be prepared for further analysis. Initial exploratory analysis was performed through NumPy and Pandas and visualisation was performed through the Bokeh Python library. The goal of this project is to complete a data mining and data wrangling task using real world data while learning how to use a new visualisation package that allowed for interactive visualisations. The analysis and visualisation of this dataset aims to explore relationships between various medical measurements, particularly between various types of cholesterol and glucose measurements.



Import preliminary modules for setting up DataFrame

In [1]:
import numpy as np
import pandas as pd
from bokeh.io import output_notebook

pd.set_option("display.notebook_repr_html", False)  # disable "rich" output

Allow Bokeh plots to display inline within a Jupyter notebook

In [2]:
output_notebook()

**Setting up the DataFrame** 

Load `.xpt` files downloaded from https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Cycle=2017-2020.

Convert `.xpt` files into `.csv` files.

Read `.csv` files into Pandas individual DataFrames.

Merge individual into combined dataframe `merged_data`.

Remove unnecessary columns and keeping columns of interest using `.drop()`, assign new DataFrame as `working_df`.

Re-name remaining columns for better readability.

In [3]:
# Load xpt files
P_TRIGLY = pd.read_sas('P_TRIGLY.xpt', index=None)
P_HDL = pd.read_sas('P_HDL.xpt', index=None)
P_TCHOL = pd.read_sas('P_TCHOL.xpt', index=None)
P_GHB = pd.read_sas('P_GHB.xpt', index=None)
P_GLU = pd.read_sas('P_GLU.xpt', index=None)

# Convert xpt files to csv
P_TRIGLY.to_csv('P_TRIGLY.csv',index=False)
P_HDL.to_csv('P_HDL.csv',index=False)
P_TCHOL.to_csv('P_TCHOL.csv',index=False)
P_GHB.to_csv('P_GHB.csv',index=False)
P_GLU.to_csv('P_GLU.csv',index=False)

# Read CSV Files
ldl = pd.read_csv("P_TRIGLY.csv", comment="#")
hdl = pd.read_csv("P_HDL.csv", comment="#")
total_chol = pd.read_csv("P_TCHOL.csv", comment="#")
fasting_glucose = pd.read_csv("P_GLU.csv", comment="#")
hba1c = pd.read_csv("P_GHB.csv", comment="#")

# Join DataFrames
merged_data = pd.merge(ldl, hdl, on='SEQN', how='inner')
merged_data = pd.merge(merged_data, total_chol, on='SEQN', how='inner')
merged_data = pd.merge(merged_data, fasting_glucose, on='SEQN', how='inner')
merged_data = pd.merge(merged_data, hba1c, on='SEQN', how='inner')

# Keeping columns of interest
columns_to_remove = ["WTSAFPRP_x", "WTSAFPRP_y", "LBXTR", "LBDLDL", "LBDLDL", "LBDLDLSI", "LBDLDLM", "LBDLDMSI", "LBDLDLN", "LBDLDNSI", "LBDHDD", "LBXTC", "LBXGLU"]  
working_df = merged_data.drop(columns=columns_to_remove)

# Re-naming columns
working_df.columns= ["Respondent", "LDL & Triglycerides", "HDL", "Total Chol", "Fasting Glucose", "Glycohemoglobin"]

**Understanding the data** 

We can perform some initial exploratory data analysis using `.describe()` to get an idea the cleaned DataFrame.

We can see the counts of each column as well as their mean, standard deviation, min, max and 25%, 50% (median) and 75% quartiles.


In [4]:
working_df.describe()

          Respondent  LDL & Triglycerides          HDL   Total Chol  \
count    5090.000000          4650.000000  4661.000000  4661.000000   
mean   117177.326523             1.170958     1.386486     4.635748   
std      4487.335243             1.014096     0.400713     1.061997   
min    109264.000000             0.113000     0.280000     1.890000   
25%    113303.500000             0.632000     1.090000     3.900000   
50%    117265.500000             0.948000     1.320000     4.530000   
75%    121046.500000             1.423000     1.580000     5.250000   
max    124822.000000            30.302000     4.840000    11.530000   

       Fasting Glucose  Glycohemoglobin  
count      4744.000000      4777.000000  
mean          6.171739         5.789324  
std           2.016121         1.103826  
min           2.610000         2.800000  
25%           5.270000         5.300000  
50%           5.660000         5.500000  
75%           6.220000         5.900000  
max          29.100000  

Comparing the *max* to the *means* of each columns we can see that there are some obvious outliers especially when we take into consideration the *standard deviation*.

**Initial visualisation of Data**

Next we can begin the visualise this data and perhaps start to see these outliers.

## Boxplots 

### Boxplots of Cholesterol

We can use boxplots to visualise the above table much more clearly using Bokeh.

In [5]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, Whisker
from bokeh.transform import factor_cmap

box_plot_df = working_df.loc[:, ["LDL & Triglycerides", "HDL", "Total Chol", "Fasting Glucose", "Glycohemoglobin"]]
chol_box_plot_df = working_df.loc[:, ["LDL & Triglycerides", "HDL", "Total Chol"]]

cats = np.array(["LDL & Triglycerides", "HDL", "Total Chol"])
 
# compute quantiles
qs = box_plot_df.quantile([0.25, 0.5, 0.75]) 
qs = qs.transpose().reset_index() # Transpose DataFrame
qs.columns = ["cats", "q1", "q2", "q3"]
qs["iqr"] = qs.loc[:, "q3"] - qs.loc[:, "q1"] # Add IQR
qs["upper"] = qs.loc[:, "q3"] + 1.5*qs.loc[:, "iqr"] # Add upper bound 
qs["lower"] = qs.loc[:, "q1"] - 1.5*qs.loc[:, "iqr"] # Add lower bound

source = ColumnDataSource(qs)

b = figure(x_range=cats, tools="", toolbar_location=None,
           title="Boxplots of LDL, HDL and Total Cholesterol",
           background_fill_color="#eaefef", y_range=(-1,12), width=400)

whisker = Whisker(base="cats", upper="upper", lower="lower", source=source)
whisker.upper_head.size = whisker.lower_head.size = 20
b.add_layout(whisker)

# quantile boxes
cmap = factor_cmap("cats", "TolRainbow7", cats)
b.vbar("cats", 0.4, "q2", "q3", source=source, color=cmap, line_color="black")
b.vbar("cats", 0.4, "q1", "q2", source=source, color=cmap, line_color="black")

# Setting up for outliers
transposed_qs = qs.transpose()
new_header = transposed_qs.iloc[0] # Take first row for header
transposed_qs = transposed_qs[1:] # Take data less header row
transposed_qs.columns = new_header # Set header row as df header
transposed_qs.set_index(transposed_qs.columns[0]) # Set first column as index

# outliers 
ldl_series = box_plot_df.loc[:, "LDL & Triglycerides"]
ldl_outliers = box_plot_df.loc[:, "LDL & Triglycerides"].between(transposed_qs.loc["lower", "LDL & Triglycerides"], transposed_qs.loc["upper", "LDL & Triglycerides"])
x_ldl_outlier_values = np.full(236, "LDL & Triglycerides")
y_ldl_outlier_values = ldl_series[~ldl_outliers].dropna()
y_ldl_outlier_values = y_ldl_outlier_values.to_numpy().ravel()
b.circle(x_ldl_outlier_values, y_ldl_outlier_values, size=6, color="#F38630", fill_alpha=0.6)

hdl_series = box_plot_df.loc[:, "HDL"]
hdl_outliers = box_plot_df.loc[:, "HDL"].between(transposed_qs.loc["lower", "HDL"], transposed_qs.loc["upper", "HDL"])
x_hdl_outliers_values = np.full(127, "HDL")
y_hdl_outliers_values = hdl_series[~hdl_outliers].dropna()
y_hdl_outliers_values = y_hdl_outliers_values.to_numpy().ravel()
b.circle(x_hdl_outliers_values, y_hdl_outliers_values, size=6, color="#F38630", fill_alpha=0.6)

total_chol_series = box_plot_df.loc[:, "Total Chol"]
total_chol_outliers = box_plot_df.loc[:, "Total Chol"].between(transposed_qs.loc["lower", "Total Chol"], transposed_qs.loc["upper", "Total Chol"])
x_total_chol_values = np.full(81, "Total Chol")
y_total_chol_values = total_chol_series[~total_chol_outliers].dropna()
y_total_chol_values = y_total_chol_values.to_numpy().ravel()
b.circle(x_total_chol_values, y_total_chol_values, size=6, color="#F38630", fill_alpha=0.6)

b.xgrid.grid_line_color = None
b.axis.major_label_text_font_size="14px"
b.axis.axis_label_text_font_size="12px"

show(b)

Here we visualise the three cholesterol-related columns onto the same boxplot.

Looking at this data we can see that `LDL & Triglycerides` and `HDL` share similar Q1, Q2 and Q3. Whilst `LDL & Triglycerides` has a much larger spread (larger whiskers).

As `Total Cholesterol` is a measured combination of both  `LDL & Triglycerides` and `HDL` it is expected that their values are higher than that of  `LDL & Triglycerides` and `HDL` individually.

The outliers are expected to represent individuals with unhealthy levels of cholesterol as then fall outside of the range of 'normal'.

### Boxplots of Glucose measures

Plasma Fasting Glucose (`Fasting Glucose`) measures blood glucose at a single point in time, while Glycohemoglobin measures long-term glucose control.

As the two are interlinked, it is expected that they resemble a similar pattern when visualised.

In [6]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, Whisker
from bokeh.transform import factor_cmap

glucose_box_plot_df = working_df.loc[:, ["Fasting Glucose", "Glycohemoglobin"]]

cats = np.array(["Fasting Glucose", "Glycohemoglobin"])
 
# compute quantiles
qs = box_plot_df.quantile([0.25, 0.5, 0.75]) 
qs = qs.transpose().reset_index() # Transpose DataFrame
qs.columns = ["cats", "q1", "q2", "q3"]
qs["iqr"] = qs.loc[:, "q3"] - qs.loc[:, "q1"] # Add IQR
qs["upper"] = qs.loc[:, "q3"] + 1.5*qs.loc[:, "iqr"] # Add upper bound 
qs["lower"] = qs.loc[:, "q1"] - 1.5*qs.loc[:, "iqr"] # Add lower bound

source = ColumnDataSource(qs)

p = figure(x_range=cats, tools="", toolbar_location=None,
           title="Boxplots of Fasting Glucose and Glycohemoglobin",
           background_fill_color="#eaefef", y_range=(2,25), width=400)

whisker = Whisker(base="cats", upper="upper", lower="lower", source=source)
whisker.upper_head.size = whisker.lower_head.size = 20
p.add_layout(whisker)

# quantile boxes
cmap = factor_cmap("cats", "TolRainbow7", cats)
p.vbar("cats", 0.4, "q2", "q3", source=source, color=cmap, line_color="black")
p.vbar("cats", 0.4, "q1", "q2", source=source, color=cmap, line_color="black")

# Setting up for outliers
transposed_qs = qs.transpose()
new_header = transposed_qs.iloc[0] # Take first row for header
transposed_qs = transposed_qs[1:] # Take data less header row
transposed_qs.columns = new_header # Set header row as df header
transposed_qs.set_index(transposed_qs.columns[0]) # Set first column as index

# outliers 
glucose_series = box_plot_df.loc[:, "Fasting Glucose"]
glucose_outliers = box_plot_df.loc[:, "Fasting Glucose"].between(transposed_qs.loc["lower", "Fasting Glucose"], transposed_qs.loc["upper", "Fasting Glucose"])
x_glucose_outlier_values = np.full(468, "Fasting Glucose")
y_glucose_outlier_values = glucose_series[~glucose_outliers].dropna()
y_glucose_outlier_values = y_glucose_outlier_values.to_numpy().ravel()
p.circle(x_glucose_outlier_values, y_glucose_outlier_values, size=6, color="#F38630", fill_alpha=0.5)

glycohemoglobin_series = box_plot_df.loc[:, "Glycohemoglobin"]
glycohemoglobin_outliers = box_plot_df.loc[:, "Glycohemoglobin"].between(transposed_qs.loc["lower", "Glycohemoglobin"], transposed_qs.loc["upper", "Glycohemoglobin"])
x_glycohemoglobin_values = np.full(447, "Glycohemoglobin")
y_glycohemoglobin_values = glycohemoglobin_series[~glycohemoglobin_outliers].dropna()
y_glycohemoglobin_values = y_glycohemoglobin_values.to_numpy().ravel()
p.circle(x_glycohemoglobin_values, y_glycohemoglobin_values, size=6, color="#F38630", fill_alpha=0.5)

p.xgrid.grid_line_color = None
p.axis.major_label_text_font_size="14px"
p.axis.axis_label_text_font_size="12px"

show(p)

Here we can see that `Fasting Glucose` and `Glycohemoglobin` share similar means. 

The ends of the whiskers should represent the range of which a healthy individual's fast blood glucose should be. The outliers in orange, therefore should indicate individuals that have elevated fasted blood glucose, which would be outside of the normal range.

These individuals would likely be the diabetics that were part of the study.

According to the World Health Organisation, expected normal values for fasted blood glucose should be between 3.9mmol/L and 5.6mmol/L which seems to correlate closely to the graph.<sup>1</sup> 

According to Centers for Disease Control and Prevention, normal HbA1c levels should be below 5.7%, which again seems to follow closely to the observed normal ranges in the box plot.<sup>2</sup>

As the outliers fall outside of this normal range, then according to these definitions, these outliers would represent the diabetics that participated in this study.

1. https://www.who.int/data/gho/indicator-metadata-registry/imr-details/2380#:~:text=The%20expected%20values%20for%20normal,and%20monitoring%20glycemia%20are%20recommended.
2. https://www.cdc.gov/diabetes/managing/managing-blood-sugar/a1c.html#:~:text=A%20normal%20A1C%20level%20is,6.5%25%20or%20more%20indicates%20diabetes.

## Histograms and Scatterplots

We can also use histograms to visualise the distribution of the data as well.

Furthermore we can use scatterplots to observe relationships between our chosen variables.


In [7]:
from bokeh.plotting import figure, show
from bokeh.layouts import column, layout
from bokeh.models import RangeSlider, HoverTool, Div, Spinner, ColumnDataSource
import numpy as np

# ======================================  Histograms  ======================================  

# Create histogram 1 for LDL and Trigly
ldl_series = box_plot_df.loc[:, "LDL & Triglycerides"]
h1 = figure(title="LDL & Triglycerides", x_axis_label="mmol/L", y_axis_label="Number of Participants") # Create a Bokeh figure

hist, edges = np.histogram(ldl_series, bins=np.linspace(3,14,20)) # Plot histogram
h1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="white", fill_color="blue", alpha=0.7)

range_slider1 = RangeSlider(
    title='Adjust x-axis range',
    value=(0, ldl_series.max() + 0.2),
    start=0,
    end=ldl_series.max() + 0.2,
)
# set up RangeSlider1 for h1
range_slider1.js_link("value", h1.x_range, "start", attr_selector=0) 
range_slider1.js_link("value", h1.x_range, "end", attr_selector=1) 

# Create histogram 2 for HDL
hdl_series = box_plot_df.loc[:, "HDL"]
h2 = figure(title="HDL", x_axis_label="mmol/L", y_axis_label="Number of Participants") # Create a Bokeh figure
# Create histogram
hist, edges = np.histogram(hdl_series, bins=np.linspace(3,10,30))
h2.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="white", fill_color="blue", alpha=0.7)

# set up RangeSlider2 for h2
range_slider2 = RangeSlider(
    title='Adjust x-axis range',
    value=(0, hdl_series.max() + 5),
    start=0,
    end=hdl_series.max() + 5,
)
range_slider2.js_link("value", h2.x_range, "start", attr_selector=0) 
range_slider2.js_link("value", h2.x_range, "end", attr_selector=1) 

# Create histogram 3 for Total Chol
total_chol_series = box_plot_df.loc[:, "Total Chol"]
h3 = figure(title="Total Cholesterol", x_axis_label="mmol/L", y_axis_label="Number of Participants") # Create a Bokeh figure

hist, edges = np.histogram(total_chol_series, bins=np.linspace(0,14,50)) # Create histogram
h3.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="white", fill_color="blue", alpha=0.7)

# set up RangeSlider3 for h3
range_slider3 = RangeSlider(
    title='Adjust x-axis range',
    value=(0, total_chol_series.max() + 5),
    start=0,
    end=total_chol_series.max() + 5,
)
range_slider3.js_link("value", h3.x_range, "start", attr_selector=0) 
range_slider3.js_link("value", h3.x_range, "end", attr_selector=1) 

# Create histogram 4 for Fasting Glucose
glucose_series = box_plot_df.loc[:, "Fasting Glucose"]
h4 = figure(title="Fasting Glucose", x_axis_label="mmol/L", y_axis_label="Number of Participants") # Create a Bokeh figure
hist, edges = np.histogram(glucose_series, bins=np.linspace(3,25,40)) # Create histogram
h4.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="white", fill_color="blue", alpha=0.7)

# set up RangeSlider4 for h4
range_slider4 = RangeSlider(
    title='Adjust x-axis range',
    value=(0, glucose_series.max() + 5),
    start=0,
    end=glucose_series.max() + 5,
)
range_slider4.js_link("value", h4.x_range, "start", attr_selector=0) 
range_slider4.js_link("value", h4.x_range, "end", attr_selector=1) 

# Create histogram 5 for Glycohemoglobin
glycohemoglobin_series = box_plot_df.loc[:, "Glycohemoglobin"]
h5 = figure(title="Glycohemoglobin", x_axis_label="%", y_axis_label="Number of Participants") # Create a Bokeh figure
hist, edges = np.histogram(glycohemoglobin_series, bins=np.linspace(3,25,40)) # Create histogram
h5.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="white", fill_color="blue", alpha=0.7)

# set up RangeSlider5 for h5
range_slider5 = RangeSlider(
    title='Adjust x-axis range',
    value=(0, glucose_series.max() + 5),
    start=0,
    end=glucose_series.max() + 5,
)
range_slider5.js_link("value", h5.x_range, "start", attr_selector=0) 
range_slider5.js_link("value", h5.x_range, "end", attr_selector=1) 


# ======================================  Scatterplots  ======================================  
# Setting up for Scatterplots - Standardise values in DataFrame for better comparison
def standardise(x):
    return (x-np.mean(x))/np.std(x, ddof=1) 

standardised_df = working_df.apply(standardise)

source = ColumnDataSource(standardised_df)

# Create figure 1
p = figure(title='LDL and HDL vs Total Cholesterol',
           x_axis_label='LDL and HDL',
           y_axis_label='Total Cholesterol')

# Add a scatter plot to the figure with different colors
points = p.scatter(x='LDL & Triglycerides', y='Total Chol', source=source, size=8, color='blue', legend_label='LDL & Triglycerides', alpha=0.3)

# Add a scatter plot to the figure with different colors
points2 = p.scatter(x='HDL', y='Total Chol', source=source, size=8, color='red', legend_label='HDL', alpha=0.3)

# Add a scatter plot to the figure with different colors
points3 = p.scatter(x='Total Chol', y='HDL', source=source, size=8, color='yellow', legend_label='Total Cholesterol', alpha=0.3)

# Set the plot aesthetics
p.title.text_font_size = '16pt'
p.xaxis.axis_label_text_font_size = '14pt'
p.yaxis.axis_label_text_font_size = '14pt'
p.xaxis.major_label_text_font_size = '12pt'
p.yaxis.major_label_text_font_size = '12pt'

# Customize the plot (add legend, grid, etc.)
p.legend.location = "top_right"
p.legend.click_policy = "hide"  # Click on legend items to hide/show

# set up textarea (div)
div1 = Div(
    text="""
          <p>Select the circle's size using this control element:</p>
          """,
    width=200,
    height=30,
)

# Create figure2
p2 = figure(title='Scatter Plot - Glycohemoglobin vs Fasting Glucose',
           x_axis_label='Fasting Glucose',
           y_axis_label='Glycohemoglobin')

# Add a scatter plot to the figure with different colors
points4 = p2.scatter(x='Fasting Glucose', y='Glycohemoglobin', source=source, size=8, color='blue', legend_label='Fasting Gluclose', alpha=0.3)

# Add a scatter plot to the figure with different colors
points5 = p2.scatter(x='Glycohemoglobin', y='Fasting Glucose', source=source, size=8, color='red', legend_label='Glycohemoglobin', alpha=0.3)

# Set the plot aesthetics
p2.title.text_font_size = '16pt'
p2.xaxis.axis_label_text_font_size = '14pt'
p2.yaxis.axis_label_text_font_size = '14pt'
p2.xaxis.major_label_text_font_size = '12pt'
p2.yaxis.major_label_text_font_size = '12pt'

# Customize the plot (add legend, grid, etc.)
p2.legend.location = "top_left"
p2.legend.click_policy = "hide"  # Click on legend items to hide/show

hover2 = HoverTool()
hover2.tooltips = [("Fasting Glucose", "@{Fasting Glucose}"), ("Glycohemoglobin", "@{Glycohemoglobin}")]
p2.add_tools(hover2)

# set up textarea (div)
div2 = Div(
    text="""
          <p>Select the circle's size using this control element:</p>
          """,
    width=200,
    height=30,
)

# set up spinner
spinner1 = Spinner(
    title="LDL & Triglyceride size",
    low=0,
    high=20,
    step=1,
    value=points.glyph.size,
    width=200,
)
spinner1.js_link("value", points.glyph, "size")

# set up spinner2
spinner2 = Spinner(
    title="HDL size",
    low=0,
    high=20,
    step=1,
    value=points2.glyph.size,
    width=200,
)
spinner2.js_link("value", points2.glyph, "size")

# set up spinner3
spinner3 = Spinner(
    title="Total Cholesterol Size",
    low=0,
    high=20,
    step=1,
    value=points3.glyph.size,
    width=200,
)
spinner3.js_link("value", points3.glyph, "size")

# set up spinner4
spinner4 = Spinner(
    title="Fasting Glucose size",
    low=0,
    high=20,
    step=1,
    value=points4.glyph.size,
    width=200,
)
spinner4.js_link("value", points4.glyph, "size")

# set up spinner5
spinner5 = Spinner(
    title="Glycohemoglobin size",
    low=0,
    high=20,
    step=1,
    value=points5.glyph.size,
    width=200,
)
spinner5.js_link("value", points5.glyph, "size")

# ======================================  Setting up for layouts  ======================================  
layout = layout([
    [[range_slider1],[h1]],
    [[range_slider2],[h2]],
    [[range_slider3],[h3]],
    [[range_slider4],[h4]],
    [[range_slider5],[h5]],
    [[div1], [spinner1], [spinner2], [spinner3]],
    [p],
    [[div2], [spinner4], [spinner5]],
    [p2],
                ])

# Show the plot
show(layout)


**Histogram observation**

Looking at the histograms we can see that `LDL & Triglycerides` and `HDL` both have a strong positive skew.

This is expected as healthy individuals should have lower levels of these cholesterols and only individuals with hypercholesterolaemia would have elevated measured levels of cholesterol. 

Even when we change the x-axis range to zoom in on the histogram we don't observe any small columns below a certain point.

Obviously we can't have zero levels of cholesterol, so this suggests that humans at least require some baseline level of cholesterol to function/stay alive.

Strangely enough `Total Cholesterol` appears to have a normal distribution, with perhaps a very slight positive skew.

Taking this into consideration, we can infer that most individuals have on average around 4 - 4.5mmol/L of baseline cholesterol in order to function, there are individuals that can survive on less but we need at least 2mmol/L. 

Looking at the histograms of `Fasting Glucose` and `Glycohemoglobin` we also observe a strong postive skew.

Again, this makes sense as healthy individuals will have lower levels of blood glucose and only unhealthy/diabetic individuals will have elevated blood glucose.

**Scatterplot observation**

We also standardise the values in the dataframe to allow for proper comparison.

*Cholesterol Scatterplot*

When we plot LDL and HDL against Total cholesterol we can observe that there is not a strong linear relationship between LDL and HDL against Total Cholesterol levels.

As `Total Cholesterol` is comprised of a combination of both `LDL & Triglycerides` and `HDL`.

Generally as `LDL & Triglycerides` is considered the 'bad' cholesterol and `HDL` is considered the 'good' cholesterol, it would be expected that individuals that have elevated of `LDL & Triglycerides`'s would have lower levels of `HDL` and vice versa.

Therefore the combined values would tend to even out.

*Blood Glucose Scatterplot*

Looking at the scatterplot of `Glycohemoglobin` vs `Fasting Glucose` we can see a linear relationship with a cluster of points towards the lower values.

The cluster of points towards lower values indicates the majority of healthy individuals that have lower measured blood glucose versus the hyperglycaemic individuals with higher measured blood glucose.

The linear correlation is expected as both variables are measures of blood glucose. While `Fasting Glucose` measures an individuals blood glucose at a single point in time and `Glycohemoglobin`  measures long-term blood glucose control. It is expected that individuals that have poor blood glucose control in the short term would also have poor blood glucose control in the long term and hence the linear correlation of values.




## Conclusion

From multiple datasets as long as there is a common column/key we can begin to merge them into larger datasets for more extensive data analysis allowing us to compare a wider variety of data.

From looking at the data analysed in this report we can see that the measured values tend follow with reported health and unhealthy levels of cholesterol and blood glucose.

We can also observe that fasting glucose and Glycohemoglobin have a linear correlation suggesting that poor short-term blood glucose control likely results in poor long-term blood glucose control.

