#**Data Exploration**

In this script, features within the intitial cvd dataset will be visualised to help gain further understanding of the dataset being worked with.

In [1]:
import os

# Find the latest version of spark 4.x  from http://www.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.4.1'
spark_version = 'spark-3.4.1'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop3.tgz
!tar xf $SPARK_VERSION-bin-hadoop3.tgz
!pip install -q findspark

# Set Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop3"

# Start a SparkSession
import findspark
findspark.init()

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 338 kB in 4s (87.8 kB/s)
Reading package lists... Done


In [2]:
# Import dependencies
import requests
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import pyspark.sql.functions as F

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType


# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()

In [3]:
# Retrieves in the CSV data from Google Sheets
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSDchXr1EhgCSsxlxJ3lWPhh1kT5EJS3yv4DJ2YLeMIC3y4uq-Pp4EQknrs9zAiaI3ulne2Jyi6gR6G/pub?gid=602879552&single=true&output=csv"
response = requests.get(url)

# Write the CSV data to a local file
with open("cvd.csv", "wb") as f:
    f.write(response.content)

# Read the local CSV file using Spark
cvd_df = spark.read.csv("cvd.csv", header=True, sep=",", inferSchema=True)

# Show DataFrame
cvd_df.show()

+--------------+--------------------+--------+-------------+-----------+------------+----------+--------+---------+------+------------+---------+---------+-----+---------------+-------------------+-----------------+----------------------------+-----------------------+
|general_health|             checkup|exercise|heart_disease|skin_cancer|other_cancer|depression|diabetes|arthritis|   sex|age_category|height_cm|weight_kg|  bmi|smoking_history|alcohol_consumption|fruit_consumption|green_vegetables_consumption|friedpotato_consumption|
+--------------+--------------------+--------+-------------+-----------+------------+----------+--------+---------+------+------------+---------+---------+-----+---------------+-------------------+-----------------+----------------------------+-----------------------+
|          Poor|Within the past 2...|      No|           No|         No|          No|        No|      No|      Yes|Female|       70-74|      150|    32.66|14.54|            Yes|                

In [4]:
# Check data types
cvd_df.describe()

DataFrame[summary: string, general_health: string, checkup: string, exercise: string, heart_disease: string, skin_cancer: string, other_cancer: string, depression: string, diabetes: string, arthritis: string, sex: string, age_category: string, height_cm: string, weight_kg: string, bmi: string, smoking_history: string, alcohol_consumption: string, fruit_consumption: string, green_vegetables_consumption: string, friedpotato_consumption: string]

Any numeric columns will be converted to their respective data type as all values from the CSV were read in as a string.

In [5]:
# Convert data types
cvd_df = cvd_df.withColumn("height_cm", col("height_cm").cast(IntegerType()))
cvd_df = cvd_df.withColumn("weight_kg", col("weight_kg").cast(FloatType()))
cvd_df = cvd_df.withColumn("bmi", col("bmi").cast(FloatType()))
cvd_df = cvd_df.withColumn("alcohol_consumption", col("alcohol_consumption").cast(IntegerType()))
cvd_df = cvd_df.withColumn("fruit_consumption", col("fruit_consumption").cast(IntegerType()))
cvd_df = cvd_df.withColumn("green_vegetables_consumption", col("green_vegetables_consumption").cast(IntegerType()))
cvd_df = cvd_df.withColumn("friedpotato_consumption", col("friedpotato_consumption").cast(IntegerType()))

In [6]:
# Create temporary view
cvd_df.createOrReplaceTempView('cvd')

### **Heart Disease vs Age**

In [7]:
# Spark SQL query to calculate Heart Disease Count by Age
query = """
SELECT
  age_category,
  COUNT(*) AS total_count,
  SUM(
    CASE WHEN heart_disease = 'Yes' THEN 1 ELSE 0 END
    ) AS heart_disease_count
FROM
  cvd
GROUP BY
  age_category
ORDER BY
  age_category
"""

# Execute the SQL query and store the result in a Spark DataFrame
age_df = spark.sql(query)

# Convert the Spark DataFrame to a Pandas DataFrame for plotting
age_pandas_df = age_df.toPandas()

# Add a calculated column which computes the percentage of the age group with heart disease
age_pandas_df['heart_disease_percentage'] = (age_pandas_df['heart_disease_count'] / age_pandas_df['total_count']) * 100

# Display the dataframe
age_pandas_df

Unnamed: 0,age_category,total_count,heart_disease_count,heart_disease_percentage
0,18-24,18474,93,0.50341
1,25-29,15196,113,0.743617
2,30-34,17963,193,1.074431
3,35-39,19913,258,1.295636
4,40-44,20857,412,1.975356
5,45-49,20295,651,3.207687
6,50-54,24259,1118,4.608599
7,55-59,27134,1913,7.050195
8,60-64,31268,2893,9.252271
9,65-69,32321,3691,11.41982


In [8]:
# Create an interactive bar chart to visualise the total count of each age group in the dataset using Plotly
fig = px.bar(
             age_pandas_df,
             x='age_category',
             y='total_count',
             labels={'age_category': 'Age Category', 'total_count': 'Total Count'},
             title='<b>Total Count by Age Category</b>',
             color='total_count',
             color_continuous_scale='Reds',
             range_color=[min(age_pandas_df['total_count']), max(age_pandas_df['total_count'])],
             template='simple_white'
            )

# Customise bar chart appearance
fig.update_layout(
    title_font=dict(color='black', size=28),
    coloraxis_colorbar=dict(title_font=dict(color='black'))
)

# Show the bar chart
fig.show()

There are more 65-69 year olds in our dataset compared to any other age group, this represents an aging population. However, there is still a respectable distibution but it should just be noted that this dataset is slightly left skewed.

In [9]:
# Create an interactive bar chart to visualise the heart disease count of each age group in the dataset using Plotly
fig = px.bar(
             age_pandas_df,
             x='age_category',
             y='heart_disease_count',
             labels={'age_category': 'Age Category', 'heart_disease_count': 'Heart Disease Count'},
             title='<b>Heart Disease Count by Age Category</b>',
             color='heart_disease_count',
             color_continuous_scale='Reds',
             range_color=[min(age_pandas_df['heart_disease_count']), max(age_pandas_df['heart_disease_count'])],
             template='simple_white'
            )

# Customise bar chart appearance
fig.update_layout(
                  title_font=dict(color='black', size=28),
                  coloraxis_colorbar=dict(title_font=dict(color='black'))
                 )

# Show the bar chart
fig.show()

Although there were more 65-69 year olds in the dataset, those who were 80+ were most likely to have a heart disease. There is also a drop in the number of 75-79 year olds in our dataset who have a heart disease in comparison with neighboring age categories but this could be down to the number 75-79 year olds in our data and therefore, the heart disease prevelance in each age group would help provide further insight.

In [10]:
# Create an interactive line chart to visualise Heart Disease Prevalence by Age using Plotly
fig = px.line(
              age_pandas_df,
              x='age_category',
              y='heart_disease_percentage',
              labels={'age_category': 'Age Category', 'heart_disease_percentage': 'Heart Disease Prevalence (%)'},
              title='<b>Heart Disease Prevalence by Age</b>',
              markers=True,
              template='simple_white',
              )

# Customise the line chart appearance
fig.update_traces(line=dict(color='#8D021F'))
fig.update_layout(title_font=dict(size=28), xaxis_title_font=dict(size=16), yaxis_title_font=dict(size=16), xaxis_tickfont=dict(size=12), yaxis_tickfont=dict(size=12))

# Show the line chart
fig.show()

The older an individual is, the proportion of their age group that has a heart disease increases. This trend increases more exponentially after the age of 40.

### **Heart Disease vs Gender**

In [11]:
# Spark SQL query to calculate heart disease breakdown by gender
query = """
SELECT
  sex,
  COUNT(*) AS total_count,
  SUM(
    CASE WHEN heart_disease = 'Yes' THEN 1 ELSE 0 END
    ) AS heart_disease_count
FROM
  cvd
GROUP BY
  sex
ORDER BY
  sex
"""

# Execute the SQL query and store the result in a Spark DataFrame
sex_df = spark.sql(query)

# Convert the Spark DataFrame to a Pandas DataFrame for plotting
sex_pandas_df = sex_df.toPandas()

# Add a calculated column which computes the percentage of the sex group with heart disease
sex_pandas_df['heart_disease_percentage'] = (sex_pandas_df['heart_disease_count'] / sex_pandas_df['total_count']) * 100

# Display the dataframe
sex_pandas_df

Unnamed: 0,sex,total_count,heart_disease_count,heart_disease_percentage
0,Female,153867,9492,6.168964
1,Male,145445,14589,10.030596


In [12]:
# Create an interactive bar chart to visualise the total count of each sex in the dataset using Plotly
fig = px.bar(
            sex_pandas_df,
            x='sex',
            y='total_count',
            labels={'sex': 'Sex', 'total_count': 'Total Count'},
            title='<b>Total Count by Sex</b>',
            color='sex',
            color_discrete_sequence=['#CD5C5C', '#8D021F'],
            template='simple_white'
            )

# Customise chart appearance
fig.update_layout(
    title_font=dict(color='black', size=28),
)

# Show the bar chart
fig.show()

There were marginally more females in our dataset than males.

In [13]:
# Create an interactive bar chart to visualise the heart disease count of each sex in the dataset using Plotly
fig = px.pie(
             sex_pandas_df,
             names='sex',
             values='heart_disease_count',
             title='<b>Heart Disease Breakdown by Gender</b>',
             color_discrete_sequence=['#8D021F', '#CD5C5C'],
             template='simple_white'
            )

# Customise the chart appearance
fig.update_traces(textinfo='percent+label', pull=[0.1, 0], hole=0.3)
fig.update_layout(title_font=dict(size=28))

# Show the pie chart
fig.show()

Of all individuals who have heart disease in our dataset, males are 1.5x more likely to be at risk.

In [14]:
# Create an interactive line chart to visualise Heart Disease Prevalence by Sex using Plotly
fig = px.bar(
            sex_pandas_df,
            x='sex',
            y='heart_disease_percentage',
            labels={'sex': 'Sex', 'heart_disease_percentage': 'Heart Disease Prevalence (%)'},
            title='<b>Heart Disease Prevalence by Sex</b>',
            color='sex',
            color_discrete_sequence=['#CD5C5C', '#8D021F'],
            template='simple_white',
            )

# Customise the appearance of the bar chart
fig.update_layout(title_font=dict(size=28), xaxis_title_font=dict(size=16), yaxis_title_font=dict(size=16), xaxis_tickfont=dict(size=12), yaxis_tickfont=dict(size=12))

# Show the bar chart
fig.show()


A higher proportion of the males sex group had a heart disease, this is in line with the prior charts due to the distribution between the sex groups being fairly even.

### **Heart Disease vs General Health over Lifetime**

In [15]:
# Spark SQL query to calculate Heart Disease Count by General Health
query = """
SELECT
  general_health,
  COUNT(*) AS total_count,
  SUM(
    CASE WHEN heart_disease = 'Yes' THEN 1 ELSE 0 END
    ) AS heart_disease_count
FROM
  cvd
GROUP BY
  general_health
ORDER BY
  general_health
"""

# Execute the SQL query and store the result in a Spark DataFrame
health_df = spark.sql(query)

# Convert the Spark DataFrame to a Pandas DataFrame for plotting
health_pandas_df = health_df.toPandas()

# Add a calculated column which computes the percentage of the age group with heart disease
health_pandas_df['heart_disease_percentage'] = (health_pandas_df['heart_disease_count'] / health_pandas_df['total_count']) * 100

# Display the dataframe
health_pandas_df

Unnamed: 0,general_health,total_count,heart_disease_count,heart_disease_percentage
0,Excellent,55002,1084,1.970837
1,Fair,34228,6532,19.083791
2,Good,91775,8338,9.085263
3,Poor,10858,3451,31.783017
4,Very Good,107449,4676,4.351832


In [16]:
# Create an interactive bar chart to visualise the total count of each general health group in the dataset using Plotly
fig = px.bar(
             health_pandas_df,
             x='general_health',
             y='total_count',
             labels={'general_health': 'General Health', 'total_count': 'Total Count'},
             title='<b>Total Count by General Health over Lifetime</b>',
             color='total_count',
             color_continuous_scale='Reds',
             range_color=[min( health_pandas_df['total_count']), max( health_pandas_df['total_count'])],
             template='simple_white',
             category_orders={"general_health": ["Poor", "Fair", "Good", "Very Good", "Excellent"]}
            )

# Customise bar chart appearance
fig.update_layout(
    title_font=dict(color='black', size=28),
    coloraxis_colorbar=dict(title_font=dict(color='black'))
)

# Show the bar chart
fig.show()

Most individuals in our dataset had a very good health over their life time whist very few has a poor overall health.

In [17]:
# Create an interactive bar chart to visualise the heart disease count of each general health group in the dataset using Plotly
fig = px.bar(
             health_pandas_df,
             x='general_health',
             y='heart_disease_count',
             labels={'general_health': 'General Health', 'heart_disease_count': 'Heart Disease Count'},
             title='<b>Heart Disease Count by General Health</b>',
             color='heart_disease_count',
             color_continuous_scale='Reds',
             range_color=[min(age_pandas_df['heart_disease_count']), max(age_pandas_df['heart_disease_count'])],
             category_orders={"general_health": ["Poor", "Fair", "Good", "Very Good", "Excellent"]},
             template='simple_white'
            )

# Customise bar chart appearance
fig.update_layout(
                  title_font=dict(color='black', size=28),
                  coloraxis_colorbar=dict(title_font=dict(color='black'))
                 )

# Show the bar chart
fig.show()

Most individuals in our dataset who had heart disease had a good or fair health over the course of their lifetime. The number of cases of individuals with a poor health not having as many heart disease could be due to their sample size within our dataset which is evident in the prior general health breakdown within our dataset.

In [18]:
# Create an interactive line chart to visualise Heart Disease Prevalence by Sex using Plotly
fig = px.bar(
            health_pandas_df,
            x='general_health',
            y='heart_disease_percentage',
            labels={'general_health': 'General Health', 'heart_disease_percentage': 'Heart Disease Prevalence (%)'},
            title='<b>Heart Disease Prevalence by General Health over Lifetime</b>',
            color='general_health',
            color_discrete_sequence=['#CD5C5C', '#8D021F'],
            category_orders={"general_health": ["Poor", "Fair", "Good", "Very Good", "Excellent"]},
            template='simple_white',
            )

# Customise the appearance of the bar chart
fig.update_layout(title_font=dict(size=28), xaxis_title_font=dict(size=16), yaxis_title_font=dict(size=16), xaxis_tickfont=dict(size=12), yaxis_tickfont=dict(size=12))

# Show the bar chart
fig.show()

The prevalence has concluded that the more depleated an individuals health has been over their life time, the more likely they are to have a heart disease.