## Health Data analysis

Data analysis is crucial for businesses and organizations to make informed decisions from raw data. Here are the main types:

1\. Descriptive Analysis
------------------------

*   **What it does:** Summarizes historical data to show what has happened. Uses techniques like mean, median, mode, standard deviation, and range.

2\. Diagnostic Analysis
-----------------------

*   **What it does:** Explores why something happened by drilling down into data to identify patterns, correlations, and anomalies.

3\. Predictive Analysis
-----------------------

*   **What it does:** Uses historical data and statistical techniques to predict future outcomes.

4\. Prescriptive Analysis
-------------------------

*   **What it does:** Recommends actions to take by using optimization techniques to identify the best course of action.

5\. Exploratory Analysis
------------------------

*   **What it does:** Discovers patterns and relationships in data, often used in early stages to gain understanding and generate hypotheses.

6\. Inferential Analysis
------------------------

*   **What it does:** Uses statistical methods to draw conclusions about a population based on a sample of data.

7\. Causal Analysis
-------------------

*   **What it does:** Identifies cause-and-effect relationships between variables.

8\. Mechanistic Analysis
------------------------

*   **What it does:** Focuses on understanding the underlying mechanisms that drive a phenomenon.

### Setup Environment

In [2]:
%run initializespark.ipynb

<IPython.core.display.Javascript object>

24/12/09 13:21:29 WARN Utils: Your hostname, Nihars-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.51.226 instead (on interface en0)
24/12/09 13:21:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /Users/niharmalali/.ivy2/cache
The jars for the packages stored in: /Users/niharmalali/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2099af43-eaf9-4892-b387-a91b3df7a14e;1.0
	confs: [default]


:: loading settings :: url = jar:file:/Volumes/D/WORKSPACE/PYTHON/notebooktest/.venv/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found io.delta#delta-spark_2.12;3.2.1 in central
	found io.delta#delta-storage;3.2.1 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 97ms :: artifacts dl 4ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.2.1 from central in [default]
	io.delta#delta-storage;3.2.1 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-2099af43-eaf9-4892-b387-a91b3df7a14e
	confs: [default]
	0 artifacts copied, 3 already retrieved (0kB/3ms)
24/12/09 13:21:30 

+-------------+
|      Message|
+-------------+
|Testing Spark|
+-------------+



### Loading Data

In [3]:
# Load the data into a dataframe
health_df = spark.read.parquet(
    "testdata/health_data_dev.parquet", header=True, inferSchema=True
)

# Show the first few rows of the dataframe
health_df.show()

+---------+---+------+------+-------------+
|PatientID|Age|Height|Weight|BloodPressure|
+---------+---+------+------+-------------+
|        1| 64|   161|    95|          161|
|        2| 67|   154|    61|          115|
|        3| 73|   156|    66|          171|
|        4| 20|   154|    74|          120|
|        5| 23|   197|    79|          116|
|        6| 79|   153|    71|          128|
|        7| 23|   162|    96|          105|
|        8| 59|   186|    75|          147|
|        9| 29|   190|    66|          115|
|       10| 39|   164|    69|          110|
|       11| 41|   165|    83|          109|
|       12| 70|   170|    90|          113|
|       13| 56|   185|    82|           98|
|       14| 43|   173|    86|           97|
|       15| 26|   165|    56|          173|
|       16| 44|   163|    71|          164|
|       17| 44|   171|    81|           82|
|       18| 32|   198|    63|          149|
|       19| 78|   199|    57|           92|
|       20| 21|   155|    74|   

### 1\. Descriptive Analysis
------------------------
Summarizes historical data to show what has happened. Uses techniques like mean, median, mode, standard deviation, and range.

In [4]:
# Perform descriptive analysis
health_df.describe().show()

+-------+------------------+------------------+-----------------+------------------+----------------+
|summary|         PatientID|               Age|           Height|            Weight|   BloodPressure|
+-------+------------------+------------------+-----------------+------------------+----------------+
|  count|               100|               100|              100|               100|             100|
|   mean|              50.5|             48.62|           173.29|             74.91|          131.52|
| stddev|29.011491975882016|18.117896727219243|15.59804118756227|14.045449315157079|29.4952418693133|
|    min|                 1|                20|              150|                50|              80|
|    max|               100|                79|              199|                97|             179|
+-------+------------------+------------------+-----------------+------------------+----------------+



2\. Diagnostic Analysis
-----------------------

*   **What it does:** Explores why something happened by drilling down into data to identify patterns, correlations, and anomalies.

In [9]:
from pyspark.sql.functions import col, corr

# Import necessary libraries

# Example: Find correlation between 'age' and 'BloodPressure'
age_bp_corr = health_df.select(corr('Age', 'BloodPressure')).collect()[0][0]
print(f"Correlation between age and blood pressure: {age_bp_corr}")

# Identify anomalies using standard deviation
mean_bp = health_df.select('BloodPressure').groupBy().mean().collect()[0][0]
stddev_bp = health_df.select('BloodPressure').groupBy().agg({'BloodPressure': 'stddev'}).collect()[0][0]

# Define a threshold for anomalies (e.g., 3 standard deviations from the mean)
threshold = 3 * stddev_bp

# Filter out anomalies
anomalies = health_df.filter((col('BloodPressure') > mean_bp + threshold) | (col('BloodPressure') < mean_bp - threshold))
print("Anomalies in blood pressure:")
anomalies.show()

Correlation between age and blood pressure: 0.1883905284146517
Anomalies in blood pressure:
+---------+---+------+------+-------------+
|PatientID|Age|Height|Weight|BloodPressure|
+---------+---+------+------+-------------+
+---------+---+------+------+-------------+

