## Health Data analysis

Data analysis is crucial for businesses and organizations to make informed decisions from raw data. Here are the main types:

1\. Descriptive Analysis
------------------------

*   **What it does:** Summarizes historical data to show what has happened. Uses techniques like mean, median, mode, standard deviation, and range.

2\. Diagnostic Analysis
-----------------------

*   **What it does:** Explores why something happened by drilling down into data to identify patterns, correlations, and anomalies.

3\. Predictive Analysis
-----------------------

*   **What it does:** Uses historical data and statistical techniques to predict future outcomes.

4\. Prescriptive Analysis
-------------------------

*   **What it does:** Recommends actions to take by using optimization techniques to identify the best course of action.

5\. Exploratory Analysis
------------------------

*   **What it does:** Discovers patterns and relationships in data, often used in early stages to gain understanding and generate hypotheses.

6\. Inferential Analysis
------------------------

*   **What it does:** Uses statistical methods to draw conclusions about a population based on a sample of data.

7\. Causal Analysis
-------------------

*   **What it does:** Identifies cause-and-effect relationships between variables.

8\. Mechanistic Analysis
------------------------

*   **What it does:** Focuses on understanding the underlying mechanisms that drive a phenomenon.

### Setup Environment

In [None]:
%run initializespark.ipynb

### Loading Data

In [None]:
# Load the data into a dataframe
health_df = spark.read.parquet(
    "testdata/health_data_dev.parquet", header=True, inferSchema=True
)

# Show the first few rows of the dataframe
health_df.show()

### 1\. Descriptive Analysis
------------------------
Summarizes historical data to show what has happened. Uses techniques like mean, median, mode, standard deviation, and range.

In [None]:
# Perform descriptive analysis
health_df.describe().show()

2\. Diagnostic Analysis
-----------------------

*   **What it does:** Explores why something happened by drilling down into data to identify patterns, correlations, and anomalies.

In [None]:
from pyspark.sql.functions import col, corr

# Import necessary libraries

# Example: Find correlation between 'age' and 'BloodPressure'
age_bp_corr = health_df.select(corr("Age", "BloodPressure")).collect()[0][0]
print(f"Correlation between age and blood pressure: {age_bp_corr}")

# Identify anomalies using standard deviation
mean_bp = health_df.select("BloodPressure").groupBy().mean().collect()[0][0]
stddev_bp = (
    health_df.select("BloodPressure")
    .groupBy()
    .agg({"BloodPressure": "stddev"})
    .collect()[0][0]
)

# Define a threshold for anomalies (e.g., 3 standard deviations from the mean)
threshold = 3 * stddev_bp

# Filter out anomalies
anomalies = health_df.filter(
    (col("BloodPressure") > mean_bp + threshold)
    | (col("BloodPressure") < mean_bp - threshold)
)
print("Anomalies in blood pressure:")
anomalies.show()

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Select features and label
feature_columns = ["Age", "Height", "Weight"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = assembler.transform(health_df)

# Split the data into training and test sets
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

# Initialize and train the linear regression model
lr = LinearRegression(featuresCol="features", labelCol="BloodPressure")
lr_model = lr.fit(train_data)

# Make predictions on the test set
predictions = lr_model.transform(test_data)

# Evaluate the model
evaluator = RegressionEvaluator(
    labelCol="BloodPressure", predictionCol="prediction", metricName="rmse"
)
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE) on test data: {rmse}")

# Show some predictions
predictions.select("Age", "Height", "Weight", "BloodPressure", "prediction").show()

### 3\. Predictive Analysis
-----------------------

##### *   **What it does:** Uses historical data and statistical techniques to predict future outcomes.

In [None]:
def train_and_evaluate_linear_regression(
    data, feature_columns, label_column, test_size=0.2, seed=42
):
    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.evaluation import RegressionEvaluator

    # Assemble features
    assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
    data = assembler.transform(data)

    # Split the data into training and test sets
    train_data, test_data = data.randomSplit([1 - test_size, test_size], seed=seed)

    # Initialize and train the linear regression model
    lr = LinearRegression(featuresCol="features", labelCol=label_column)
    lr_model = lr.fit(train_data)

    # Make predictions on the test set
    predictions = lr_model.transform(test_data)

    # Evaluate the model
    evaluator = RegressionEvaluator(
        labelCol=label_column, predictionCol="prediction", metricName="rmse"
    )
    rmse = evaluator.evaluate(predictions)
    print(f"Root Mean Squared Error (RMSE) on test data: {rmse}")

    # Show some predictions
    predictions.select(*feature_columns, label_column, "prediction").show()

    return lr_model, rmse, predictions


def predict_blood_pressure(lr_model, age, height, weight):
    from pyspark.ml.linalg import Vectors
    from pyspark.sql import Row

    # Create a single row dataframe with the input features
    input_data = spark.createDataFrame(
        [
            Row(
                Age=age,
                Height=height,
                Weight=weight,
                features=Vectors.dense([age, height, weight]),
            )
        ]
    )

    # Make prediction
    prediction = lr_model.transform(input_data)
    predicted_bp = prediction.select("prediction").collect()[0][0]
    return predicted_bp


# Example usage
lr_model, rmse, predictions = train_and_evaluate_linear_regression(
    health_df, ["Age", "Height", "Weight"], "BloodPressure"
)

In [None]:
predicted_bp = predict_blood_pressure(lr_model, 44, 70, 168)
print(f"Predicted Blood Pressure: {predicted_bp}")

## 4. Prescriptive Analysis
##### * What it does: Recommends actions to take by using optimization techniques to identify the best course of action.

In [49]:
from scipy.optimize import linprog

# Define the coefficients of the linear regression model
coefficients = lr_model.coefficients.toArray()
intercept = lr_model.intercept

# Define the target blood pressure
target_bp = 120

# Define the objective function (minimize the difference between predicted and target blood pressure)
c = [coefficients[0], coefficients[1], coefficients[2], -1]

# Define the constraints (predicted blood pressure should be equal to target blood pressure)
A_eq = [[coefficients[0], coefficients[1], coefficients[2], -1]]
b_eq = [target_bp - intercept]

# Define the bounds for age, height, and weight
age_bounds = (20, 80)
height_bounds = (150, 200)
weight_bounds = (50, 150)
slack_variable_bounds = (None, None)

bounds = [age_bounds, height_bounds, weight_bounds, slack_variable_bounds]

# Solve the linear programming problem
result = linprog(c, A_eq=A_eq, b_eq=b_eq, bounds=bounds, method="highs")

# Extract the optimal values for age, height, and weight
optimal_age, optimal_height, optimal_weight, _ = result.x

print(f"Optimal Age: {optimal_age}")
print(f"Optimal Height: {optimal_height}")
print(f"Optimal Weight: {optimal_weight}")

Optimal Age: 20.0
Optimal Height: 150.0
Optimal Weight: 50.0
