# Multivariate Analysis

Multivariate analysis (MVA) encompasses statistical techniques used to analyze data that arises from more than one variable. This form of analysis is especially crucial in healthcare, a field inherently multidimensional with interrelated variables. Here's a breakdown of its importance and significance:

### Importance of Multivariate Analysis:

1. **Simultaneous Analysis of Multiple Variables**:
    - Allows for the simultaneous examination of multiple variables, mirroring the complex, multidimensional nature of healthcare phenomena.

2. **Uncovering Complex Relationships**:
    - Helps in understanding complex relationships among different variables, which is critical for identifying underlying patterns and interactions affecting health outcomes.

3. **Controlling Confounding Variables**:
    - Provides a framework to control for confounding variables, thereby offering a more accurate estimate of the true relationships among variables.

4. **Prediction and Classification**:
    - Useful in developing predictive models to forecast health outcomes or classify individuals into different risk groups.

5. **Data Reduction**:
    - Techniques like Principal Component Analysis (PCA) and Factor Analysis reduce data dimensionality, making it more manageable and interpretable.

6. **Identification of Subgroups**:
    - Helps in identifying distinct subgroups within a population, which is crucial for personalized medicine and targeted interventions.

### Significance for Scientific Papers in Healthcare:

1. **Evidence-Based Insights**:
    - Provides evidence-based insights into complex healthcare phenomena, contributing to the evidence-based practice paradigm.

2. **Policy and Decision-Making**:
    - The findings from MVA can inform policy and clinical decision-making by unveiling the multifaceted relationships among variables.

3. **Enhanced Understanding**:
    - Enhances understanding of disease etiology, treatment effectiveness, and other critical healthcare issues by analyzing multiple variables together.

4. **Improved Predictive Models**:
    - Leads to the development of more accurate predictive models by considering multiple predictors simultaneously.

5. **Transparency and Rigor**:
    - Adds rigor and transparency to data analysis, which is essential for the credibility and replicability of scientific findings.

6. **Interdisciplinary Research**:
    - Facilitates interdisciplinary research by enabling the analysis of complex, multivariate data common in collaborative healthcare research.

7. **Improved Communication**:
    - Through graphical representations and structured findings, MVA helps in better communication of complex relationships to a broad audience, including policy-makers and the public.

8. **Resource Allocation**:
    - Informs resource allocation by identifying key factors and subgroups that may require targeted interventions or resources.

In summary, Multivariate Analysis is indispensable in healthcare research for unraveling complex relationships, making accurate predictions, and informing evidence-based policy and practice. Its utilization in scientific papers significantly contributes to the advancement of knowledge, the credibility of findings, and the overall impact of the research in improving healthcare outcomes.

# Scripts

In [11]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Load data
df = pd.read_csv('https://raw.githubusercontent.com/csanicola74/Important_Reference_Repo/main/Data/sample_patient_data.csv')

# Set style for seaborn
sns.set(style="whitegrid")

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

print("Categorical Columns:", categorical_cols)

# Identify continuous columns
continuous_cols = df.select_dtypes(include=['int64', 'float64']).columns

print("Continuous Columns:", continuous_cols)

Categorical Columns: Index(['patient_id', 'race', 'gender', 'icd10_diagnosis_1',
       'icd10_diagnosis_2', 'cpt_1', 'cpt_2', 'surgery_date', 'surgeon',
       'admit_date', 'discharge_date', 'seen_30days', 'occurrence_y/n',
       'occurrence_type'],
      dtype='object')
Continuous Columns: Index(['age', 'zipcode', 'weight_kg', 'height_cm', 'blood_pressure_systolic',
       'blood_pressure_diastolic', 'cholesterol_level', 'blood_glucose_level'],
      dtype='object')


## Data Cleaning/Transformation

In [12]:
# Identify columns with more than 5 unique values
cols_with_more_than_5_unique = [col for col in df.columns if df[col].nunique() > 5]

print("Columns with more than 5 unique values:", cols_with_more_than_5_unique)

print(df['icd10_diagnosis_1'].unique())
print(df['cpt_1'].unique())

Columns with more than 5 unique values: ['patient_id', 'age', 'zipcode', 'icd10_diagnosis_1', 'icd10_diagnosis_2', 'cpt_1', 'cpt_2', 'surgery_date', 'surgeon', 'admit_date', 'discharge_date', 'weight_kg', 'height_cm', 'blood_pressure_systolic', 'blood_pressure_diastolic', 'cholesterol_level', 'blood_glucose_level']
['D744' 'D132' 'D502' 'D465' 'D719' 'D252' 'D398' 'D305' 'D302' 'D655'
 'D757' 'D116' 'D121' 'D927' 'D691' 'D455' 'D653' 'D625' 'D761' 'D894'
 'D921' 'D930' 'D323' 'D471' 'D608' 'D698' 'D441' 'D226' 'D747' 'D893'
 'D332' 'D403' 'D460' 'D701' 'D366' 'D888' 'D129' 'D963' 'D138' 'D741'
 'D365' 'D111' 'D382' 'D754' 'D538' 'D954' 'D629' 'D827' 'D630' 'D845'
 'D423' 'D620' 'D109' 'D446' 'D348' 'D163' 'D918' 'D857' 'D702' 'D899'
 'D520' 'D511' 'D357' 'D381' 'D540' 'D871' 'D115' 'D950' 'D641' 'D900'
 'D419' 'D736' 'D840' 'D628' 'D648' 'D841' 'D854' 'D493' 'D985' 'D801'
 'D542' 'D822' 'D202' 'D948' 'D416' 'D728' 'D457' 'D850' 'D992' 'D802'
 'D565' 'D589' 'D607' 'D933' 'D136' 'D904' '

In [13]:
# grouping the categorical variables for all columns where there are more than 5 unique values
# ICD Grouping
df['grouped_icd10_1'] = 'D' + (df['icd10_diagnosis_1'].str[1:4].astype(int) // 100).astype(str).str.zfill(3)
print(df['grouped_icd10_1'].value_counts())
df['grouped_icd10_2'] = 'D' + (df['icd10_diagnosis_2'].str[1:4].astype(int) // 100).astype(str).str.zfill(3)
print(df['grouped_icd10_2'].value_counts())

# CPT Grouping
df['grouped_cpt_1'] = 'C' + (df['cpt_1'].str[1:5].astype(int) // 1000).astype(str).str.zfill(4)
print(df['grouped_cpt_1'].value_counts())
df['grouped_cpt_2'] = 'C' + (df['cpt_2'].str[1:5].astype(int) // 1000).astype(str).str.zfill(4)
print(df['grouped_cpt_2'].value_counts())

grouped_icd10_1
D008    125
D003    121
D009    118
D002    112
D006    110
D007    106
D001    103
D005    103
D004    102
Name: count, dtype: int64
grouped_icd10_2
D003    129
D007    123
D008    115
D009    115
D001    113
D004    113
D002    105
D005     98
D006     89
Name: count, dtype: int64
grouped_cpt_1
C0005    125
C0008    124
C0001    123
C0009    117
C0004    117
C0006    111
C0003     98
C0002     98
C0007     87
Name: count, dtype: int64
grouped_cpt_2
C0007    120
C0009    119
C0005    116
C0002    114
C0001    113
C0003    108
C0008    104
C0006    103
C0004    103
Name: count, dtype: int64


In [14]:
# Convert the 'surgery_date' column to a datetime object
df['surgery_date'] = pd.to_datetime(df['surgery_date'])

# Extract month and year
df['year'] = df['surgery_date'].dt.year
df['month'] = df['surgery_date'].dt.month

# Combine year and month into a single column
df['sx_year_month'] = df['year'].astype(str) + '-' + df['month'].astype(str).str.zfill(2)

print(df['sx_year_month'])

0      2017-10
1      2017-07
2      2019-07
3      2021-08
4      2019-03
        ...   
995    2020-02
996    2022-12
997    2021-09
998    2019-03
999    2019-06
Name: sx_year_month, Length: 1000, dtype: object


In [15]:
# select the target variable
target_variable = df['gender']

# select the categorical columns
categorical_variables = ['zipcode', 'race', 'grouped_icd10_1', 'grouped_icd10_2', 'grouped_cpt_1', 'grouped_cpt_2', 'sx_year_month', 'surgeon', 'seen_30days', 'occurrence_y/n', 'occurrence_type']
print(df[categorical_variables])

# select the continuous columns
continuous_variables = ['age', 'weight_kg', 'height_cm', 'blood_pressure_systolic', 'blood_pressure_diastolic', 'cholesterol_level', 'blood_glucose_level']
print(df[continuous_variables])

     zipcode      race grouped_icd10_1 grouped_icd10_2 grouped_cpt_1  \
0      44786     Other            D007            D003         C0008   
1      38612     Asian            D001            D007         C0001   
2      94097     Asian            D005            D008         C0007   
3      70401     Black            D004            D005         C0003   
4      84258     Black            D007            D007         C0001   
..       ...       ...             ...             ...           ...   
995    34141     Other            D001            D009         C0004   
996    46330     Black            D007            D007         C0009   
997    99960  Hispanic            D007            D009         C0005   
998    41535  Hispanic            D002            D006         C0008   
999    21884     White            D002            D009         C0002   

    grouped_cpt_2 sx_year_month        surgeon seen_30days occurrence_y/n  \
0           C0007       2017-10    Sarah Davis         Yes

## Analysis Scripts

Given the dataset, let's say you are interested in understanding the influence of patient characteristics (like age, gender, race, weight_kg, height_cm) on cholesterol_level. For this scenario, you'd use Multiple Linear Regression.

In [18]:
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Convert 'surgery_date' and 'discharge_date' to datetime and extract year
df['surgery_year'] = pd.to_datetime(df['surgery_date']).dt.year
df['discharge_year'] = pd.to_datetime(df['discharge_date']).dt.year

# Convert 'seen_30days' and 'occurrence_y/n' to numeric values (1s and 0s)
df['seen_30days'] = df['seen_30days'].map({'Yes': 1, 'No': 0})
df['occurrence_y/n'] = df['occurrence_y/n'].map({'Yes': 1, 'No': 0})

# Convert categorical columns 'gender', 'race', and 'occurrence_type' to dummy variables
df_dummies = pd.get_dummies(df[['gender', 'race', 'occurrence_type']], drop_first=True)

# Create a dataframe with selected independent variables and the dummy variables
X = pd.concat([df[['age', 'weight_kg', 'height_cm', 'surgery_year', 'discharge_year', 'seen_30days', 'occurrence_y/n']], 
               df_dummies], axis=1)

# Add a constant to the independent variables matrix (for the intercept)
X = sm.add_constant(X)

# Define the dependent variable
y = df['cholesterol_level']

# Perform Multiple Linear Regression using statsmodels
model = sm.OLS(y, X).fit()

# Display summary of the regression model
print(model.summary())

# Visualization of the coefficients (excluding the intercept)
coefs = model.params.drop('const')
plt.figure(figsize=(10, 6))
coefs.sort_values().plot(kind='barh')
plt.title('Regression Coefficients')
plt.xlabel('Coefficient Value')
plt.ylabel('Predictors')
plt.tight_layout()
plt.savefig('regression_coefficients.png')
plt.show()


ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).