# Question 1: Dataset Overview
---

- Begin by loading the dataset and using summary functions to gain an initial understanding of the variables, their types, and the presence of any missing data. Discuss the importance of assessing data structures in the context of statistical modeling.

## Step 0: Install the necessary packages and load them
---

In [1]:
install.packages("dplyr")
install.packages("fastDummies")
install.packages("tableone")


The downloaded binary packages are in
	/var/folders/nv/sjj_9gb52674c8ktybqghkm80000gq/T//Rtmph6jHIk/downloaded_packages

The downloaded binary packages are in
	/var/folders/nv/sjj_9gb52674c8ktybqghkm80000gq/T//Rtmph6jHIk/downloaded_packages

The downloaded binary packages are in
	/var/folders/nv/sjj_9gb52674c8ktybqghkm80000gq/T//Rtmph6jHIk/downloaded_packages


In [237]:
library(dplyr)
library(fastDummies)
library(tableone)

## Step 1: Load the dataset
---

In [238]:
load("Df_regression.RData")

## Step 2: Check the content of the data
---

In [239]:
ls()

## Step 3: Check the class of the data
---

In [240]:
class(Df_regression_unique)

### Explanation:
---
1. Df_regression_unique is the 'dataframe' while dataset contains the 'character'.
2. It is obvious then that Df_regression_unique contains the data.

## Step 4: Get a snapshot of the data
---
1. Snapshots are important because we can see how our data looks like without looking at the entire dataset.

2. In my practice I usually look at the tail in of the dataset, for me this is ideal because of two things:

    a. By using the tail we can check how many rows the dataset is.

    b. We can also check the datatype of each columns.

3. Change the column names.

In [241]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-103579009,C-72166-2,C-763302001,C-39156-5,C-185086009,C-21908-9,C-233604007,C-195967001,C-132281000119108,Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<dbl>,<lgl>,<chr>,<lgl>,<lgl>,<lgl>,<int>,<dbl>
5933,p9972,0,53,f,white,never,True,29.66,,,,,,822,1
5934,p9976,0,70,m,white,former,True,30.06,,,,,,3042,1
5935,p9982,0,70,m,black,never,True,31.07,,,,,,1233,1
5936,p9992,1,75,m,white,never,True,29.6,,,,,,1838,1
5937,p9996,0,59,m,white,never,True,27.3,,,,,,2374,2
5938,p9998,0,46,f,white,former,True,27.6,,,,,,356,1


In [242]:
load("Ontology.RData")
# Create a named vector from the codes dictionary
name_map <- setNames(Codes_dictionary$name, Codes_dictionary$code)
# Rename only matching columns
names(Df_regression_unique) <- ifelse(
  names(Df_regression_unique) %in% names(name_map), # Check if column name exists in the dictionary
  name_map[names(Df_regression_unique)],           # Replace with the new name if it matches
  names(Df_regression_unique)                      # Keep original name if no match
)

In [243]:
Df_regression_unique

Unnamed: 0_level_0,ptnum,label,age,gender,race,Tobacco smoking status NHIS,Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure),Body Mass Index,Chronic obstructive bronchitis (disorder),Stage group.clinical Cancer,Pneumonia (disorder),Asthma,Acute deep venous thrombosis (disorder),Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<dbl>,<lgl>,<chr>,<lgl>,<lgl>,<lgl>,<int>,<dbl>
1,p10000,1,72,m,asian,former,TRUE,28.10,,,,,,985,1
2,p10005,0,70,m,white,never,TRUE,27.90,,,,,,2897,2
3,p10006,0,65,m,white,never,TRUE,28.70,,,TRUE,,,2244,2
4,p10009,0,66,m,white,never,TRUE,29.92,,,,,,1103,1
5,p10018,0,67,m,white,never,TRUE,28.00,,,,,,1761,1
6,p10019,0,69,m,white,former,TRUE,34.20,,,,,,1597,2
7,p10026,1,69,m,white,former,TRUE,29.97,,,,,,3151,1
8,p10029,0,62,f,white,former,TRUE,29.26,TRUE,,,,,1149,1
9,p10034,1,72,m,white,never,TRUE,27.40,,,,,,1662,1
10,p10037,0,68,f,white,never,,30.30,,,,,,844,1


In [244]:
# Check the values of 'Stage group.clinical Cancer' column
unique(Df_regression_unique$'Stage group.clinical Cancer')

In [245]:
# Replace NA values with "no_cancer"
Df_regression_unique$`Stage group.clinical Cancer`[is.na(Df_regression_unique$`Stage group.clinical Cancer`)] <- "no_cancer"

# Verify the changes
unique(Df_regression_unique$`Stage group.clinical Cancer`)

In [246]:
# Checking the column names
colnames(Df_regression_unique)

## Step 5: Counting the number of nulls in the columns.
---
1. In the previous cell we have observed that columns: C-1850860095796, C-21908-95715C-2336040075754, C-1959670015932, and C-1322810001191085880 contains null.

2. Strategy: 

    a. We know that the total number of columns is 5,938, we need to compare how many nulls are in these columns.
    b. If the nulls are greater than 50% that is:
    
    $\text{% null} = \frac{\text{total number of nulls in a column}}{\text{total number of rows}}$

In [247]:
colSums(is.na(Df_regression_unique))

## Step 6: Changing values of NA.
---

Replace NA with 0 and True with 1 for columns with NA's

In [248]:
Df_regression_unique$'Chronic obstructive bronchitis (disorder)'[is.na(Df_regression_unique$'Chronic obstructive bronchitis (disorder)')] <- 0
Df_regression_unique$'Chronic obstructive bronchitis (disorder)' <- as.integer(Df_regression_unique$'Chronic obstructive bronchitis (disorder)')

Df_regression_unique$'Pneumonia (disorder)'[is.na(Df_regression_unique$'Pneumonia (disorder)')] <- 0
Df_regression_unique$'Pneumonia (disorder)' <- as.integer(Df_regression_unique$'Pneumonia (disorder)')

Df_regression_unique$'Asthma'[is.na(Df_regression_unique$'Asthma')] <- 0
Df_regression_unique$'Asthma' <- as.integer(Df_regression_unique$'Asthma')

Df_regression_unique$'Acute deep venous thrombosis (disorder)'[is.na(Df_regression_unique$'Acute deep venous thrombosis (disorder)')] <- 0
Df_regression_unique$'Acute deep venous thrombosis (disorder)' <- as.integer(Df_regression_unique$'Acute deep venous thrombosis (disorder)')

In [249]:
unique(Df_regression_unique$'Chronic obstructive bronchitis (disorder)')

In [250]:
unique(Df_regression_unique$'Stage group.clinical Cancer')

In [251]:
unique(Df_regression_unique$'Pneumonia (disorder)')

In [252]:
unique(Df_regression_unique$'Asthma')

In [253]:
unique(Df_regression_unique$'Pneumonia (disorder)')

In [254]:
unique(Df_regression_unique$'Acute deep venous thrombosis (disorder)')

## Step 7: Check is the columns are properly removed.
---

In [255]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,age,gender,race,Tobacco smoking status NHIS,Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure),Body Mass Index,Chronic obstructive bronchitis (disorder),Stage group.clinical Cancer,Pneumonia (disorder),Asthma,Acute deep venous thrombosis (disorder),Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<dbl>
5933,p9972,0,53,f,white,never,True,29.66,0,no_cancer,0,0,0,822,1
5934,p9976,0,70,m,white,former,True,30.06,0,no_cancer,0,0,0,3042,1
5935,p9982,0,70,m,black,never,True,31.07,0,no_cancer,0,0,0,1233,1
5936,p9992,1,75,m,white,never,True,29.6,0,no_cancer,0,0,0,1838,1
5937,p9996,0,59,m,white,never,True,27.3,0,no_cancer,0,0,0,2374,2
5938,p9998,0,46,f,white,former,True,27.6,0,no_cancer,0,0,0,356,1


## Discussion:
---
Assessing data structures is a crucial step in preparing for statistical modeling. It ensures that the data is correctly understood, processed, and interpreted, ultimately leading to valid and reliable results. Here’s why it’s important:

1. Understanding Data Types
Significance:
Each variable type (e.g., numeric, categorical, logical) dictates the kinds of analyses and transformations that can be performed.
For example, regression models treat numeric and categorical variables differently.
Potential Issues:
A categorical variable coded as numeric may lead to incorrect interpretation (e.g., treating regions as continuous numbers).
Logical variables not converted to binary (0/1) may confuse algorithms.
Solution:
Use functions like str() and summary() to ensure variables are properly typed.


2. Detecting and Addressing Missing Data
Significance:
Missing data can skew model results, reduce statistical power, and introduce biases.
Understanding patterns of missingness (e.g., Missing Completely at Random, Missing Not at Random) informs appropriate imputation methods.
Potential Issues:
Ignoring missing data can result in smaller, unrepresentative datasets if rows with missing values are removed.
Arbitrary imputation can distort relationships in the data.
Solution:
Visualize missing data patterns (naniar in R) and assess whether imputation, removal, or advanced techniques (e.g., multiple imputation) are needed.


3. Identifying Outliers and Inconsistent Values
Significance:
Outliers can have a disproportionate influence on statistical models, especially in regression and clustering.
Inconsistent values (e.g., negative ages, nonsensical categories) can indicate data entry errors.
Potential Issues:
Models may overfit or misinterpret trends due to extreme or incorrect values.
Solution:
Perform exploratory data analysis (EDA) to detect and decide how to handle outliers.


4. Ensuring Data Conforms to Model Requirements
Significance:
Many statistical models have assumptions (e.g., linear regression assumes linearity, normality, homoscedasticity).
Proper formatting (e.g., encoding categorical variables as factors) ensures compatibility with modeling functions.
Potential Issues:
Non-normal data in a model requiring normality can lead to invalid inferences.
Misformatted data might cause errors or incorrect results in statistical software.
Solution:
Check distributions, relationships, and data transformations (e.g., log transformations for skewed data).


5. Reducing Dimensionality and Improving Interpretability
Significance:
High-dimensional data can cause overfitting and make results harder to interpret.
Dimensionality reduction techniques (e.g., PCA, feature selection) require an understanding of variable roles.
Potential Issues:
Irrelevant variables add noise and reduce the efficiency of the model.
Solution:
Assess variable relevance and multicollinearity before modeling.


6. Recognizing Data Relationships
Significance:
Relationships between variables (e.g., collinearity, interactions) influence model complexity and performance.
Potential Issues:
Ignoring relationships can lead to misspecified models or incorrect interpretations.
Solution:
Explore relationships using scatterplots, correlation matrices, or advanced methods like variance inflation factors (VIF).


7. Ensuring Ethical and Accurate Representation
Significance:
Assessing data structures ensures that the dataset represents the population accurately.
Potential Issues:
Imbalanced datasets can bias models (e.g., under-representation of minority groups in healthcare studies).
Solution:
Stratify data, balance classes, and ensure fair sampling.

# Question 2: Handling Missing Data
---
Identify columns with logical (TRUE/FALSE) values and apply methods for addressing missing values in these columns, replacing them with 0 or 1 based on their logical value. Reflect on the implications of missing data and the
strategies used for handling it.

## Step 0: Direct Conversion in Base R
---

Logical to Integer: In R, logical values are internally represented as TRUE = 1 and FALSE = 0. The as.integer() function converts these directly.

In [256]:
Df_regression_unique$'Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure)' <- as.integer(Df_regression_unique$'Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure)')

## Step 1: Verify the Changes
----

In [257]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,age,gender,race,Tobacco smoking status NHIS,Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure),Body Mass Index,Chronic obstructive bronchitis (disorder),Stage group.clinical Cancer,Pneumonia (disorder),Asthma,Acute deep venous thrombosis (disorder),Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<int>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<dbl>
5933,p9972,0,53,f,white,never,1,29.66,0,no_cancer,0,0,0,822,1
5934,p9976,0,70,m,white,former,1,30.06,0,no_cancer,0,0,0,3042,1
5935,p9982,0,70,m,black,never,1,31.07,0,no_cancer,0,0,0,1233,1
5936,p9992,1,75,m,white,never,1,29.6,0,no_cancer,0,0,0,1838,1
5937,p9996,0,59,m,white,never,1,27.3,0,no_cancer,0,0,0,2374,2
5938,p9998,0,46,f,white,former,1,27.6,0,no_cancer,0,0,0,356,1


## Step2:  If There Are NA Values
---

If the column contains (NA) values, decide how to handle them. For instance, you might replace them with 0.

In [258]:
Df_regression_unique$'Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure)'[is.na(Df_regression_unique$'Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure)')] <- 0
Df_regression_unique$'Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure)' <- as.integer(Df_regression_unique$'Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure)')

In [259]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,age,gender,race,Tobacco smoking status NHIS,Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure),Body Mass Index,Chronic obstructive bronchitis (disorder),Stage group.clinical Cancer,Pneumonia (disorder),Asthma,Acute deep venous thrombosis (disorder),Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<int>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<dbl>
5933,p9972,0,53,f,white,never,1,29.66,0,no_cancer,0,0,0,822,1
5934,p9976,0,70,m,white,former,1,30.06,0,no_cancer,0,0,0,3042,1
5935,p9982,0,70,m,black,never,1,31.07,0,no_cancer,0,0,0,1233,1
5936,p9992,1,75,m,white,never,1,29.6,0,no_cancer,0,0,0,1838,1
5937,p9996,0,59,m,white,never,1,27.3,0,no_cancer,0,0,0,2374,2
5938,p9998,0,46,f,white,former,1,27.6,0,no_cancer,0,0,0,356,1


## Discussion
---

Missing data is a common challenge in statistical modeling and data analysis. Its presence can significantly impact the validity, reliability, and interpretability of results. Proper handling of missing data is crucial to ensure robust and unbiased conclusions.

Implications of Missing Data:

1. Reduced Statistical Power
Missing data reduces the sample size available for analysis, weakening the statistical power of tests.
Smaller datasets may lead to inconclusive results or increased Type II errors (failing to detect true effects).

2. Bias in Results
Missing data that is not random (e.g., people with certain traits systematically omit answers) can bias estimates.
For example, in medical studies, patients with severe symptoms may drop out, leading to underestimation of the disease severity.


3. Distorted Relationships
Missing data can change the relationships between variables if handled improperly.
For instance, imprecise imputation might introduce spurious correlations.


4. Loss of Representativeness
If a particular subgroup is more likely to have missing data, the remaining data may no longer represent the entire population.


5. Impact on Model Performance
Machine learning and statistical models often fail or perform poorly with missing data unless explicitly accounted for.


## Strategies for Handling Missing Data
---

1. Prevention

    Careful Data Collection: Design surveys and data collection processes to minimize missing values.
    
    Data Validation: Use real-time checks during data entry to catch missing or incorrect inputs.

<br>
2. Understanding the Missingness Mechanism

    Missing Completely at Random (MCAR): Data is missing with no systematic relationship to the observed or unobserved data. This is the least problematic type.

    Missing at Random (MAR): The likelihood of missingness depends on observed variables but not unobserved ones.

    Missing Not at Random (MNAR): Missingness depends on unobserved variables. This requires specialized techniques or assumptions.

<br>
3. Strategies to Handle Missing Data

    A. Deletion Methods
        Listwise Deletion: Remove rows with any missing values.
        
        Advantages: Simple and retains consistency across analyses.
        
        Disadvantages: Reduces sample size, can bias results if data is not MCAR.
        
        Pairwise Deletion: Use available data for each analysis without discarding entire rows.
        
        Advantages: Retains more data.
        
        Disadvantages: Results may vary across analyses.
        

    B. Imputation Methods
        Mean/Median/Mode Imputation: Replace missing values with the mean (for numeric) or mode (for categorical).
        Advantages: Simple and quick.
        Disadvantages: Reduces variability, can introduce bias.
        Regression Imputation: Use regression models to predict and fill missing values based on other variables.
        Advantages: Accounts for relationships between variables.
        Disadvantages: Assumes the imputation model is correct, underestimates variability.
        Multiple Imputation: Create multiple plausible datasets by imputing missing values with different estimates and combining results.
        Advantages: Accounts for uncertainty and variability in imputations.
        Disadvantages: Computationally intensive, complex to implement.
        Hot Deck Imputation:Replace missing values with observed values from similar cases.
        Advantages: Retains observed data distribution.
        Disadvantages: Depends on finding similar cases.

    C. Model-Based Approaches
        Maximum Likelihood Estimation (MLE): Estimate parameters directly while accounting for missing data.
        Advantages: Efficient and unbiased under MAR.
        Disadvantages: Requires complex algorithms.
        Bayesian Methods: Use prior distributions to estimate missing values.
        Advantages: Flexible and accounts for uncertainty.
        Disadvantages: Requires expertise and computational power.

    D. Advanced Techniques
        K-Nearest Neighbors (KNN): Impute missing values based on the closest neighbors in the dataset.
        Advantages: Considers the data's structure.
        Disadvantages: Computationally expensive for large datasets.
        Machine Learning: Use predictive models like Random Forest to impute missing values.
        Advantages: Handles complex relationships.
        Disadvantages: May overfit and requires validation.

# Question 3: Categorical Variable Conversion
---

Convert relevant variables into categorical (factor) variables as necessary, and consider the role of this transformation in regression analyses. Discuss when and why categorical conversion is crucial in epidemiological research.

## Step 0:
---

Explanation

ifelse Function:

ifelse(condition, value_if_true, value_if_false) checks the condition for each element.

If the value is "f", it assigns 1.

Otherwise (e.g., if the value is "m"), it assigns 0.

---

Replace the Original Column: The column C-263495000 is overwritten with the transformed values.

In [260]:
Df_regression_unique$"gender" <- ifelse(Df_regression_unique$"gender" == "f", 1, 0)

In [261]:
colnames(Df_regression_unique)[colnames(Df_regression_unique) == "gender"] <- "gender_female"

In [262]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,age,gender_female,race,Tobacco smoking status NHIS,Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure),Body Mass Index,Chronic obstructive bronchitis (disorder),Stage group.clinical Cancer,Pneumonia (disorder),Asthma,Acute deep venous thrombosis (disorder),Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<dbl>
5933,p9972,0,53,1,white,never,1,29.66,0,no_cancer,0,0,0,822,1
5934,p9976,0,70,0,white,former,1,30.06,0,no_cancer,0,0,0,3042,1
5935,p9982,0,70,0,black,never,1,31.07,0,no_cancer,0,0,0,1233,1
5936,p9992,1,75,0,white,never,1,29.6,0,no_cancer,0,0,0,1838,1
5937,p9996,0,59,0,white,never,1,27.3,0,no_cancer,0,0,0,2374,2
5938,p9998,0,46,1,white,former,1,27.6,0,no_cancer,0,0,0,356,1


## Step 1: Check the unique values of columns with categorical data
---

In [263]:
unique(Df_regression_unique$"Tobacco smoking status NHIS")

In [264]:
unique(Df_regression_unique$"race")

## Step 2: Tobacco smoking status NHIS change into binary variable
---
Coding legend is:

'Former' = 1

'Never' = 0

In [265]:
Df_regression_unique$"Tobacco smoking status NHIS" <- ifelse(Df_regression_unique$"Tobacco smoking status NHIS" == "former", 1, 0)

## Step 3: Verify the change
---

In [266]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,age,gender_female,race,Tobacco smoking status NHIS,Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure),Body Mass Index,Chronic obstructive bronchitis (disorder),Stage group.clinical Cancer,Pneumonia (disorder),Asthma,Acute deep venous thrombosis (disorder),Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<int>,<dbl>,<int>,<chr>,<int>,<int>,<int>,<int>,<dbl>
5933,p9972,0,53,1,white,0,1,29.66,0,no_cancer,0,0,0,822,1
5934,p9976,0,70,0,white,1,1,30.06,0,no_cancer,0,0,0,3042,1
5935,p9982,0,70,0,black,0,1,31.07,0,no_cancer,0,0,0,1233,1
5936,p9992,1,75,0,white,0,1,29.6,0,no_cancer,0,0,0,1838,1
5937,p9996,0,59,0,white,0,1,27.3,0,no_cancer,0,0,0,2374,2
5938,p9998,0,46,1,white,1,1,27.6,0,no_cancer,0,0,0,356,1


## Step 4: Create dummy variables for the Race and Stage group.clinical Cancer
---

In [267]:
Df_regression_unique <- Df_regression_unique %>%
  mutate(
    asian = ifelse(`race` == "asian", 1, 0),
    white = ifelse(`race` == "white", 1, 0),
    black = ifelse(`race` == "black", 1, 0),
    hawaiian = ifelse(`race` == "hawaiian", 1, 0),
    native = ifelse(`race` == "native", 1, 0)
  ) %>%
  select(-"race")  # Remove the original column if not needed

In [268]:
unique(Df_regression_unique$"Stage group.clinical Cancer")

In [269]:
Df_regression_unique <- Df_regression_unique %>%
  mutate(
    no_cancer = ifelse(`Stage group.clinical Cancer` == 'no_cancer', 1, 0),
    stage4 = ifelse(`Stage group.clinical Cancer` == 'stage4', 1, 0),
    stage1a = ifelse(`Stage group.clinical Cancer` == 'stage1a', 1, 0),
    stage2a = ifelse(`Stage group.clinical Cancer` == 'stage2a', 1, 0),
    stage3a = ifelse(`Stage group.clinical Cancer` == 'stage3a', 1, 0),
    stage1b = ifelse(`Stage group.clinical Cancer` == 'stage1b', 1, 0),
    stage2b = ifelse(`Stage group.clinical Cancer` == 'stage2b', 1, 0),
    stage3b = ifelse(`Stage group.clinical Cancer` == 'stage3b', 1, 0)
  ) %>%
  select(-"Stage group.clinical Cancer")  # Remove the original column if not needed

## Step 5: Verify the changes
---

In [270]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,age,gender_female,Tobacco smoking status NHIS,Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure),Body Mass Index,Chronic obstructive bronchitis (disorder),Pneumonia (disorder),Asthma,⋯,hawaiian,native,no_cancer,stage4,stage1a,stage2a,stage3a,stage1b,stage2b,stage3b
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5933,p9972,0,53,1,0,1,29.66,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5934,p9976,0,70,0,1,1,30.06,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5935,p9982,0,70,0,0,1,31.07,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5936,p9992,1,75,0,0,1,29.6,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5937,p9996,0,59,0,0,1,27.3,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5938,p9998,0,46,1,1,1,27.6,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0


In [271]:
# Renaming some columns for consistency
Df_regression_unique <- Df_regression_unique %>%
  rename(smoking_status = `Tobacco smoking status NHIS`,
         alcohol_assessment = `Assessment using Alcohol Use Disorders Identification Test - Consumption (procedure)`,
         bmi = `Body Mass Index`,
         chronic_obstructive_bronchitis = `Chronic obstructive bronchitis (disorder)`,
         pneumonia = `Pneumonia (disorder)`,
         asthma = `Asthma`, 
         dvt = `Acute deep venous thrombosis (disorder)`,
         statin = Exposure)

In [272]:
Df_regression_unique

Unnamed: 0_level_0,ptnum,label,age,gender_female,smoking_status,alcohol_assessment,bmi,chronic_obstructive_bronchitis,pneumonia,asthma,⋯,hawaiian,native,no_cancer,stage4,stage1a,stage2a,stage3a,stage1b,stage2b,stage3b
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,p10000,1,72,0,1,1,28.10,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
2,p10005,0,70,0,0,1,27.90,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
3,p10006,0,65,0,0,1,28.70,0,1,0,⋯,0,0,1,0,0,0,0,0,0,0
4,p10009,0,66,0,0,1,29.92,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5,p10018,0,67,0,0,1,28.00,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
6,p10019,0,69,0,1,1,34.20,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
7,p10026,1,69,0,1,1,29.97,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
8,p10029,0,62,1,1,1,29.26,1,0,0,⋯,0,0,1,0,0,0,0,0,0,0
9,p10034,1,72,0,0,1,27.40,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
10,p10037,0,68,1,0,0,30.30,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0


## Step 7: Check the unique values of the Exposure columns
---

In [273]:
unique(Df_regression_unique$"statin")

## Step 8: in the Exposure change into binary variable
---
Coding legend is:

1 = 0

2 = 1

In [274]:
Df_regression_unique$"statin" <- ifelse(Df_regression_unique$"statin" == 2, 1, 0)

## Step 10: Verify changes
---

In [275]:
unique(Df_regression_unique$"statin")

## Discussion
---
Epidemiological research often involves variables that naturally fall into discrete categories, such as demographic data (e.g., age groups, gender), exposure groups, or disease classifications. Converting such variables into categorical formats is a critical step in data preparation and analysis, as it ensures that these variables are correctly represented and interpreted in statistical models.

## When Categorical Conversion is Crucial
---

a. When Variables Are Nominal or Ordinal

    Nominal Variables: Variables with distinct categories that have no inherent order (e.g., blood type, ethnicity, or disease type).
    
    Ordinal Variables: Variables with categories that have a meaningful order but no consistent scale (e.g., disease severity: mild, moderate, severe).
    Example: Converting "mild, moderate, severe" into ordered factors to preserve their natural hierarchy.

b. When Applying Statistical Models

    Many statistical models, such as regression, ANOVA, and logistic regression, require categorical variables to be explicitly coded as factors. Without conversion, the software might misinterpret categorical variables as continuous, leading to incorrect model specification.

c. When Handling Group Comparisons

        Epidemiological studies often involve comparing groups (e.g., exposed vs. unexposed, treatment vs. control). Categorical conversion ensures clear group delineation for stratified analysis or interaction testing.

d. When Using Machine Learning or Modeling Algorithms

    Machine learning models often require variables to be encoded as numerical representations (dummy variables or one-hot encoding). Categorical conversion facilitates the creation of these encodings.
    
    
## Why Categorical Conversion Is Crucial
---

a. Preserves the Integrity of Categorical Data

    Without proper conversion, categorical variables might be treated as continuous, leading to meaningless operations (e.g., averaging categorical values like "male" and "female"). This can distort model outputs and make the findings invalid.
    
b. Enables Appropriate Statistical Interpretation

    Categorical conversion allows models to calculate meaningful metrics: Odds ratios in logistic regression.
    Relative risks in survival analysis.
    Proper conversion helps interpret the role of exposure, disease outcomes, or risk factors accurately.

c. Facilitates Group-Level Analysis

    In epidemiology, researchers often analyze data stratified by categories like age, sex, or socioeconomic status. Categorical conversion ensures these groupings are well-defined and allows for subgroup analysis, such as identifying differential risks across strata.

d. Prevents Misrepresentation of Ordinal Data

    For ordinal variables, conversion to factors ensures that statistical methods respect the natural ordering of categories. For example, in a disease severity scale, treating categories as unordered would ignore their hierarchical relationship.

e. Essential for Interaction Testing

    Testing interactions between categorical variables (e.g., age group × treatment type) requires proper encoding to understand joint effects.

f. Improves Computational Efficiency

    Converting categorical variables into factors reduces memory usage and speeds up computations in R and other statistical software.

g. Ensures Reproducibility

    Explicitly converting and documenting categorical variables improves transparency and ensures consistent data handling across analyses.

# Question 4: Data Cleaning
---
Implement imputation or replacement techniques to address missing values in
key categorical variables, such as disease stages. Discuss the ethical and
methodological considerations when cleaning clinical datasets.

## Step 0: Check if there are missing datasets
---

In [276]:
colSums(is.na(Df_regression_unique))

## Step 1: There are no missing values in the dataset
---

## Dicussion
---

Cleaning clinical datasets is a critical step in preparing data for analysis, but it involves both ethical and methodological challenges. Clinical data often deals with sensitive health information, and how the data is handled can influence the validity of research findings and impact clinical decisions.

## Ethical Considerations
---

1. Patient Privacy and Confidentiality

    Importance: Clinical datasets often contain personally identifiable information (PII) such as names, dates of birth, and medical history. Maintaining privacy is crucial to comply with regulations like HIPAA (Health Insurance Portability and Accountability Act) or GDPR (General Data Protection Regulation).
    
    Actions:
    
        - Remove or anonymize PII before cleaning.
        - Use encryption and secure storage methods to protect data.
        - Limit access to sensitive information on a need-to-know basis.      

2. Informed Consent

    Importance: Patients must consent to their data being used for research purposes.
    
    Challenges: Retrospective datasets may lack explicit consent. Data cleaning activities, such as creating derived variables, must remain within the bounds of the consent obtained.
    
    Actions:
    
        - Ensure that the research use aligns with the terms of consent.
        - Obtain ethics committee approval for secondary uses of data.

3. Bias Introduction
    Importance: Cleaning processes, such as imputing missing values or removing outliers, can introduce biases that affect the validity of findings.
    
    Examples:
        Excluding patients with incomplete records may disproportionately affect underrepresented groups.
        Over-imputation of missing data may create artificial trends.
    
    Actions:
        Document all cleaning decisions.
        Use transparent, reproducible methods to minimize bias.
4. Equity and Representation

    Importance: Clinical datasets often underrepresent certain populations (e.g., minorities, rural communities).

    Challenges: Cleaning that removes records with missing demographic or clinical data may exacerbate disparities.

    Actions:
    
        - Assess the impact of cleaning decisions on subgroup representation.
        - Consider weighting or stratification to ensure equitable analyses.

5. Accountability and Transparency

    Importance: Ethical research requires transparency in data handling to ensure findings are reproducible and credible.
    
    Actions:
    
        - Maintain logs of all cleaning steps.
        - Share cleaned datasets only with appropriate documentation.
        
## Methodological Considerations
---

1. Understanding the Nature of Missing Data

    Challenge: Missing data can arise from various mechanisms:
    
    Missing Completely at Random (MCAR): No systematic relationship to observed or unobserved data.
    
    Missing at Random (MAR): Depends on observed data.
    
    Missing Not at Random (MNAR): Depends on unobserved data.
    
    Actions:
    
        - Use appropriate techniques (e.g., multiple imputation) to address missingness.
        - Avoid arbitrary deletion of records, as it may introduce bias.

2. Handling Outliers

    Challenge: Outliers in clinical data may reflect true extreme cases or data entry errors.
    
    Actions:
    
        - Investigate the cause of outliers before removing or modifying them.
        - Use robust statistical methods that are less sensitive to outliers.

3. Standardization of Variables

    Challenge: Clinical datasets often contain non-standardized values (e.g., varying units for lab results).
    
    Actions:
        
        - Standardize units and formats (e.g., mg/dL vs. mmol/L).
        - Ensure consistent coding for categorical variables.

4. Data Duplication

    Challenge: Duplicate records can distort analyses if not handled correctly.
    
    Actions:
    
        - Identify and resolve duplicates through patient IDs, timestamps, and other key identifiers.
        - Decide whether to retain or combine duplicates based on clinical context.

5. Ethical Imputation and Replacement

    Challenge: Imputing missing clinical values can affect the accuracy of analyses.
    
    Actions:
     
        - Choose methods that respect the clinical context (e.g., avoid imputing unrealistic values for lab tests).
        - Clearly report imputation methods and their limitations.

6. Longitudinal and Time-Dependent Data

    Challenge: Clinical datasets often include repeated measures or time-series data.
    
    Actions:
    
        - Maintain the temporal sequence of data during cleaning.
        - Use appropriate methods for handling missingness in time-dependent variables.

7. Data Integration

    Challenge: Combining datasets from multiple sources (e.g., EHR systems, clinical trials) may introduce inconsistencies.
    
    Actions:
    
        - Harmonize variable definitions and formats.
        - Use linkage techniques to merge records accurately.

# Question 6: Descriptive Statistics and Summary Tables
---

Create a comprehensive summary table that provides insights into the
distribution of key variables in the dataset, stratified by exposure groups. This
exercise will help familiarize you with descriptive statistical techniques that
are essential for cohort studies.

## Step 0: Delete the column named ptnum because these are not informative
---

In [277]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,age,gender_female,smoking_status,alcohol_assessment,bmi,chronic_obstructive_bronchitis,pneumonia,asthma,⋯,hawaiian,native,no_cancer,stage4,stage1a,stage2a,stage3a,stage1b,stage2b,stage3b
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5933,p9972,0,53,1,0,1,29.66,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5934,p9976,0,70,0,1,1,30.06,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5935,p9982,0,70,0,0,1,31.07,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5936,p9992,1,75,0,0,1,29.6,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5937,p9996,0,59,0,0,1,27.3,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5938,p9998,0,46,1,1,1,27.6,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0


In [278]:
Df_regression_unique <- Df_regression_unique %>% select(-ptnum)

## Step 1: Verify changes
---

In [279]:
tail(Df_regression_unique)

Unnamed: 0_level_0,label,age,gender_female,smoking_status,alcohol_assessment,bmi,chronic_obstructive_bronchitis,pneumonia,asthma,dvt,⋯,hawaiian,native,no_cancer,stage4,stage1a,stage2a,stage3a,stage1b,stage2b,stage3b
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,<int>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5933,0,53,1,0,1,29.66,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5934,0,70,0,1,1,30.06,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5935,0,70,0,0,1,31.07,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5936,1,75,0,0,1,29.6,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5937,0,59,0,0,1,27.3,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5938,0,46,1,1,1,27.6,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0


# Step 2: Create summary table
---

In [280]:
# Define the list of variables to summarize (exclude the exposure variable itself)
variables <- setdiff(names(Df_regression_unique), "label")

# Create a table stratified by the exposure variable
summary_table <- CreateTableOne(vars = variables, strata = "label", data = Df_regression_unique)

# Print the table
print(summary_table)

                                            Stratified by label
                                             0                 1               
  n                                             4239              1697         
  age (mean (SD))                              63.04 (7.61)      69.21 (8.96)  
  gender_female (mean (SD))                     0.40 (0.49)       0.10 (0.30)  
  smoking_status (mean (SD))                    0.39 (0.49)       0.49 (0.50)  
  alcohol_assessment (mean (SD))                0.87 (0.34)       0.73 (0.44)  
  bmi (mean (SD))                              28.85 (1.95)      28.57 (1.98)  
  chronic_obstructive_bronchitis (mean (SD))    0.03 (0.17)       0.00 (0.06)  
  pneumonia (mean (SD))                         0.04 (0.20)       0.00 (0.02)  
  asthma (mean (SD))                            0.00 (0.03)       0.00 (0.00)  
  dvt (mean (SD))                               0.01 (0.11)       0.00 (0.00)  
  Followup (mean (SD))                       1829.19 (10

## Step 3: Export the summary table as a CSV
---

In [281]:
# Extract the table as a matrix
summary_matrix <- print(summary_table, quote = FALSE, noSpaces = TRUE)

# Convert the matrix to a data.frame
summary_df <- as.data.frame(summary_matrix)

# Write the data.frame to a CSV file
write.csv(summary_df, "summary_table_exercise_1.csv", row.names = TRUE)

                                            Stratified by label
                                             0                 1               
  n                                          4239              1697            
  age (mean (SD))                            63.04 (7.61)      69.21 (8.96)    
  gender_female (mean (SD))                  0.40 (0.49)       0.10 (0.30)     
  smoking_status (mean (SD))                 0.39 (0.49)       0.49 (0.50)     
  alcohol_assessment (mean (SD))             0.87 (0.34)       0.73 (0.44)     
  bmi (mean (SD))                            28.85 (1.95)      28.57 (1.98)    
  chronic_obstructive_bronchitis (mean (SD)) 0.03 (0.17)       0.00 (0.06)     
  pneumonia (mean (SD))                      0.04 (0.20)       0.00 (0.02)     
  asthma (mean (SD))                         0.00 (0.03)       0.00 (0.00)     
  dvt (mean (SD))                            0.01 (0.11)       0.00 (0.00)     
  Followup (mean (SD))                       1829.19 (10

## Step 4: Include p-values for comparisons between exposure groups
---

In [282]:
print(summary_table, showAllLevels = TRUE, test = TRUE)

                                            Stratified by label
                                             level 0                
  n                                                   4239          
  age (mean (SD))                                    63.04 (7.61)   
  gender_female (mean (SD))                           0.40 (0.49)   
  smoking_status (mean (SD))                          0.39 (0.49)   
  alcohol_assessment (mean (SD))                      0.87 (0.34)   
  bmi (mean (SD))                                    28.85 (1.95)   
  chronic_obstructive_bronchitis (mean (SD))          0.03 (0.17)   
  pneumonia (mean (SD))                               0.04 (0.20)   
  asthma (mean (SD))                                  0.00 (0.03)   
  dvt (mean (SD))                                     0.01 (0.11)   
  Followup (mean (SD))                             1829.19 (1055.41)
  statin (mean (SD))                                  0.18 (0.38)   
  asian (mean (SD))                    

## Step 5: Save the preprocessed data as a .csv file (for the python users) and .R file (for the R users)
---

In [283]:
write.csv(Df_regression_unique, "cleaned_data.csv", row.names = FALSE)

In [284]:
save(Df_regression_unique, file = "cleaned_data.RData")

## Step 6: Verify by loading the Rdata
---

In [285]:
load("cleaned_data.RData")

In [286]:
ls()

In [287]:
tail(Df_regression_unique)

Unnamed: 0_level_0,label,age,gender_female,smoking_status,alcohol_assessment,bmi,chronic_obstructive_bronchitis,pneumonia,asthma,dvt,⋯,hawaiian,native,no_cancer,stage4,stage1a,stage2a,stage3a,stage1b,stage2b,stage3b
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,<int>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5933,0,53,1,0,1,29.66,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5934,0,70,0,1,1,30.06,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5935,0,70,0,0,1,31.07,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5936,1,75,0,0,1,29.6,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5937,0,59,0,0,1,27.3,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5938,0,46,1,1,1,27.6,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
