# Question 1: Dataset Overview
---

- Begin by loading the dataset and using summary functions to gain an initial understanding of the variables, their types, and the presence of any missing data. Discuss the importance of assessing data structures in the context of statistical modeling.

# Step 0: Install the necessary packages and load them
---

In [110]:
install.packages("dplyr")
install.packages("fastDummies")


The downloaded binary packages are in
	/var/folders/nv/sjj_9gb52674c8ktybqghkm80000gq/T//Rtmp0rDFXY/downloaded_packages

The downloaded binary packages are in
	/var/folders/nv/sjj_9gb52674c8ktybqghkm80000gq/T//Rtmp0rDFXY/downloaded_packages


In [111]:
library(dplyr)
library(fastDummies)

## Step 1: Load the dataset
---

In [112]:
load("Df_regression.RData")

## Step 2: Check the content of the data
---

In [113]:
ls()

## Step 3: Check the class of the data
---

In [114]:
class(Df_regression_unique)
class(dataset)

### Explanation:
---
1. Df_regression_unique is the 'dataframe' while dataset contains the 'character'.
2. It is obvious then that Df_regression_unique contains the data.

## Step 4: Get a snapshot of the data
---
1. Snapshots are important because we can see how our data looks like without looking at the entire dataset.

2. In my practice I usually look at the tail in of the dataset, for me this is ideal because of two things:

    a. By using the tail we can check how many rows the dataset is.

    b. We can also check the datatype of each columns.


In [115]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-103579009,C-72166-2,C-763302001,C-39156-5,C-185086009,C-21908-9,C-233604007,C-195967001,C-132281000119108,Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<dbl>,<lgl>,<chr>,<lgl>,<lgl>,<lgl>,<int>,<dbl>
5933,p9972,0,53,f,white,never,True,29.66,,,,,,822,1
5934,p9976,0,70,m,white,former,True,30.06,,,,,,3042,1
5935,p9982,0,70,m,black,never,True,31.07,,,,,,1233,1
5936,p9992,1,75,m,white,never,True,29.6,,,,,,1838,1
5937,p9996,0,59,m,white,never,True,27.3,,,,,,2374,2
5938,p9998,0,46,f,white,former,True,27.6,,,,,,356,1


## Step 5: Counting the number of nulls in the columns.
---
1. In the previous cell we have observed that columns: C-1850860095796, C-21908-95715C-2336040075754, C-1959670015932, and C-1322810001191085880 contains null.

2. Strategy: 

    a. We know that the total number of columns is 5,938, we need to compare how many nulls are in these columns.
    b. If the nulls are greater than 50% that is:
    
    $\text{% null} = \frac{\text{total number of nulls in a column}}{\text{total number of rows}}$
    
    then it is better to drop the columns with more than 50% missing than spending effort on testing if the nulls are missing by random.

In [116]:
colSums(is.na(Df_regression_unique))

## Step 6: Remove columns with more than 50% missing values.
---
Note: Make sure the dplyr library is installed here.

If the column names contain special characters (like -), the dplyr package is a more robust solution. Use backticks ("<name column here>") to reference such column names.

In [117]:
Df_regression_unique <- Df_regression_unique %>% select(-"C-185086009", -"C-21908-9", -"C-233604007", -"C-195967001", -"C-132281000119108")

## Step 7: Check is the columns are properly removed.
---

In [118]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-103579009,C-72166-2,C-763302001,C-39156-5,Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<dbl>,<int>,<dbl>
5933,p9972,0,53,f,white,never,True,29.66,822,1
5934,p9976,0,70,m,white,former,True,30.06,3042,1
5935,p9982,0,70,m,black,never,True,31.07,1233,1
5936,p9992,1,75,m,white,never,True,29.6,1838,1
5937,p9996,0,59,m,white,never,True,27.3,2374,2
5938,p9998,0,46,f,white,former,True,27.6,356,1


## Discussion:
---
Assessing data structures is a crucial step in preparing for statistical modeling. It ensures that the data is correctly understood, processed, and interpreted, ultimately leading to valid and reliable results. Here’s why it’s important:

1. Understanding Data Types
Significance:
Each variable type (e.g., numeric, categorical, logical) dictates the kinds of analyses and transformations that can be performed.
For example, regression models treat numeric and categorical variables differently.
Potential Issues:
A categorical variable coded as numeric may lead to incorrect interpretation (e.g., treating regions as continuous numbers).
Logical variables not converted to binary (0/1) may confuse algorithms.
Solution:
Use functions like str() and summary() to ensure variables are properly typed.


2. Detecting and Addressing Missing Data
Significance:
Missing data can skew model results, reduce statistical power, and introduce biases.
Understanding patterns of missingness (e.g., Missing Completely at Random, Missing Not at Random) informs appropriate imputation methods.
Potential Issues:
Ignoring missing data can result in smaller, unrepresentative datasets if rows with missing values are removed.
Arbitrary imputation can distort relationships in the data.
Solution:
Visualize missing data patterns (naniar in R) and assess whether imputation, removal, or advanced techniques (e.g., multiple imputation) are needed.


3. Identifying Outliers and Inconsistent Values
Significance:
Outliers can have a disproportionate influence on statistical models, especially in regression and clustering.
Inconsistent values (e.g., negative ages, nonsensical categories) can indicate data entry errors.
Potential Issues:
Models may overfit or misinterpret trends due to extreme or incorrect values.
Solution:
Perform exploratory data analysis (EDA) to detect and decide how to handle outliers.


4. Ensuring Data Conforms to Model Requirements
Significance:
Many statistical models have assumptions (e.g., linear regression assumes linearity, normality, homoscedasticity).
Proper formatting (e.g., encoding categorical variables as factors) ensures compatibility with modeling functions.
Potential Issues:
Non-normal data in a model requiring normality can lead to invalid inferences.
Misformatted data might cause errors or incorrect results in statistical software.
Solution:
Check distributions, relationships, and data transformations (e.g., log transformations for skewed data).


5. Reducing Dimensionality and Improving Interpretability
Significance:
High-dimensional data can cause overfitting and make results harder to interpret.
Dimensionality reduction techniques (e.g., PCA, feature selection) require an understanding of variable roles.
Potential Issues:
Irrelevant variables add noise and reduce the efficiency of the model.
Solution:
Assess variable relevance and multicollinearity before modeling.


6. Recognizing Data Relationships
Significance:
Relationships between variables (e.g., collinearity, interactions) influence model complexity and performance.
Potential Issues:
Ignoring relationships can lead to misspecified models or incorrect interpretations.
Solution:
Explore relationships using scatterplots, correlation matrices, or advanced methods like variance inflation factors (VIF).


7. Ensuring Ethical and Accurate Representation
Significance:
Assessing data structures ensures that the dataset represents the population accurately.
Potential Issues:
Imbalanced datasets can bias models (e.g., under-representation of minority groups in healthcare studies).
Solution:
Stratify data, balance classes, and ensure fair sampling.

# Question 2: Handling Missing Data
---
Identify columns with logical (TRUE/FALSE) values and apply methods for addressing missing values in these columns, replacing them with 0 or 1 based on their logical value. Reflect on the implications of missing data and the
strategies used for handling it.

## Step 0: Direct Conversion in Base R
---

Logical to Integer: In R, logical values are internally represented as TRUE = 1 and FALSE = 0. The as.integer() function converts these directly.

In [119]:
Df_regression_unique$"C-763302001" <- as.integer(Df_regression_unique$"C-763302001")

## Step 1: Verify the Changes
----

In [120]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-103579009,C-72166-2,C-763302001,C-39156-5,Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<int>,<dbl>,<int>,<dbl>
5933,p9972,0,53,f,white,never,1,29.66,822,1
5934,p9976,0,70,m,white,former,1,30.06,3042,1
5935,p9982,0,70,m,black,never,1,31.07,1233,1
5936,p9992,1,75,m,white,never,1,29.6,1838,1
5937,p9996,0,59,m,white,never,1,27.3,2374,2
5938,p9998,0,46,f,white,former,1,27.6,356,1


## Step2:  If There Are NA Values
---

If the column contains (NA) values, decide how to handle them. For instance, you might replace them with 0.

In [121]:
Df_regression_unique$"C-763302001"[is.na(Df_regression_unique$"C-763302001")] <- 0
Df_regression_unique$"C-763302001" <- as.integer(Df_regression_unique$"C-763302001")

In [122]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-103579009,C-72166-2,C-763302001,C-39156-5,Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<int>,<dbl>,<int>,<dbl>
5933,p9972,0,53,f,white,never,1,29.66,822,1
5934,p9976,0,70,m,white,former,1,30.06,3042,1
5935,p9982,0,70,m,black,never,1,31.07,1233,1
5936,p9992,1,75,m,white,never,1,29.6,1838,1
5937,p9996,0,59,m,white,never,1,27.3,2374,2
5938,p9998,0,46,f,white,former,1,27.6,356,1


## Discussion
---

Missing data is a common challenge in statistical modeling and data analysis. Its presence can significantly impact the validity, reliability, and interpretability of results. Proper handling of missing data is crucial to ensure robust and unbiased conclusions.

Implications of Missing Data:

1. Reduced Statistical Power
Missing data reduces the sample size available for analysis, weakening the statistical power of tests.
Smaller datasets may lead to inconclusive results or increased Type II errors (failing to detect true effects).

2. Bias in Results
Missing data that is not random (e.g., people with certain traits systematically omit answers) can bias estimates.
For example, in medical studies, patients with severe symptoms may drop out, leading to underestimation of the disease severity.


3. Distorted Relationships
Missing data can change the relationships between variables if handled improperly.
For instance, imprecise imputation might introduce spurious correlations.


4. Loss of Representativeness
If a particular subgroup is more likely to have missing data, the remaining data may no longer represent the entire population.


5. Impact on Model Performance
Machine learning and statistical models often fail or perform poorly with missing data unless explicitly accounted for.


## Strategies for Handling Missing Data
---

1. Prevention

    Careful Data Collection: Design surveys and data collection processes to minimize missing values.
    
    Data Validation: Use real-time checks during data entry to catch missing or incorrect inputs.

<br>
2. Understanding the Missingness Mechanism

    Missing Completely at Random (MCAR): Data is missing with no systematic relationship to the observed or unobserved data. This is the least problematic type.

    Missing at Random (MAR): The likelihood of missingness depends on observed variables but not unobserved ones.

    Missing Not at Random (MNAR): Missingness depends on unobserved variables. This requires specialized techniques or assumptions.

<br>
3. Strategies to Handle Missing Data

    A. Deletion Methods
        Listwise Deletion: Remove rows with any missing values.
        
        Advantages: Simple and retains consistency across analyses.
        
        Disadvantages: Reduces sample size, can bias results if data is not MCAR.
        
        Pairwise Deletion: Use available data for each analysis without discarding entire rows.
        
        Advantages: Retains more data.
        
        Disadvantages: Results may vary across analyses.
        

    B. Imputation Methods
        Mean/Median/Mode Imputation: Replace missing values with the mean (for numeric) or mode (for categorical).
        Advantages: Simple and quick.
        Disadvantages: Reduces variability, can introduce bias.
        Regression Imputation: Use regression models to predict and fill missing values based on other variables.
        Advantages: Accounts for relationships between variables.
        Disadvantages: Assumes the imputation model is correct, underestimates variability.
        Multiple Imputation: Create multiple plausible datasets by imputing missing values with different estimates and combining results.
        Advantages: Accounts for uncertainty and variability in imputations.
        Disadvantages: Computationally intensive, complex to implement.
        Hot Deck Imputation:Replace missing values with observed values from similar cases.
        Advantages: Retains observed data distribution.
        Disadvantages: Depends on finding similar cases.

    C. Model-Based Approaches
        Maximum Likelihood Estimation (MLE): Estimate parameters directly while accounting for missing data.
        Advantages: Efficient and unbiased under MAR.
        Disadvantages: Requires complex algorithms.
        Bayesian Methods: Use prior distributions to estimate missing values.
        Advantages: Flexible and accounts for uncertainty.
        Disadvantages: Requires expertise and computational power.

    D. Advanced Techniques
        K-Nearest Neighbors (KNN): Impute missing values based on the closest neighbors in the dataset.
        Advantages: Considers the data's structure.
        Disadvantages: Computationally expensive for large datasets.
        Machine Learning: Use predictive models like Random Forest to impute missing values.
        Advantages: Handles complex relationships.
        Disadvantages: May overfit and requires validation.

# Question 3: Categorical Variable Conversion
---

Convert relevant variables into categorical (factor) variables as necessary, and consider the role of this transformation in regression analyses. Discuss when and why categorical conversion is crucial in epidemiological research.

## Step 0:
---

Explanation

ifelse Function:

ifelse(condition, value_if_true, value_if_false) checks the condition for each element.

If the value is "f", it assigns 1.

Otherwise (e.g., if the value is "m"), it assigns 0.

---

Replace the Original Column: The column C-263495000 is overwritten with the transformed values.

In [123]:
Df_regression_unique$"C-263495000" <- ifelse(Df_regression_unique$"C-263495000" == "f", 1, 0)

In [124]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-103579009,C-72166-2,C-763302001,C-39156-5,Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>,<dbl>,<int>,<dbl>
5933,p9972,0,53,1,white,never,1,29.66,822,1
5934,p9976,0,70,0,white,former,1,30.06,3042,1
5935,p9982,0,70,0,black,never,1,31.07,1233,1
5936,p9992,1,75,0,white,never,1,29.6,1838,1
5937,p9996,0,59,0,white,never,1,27.3,2374,2
5938,p9998,0,46,1,white,former,1,27.6,356,1


## Step 1: Check the unique values of columns with categorical data
---

In [125]:
unique(Df_regression_unique$"C-72166-2")

In [126]:
unique(Df_regression_unique$"C-103579009")

## Step 2: C-72166-2 change into binary variable
---
Coding legend is:

'Former' = 1

'Never' = 0

In [127]:
Df_regression_unique$"C-72166-2" <- ifelse(Df_regression_unique$"C-72166-2" == "Former", 1, 0)

## Step 3: Verify the change
---

In [128]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-103579009,C-72166-2,C-763302001,C-39156-5,Followup,Exposure
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<int>,<dbl>,<int>,<dbl>
5933,p9972,0,53,1,white,0,1,29.66,822,1
5934,p9976,0,70,0,white,0,1,30.06,3042,1
5935,p9982,0,70,0,black,0,1,31.07,1233,1
5936,p9992,1,75,0,white,0,1,29.6,1838,1
5937,p9996,0,59,0,white,0,1,27.3,2374,2
5938,p9998,0,46,1,white,0,1,27.6,356,1


## Step 4: Create dummy variables for the C-103579009
---

In [129]:
Df_regression_unique <- Df_regression_unique %>%
  mutate(
    asian = ifelse(`C-103579009` == "asian", 1, 0),
    white = ifelse(`C-103579009` == "white", 1, 0),
    black = ifelse(`C-103579009` == "black", 1, 0),
    hawaiian = ifelse(`C-103579009` == "hawaiian", 1, 0),
    native = ifelse(`C-103579009` == "native", 1, 0)
  ) %>%
  select(-"C-103579009")  # Remove the original column if not needed

## Step 5: Verify the changes
---

In [130]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-72166-2,C-763302001,C-39156-5,Followup,Exposure,asian,white,black,hawaiian,native
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5933,p9972,0,53,1,0,1,29.66,822,1,0,1,0,0,0
5934,p9976,0,70,0,0,1,30.06,3042,1,0,1,0,0,0
5935,p9982,0,70,0,0,1,31.07,1233,1,0,0,1,0,0
5936,p9992,1,75,0,0,1,29.6,1838,1,0,1,0,0,0
5937,p9996,0,59,0,0,1,27.3,2374,2,0,1,0,0,0
5938,p9998,0,46,1,0,1,27.6,356,1,0,1,0,0,0


## Step 6: Remove the "Native"
---

Removing the variable to avoid the dummy variable trap.

In [131]:
Df_regression_unique <- Df_regression_unique %>% select(-native)

## Step 7: Verify changes
---

In [132]:
tail(Df_regression_unique)

Unnamed: 0_level_0,ptnum,label,C-424144002,C-263495000,C-72166-2,C-763302001,C-39156-5,Followup,Exposure,asian,white,black,hawaiian
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5933,p9972,0,53,1,0,1,29.66,822,1,0,1,0,0
5934,p9976,0,70,0,0,1,30.06,3042,1,0,1,0,0
5935,p9982,0,70,0,0,1,31.07,1233,1,0,0,1,0
5936,p9992,1,75,0,0,1,29.6,1838,1,0,1,0,0
5937,p9996,0,59,0,0,1,27.3,2374,2,0,1,0,0
5938,p9998,0,46,1,0,1,27.6,356,1,0,1,0,0


## Step 8: Check the unique values of the Exposure columns
---

In [133]:
unique(Df_regression_unique$"Exposure")

## Step 9: in the Exposure change into binary variable
---
Coding legend is:

1 = 0

2 = 1

In [134]:
Df_regression_unique$"Exposure" <- ifelse(Df_regression_unique$"Exposure" == 2, 1, 0)

## Step 10: Verify changes
---

In [135]:
unique(Df_regression_unique$"Exposure")

## Discussion
---
Epidemiological research often involves variables that naturally fall into discrete categories, such as demographic data (e.g., age groups, gender), exposure groups, or disease classifications. Converting such variables into categorical formats is a critical step in data preparation and analysis, as it ensures that these variables are correctly represented and interpreted in statistical models.

## When Categorical Conversion is Crucial
---

a. When Variables Are Nominal or Ordinal

    Nominal Variables: Variables with distinct categories that have no inherent order (e.g., blood type, ethnicity, or disease type).
    
    Ordinal Variables: Variables with categories that have a meaningful order but no consistent scale (e.g., disease severity: mild, moderate, severe).
    Example: Converting "mild, moderate, severe" into ordered factors to preserve their natural hierarchy.

b. When Applying Statistical Models

    Many statistical models, such as regression, ANOVA, and logistic regression, require categorical variables to be explicitly coded as factors. Without conversion, the software might misinterpret categorical variables as continuous, leading to incorrect model specification.

c. When Handling Group Comparisons

        Epidemiological studies often involve comparing groups (e.g., exposed vs. unexposed, treatment vs. control). Categorical conversion ensures clear group delineation for stratified analysis or interaction testing.

d. When Using Machine Learning or Modeling Algorithms

    Machine learning models often require variables to be encoded as numerical representations (dummy variables or one-hot encoding). Categorical conversion facilitates the creation of these encodings.
    
    
## Why Categorical Conversion Is Crucial
---

a. Preserves the Integrity of Categorical Data

    Without proper conversion, categorical variables might be treated as continuous, leading to meaningless operations (e.g., averaging categorical values like "male" and "female"). This can distort model outputs and make the findings invalid.
    
b. Enables Appropriate Statistical Interpretation

Categorical conversion allows models to calculate meaningful metrics:
Odds ratios in logistic regression.
Relative risks in survival analysis.
Proper conversion helps interpret the role of exposure, disease outcomes, or risk factors accurately.
Facilitates Group-Level Analysis

In epidemiology, researchers often analyze data stratified by categories like age, sex, or socioeconomic status.
Categorical conversion ensures these groupings are well-defined and allows for subgroup analysis, such as identifying differential risks across strata.
Prevents Misrepresentation of Ordinal Data

For ordinal variables, conversion to factors ensures that statistical methods respect the natural ordering of categories.
For example, in a disease severity scale, treating categories as unordered would ignore their hierarchical relationship.
Essential for Interaction Testing

Testing interactions between categorical variables (e.g., age group × treatment type) requires proper encoding to understand joint effects.
Improves Computational Efficiency

Converting categorical variables into factors reduces memory usage and speeds up computations in R and other statistical software.
Ensures Reproducibility

Explicitly converting and documenting categorical variables improves transparency and ensures consistent data handling across analyses.