# HW6 (20')

<font size='4'>
    
Please submit your assignment as an HTML or PDF file.

Print your name (First and Last) below.

Gahyun Lim

<font size='4'>
    
Import the `pandas`, `matplotlib.pyplot`, `numpy`, `scipy` libraries and assign them with proper nicknames.

In [75]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp

In [76]:
# Do not delete this code section.
np.random.seed(100)

## 1. Write a function that output marginal summary statistics and missing values for continuous variable (10')

<font size='4'>

**Write the function with `def`** (6')
- Given a vector of continuous measure, you are asked to write a function named `fn_marginal_continuous`.
- The function has one parameter `input_vec` and outputs a list of summary measure and the number of missing values.
- Check the missingness of the `input_vec` using `np.isnan()` or `pd.isna()` functions.
    - In our first case, we will assume the `input_vec` is an numpy array with missing values marked with `np.nan`.
- The list of summary measure can be either **(mean, std)** or **(median, q1, q3)** depending on the normality assumption.
    - To determine the normality assumption, you can rely on the p-value of the Shapiro-Wilk test.
    - Relevant functions can be found here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html 
    - If the p-value < 0.05, it is not normally distributed, otherwise, you can treat it as normally distributed.
    - Think about what measure to report based on the normality assumption.
    - Part of relevant functions can be found here: https://numpy.org/doc/2.0/reference/generated/numpy.nanmean.html
- The return statement should include two components: `missing_num` and `output_ls` (your summmary measure).
    - The missing numbers should be of type `int` instead of `np.int64`, and summary values should be of type `float` instead of `np.float64`.
    - Round all your summary values such that they have no more than **3** digits after the decimals. 

In [77]:
def fn_marginal_continuous(input_vec):
    pass
    missing_num = int(np.sum(np.isnan(input_vec)))  
    clean_vec = input_vec[~np.isnan(input_vec)]
    p_value = stats.shapiro(clean_vec).pvalue
    if p_value >= 0.05:
        output_ls = [float(round(np.mean(clean_vec), 3)),
                     float(round(np.std(clean_vec, ddof=1), 3))]
    else:
        output_ls = [float(round(np.median(clean_vec), 3)),
                     float(round(np.percentile(clean_vec, 25), 3)),
                     float(round(np.percentile(clean_vec, 75), 3))]
    return missing_num, output_ls

<font size='4'>

**Test your function with the following different arguments:** (4') <br>
For each scenario, please export the results as `missing_num_x`, `output_ls_x` and print them out separately.
1. A standard normal random vector with a sample size of 100, named `input_vec_1`.
2. A Chi-squared random vector with a degree of freedom 1 and a sample size of 100, named `input_vec_2`.
3. Change the first element of `input_vec_1` as `np.nan` and create a new array named `input_vec_3`.
    - Note that to create a copy of an numpy array, use `np.copy()` first.
4. Change the last element of `input_vec_2` as `np.nan` and create a new array named `input_vec_4`.

In [78]:
np.random.seed(100) 
input_vec_1 = np.random.normal(loc=0, scale=1, size=100)
missing_num_1, output_ls_1 = fn_marginal_continuous(input_vec_1)
print (missing_num_1, output_ls_1)

0 [-0.104, 0.975]


In [79]:
input_vec_2 = np.random.chisquare(df=1, size=100)
missing_num_2, output_ls_2 = fn_marginal_continuous(input_vec_2)
print (missing_num_2, output_ls_2)

0 [0.43, 0.089, 1.184]


In [80]:
input_vec_3 = np.copy(input_vec_1)
input_vec_3[0] = np.nan
missing_num_3, output_ls_3 = fn_marginal_continuous(input_vec_3)
print (missing_num_3, output_ls_3)

1 [-0.088, 0.965]


In [81]:
input_vec_4 = np.copy(input_vec_2)
input_vec_4[-1] = np.nan
missing_num_4, output_ls_4 = fn_marginal_continuous(input_vec_4)
print (missing_num_4, output_ls_4)

1 [0.422, 0.088, 1.19]


## 2. Write a function that output marginal summary statistics and missing values for categorical variable (5')

<font size='4'>

**Write the function with `def`**
- Given a column vector, you are asked to write a function named `fn_marginal_categorical`.
- The function has one parameter named `input_vec` and outputs a list of summary measure and the number of missing values.
- Check the missingness of the `input_vec` using `np.isnan()` or `pd.isna()` functions.
    - In our second case, we will assume the `input_vec` is a column from a pandas DataFrame with missing values marked with `np.nan`.
    - You can use both functions to identify missing values and yield the number of missingness.
    - Use `pd.Series.value_counts()` function to obtain the frequency and proportion of `input_vec`, denoted as `tab_count` and `tab_percent`, respectively.
    - Details can be found here: https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html
    - For proportion, please use percentage (0-100%). You can ignore % when reporting.
- The return statement should include two components: `missing_num` and `output_tab` (your summmary measure).
    - For your `output_tab`, combine the count and proportion together using `pd.concat()`. Your `output_tab` should have three columns: variable name, count, and proportion.
    - Details can be found here: https://pandas.pydata.org/docs/reference/api/pandas.concat.html
    - The missing numbers should be of type `int` instead of `np.int64`, and percentgae values should be of type `float` instead of `np.float64`.
    - Round all your relevant summary values such that they have no more than **2** digits after the decimals.

In [82]:
def fn_marginal_categorical(input_vec):
    pass
    missing_num = int(np.sum(pd.isna(input_vec)))  
    tab_count = input_vec.value_counts(dropna=True)
    tab_percent = (input_vec.value_counts(normalize=True, dropna=True) * 100).round(2)
    output_tab = pd.concat([tab_count, tab_percent], axis=1)
    output_tab.columns = ['count', 'proportion']
    return missing_num, output_tab

## 3. Test your written functions with a real dataset. (3')

<font size='4'>
    
**Test the summary function for the continuous measure** (1')
- Load the `PTSD dataset.xlsx` and name it as `ptsd_df`.
    - The dataset should be stored under `data` folder when you sync changes and fetch origins the GitHub repository.
- Use the column `pcl5month_score.baseline` as the input vector. This is a continuous measure.
- Extract the corresponding column and give it a name `pcl5month_base`.
- (*Optional*) You convert from a pandas.dataframe to an numpy array.

In [83]:
ptsd_df = pd.read_excel('data/PTSD dataset.xlsx', sheet_name='main_dataset') 

In [84]:
pcl5month_base = ptsd_df['pcl5month_score.baseline']
missing_num, output_ls = fn_marginal_continuous(pcl5month_base)
print(missing_num, output_ls)

9 [51.0, 42.0, 62.0]


<font size='4'>

**Test the summary function for the categorical measure** (2')

- Use the column `mdd_code` as the input vector. This is a binary vector.
1. Extract the corresponding column and give it a name `mdd_code_vec`. Output your results as `missing_num_1` and `tab_1`. Print each element out (You will write `print()` twice).
2. Create a copy of `mdd_code_vec` and name it as `mdd_code_vec_2`. Change its first element to `NaN`.
    - Rerun the `fn_marginal_categorical()` with the new vector. Output your results as `missing_num_2` and `tab_2`. Print each element out (You will write `print()` twice).

In [85]:
mdd_code_vec = ptsd_df['mdd_code']
missing_num_1, tab_1 = fn_marginal_categorical(mdd_code_vec)
print(missing_num_1)
print(tab_1)

0
          count  proportion
mdd_code                   
0           340       70.39
1           143       29.61


In [86]:
mdd_code_vec_2 = mdd_code_vec.copy()
mdd_code_vec_2.iloc[0] = np.nan
missing_num_2, tab_2 = fn_marginal_categorical(mdd_code_vec_2)
print(missing_num_2)
print(tab_2)

1
          count  proportion
mdd_code                   
0.0         339       70.33
1.0         143       29.67


## 4. Lambda functions and for loop (2')

<font size='4'>

- Create a ``` lambda ``` function that checks if column `pcl5_score_intake` is greater than 30, denoted as `fn_pcl5_ptsd_check`.
- Create a new list `pcl5_score_intake_ls` that shows `True` if `pcl5_score_intake`>30 and `False` otherwise using a for loop.
    - You can also try `map()` function to iterate the function over `pcl5_score_intake`.
    - Syntax is simple: `map(function_name, iterable, ...)`. 
- Print out the number of patients with `pcl5_score_intake` over 30 and mark them as **Clinically Significant for PTSD**.
    - Your output can be something like `XXX out of YYY patients (ZZZ%) were marked as clinically significant for PTSD.`
    - Round your percentage with no more than **2** decimals.

In [87]:
fn_pcl5_ptsd_check = lambda x: x > 30

pcl5_score_intake_ls = []
for score in ptsd_df['pcl5_score_intake']:
    pcl5_score_intake_ls.append(fn_pcl5_ptsd_check(score))

number = sum(pcl5_score_intake_ls)
total = len(pcl5_score_intake_ls)
percent = round((number / total) * 100, 2)

print(f"{number} out of {total} patients ({percent}%) were marked as clinically significant for PTSD.")

451 out of 483 patients (93.37%) were marked as clinically significant for PTSD.
