# Data

## Data Understanding

* Group: This column appears to indicate whether the individual is a "health person", which may suggest that the data pertains to a study comparing healthy individuals to those with periodontal disease.
* Name: This column includes the names of the individuals in the data set.
* Generative: This column appears to indicate the gender of each individual (male or female).
* Age: This column includes the ages of the individuals in the data set.
* PD depth: This column may refer to the depth of periodontal pockets, which can be a sign of periodontal disease.
* mSBI score: This column may refer to a score on the modified Sulcular Bleeding Index, which can be used to assess gingival inflammation.
* SBI score: This column may refer to a score on the Sulcular Bleeding Index, which can also be used to assess gingival inflammation.
* MBL count: This column may refer to the number of millimeters of clinical attachment loss, which can be used to assess periodontal disease severity.
* PLI score: This column may refer to a score on the Plaque Index, which can be used to assess the amount of plaque on teeth.
* BOP probing: This column appears to indicate whether bleeding on probing was observed, which can be a sign of periodontal disease.
* CAL score: This column may refer to a score on the Clinical Attachment Level, which can be used to assess periodontal disease severity.
* FMBS score: This column may refer to a score on the Full Mouth Bleeding Score, which can be used to assess gingival inflammation.

In [1]:
# import relevant libraries
import pandas as pd
import numpy as np
import statsmodels.stats.power as smp

# load datasets
df = pd.read_csv('clinicData.csv')
# drop unnamed column from the clinical dataset.
df.dropna(how='all', axis='columns', inplace=True)

# preview the dataframe
df.head(7)

Unnamed: 0,group,name,generative,age,PD depth,mSBI score,SBI score,MBL count,PLI score,BOP probing,CAL score,FMBS score
0,health person,Zhu Weijun,male,25,<1mm,0,0.0,<0.5mm,0,-,<1mm,<1
1,health person,Zhou Yi,male,26,<1mm,0,0.0,<0.5mm,0,-,<1mm,<1
2,health person,Liang Mengmeng,female,26,<1mm,0,0.0,<0.5mm,0,-,<1mm,<1
3,health person,Xie Mengying,female,30,<1mm,0,0.0,<0.5mm,0,-,<1mm,<1
4,health person,Zhang Qianmei,female,24,<1mm,0,0.0,<0.5mm,0,-,<1mm,<1
5,health person,Zhao Xueqian,female,29,<1mm,0,0.0,<0.5mm,0,-,<1mm,<1
6,health person,Wang Jingjing,female,32,<1mm,0,0.0,<0.5mm,0,-,<1mm,<1


## Data processing

### FMBS score

In [2]:
# function to fill missing values with mean
df['FMBS score'].fillna('<1', inplace=True)

# create a dictionary mapping the values to their corresponding FMBS scores
FMBS_level = {
    '<1' : 'low FMBS scores',
    '1' : 'moderate FMBS scores',
    '2' : 'high FMBS scores',
    '0' : 'low FMBS scores'
}

# apply the mapping to the MBLcount column
df['FMBS score'] = df['FMBS score'].map(FMBS_level)


### pd depth

In [3]:
# define a function to bin the PD depth values into categories
def bin_PD_depth(depth):
    if depth == '<1mm':
        return 'shallow'
    elif depth in ['1.0mm', '1.2mm', '1.3mm', '1.4mm', '1.5mm']:
        return 'moderate'
    else:
        return 'deep'

# apply the bin_PD_depth function to the 'PD depth' column
df['PD depth'] = df['PD depth'].apply(bin_PD_depth)

### MBL count

In [4]:
# create a dictionary mapping the values to their corresponding risk levels
risk_levels = {
    '<0.5mm': 'Low Risk',
    '0.7mm': 'Low Risk',
    '0.9mm': 'Low Risk',
    '1.0mm': 'Low Risk',
    '1.1mm': 'Low Risk',
    '1.2mm': 'Moderate Risk',
    '1.3mm': 'Moderate Risk',
    '1.6mm': 'Moderate Risk',
    '1.7mm': 'Moderate Risk',
    '1.8mm': 'Moderate Risk',
    '1.9mm': 'Moderate Risk',
    '2.2mm': 'Moderate Risk',
    '4.5mm': 'Very High Risk',
    '4~5mm': 'Very High Risk',
    '>4mm': 'Very High Risk',
    '5mm': 'Very High Risk'
}
# apply the mapping to the MBLcount column
df['MBL count'] = df['MBL count'].map(risk_levels)


### CAL score

In [5]:
# replace "1-2" with 1.5
df.loc[df['CAL score'] == '1-2mm', 'CAL score'] = 1.5

# convert the 'CAL score' column to a numeric format
df['CAL score'] = pd.to_numeric(df['CAL score'].str.strip('<>mm'))

# define the value ranges for each category
no_cal = df['CAL score'] < 1
mid_cal = (df['CAL score'] >= 1) & (df['CAL score'] < 2)
moderate_cal = (df['CAL score'] >= 2) & (df['CAL score'] <= 6)
severe_cal = df['CAL score'] > 6

# assign categories to each value based on the ranges defined above
df['CAL score'] = pd.cut(df['CAL score'], bins=[-float('inf'), 1, 2, 6, float('inf')], 
                            labels=['no_cal', 'mid_cal', 'moderate_cal', 'severe_cal'])


### BOP probing

In [6]:
# convert the target colum to binary
replacements = {'-': 'healthy', '+': 'not_healthy'}
df['BOP probing'] = df['BOP probing'].replace(replacements)


In [7]:
# preview the cleaned data
df.head(7)

Unnamed: 0,group,name,generative,age,PD depth,mSBI score,SBI score,MBL count,PLI score,BOP probing,CAL score,FMBS score
0,health person,Zhu Weijun,male,25,shallow,0,0.0,Low Risk,0,healthy,no_cal,low FMBS scores
1,health person,Zhou Yi,male,26,shallow,0,0.0,Low Risk,0,healthy,no_cal,low FMBS scores
2,health person,Liang Mengmeng,female,26,shallow,0,0.0,Low Risk,0,healthy,no_cal,low FMBS scores
3,health person,Xie Mengying,female,30,shallow,0,0.0,Low Risk,0,healthy,no_cal,low FMBS scores
4,health person,Zhang Qianmei,female,24,shallow,0,0.0,Low Risk,0,healthy,no_cal,low FMBS scores
5,health person,Zhao Xueqian,female,29,shallow,0,0.0,Low Risk,0,healthy,no_cal,low FMBS scores
6,health person,Wang Jingjing,female,32,shallow,0,0.0,Low Risk,0,healthy,no_cal,low FMBS scores


## Sample size and Power

## steps

1. Determine the effect size you expect to observe in your study. The effect size is the magnitude of the difference or relationship you expect to observe between groups or variables. The effect size is typically measured as the standardized mean difference (e.g., Cohen's d) for continuous outcomes or as the odds ratio or relative risk for binary outcomes. Without additional information about your study, it is difficult to determine the effect size.


2. Choose the level of significance (alpha) you want to use in your study. The level of significance is the probability of rejecting the null hypothesis when it is actually true (i.e., a type I error). A common choice for alpha is 0.05.


3. Choose the desired power (1-beta) for your study. The power is the probability of rejecting the null hypothesis when it is actually false (i.e., a type II error). A common choice for power is 0.80.


4. Determine the allocation ratio for your study. The allocation ratio is the ratio of participants assigned to each group in a study. For your study, the allocation ratios are as follows: Healthy implants (16), Peri-implant mucositis (15), peri-implantitis (15), and health person (14).


5. Use a sample size calculator or statistical software to calculate the necessary sample size for your study based on the inputs above.

**Research question**
>What is the effect of peri-implantitis on clinical and microbiological parameters compared to healthy implants and peri-implant mucositis?

The **expected difference** between the groups could be the following:

* **Peri-implantitis group vs. Healthy implants group:** It is expected that the peri-implantitis group will have a higher mean value for PD depth, mSBI score, BOP probing, and CAL score compared to the healthy implants group.
* **Peri-implantitis group vs. Peri-implant mucositis group:** It is expected that the peri-implantitis group will have a higher mean value for PD depth, mSBI score, BOP probing, and CAL score compared to the peri-implant mucositis group.
* **Peri-implant mucositis group vs. Healthy implants group:** It is expected that the peri-implant mucositis group will have a higher mean value for PD depth, mSBI score, and BOP probing compared to the healthy implants group, but a lower mean value for CAL score.

### alpha and power values

In [8]:
# set alpha and power values
alpha = 0.05
power = 0.80

### sample sizes and expected difference

In [9]:
# define the sample sizes for each group
n1 = 16
n2 = 15

# define the expected difference between the groups
d = 5

### subsets groups

In [10]:
# subset the data for the two groups we want to compare
group1 = df[df['group'] == 'Healthy implants']
group2 = df[df['group'] == 'peri-implantitis']


### mean and sd for the two gropus

In [11]:
# mean and standard deviation of the age variable for each group
mean1, mean2 = group1['age'].mean(), group2['age'].mean()
std1, std2 = group1['age'].std(), group2['age'].std()


### effect size

In [12]:
# calculate the effect size
effect_size = d / np.sqrt(((n1-1)*std1**2 + (n2-1)*std2**2) / (n1+n2-2))

# print effect size
effect_size

0.4111250067235885

###  Using the z-test power function

In [13]:
# z-test power function (approximation for large sample sizes)
power_ztest = smp.zt_ind_solve_power(effect_size=effect_size, nobs1=n1, alpha=alpha, ratio=n2/n1, alternative='two-sided')
nobs_ztest = smp.zt_ind_solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=n2/n1)

# display
print(f"Z_test power: {power_ztest}")
print()
print(f"Z_test nobs: {nobs_ztest}")

Z_test power: 0.20819472570484907

Z_test nobs: 95.96862265361582
