### Sensitivity, Specificity and Evaluation Metrics

Sensitivity and specificity are commonly used evaluation metrics in medical diagnosis and classification problems.

- Sensitivity is the proportion of true positives (TP) that are correctly identified as positive by a model or test. It measures the ability of the model to correctly identify individuals with the condition or disease of interest, and is calculated as TP / (TP + FN), where FN is the number of false negatives.

- Specificity is the proportion of true negatives (TN) that are correctly identified as negative by a model or test. It measures the ability of the model to correctly identify individuals without the condition or disease of interest, and is calculated as TN / (TN + FP), where FP is the number of false positives.

Evaluation metrics are used to assess the performance of a model or test, and to compare different models or tests. In addition to sensitivity and specificity, other commonly used evaluation metrics include:

- Accuracy: the proportion of correctly classified instances (TP + TN) / (TP + TN + FP + FN)

- Precision: the proportion of true positives among all positive predictions TP / (TP + FP)

- Recall: the proportion of true positives among all actual positives TP / (TP + FN)

- F1 score: the harmonic mean of precision and recall 2 * (precision * recall) / (precision + recall)

- ROC curve: a plot of sensitivity against 1-specificity for different threshold values

The choice of evaluation metric depends on the specific problem and goals of the analysis. For example, in a medical diagnosis problem, high sensitivity may be more important than high specificity if the consequences of a false negative are severe, while high specificity may be more important if the consequences of a false positive are severe.

### Accuracy in Terms of Conditional Probability

In terms of conditional probability, accuracy is the probability that a model or test correctly predicts the class or category of an instance, given the true class of that instance. Mathematically, accuracy is defined as:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

In terms of conditional probabilities, we can express these quantities as:

- TP: the probability that a positive instance is correctly identified as positive, i.e., P(predicted positive | actual positive)
- TN: the probability that a negative instance is correctly identified as negative, i.e., P(predicted negative | actual negative)
- FP: the probability that a negative instance is incorrectly identified as positive, i.e., P(predicted positive | actual negative)
- FN: the probability that a positive instance is incorrectly identified as negative, i.e., P(predicted negative | actual positive)

Then, accuracy can be written as:

Accuracy = P(predicted positive | actual positive) * P(actual positive) + P(predicted negative | actual negative) * P(actual negative)

where P(actual positive) is the prior probability of a positive instance, and P(actual negative) is the prior probability of a negative instance. In a balanced dataset, where the number of positive and negative instances is equal, these prior probabilities are both 0.5.

In practice, accuracy is a widely used evaluation metric because it is simple to calculate and easy to interpret. However, it may not be the best metric in all situations, especially when the classes are imbalanced or the costs of false positives and false negatives are different. In such cases, alternative metrics like precision, recall, and F1 score may be more appropriate.

### Sensitivity, Specificity and Prevalence

Sensitivity, specificity, and prevalence are all related concepts that are important in medical diagnosis and classification problems.

- Sensitivity is the proportion of true positives (TP) that are correctly identified as positive by a model or test. It measures the ability of the model to correctly identify individuals with the condition or disease of interest, and is calculated as TP / (TP + FN), where FN is the number of false negatives.

- Specificity is the proportion of true negatives (TN) that are correctly identified as negative by a model or test. It measures the ability of the model to correctly identify individuals without the condition or disease of interest, and is calculated as TN / (TN + FP), where FP is the number of false positives.

- Prevalence is the proportion of individuals in the population who have the condition or disease of interest. It represents the prior probability of the condition or disease, and is calculated as the number of individuals with the condition or disease divided by the total population.

The relationship between sensitivity, specificity, and prevalence can be understood using Bayes' theorem, which relates the conditional probabilities of an event A given an event B and the probability of B:

P(A | B) = P(B | A) * P(A) / P(B)

In the context of medical diagnosis and classification, we can use this theorem to calculate the positive predictive value (PPV) and negative predictive value (NPV) of a model or test, which are the probabilities that an individual who tests positive or negative actually has the condition or disease of interest, respectively. The PPV and NPV depend on the sensitivity, specificity, and prevalence of the model or test, as well as the probability threshold used to make the classification decision.

The PPV and NPV can be calculated as follows:

PPV = TP / (TP + FP) = P(A | B) = P(B | A) * P(A) / P(B)

NPV = TN / (TN + FN) = P(¬A | ¬B) = P(¬B | ¬A) * P(¬A) / P(¬B)

where ¬A denotes the absence of the condition or disease, and ¬B denotes a negative test result.

Thus, the sensitivity and specificity of a model or test alone do not provide a complete picture of its diagnostic accuracy. The PPV and NPV depend not only on the sensitivity and specificity, but also on the prevalence of the condition or disease in the population being tested. A test with high sensitivity and specificity may have a low PPV if the prevalence of the condition or disease is low, and vice versa. Therefore, it is important to consider the prevalence of the condition or disease when interpreting the results of a diagnostic test.

### PPV and NPV

PPV and NPV are measures of the diagnostic accuracy of a model or test that relate to the probability that an individual who tests positive or negative actually has the condition or disease of interest.

Positive predictive value (PPV) is the proportion of individuals who test positive for a condition or disease and actually have that condition or disease. Mathematically, PPV is calculated as:
PPV = true positives / (true positives + false positives)

where true positives are the number of individuals who have the condition or disease and test positive, and false positives are the number of individuals who do not have the condition or disease but test positive.

Negative predictive value (NPV) is the proportion of individuals who test negative for a condition or disease and actually do not have that condition or disease. Mathematically, NPV is calculated as:
NPV = true negatives / (true negatives + false negatives)

where true negatives are the number of individuals who do not have the condition or disease and test negative, and false negatives are the number of individuals who have the condition or disease but test negative.

Both PPV and NPV depend not only on the sensitivity and specificity of the model or test, but also on the prevalence of the condition or disease in the population being tested. In general, the prevalence of the condition or disease affects the PPV and NPV in the following ways:

- As the prevalence of the condition or disease increases, the PPV of the test also increases, while the NPV decreases.

- As the prevalence of the condition or disease decreases, the NPV of the test increases, while the PPV decreases.

Therefore, it is important to take into account the prevalence of the condition or disease when interpreting the results of a diagnostic test. A high PPV indicates that a positive test result is likely to indicate the presence of the condition or disease, while a high NPV indicates that a negative test result is likely to indicate the absence of the condition or disease.

### Confusion Matrix

A confusion matrix is a table that summarizes the performance of a classification model or test by comparing the predicted labels to the true labels of a set of data. The table contains four metrics: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

The matrix is organized as follows:

![image.png](attachment:image.png)

The four metrics in the confusion matrix can be defined as follows:

- True positives (TP) are the number of cases in which the model or test correctly predicts a positive label when the true label is positive.

- False positives (FP) are the number of cases in which the model or test predicts a positive label when the true label is negative.

- True negatives (TN) are the number of cases in which the model or test correctly predicts a negative label when the true label is negative.

- False negatives (FN) are the number of cases in which the model or test predicts a negative label when the true label is positive.

From the confusion matrix, several performance metrics can be calculated to evaluate the classification model or test, such as:

- Accuracy: The proportion of correct predictions, calculated as (TP+TN)/(TP+TN+FP+FN).

- Precision: The proportion of true positives among all positive predictions, calculated as TP/(TP+FP).

- Recall (also known as sensitivity or true positive rate): The proportion of true positives among all actual positive cases, calculated as TP/(TP+FN).

- Specificity: The proportion of true negatives among all actual negative cases, calculated as TN/(TN+FP).

- F1 score: The harmonic mean of precision and recall, calculated as 2*precision*recall/(precision+recall).

The choice of which metric to use depends on the specific goals and requirements of the classification task. For example, in medical diagnosis, recall (sensitivity) may be a more important metric than precision, as it is more important to correctly identify all positive cases even if it means having some false positives. On the other hand, in spam email detection, precision may be more important than recall, as it is more important to avoid false positives even if it means missing some true positive cases.

### Calculating PPV in Terms of Sensitivity, Specificity and Prevalence

PPV (Positive Predictive Value) is the proportion of individuals who test positive for a condition or disease and actually have that condition or disease. PPV can be calculated in terms of sensitivity, specificity, and prevalence using the following formula:

PPV = (sensitivity × prevalence) / [(sensitivity × prevalence) + ((1 - specificity) × (1 - prevalence))]

where sensitivity is the true positive rate (TPR), specificity is the true negative rate (TNR), and prevalence is the proportion of individuals in the population who have the condition or disease.

To understand how the formula works, consider a hypothetical example where a diagnostic test has a sensitivity of 90%, a specificity of 80%, and a prevalence of 10% in a population of 1,000 individuals.

- TP (true positives) = 90% of 100 (positive cases) = 90
- FP (false positives) = 20% of 900 (negative cases) = 180
- TN (true negatives) = 80% of 900 (negative cases) = 720
- FN (false negatives) = 10% of 100 (positive cases) = 10

Using these values, we can calculate the PPV as:

PPV = (sensitivity × prevalence) / [(sensitivity × prevalence) + ((1 - specificity) × (1 - prevalence))]
= (0.9 × 0.1) / [(0.9 × 0.1) + ((1 - 0.8) × (1 - 0.1))]
= 0.33

This means that out of all individuals who test positive for the condition, only 33% actually have the condition. In other words, there is a high false positive rate, and the test may not be very useful for accurately diagnosing the condition.

### ROC Curve and Threshold

A receiver operating characteristic (ROC) curve is a graphical representation of the performance of a binary classification model across a range of decision thresholds. It plots the true positive rate (TPR) against the false positive rate (FPR) as the decision threshold for predicting the positive class is varied.

The TPR is also known as sensitivity, and is calculated as TP / (TP + FN), where TP is the number of true positive predictions and FN is the number of false negative predictions. The TPR represents the proportion of actual positive cases that are correctly identified as positive by the model.

The FPR is calculated as FP / (FP + TN), where FP is the number of false positive predictions and TN is the number of true negative predictions. The FPR represents the proportion of actual negative cases that are incorrectly identified as positive by the model.

To plot an ROC curve, the model is used to predict the class probabilities for a set of test examples, and the TPR and FPR are calculated for different decision thresholds by varying the probability threshold for predicting the positive class. The TPR and FPR are then plotted against each other on a graph, with the x-axis representing the FPR and the y-axis representing the TPR.

The ROC curve can help to visualize the trade-off between sensitivity and specificity for different decision thresholds. A perfect classifier would have an ROC curve that passes through the point (0,1), indicating 100% sensitivity and 0% FPR. A random classifier, on the other hand, would have an ROC curve that is a diagonal line from (0,0) to (1,1), indicating that it performs no better than chance.

The threshold for making a prediction can be adjusted to optimize the classification performance for a particular task. The optimal threshold depends on the relative importance of sensitivity and specificity for the task, and can be chosen based on the receiver operating characteristic curve. In general, a higher threshold will result in higher specificity and lower sensitivity, while a lower threshold will result in higher sensitivity and lower specificity.

### Varying the Threshold

![image.png](attachment:image.png)

Now notice we can compute the sensitivity and specificity of the model. Here, the denominator for sensitivity is the total number of disease examples, which we can count as the total number of red, which is going to be seven. The numerator is how many of those are positive, or in other words, on the right side of the threshold. This is all of them except one, which is six. So our sensitivity is six over seven, or 0.85. Similarly, the denominator for specificity is the total number of normal examples, which is the total number of blue circles here, which is eight. The numerator is how many of those are negative, or in other words, on the left side of the threshold. This is all except two, so this is six. So our specificity is six over eight or 0.75.

![image.png](attachment:image.png)

Let's say we now change the threshold such that it was higher. We now expect, we classify fewer examples as positive and more examples as negative. We can now recompute the sensitivity and specificity. Note that the sensitivity has gone down, our numerator has fallen, and the specificity has gone up, our numerator has increased because we are now correctly classifying more normal patients and incorrectly classifying more disease patients.

![image.png](attachment:image.png)

We can take this to the extreme and set the threshold to be at one. In this case, sensitivity is going to be zero since no examples are classified positive, and specificity is going to be one, since all the examples are classified as negative.

### Sampling from the Total Population

Sampling from the total population refers to the process of selecting a subset of individuals from a larger group or population for the purpose of statistical analysis or inference. Sampling is used when it is not feasible or practical to measure or study the entire population, and instead, a smaller subset is selected to represent the larger population.

There are several different sampling methods that can be used to select a sample from the total population, including:

- Simple random sampling: This involves selecting individuals from the population at random, with each individual having an equal chance of being selected. This method is unbiased, but may not always be feasible or practical.

- Stratified sampling: This involves dividing the population into subgroups or strata based on certain characteristics, and then selecting a random sample from each stratum. This method can improve the representativeness of the sample by ensuring that each stratum is adequately represented.

- Cluster sampling: This involves selecting groups or clusters of individuals from the population, rather than individual individuals. This method is often used when it is difficult or impractical to obtain a complete list of individuals in the population.

- Systematic sampling: This involves selecting individuals from the population at regular intervals, such as every 10th individual on a list. This method can be more efficient than simple random sampling, but may introduce bias if there is a pattern or regularity in the selection process.

Regardless of the sampling method used, it is important to ensure that the sample is representative of the larger population in order to make valid inferences or conclusions. The sample size and sampling method will depend on various factors, such as the research question, the characteristics of the population, and the available resources.

Let's say there were 50,000 patients in a hospital and we wanted to find out the accuracy of our chest x-ray model on everyone who gets the chest x-ray in the hospital. If we were able to run the model and get the ground truth for all patients, we would be able to get the performance of the model on the whole population. For example, say we're looking at accuracy, but this could be any other metric. We find that the accuracy of the model around all 50,000 patients is 0.78. This is called the population accuracy, here with small p.

In reality, we don't want to test the model on the whole population because it's simply infeasible to do so. Therefore, the population accuracy p is unknown. The question is, can we get a sense of how well the model will perform on this population by using a small sample of patients? Let's say we sample a hundred patients from the hospital. Now we find that the model gets an accuracy of 0.8 on the set. Can we say anything about the range in which the population accuracy p will lie?

### Confidence Interval

Confidence intervals allow us to say that using our sample, we're 95 percent confident that the population accuracy p is in the interval 0.72, 0.88. 0.72 is called the lower bound and 0.88, the upper bound of this interval. The calculation of these confidence intervals is beyond the scope of this course, but it's important to understand their interpretation. When we report the accuracy of a model on the sample, we report it with the mean and the confidence intervals. The 95 percent confidence intervals here allow us to say that with 95 percent confidence, p is in the interval 0.72, 0.88. What we haven't seen is what it means to be 95 percent confident. 95 percent confidence does not say there is a 95 percent probability that p lies within the interval. It also does not say that 95 percent of the sample accuracies lie within this interval. The interpretation of 95 percent confidence is a little more nuanced and requires us to think about making repeated samples.



Let's dive into this. Let's say we were able to repeatedly sample in 100 patients from the population several times. Each time we get a different sample and hence a different sample accuracy. We can also compute the confidence intervals associated with each sample. We can look at these samples on a plot. For each of these samples, we can plot the sample accuracies, here represented by the circle, and the lower and upper bounds of the confidence intervals of the samples. On this plot, we also have the true population accuracy plotted as the dotted line. This is unobserved. Note that most of these samples contain the population accuracy, this vertical line. Here, six out of the seven contain it and one misses it. In fact, when we have 95 percent confidence intervals, 95 percent of the samples will contain the population accuracy. Ninety-five percent is what's called our confidence level. Thus, the interpretation of 95 percent confidence is that in repeated sampling, this method produces intervals that include the population accuracy in about 95 percent of samples.

![image.png](attachment:image.png)

In practice, we don't compute the confidence intervals for many samples. We only compute our model performance on one sample. For our sample, the computed confidence interval may or may not contain p. However, we can be 95% confident that it does. One of the factors that affects the width of the confidence intervals, which is given by how close these numbers are, is the sample size. Let's say we drew another sample from the population, but this time with 500 patients. This is 5 times as large as our previous sample. We can expect that we'll have a better estimate of the population accuracy using the larger sample. We can see that even though the model gets an accuracy of 0.8 on both samples, notice that the confidence intervals are tighter for the larger sample and wider for the smaller sample. Thus a larger sample is giving us a better estimate of this population accuracy because these numbers are closer together. To summarize, confidence intervals are useful because even when we cannot run the model on a whole population, we can at least use a test result on a sample to express the range in which we're pretty sure our population accuracy lies.

![image-2.png](attachment:image-2.png)

A confidence interval is a range of values that is likely to contain the true value of a population parameter with a certain degree of confidence or probability. It is used to estimate the uncertainty or variability associated with a sample statistic, such as the mean or proportion.

A confidence interval is typically expressed as an interval of values around the sample statistic, and is based on a specified level of confidence, which is often set at 95% or 99%. For example, a 95% confidence interval for the population mean would indicate that there is a 95% probability that the true population mean falls within the calculated interval.

The formula for calculating a confidence interval depends on the type of statistic being estimated and the distribution of the population. For example, if the population is normally distributed and the sample size is large, the confidence interval for the population mean can be calculated using the standard error of the mean and the critical values from the normal distribution.

There are several factors that affect the width of a confidence interval, including the level of confidence, the sample size, and the variability of the population. Increasing the level of confidence will result in a wider interval, while increasing the sample size or decreasing the population variability will result in a narrower interval.

Confidence intervals are important in statistical inference, as they provide a way to estimate the range of possible values for a population parameter, based on a sample from the population. They allow researchers to make inferences about the population based on the information contained in the sample, while accounting for the uncertainty and variability associated with the estimation process.