# Statistik
## Innehållsförteckning:
- [Statistical Filter Methods](#statistical-filter-methods)
- [Mutual Information](#mutual-information)
- [Types of missingness](#types-of-missingness)
- [Shannon Entropy](#shannon-entropy)
- [Differential Entropy](#differential-entropy)
- [Baye's Rule](#bayes-rule)


## Statistical Filter Methods

Below is an overview of **statistical filter methods** for feature selection in machine learning, focusing on their concepts, common techniques, and practical usage.

## 1. What Is a Statistical Filter in Feature Selection?

In machine learning, a **filter method** (often called a “statistical filter”) is a **preprocessing step** for feature selection that uses **statistical criteria** to measure how relevant each feature is to the prediction target. Features are then **ranked** or **selected** according to those statistics—**without** training a specific predictive model in the loop. 

### Key Idea
- **Filter methods** separate the feature selection step from model building.
- They use measures of association like correlation, mutual information, chi-square, etc.
- They remove or select features **before** a learning algorithm sees the data.

## 2. Why Use Statistical Filters for Feature Selection?

1. **Efficiency**  
   - They are typically **fast** and computationally inexpensive.  
   - They don’t require training and evaluating multiple models for each feature subset.

2. **Reduction in Overfitting**  
   - By removing irrelevant or noisy features, you can often help the downstream model generalize better.

3. **Improve Model Interpretability**  
   - With fewer features, it’s easier to understand **which** input variables are most important and **why**.

4. **Dimensionality Reduction**  
   - In high-dimensional datasets, filter methods help reduce the dimensionality quickly without heavy computation.


## 3. Common Statistical Filter Techniques

1. **Correlation-Based Filtering**  
   - **Pearson correlation** (for continuous features) or other correlation measures (e.g., Spearman, Kendall) are computed between each feature and the target.  
   - Features that have a correlation magnitude below a certain threshold can be discarded.  
   - Often used for quick screening, but only captures **linear** relationships.

2. **Mutual Information (MI)**  
   - Measures **any** (linear or nonlinear) dependence between each feature and the target.  
   - A feature with higher MI to the target presumably carries more predictive signal.  
   - In practice, can be more robust than correlation if relationships are nonlinear.

3. **Chi-Square Test**  
   - Common for **categorical** data. Compares observed vs. expected frequencies of a feature’s values across different classes of the target.  
   - Features that show a strong dependency with the target (high chi-square score, low p-value) are selected.

4. **ANOVA F-test**  
   - Typically used for **continuous** features vs. **categorical** target (e.g., classification).  
   - The F-test checks if the mean of a feature differs significantly across target classes.

5. **Variance Threshold**  
   - Simplest approach: remove all features whose variance is below some threshold.  
   - Assumes features with low variance don’t carry much discriminative power.  
   - Often used as a quick **first pass** to eliminate near-constant or trivially varying features.

## 4. How These Methods Work in Practice

1. **Compute the Score**  
   - For each feature $X_i$, compute a statistical score with respect to the target $Y$. Examples:  
     - Correlation coefficient,  
     - Mutual Information,  
     - Chi-square statistic,  
     - F-statistic (ANOVA).

2. **Rank Features**  
   - Sort the features from **highest** to **lowest** in terms of the chosen score.

3. **Select the Features**  
   - **Keep** the top $k$ features or all features above some threshold.  
   - This selection **does not** directly depend on how a specific model (like a neural network or SVM) would handle these features; it’s purely statistical.

## 5. Advantages and Disadvantages of Statistical Filters

### Advantages
- **Speed and Scalability**: Easy to apply even with very large datasets (many rows or features).  
- **Model-Agnostic**: Works independently of any ML algorithm you might use afterward.  
- **Fewer Hyperparameters**: Usually just a threshold or a number $k$ of features to keep.

### Disadvantages
- **Ignores Feature Interactions**: Each feature is typically evaluated independently (univariate). Two features that individually have low relevance but combined have high predictive power might get discarded.  
- **Not Always Optimal for the Final Model**: Because they are agnostic to the final model’s inner workings, filter methods might select suboptimal sets for certain algorithms that rely on complex interactions.


## 6. Example Workflow

Suppose you have a dataset with 1000 features (columns) and a target variable $Y$ (for classification).

1. **Correlation Screening**  
   - Compute the absolute Pearson correlation coefficient `abs_corr` between each feature and $Y$.  
   - Discard any feature where `abs_corr < 0.1` (just an example threshold).

2. **Mutual Information**  
   - On the filtered set (say you’re down to 200 features), compute the mutual information `MI(X_i; Y)` for each remaining feature $X_i`.  
   - Select the top 50 features based on MI.

3. **Train a Model**  
   - Now train your favorite classifier (e.g., random forest, logistic regression, etc.) using those 50 features.  
   - Evaluate performance on a validation/test set.

By combining multiple statistical measures (correlation, MI) in a pipeline, you can refine your feature set before feeding it to a predictive model.



## 7. Key Takeaways

- **Statistical filter** methods rely purely on measures like correlation, mutual information, chi-square, etc., to evaluate each feature’s relevance to the target.
- They are generally **fast** and **simple** to implement, making them a common **first step** in feature selection.
- They do **not** account for *interaction effects* among features, so sometimes advanced methods (wrapper or embedded) might be necessary.
- In practice, you often combine filter methods with other selection techniques or domain knowledge to create a robust final feature set.


### In Summary

A **statistical filter** is a straightforward and efficient way to select important features by ranking them according to a statistical criterion. While they’re excellent at quickly reducing dimensionality and avoiding overfitting from too many features, be aware of their limitation in recognizing complex dependencies among features. They are typically the **first step** in a more extensive feature selection workflow for many machine learning projects.

---


## Mutual Information
**Mutual Information (MI)** is a concept from **information theory** that measures the amount of information shared between two random variables. In other words, MI answers: *“How much does knowing one variable reduce the uncertainty about the other?”*

Another way to say the same thing:
**Mutual Information (MI)** measures how much knowing one random variable reduces uncertainty about another. If you know the outcome of one variable and it helps you predict the other, the mutual information is greater than zero.

Zero MI ($I(X;Y) = 0$) means there is no dependence between $X$ and $Y$ (they are independent).
High MI means knowing $X$ tells you a lot about $Y$, and vice versa.

Here is a concise overview:

### Basic Definition (Using Entropy)

1. **Formula in Terms of Entropies**  
   For two random variables $X$ and $Y$, the mutual information $I(X; Y)$ can be written as:

   $$
   I(X; Y) \;=\; H(X) \;+\; H(Y) \;-\; H(X, Y),
   $$

   where:
   - $H(X)$ is the **entropy** of $X$ (a measure of the uncertainty in $X$).
   - $H(Y)$ is the **entropy** of $Y$.
   - $H(X, Y)$ is the **joint entropy** of $X$ and $Y$ together.

2. **Equivalent Expressions**  
   Mutual information can also be expressed in two other, but equivalent, ways:

   - $$I(X; Y) = H(X) - H(X \mid Y),$$
   - $$I(X; Y) = H(Y) - H(Y \mid X).$$

   Here, $H(X \mid Y)$ denotes the conditional entropy of $X$ given $Y$.  
   Intuitively, $I(X; Y)$ measures how much the uncertainty in $X$ is reduced by knowing $Y$ (and vice versa).

3. **Interpretation**  
   - If $X$ and $Y$ are independent, then $H(X, Y) = H(X) + H(Y)$, making $I(X;Y) = 0$.  
   - If $X$ can be fully determined by $Y$ (perfectly dependent), then $H(X \mid Y) = 0$, and thus $I(X; Y) = H(X)$, the maximum possible reduction in uncertainty for $X$.

Using these entropy-based definitions is often helpful when thinking about how mutual information relates to other information-theoretic quantities like **conditional entropy** and **joint entropy**.

4. **Interpretation**  
   - **Zero mutual information** means the variables do not share information about each other (i.e., they are statistically independent).  
   - **High mutual information** indicates that knowing one variable gives a lot of information about the other (i.e., a strong dependency, whether linear or non-linear).

5. **Comparison with Correlation**  
   - **Correlation** measures only linear dependence and can miss complex relationships.  
   - **Mutual information** captures *any* type of dependency (e.g., non-linear), which makes it a more general measure for determining whether two variables are related.

6. **Practical Use in Machine Learning**  
   - **Feature Selection**: In some methods, features that have high mutual information with the target (and potentially minimal redundancy with each other) are often chosen.  
   - **Dimensionality Reduction**: Mutual information can help identify which variables or transformations preserve the most “information” relevant to the prediction.  
   - **Distribution & Data Analysis**: Because MI deals with entire distributions (rather than just, say, means and variances), it can reveal relationships that simpler statistics overlook.

7. **Estimation Challenges**  
   - In practice, calculating MI directly from data requires estimating probability distributions (especially for continuous variables).  
   - Techniques like **binning**, **kernel density estimation (KDE)**, or **k-nearest neighbors (KNN)** approaches are used to approximate these distributions.

8. **Connection to Kullback–Leibler Divergence**  
   - Another useful perspective is seeing MI as the **Kullback–Leibler (KL) divergence** between the joint distribution of X and Y, and the product of their marginal distributions:
       
     I(X; Y) = D_{KL}(p(X, Y) || p(X)p(Y)).
       
   - KL divergence tells us how one distribution differs from another, so mutual information tells us *how different* the joint distribution is from the assumption that \(X\) and \(Y\) are independent.

---

**Key Takeaway**:  
- Mutual information is a flexible measure of dependency between variables, capturing *any* relationship (not just linear). It’s often used in feature selection and understanding data relationships in machine learning, even though it can be more complex to estimate than simpler metrics like correlation.

## Types of Missingness

When dealing with datasets that have missing values, it’s important to understand why those values are missing. We usually categorize missingness into three main types:

- **MCAR (Missing Completely At Random)**  
  The probability of a value being missing is independent of both the observed data and the unobserved data.  
  Example: A random glitch in a sensor that causes data to be lost sporadically, with no relation to the sensor’s actual reading.

- **MAR (Missing At Random)**  
  The probability of a value being missing depends on the observed data but **not** on the missing value itself.  
  Example: A study participant fails to fill out certain questions *because of* their other known characteristics (like age or income), but not because of the unknown answer itself.

- **MNAR (Missing Not At Random)**  
  The probability of a value being missing depends on the *unobserved* data—i.e., the missing value itself.  
  Example: Patients with more severe symptoms are more likely to skip follow-ups, so the data is *linked* to the severity level.

**Why it matters**:  
Different missingness mechanisms influence how you can safely impute or analyze data. Statistical methods often assume MCAR or MAR. MNAR requires specialized handling or modeling assumptions.
  
Key takeaways:
- **Types of Missingness** (MCAR, MAR, MNAR) let you classify why data is missing, which affects how you handle it.  

---


## Shannon Entropy

**Shannon entropy** is a measure of **uncertainty** or **information content** in a **discrete** random variable $X$. If $X$ takes values $\{x_1, x_2, \dots, x_n\}$ with probabilities $p(x_i)$, then the Shannon entropy $H(X)$ is:

$$
H(X) = - \sum_{i=1}^{n} p(x_i) \log_2 \bigl(p(x_i)\bigr).
$$

- **High entropy** means \(X\) is very unpredictable (many equally likely outcomes).  
- **Low entropy** means \(X\) is more predictable (one or few outcomes dominate).  

In information theory terms, $H(X)$ is the expected number of *bits* needed to encode the outcomes of $X$.
  
Key Takeaways:
- **Shannon Entropy** measures uncertainty in discrete variables.

---

## Differential Entropy

For **continuous** random variables, the analogous concept is **differential entropy**. If $ X $ has probability density function $ f(x) $, the differential entropy $ h(X) $ is:

$$
h(X) = - \int_{-\infty}^{\infty} f(x) \, \log \bigl(f(x)\bigr) \, dx.
$$

**Key differences from Shannon entropy**:  
- Differential entropy can be **negative**.  
- It doesn’t have the same direct “bits” interpretation as the discrete case.  
- It is not invariant under certain transformations (e.g., scaling the variable).
  
Key takeaways:
- **Differential Entropy** is the continuous analog (to Shannon Entropy) but behaves differently than discrete entropy.  

---

## Bayes’ Rule

**Bayes’ Rule** describes how to update the probability of an event $ A $ after observing some evidence $ B $. Formally:

$$
P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}.
$$

- $ P(A) $: **prior** probability of $ A $.  
- $ P(A \mid B) $: **posterior** probability (after seeing $ B $).  
- $ P(B \mid A) $: likelihood of observing $ B $ when $ A $ is true.  
- $ P(B) $: overall (or marginal) probability of $ B $.

This rule is fundamental in **Bayesian** statistics, enabling you to incorporate prior knowledge and then update your beliefs as new data arrives.
  
Key Takeaways:
- **Bayes’ Rule** provides a framework for updating probabilities given new evidence and is central to Bayesian inference.

---



Mina anteckningar från mötet med Jerrad 14/2 - 2025:

A B C, mutual information capture bara main effekt, inte interactions mellan variables.
Relief kollar på multipla variabler samtidigt. 
Relief har lite disadvantages också:
 man kan inte bara kolla på absolutvärdet, man måste kolla på multiple features för att kunna jämföra score.
Använd MI för main effekts och relief för interactions
Relief kan också detekta main effekt om den är simple / lijnär
Orange: free to download, open source
Reflief ger inte information om vilka interaktioner det finns mellan olika features.
Man kan använda relief för tidsserie data också.

Man kan använda MongoDB servern för tunga grejer.
bashrc för att lägga till mongodb username och pswd
installera mongodb compass
compass har inbyggd mongosh
man kan göra query direkt i compass
aggregations
man kan kolla som i tabell typ
pymongo
för att addera en df till mongodb, gör den först till dicts (to_dicts?) och sedan insertmany.
