# Types of missing data:

**MCAR: Missing Completely At Random**  
The missing values in the data set occur completely at random. They don't depend on any other data. Example: when a device such a security camera stops working.  
*How to handle this kind of missing data? Apply data deletion or imputation (imputation is more recommended)*  

**MAR: Missing At Random**  
    The missing values depends on other observed values. Example: devices required a periodic maintenance to ensure consistent operation, so the data will be missing during those maintenance period.  
    *How to handle this kind of missing data? Single or multiple imputation (consider one or several columns during imputation)*  

**MNAR: Missing Not At Random**  
    The missing values depends on the missing values themselves. They are very difficult to identify. And we may not even know that the data is missing. Example: tools have limitations. When attempting to track data out in areas beyond the measurement range, missing values are generated. Example, a scale not detecting very small or very large values.  
    *How to handle this kind of missing data? This kind of missing values require to perform sensivity analysis. If it's not possible, imputation is preferable over deletion.* 

# Residuals
Residuals of a regression model or a time serie represent the difference between observed and predicted values.

*Residual = observed value - predicted value*

The model will be valid if its residuals must be behave as a white noise. We apply test like Durbin Watson o Ljung Box to detect autocorrelation between residuals, i.e the errors are correlationes in time. 

# Difference between correlation and autocorrelation

## Correlation

- Correlation measures the relationship between 2 different time series in the same time period
- It indicates if two variables tend to increase or decrease together
- It's expressed through Pearson correlation coefficient, whose value ranges between -1 (perfect negative correlation) y 1 (perfect positive correlation).

For example:

Daily temperature and energy consumption can be correlated

```python
df[["temperature", "energy_consumption"]].corr()
```

## Autocorrelation
- The autocorrelation measures the relationship of a serie to itself in different time periods.
- It indicates if past values influence future ones.
- It reveals patterns such a trends and seasonality

For example:

Electricity consumption in one day can be influenced by the previous day's consumption.

```python
    from pandas.plotting import autocorrelation_plot
    autocorrelation_plot(df["energy_consumption"])
```

# TTest (T-Student Test) functions

The scipy.stats.ttest_* functions (like ttest_ind, ttest_rel, and ttest_1samp) perform a t-test, which is a statistical hypothesis test to compare means. These functions typically return two values:

* Statistic: The t-statistic is a measure of how far the sample mean is from the null hypothesis (e.g., the means are equal) in terms of standard error. A higher absolute value indicates a stronger deviation from the null hypothesis.
* p-value: This value represents the probability of observing the data (or something more extreme) under the null hypothesis. If the p-value is below your chosen significance level (e.g., 0.05), you can reject the null hypothesis.

To execute ttest_ind() properly, you need two samples that:

* Are numerical (not boolean or categorical).
* Follow an approximately normal distribution (for small sample sizes).
* Not contain NaN values, because NaN will propagate through the calculation, leading to incorrect or nan results.

## Example Explanation

Suppose you run:

```python
from scipy.stats import ttest_ind

# Example data
sample1 = [2.3, 3.5, 4.2, 5.6, 6.8]
sample2 = [1.2, 2.8, 3.4, 4.7, 5.9]

# Perform a two-sample t-test
t_stat, p_val = ttest_ind(sample1, sample2)

print("t-statistic:", t_stat)
print("p-value:", p_val)
```

## "alternative" Function Parameter

Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
- two-sided: the means of the distributions underlying the samples are unequal.
- less: the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.
- greater: the mean of the distribution underlying the first sample is greater than the mean of the distribution underlying the second sample.

## Output Interpretation

- t-statistic:
    - A positive or negative value indicating the difference between the means.
    - A positive value means the mean of sample1 is higher than sample2.
    - A negative value means the mean of sample1 is lower than sample2.
    - Larger absolute values indicate greater evidence against the null hypothesis.

- p-value:
    - A small p-value (e.g., < 0.05) suggests a significant difference between the two groups.
    - A large p-value means the evidence is insufficient to reject the null hypothesis.

## Common Use Cases

- ttest_ind:
    - Used for two independent samples.
    - Example: Comparing test scores between two different classes.

- ttest_rel:
    - Used for two related samples (paired samples).
    - Example: Comparing the same students' test scores before and after a course.

- ttest_1samp:
    - Used to compare a sample mean to a population mean.
    - Example: Checking if a class's average test score differs from the national average.

## Assumptions of results of Ttest

The results depend on assumptions:
- Normality of data.
- Equal variances (for ttest_ind). Use equal_var=False if variances are unequal.
Check the assumptions before interpreting the result to ensure validity.
- Independence between the samples

You must check the underlying assumptions. Here are the common assumptions and how to test them:

### 1.Normality of the Data: 

The data in each group should follow a normal distribution.

#### Visual Inspection:

Use histograms or Q-Q plots to assess if the data roughly follows a normal curve.

```python
import matplotlib.pyplot as plt
import scipy.stats as stats

plt.hist(sample1, bins=10, alpha=0.7, label="Sample 1")
plt.hist(sample2, bins=10, alpha=0.7, label="Sample 2")
plt.legend()
plt.show()

stats.probplot(sample1, dist="norm", plot=plt)
plt.title("Sample 1 Q-Q Plot")
plt.show()
```

#### Shapiro-Wilk Test:

A formal test for normality. If the p-value is < 0.05, the data significantly deviates from normality.

```python
from scipy.stats import shapiro

stat, p = shapiro(sample1)
print("Shapiro-Wilk Test - Sample 1: stat =", stat, ", p-value =", p)

stat, p = shapiro(sample2)
print("Shapiro-Wilk Test - Sample 2: stat =", stat, ", p-value =", p)
```

#### Kolmogorov-Smirnov Test:

Another test for normality (especially for larger samples).

```python
from scipy.stats import kstest

stat, p = kstest(sample1, "norm", args=(sample1.mean(), sample1.std()))
print("K-S Test - Sample 1: stat =", stat, ", p-value =", p)
```

### 2.Equal Variances (Homogeneity of Variance):

The variance in the two groups is similar (only for independent t-tests).

#### Levene’s Test:

Tests the null hypothesis that the variances are equal. If the p-value is < 0.05, the variances are significantly different.

```python
from scipy.stats import levene

stat, p = levene(sample1, sample2)
print("Levene's Test: stat =", stat, ", p-value =", p)
```

#### F-Test (Ratio of Variances):

Compare the variances directly.

```python
f_stat = np.var(sample1, ddof=1) / np.var(sample2, ddof=1)
print("F-statistic (Variance Ratio):", f_stat)
```
If variances are unequal, use the equal_var=False parameter in ttest_ind:

```python
from scipy.stats import ttest_ind
t_stat, p_val = ttest_ind(sample1, sample2, equal_var=False)
```

### 3.Independence between the samples:

Samples must be independent between them. This is more about the design of your experiment or data collection process. Verify that your data collection process ensures no overlap or dependence between groups.

#### Durbin-Watson Test (for detecting autocorrelation between time series)

To detect autocorrelation between residuals of a model such as linear regression

```python
from statsmodels.stats.stattools import durbin_watson
dw_statistic = durbin_watson(residuals) #  regression model residuals
print(f"Durbin-Watson statistic: {dw_statistic}")
```

Output interpretation:
    Value close to 2: There is no autocorrelation.
    Value close to 0: Positive autocorrelation.
    Value close to 4: Negative autocorrelation.

#### Ljung-Box Test (for detecting correlation between time series)

To detect correlation between different lags (previous values) of a time series

```python
from statsmodels.stats.diagnostic import acorr_ljungbox
ljung_box_result = acorr_ljungbox(residuals, lags=[10], return_df=True) # Model regression residuals
print(ljung_box_result)
```

If p-value is lower than 0.05, there is evidence of autorrelation

#### For time-series data, the autocorrelation can be checked by:

This plot is an Autocorrelation Function (ACF) plot, which shows how a time series is correlated with its own past values (lags). I

```python
from statsmodels.graphics.tsaplots import plot_acf

plot_acf(sample1) # acf = Auto Correlation Function
plt.show()
```

Interpretation of the Plot:

* Y-Axis (Autocorrelation Coefficient, r_k)
    * Values range from -1 to 1.
    * A positive value means a direct relationship (past values are similar to future ones).
    * A negative value means an inverse relationship (past values are opposite to future ones).

* X-Axis (Lags)
    * The lags indicate how many steps back in time we are looking at.
    * The first bar at lag = 0 is always 1, since a time series is perfectly correlated with itself at lag 0.

* Bars and Confidence Interval (Shaded Area)
    * The bars represent autocorrelation coefficients at different lags.
    * The blue shaded area is the confidence interval. If a bar is outside this region, it is statistically significant (not due to randomness).

#### Correlation Matrix (For numeric variables):

```python
correlation_matrix = df.corr()
print(correlation_matrix)
```

If correlation values are high (> 0.8 or < -0.8), the variables can be related.

#### Chi-squared Test (For categorical variables):

```python
from scipy.stats import chi2_contingency

# Create a contingency table between 2 categorical variables
contingency_table = pd.crosstab(df["cat_1"], df["cat_2"])

# Aplicar la prueba de Chi-cuadrado
chi2_stat, p, dof, expected = chi2_contingency(contingency_table)
print(f"P-valor: {p}")
```
If p < 0.05, the variables are not independent

## What If Assumptions Are Violated?

- Normality:
    - Use a non-parametric test like the Mann-Whitney U Test (scipy.stats.mannwhitneyu) for independent samples or the Wilcoxon Signed-Rank Test for paired samples.

- Equal Variances:
    - Use equal_var=False in ttest_ind (Welch’s t-test).

- Independence:
    - Consider using statistical models (e.g., mixed-effects models) to handle dependencies explicitly.

# Ordinary Least Squares Regression

## Assumptions of results of Ttest

Ordinary Least Squares (OLS) regression requires checking key assumptions to ensure valid and reliable results. Below are the main assumptions and how to test them:
- Ensure linearity by inspecting scatter plots.
- After fitting the model, validate assumptions with residual plots.
- Check for multicollinearity if multiple predictors are used.

### 1. Linearity

The relationship between independent variables and the dependent variable is linear.

#### Plot the residuals vs. fitted values

If there is no clear pattern, linearity is likely satisfied.

```python
import matplotlib.pyplot as plt

fitted_values = model.fittedvalues
residuals = model.resid

plt.scatter(fitted_values, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values')
plt.show()
```

### 2. Independence of Errors

The residuals (errors) are independent.

#### Use the Durbin-Watson statistic

A value close to 2 indicates no autocorrelation.

```python
from statsmodels.stats.stattools import durbin_watson

dw_stat = durbin_watson(residuals)
print("Durbin-Watson Statistic:", dw_stat)
```

### 3. Normality of Errors

The residuals should be normally distributed.

#### Use a Q-Q plot

Residuals should fall along the diagonal line.

```python
import scipy.stats as stats

stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()
```

Perform a Shapiro-Wilk test:

```python
from scipy.stats import shapiro

stat, p = shapiro(residuals)
print("Shapiro-Wilk Test: stat =", stat, ", p-value =", p)
```

### 4. Homoscedasticity (Equal Variance of Errors)

The variance of residuals should be constant.

#### Use the Breusch-Pagan test.

```python
from statsmodels.stats.diagnostic import het_breuschpagan
_, pval, _, _ = het_breuschpagan(residuals, model.model.exog)
print("Breusch-Pagan Test p-value:", pval)
```

### 5. No Multicollinearity (for Multiple Regression)

Independent variables should not be highly correlated.

#### Compute Variance Inflation Factor (VIF). A VIF > 5 indicates multicollinearity.

```python
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = model.model.exog
vif = [variance_inflation_factor(X, i) for i in range(X.shape[1])]
print("VIFs:", vif)
```
### **Interpreting Model Output**

- R-squared: The proportion of the variance in the dependent variable (temp) that is explained by the independent variable (ozone). Example: While the R-squared value of 0.488 shows that ozone explains about 48.8% of the variance in temp, there might be other variables affecting temp not included in this model.
- Adjusted R-squared: Similar to R-squared but adjusts for the number of predictors in the model. It penalizes the addition of irrelevant predictors.
- F-statistic: measures the overall significance of the model.
- Prob (F-statistic): very close to 0 indicates that the model is statistically significant, meaning that independent variables has a significant effect on predicted variable.
Log-Likelihood: A measure of the likelihood of the data under the fitted model.
- AIC (Akaike Information Criterion): A metric for model comparison, with lower values indicating a better model.
- BIC (Bayesian Information Criterion): Similar to AIC but includes a stronger penalty for adding more parameters to the model.
- Number of Observations: The total number of data points used in the analysis.
- Df Residuals: Degrees of freedom for residuals (number of observations minus the number of parameters).
- Df Model: The number of predictors (independent variables) in the model.
- Covariance Type: Nonrobust Indicates that standard errors and covariance calculations are not adjusted for heteroskedasticity or autocorrelation.

# Data imputation methods

- If you are working with time series, you can apply:
    - Backward fill (or backfilling)
    - Forward fill (or forward filling)
    - Interpolation
- If you are not working with time series, you can apply:
    - Constant imputation
    - Mean imputation
    - Median imputation
    - Mode imputation

#### What Are Donor-Based Imputations?

They fill in missing values for a given unit by copying observed values from another unit, the donor.

#### What Are Model-Based Imputations?

The goal is to find a predictive model for each target variable in the dataset that contains missing values.

### Mean, Median, and Mode Imputation
*Advantages:*

- Quick and easy
- The mean can be useful in the presence of outliers
- Does not affect the statistic in question or the sample size

*Disadvantages:*

- Can bias results since it alters the distribution (kurtosis)
- Loses correlations between variables; not very precise
- Cannot be used for categorical variables (except for mode)

### Forward and Backward Fill Imputation

*Advantages:*

- Quick and easy
- Imputed data are not constant
- There are tricks to avoid breaking relationships between variables

*Disadvantages:*

- Multivariable relationships may be distorted

*Functions to apply this types of imputations:*

- ffill (forward fill): Imputes forward.
- bfill (back fill): Imputes backward.

For categorical variables, it is recommended to sort (use groupby instead of sort_values) the dataframe to maintain the relationship between missing values and the values of other variables (e.g., filling in a woman's height using the height of another woman).

### Interpolation Imputation

Different interpolation methods can be used, such as:

- Straight-line interpolation: Model-based interpolation
- Nearest neighbor interpolation: Donor-based interpolation

*Advantages:*

- Easy to implement
- Useful for time series
- Offers a variety of interpolation options

*Disadvantages:*

- May break relationships between variables
- May introduce out-of-range values

### k-NN (k-Nearest Neighbors) Imputation

*For each observation with missing values:*

- Find k most similar observations (donors, neighbors).
- Replace missing values with the aggregated values of these k neighbors.

*How to Determine the Most Similar Neighbors?*

By quantifying distances:

- Euclidean distance: Useful for numerical variables.
- Manhattan distance: Useful for factor-type variables (e.g., Monday, Tuesday; slow, fast).
- Hamming distance: Useful for categorical variables.
- Gower distance: Useful for datasets with mixed variables (not only numerical, categorical, or factor-type).

Advantages:

- Easy to implement
- Performs well with small datasets
- Excellent for numerical data but also works with mixed data

Disadvantages:

- Scalability can be an issue
- Requires special transformations for categorical variables
- Sensitive to outliers
