<a href="https://colab.research.google.com/github/amitroyal8755/Missing_value/blob/main/found_type_of_misssig.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



### Notebook Summary: Analyzing Missing Data Patterns in the Titanic Dataset

#### 1. Loading Libraries and Dataset
- **Libraries Used**: pandas, seaborn, matplotlib, numpy.
- **Dataset**: Titanic dataset loaded using `sns.load_dataset("titanic")`.

#### 2. Testing for MCAR (Missing Completely at Random)
- **Function**: `test_mcar(df, col)`
  - **Purpose**: Tests if missing data in a specific column is completely random.
  - **Method**: Creates a contingency table and performs a chi-squared test.
  - **Application**: Tested on the `age` column of the Titanic dataset.
  - **Finding**: Confirmed that `age` has MCAR (p-value > 0.05 indicates randomness).

#### 3. Testing for Missingness Type (MAR or MNAR)
- **Function**: `check_missingness(data, column)`
  - **Purpose**: Determines if missing data in a specified column is MAR (Missing at Random) or MNAR (Missing Not at Random).
  - **Method**: Encodes categorical variables, performs chi-squared tests for categorical variables, and t-tests for continuous variables.
  - **Application**: Applied to a custom dataset `Extended_MNAR_Missingness_DataFrame`.
  - **Finding**: Identified missingness type (MAR or MNAR) based on statistical test results.

#### 4. Visualizing Missing Data
- **Library Used**: `missingno`
  - **Visualizations**:
    - **Matrix Plot**: Shows the location and amount of missing data in the dataset.
    - **Heatmap**: Displays correlations between missing values in different columns.
    - **Bar Plot**: Shows the percentage of missing values in each column.
    - **Dendrogram**: Illustrates the hierarchical clustering of missing data patterns.
  - **Insights**:
    - **Deck Column**: High percentage of missing values (>80%), unique missing pattern.
    - **Age Column**: Approximately 20% missing values, MCAR confirmed.
    - **Correlation Observations**:
      - **Embarked and Embark_town**: Strong correlation, missing values often occur together.
      - **Deck and Other Columns**: Weak correlations, classified as MAR due to weak relationships.
    - **Dendrogram Analysis**:
      - **Deck**: Unique missing pattern, different from other columns.
      - **Age and Alone**: Similar missing data patterns, likely to be missing together.
      - **High Similarity Group**: Columns like `pclass`, `embarked`, `embark_town`, `sex`, `survived`, `sibsp`, `parch`, and `fare` have similar missing data patterns.

### Practical Implications
- **Imputation Strategy**: Treat columns within the same cluster similarly for imputation purposes.
- **Data Quality Improvement**: Investigate columns with unique missing patterns, such as `deck`, to understand underlying issues in data collection.



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib as plt
import numpy as np

In [None]:
df=sns.load_dataset("titanic")

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency, combine_pvalues

def test_mcar(df, col):
    df1 = df.copy()
    df1["missing_indicator"] = df1[col].isnull().astype(int)
    columns_to_test = df1.columns.drop([col, "missing_indicator"])

    p_values = []

    for i in columns_to_test:
        # print(f"Processing column: {i}, Type: {df1[i].dtype}")
        if df1[i].isnull().any():
            if df1[i].dtype.name == 'category':
                df1[i] = df1[i].cat.add_categories('missing').fillna('missing')
            else:
                df1[i] = df1[i].fillna("missing")

        contingency_table = pd.crosstab(df1[i], df1["missing_indicator"])
        _, p, _, _ = chi2_contingency(contingency_table)
        p_values.append(p)

    combined_p_value = combine_pvalues(p_values)[1]

    return combined_p_value

In [None]:
test_mcar(df,"age")

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, ttest_ind
from sklearn.preprocessing import LabelEncoder

def check_missingness(data, column):

    missing = data[column].isnull()
    # Encode categorical variables only if necessary
    data_encoded = data.copy()
    label_encoders = {}
    for col in data_encoded.columns:
        if data_encoded[col].dtype == 'object' or pd.api.types.is_categorical_dtype(data_encoded[col]):
            le = LabelEncoder()
            data_encoded[col] = le.fit_transform(data_encoded[col].astype(str))
            label_encoders[col] = le

    # Statistical tests
    result = 'Undetermined'

    for col in data.columns:
        if col != column and data[col].notnull().sum() > 0:
            if data[col].dtype == 'object' or pd.api.types.is_categorical_dtype(data[col]):
                # Chi-squared test for categorical variables
                contingency_table = pd.crosstab(missing, data[col])
                chi2, p, dof, expected = chi2_contingency(contingency_table)
                if p < 0.05:
                    result = 'MAR'
                    break
            else:
                # T-test for continuous variables
                t_stat, p = ttest_ind(data_encoded.loc[missing, col].dropna(), data_encoded.loc[~missing, col].dropna())
                if p < 0.05:
                    result = 'MAR'
                    break

    if result == 'Undetermined':
        result = 'MNAR'

    return result


In [None]:
df=pd.read_csv("/content/Extended_MNAR_Missingness_DataFrame.csv")

In [None]:
df.head()

In [None]:
check_missingness(df,"value")

In [None]:
import missingno as msno
df=sns.load_dataset("titanic")

In [None]:
msno.matrix(df)

We have already checked that the column "age" has MCAR (Missing Completely at Random) and the column "deck" has MAR (Missing at Random).

In [None]:
msno.heatmap(df)

Through the heatmap, we observed that the column "embarked" has a strong correlation with "embark_town." This means that when one column has missing values, the other column is likely to have missing values as well.

When we check the correlation of "deck" with "age," we find a correlation of 0.1, which indicates weakness. However, this does not mean that it is MCAR (Missing Completely at Random). Similarly, the correlation of "deck" with "embarked" and "embark_town" is also weak. Therefore, we cannot classify it as MCAR or MNAR (Missing Not at Random), which means it is MAR (Missing at Random). There are missing values in "age," and apart from "deck," it has no significant relationship with any other column. Therefore, we can decide that it is MCAR. But we will not make this decision solely based on this observation; we will also conduct a Little MCAR test to confirm.

In [None]:
msno.bar(df)

When we examine the "age" column, we find that it has approximately 20% missing values, while the "deck" column has more than 80% missing values.

In [None]:
msno.dendrogram(df)

Let's interpret the dendrogram in the provided image.

### Interpretation of the Dendrogram

1. **Deck**:
   - The `deck` column is joined with other columns at a very high distance, indicating that it has a unique pattern of missing data. This suggests that `deck` has many missing values that are not similarly missing in other columns.

2. **Age** and **Alone**:
   - `age` and `alone` are clustered together at a lower height, indicating that they have similar missing data patterns. This suggests that rows missing `age` data might also frequently miss the `alone` data.

3. **Adult_male, Alive, Class, Who**:
   - These columns are grouped together, suggesting they share similar missing data patterns. They are joined at a moderate height, indicating a reasonable level of similarity in their patterns of missingness.

4. **Pclass, Embarked, Embark_town, Sex, Survived, SibSp, Parch, Fare**:
   - These columns are joined at a lower height, indicating a high similarity in their missing data patterns. It suggests that if a row is missing data in one of these columns, it is likely missing data in others within this group as well.

### Key Insights

- **Unique Missing Pattern**:
  - The `deck` column stands out with a unique missing pattern, different from all other columns, likely due to a high rate of missing values.

- **Related Missingness**:
  - `Age` and `Alone` columns share similar missing patterns, possibly indicating that the absence of age data might be associated with the absence of alone status data.
  - The cluster of `adult_male`, `alive`, `class`, and `who` indicates these columns have correlated missingness.

- **High Similarity Group**:
  - Columns such as `pclass`, `embarked`, `embark_town`, `sex`, `survived`, `sibsp`, `parch`, and `fare` are very similar in their missing data patterns. This suggests that missing data in one of these columns is often accompanied by missing data in others.

### Practical Implications

- **Imputation Strategy**:
  - Treat columns within the same cluster similarly for imputation purposes. For instance, `age` and `alone` might be imputed together using similar techniques or based on each other.
  - Pay special attention to `deck` due to its unique missing pattern, which might require a different imputation strategy.

- **Data Quality Improvement**:
  - Investigate why `deck` has a different missing pattern and whether this indicates a systemic issue in data collection.
  - Understanding which columns tend to be missing together can help in diagnosing and fixing issues in the data collection process.

This interpretation helps in understanding the structure and relationships of missing data in your dataset, guiding effective data cleaning and imputation strategies. If you need further details or specific analysis, feel free to ask!