# **Missing Values**

A missing value is data point unrecordered in an observation, or also we can define missing values like an empy cell in a dataset.

We can recognize a missing value because is represented with the words `NaN`, `None`, or the symbol `?`

## **Types of Missing Values**

**Missing Completely at Random (MCAR)**

This occurs when the data is missing for a reason that is entirely unrelated to any other data in the dataset. It's the ideal type of missingness because it doesn't introduce bias.

* **Concept:** The missing data is just a random error.
* **Example:** A sensor randomly fails to record a temperature reading, regardless of the temperature itself.


**Missing at Random (MAR)**

This occurs when the reason the data is missing is related to another variable that you *have* observed, but not to the value of the missing data itself.

* **Concept:** The missingness can be predicted by other data.
* **Example:** Men are less likely to fill out a survey question about their weight than women. The missing weight data is related to gender (which you've observed).


**Missing Not at Random (MNAR)**

This is the most problematic type. The reason the data is missing is directly related to the value that is missing.

* **Concept:** The missing data is a meaningful signal in itself.
* **Example:** People with very high or very low incomes are less likely to report their income. The missingness is tied directly to the value of the income.

In [3]:
import pandas as pd
import numpy as np

In [5]:
data = {
    'customer_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Webcam', 'Mouse', 'Keyboard', 'Laptop', 'Webcam', 'Mouse'],
    'rating': [5.0, 4.0, np.nan, 3.5, 5.0, np.nan, 4.0, 5.0, 4.5, 3.0],
    'comments': ['Excellent!', 'Good product.', None, 'Disappointed with performance.', 'Great quality.', None, 'Solid keyboard.', 'Perfect for my needs.', 'Good value.', 'Works ok.'],
    'satisfaction_level': ['Happy', 'Neutral', 'Happy', 'Happy', 'Happy', 'Neutral', 'Happy', None, 'Neutral', 'Unhappy'],
    'referral_source': ['Website', 'Social Media', 'Website', 'Social Media', 'Website', None, 'Website', 'Word of Mouth', 'Website', None]
}

df = pd.DataFrame(data)

In [6]:
df

Unnamed: 0,customer_id,product,rating,comments,satisfaction_level,referral_source
0,101,Laptop,5.0,Excellent!,Happy,Website
1,102,Mouse,4.0,Good product.,Neutral,Social Media
2,103,Keyboard,,,Happy,Website
3,104,Laptop,3.5,Disappointed with performance.,Happy,Social Media
4,105,Webcam,5.0,Great quality.,Happy,Website
5,106,Mouse,,,Neutral,
6,107,Keyboard,4.0,Solid keyboard.,Happy,Website
7,108,Laptop,5.0,Perfect for my needs.,,Word of Mouth
8,109,Webcam,4.5,Good value.,Neutral,Website
9,110,Mouse,3.0,Works ok.,Unhappy,


In [10]:
df.dtypes

customer_id             int64
product                object
rating                float64
comments               object
satisfaction_level     object
referral_source        object
dtype: object

In [8]:
# verify missing values
df.isna().sum()

customer_id           0
product               0
rating                2
comments              2
satisfaction_level    1
referral_source       2
dtype: int64

### **Imputing Missing Values**

In [12]:
# numerical data
df["rating"] = df["rating"].fillna(df["rating"].median())

In [13]:
# imputing string columns
df["comments"] = df["comments"].fillna("Unspecified")

In [16]:
# replacing missing values by mode
df["referral_source"] = df["referral_source"].fillna(df["referral_source"].mode()[0])

### **Drop missing values**

In [19]:
df.dropna(subset="satisfaction_level", axis=0) # by row

Unnamed: 0,customer_id,product,rating,comments,satisfaction_level,referral_source
0,101,Laptop,5.0,Excellent!,Happy,Website
1,102,Mouse,4.0,Good product.,Neutral,Social Media
2,103,Keyboard,4.25,Unspecified,Happy,Website
3,104,Laptop,3.5,Disappointed with performance.,Happy,Social Media
4,105,Webcam,5.0,Great quality.,Happy,Website
5,106,Mouse,4.25,Unspecified,Neutral,Website
6,107,Keyboard,4.0,Solid keyboard.,Happy,Website
8,109,Webcam,4.5,Good value.,Neutral,Website
9,110,Mouse,3.0,Works ok.,Unhappy,Website
