# Data preprocessing
Data preprocessing is a crucial step in the data analysis process. It involves cleaning and transforming raw data into a format that can be easily understood and used for further analysis or modeling.

>Imagine you're a teacher, and you've collected test scores from your students. Before you analyze the scores, you need to ensure the data is clean and ready. This process is similar to washing and cutting vegetables before cooking.

### 1. Missing Values:
Missing values in a dataset can arise due to various reasons, such as data entry errors, unrecorded observations, or data corruption. Handling missing values is essential as they can lead to misleading analysis or results.

**Scenario:** While entering scores into your computer, you forgot to input scores for two students.

**Real-world analogy:** It's like having a puzzle with missing pieces. You can't see the full picture.

**Functions:**
- **isna() / isnull()**: Both functions are the same and are used to detect missing values. They return a DataFrame of the same shape as the original, but with `True` where values are missing and `False` where they are not.
>It's like asking, "Which scores did I forget to input?" The computer will point out the missing scores.

- **fillna()**: This function is used to fill missing values. You can fill them with a specific value, or use methods like forward fill (`ffill`) or backward fill (`bfill`).
>If you can't find the original test papers, you might decide to give those students the average score of the class. This is like filling in the puzzle with pieces from another similar puzzle.

- **drop()**: Used to drop rows or columns. When handling missing values, you can drop rows with missing values.
>Alternatively, you might decide not to include these students in your analysis. It's like removing the rows with missing pieces from the puzzle.



### 2. Outliers:
Outliers are data points that differ significantly from other observations. They can arise due to variability in the data or errors. Outliers can skew statistical measures and can be misleading when analyzing data.

**Scenario:** You notice one score is 250, but the maximum possible score is 100. This is clearly a mistake.

**Real-world analogy:** Imagine weighing fruits on a scale, and suddenly it shows the weight of an apple as 5 kilograms! That's clearly way off.

**Function:**
- **quantile()**: This function returns values at the given quantile. It's useful for identifying potential outliers using methods like the IQR (Interquartile Range).
>To identify such unusual scores (or outliers), you can use a method similar to finding the heaviest and lightest 25% of the apples and seeing if any apple's weight is way off from the average apple weight.

**Handling Outliers:** If you find any score that's way too high or too low compared to the others, you might decide to double-check the test paper or not include it in your analysis.

---

**Example with the Test Scores:**

You have scores like [85, 90, ?, 78, 88, 250, 95, ?, 80], where "?" means you forgot to input the score.

- Using **isna()**, you identify the missing scores.
- With **fillna()**, you replace "?" with the average score of the class.
- Using the **quantile()** method, you realize 250 is an unusual score and decide to correct it or remove it.

Now, your cleaned scores might look something like [85, 90, 86, 78, 88, 95, 80], and you're ready to analyze them!

In [2]:
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Age': [25, 28, np.nan, 24, 30, 31, 150, 23, np.nan, 27],
    'Salary': [50000, 55000, 52000, np.nan, 60000, 62000, 1000000, 48000, 49000, 53000]
}
df = pd.DataFrame(data)
print("Original Data:")

df

Original Data:


Unnamed: 0,Age,Salary
0,25.0,50000.0
1,28.0,55000.0
2,,52000.0
3,24.0,
4,30.0,60000.0
5,31.0,62000.0
6,150.0,1000000.0
7,23.0,48000.0
8,,49000.0
9,27.0,53000.0


In [3]:
# Detecting missing values
print("\nDetecting missing values:")
df.isna()


Detecting missing values:


Unnamed: 0,Age,Salary
0,False,False
1,False,False
2,True,False
3,False,True
4,False,False
5,False,False
6,False,False
7,False,False
8,True,False
9,False,False


In [6]:
# Filling missing values
df_filled = df.fillna({'Age': df['Age'].mean(), 'Salary': df['Salary'].median()})
print("\nData after filling missing values:")
df_filled


Data after filling missing values:


Unnamed: 0,Age,Salary
0,25.0,50000.0
1,28.0,55000.0
2,42.25,52000.0
3,24.0,53000.0
4,30.0,60000.0
5,31.0,62000.0
6,150.0,1000000.0
7,23.0,48000.0
8,42.25,49000.0
9,27.0,53000.0


In [7]:
# Dropping rows with missing values
df_dropped = df.dropna()
print("\nData after dropping rows with missing values:")
df_dropped


Data after dropping rows with missing values:


Unnamed: 0,Age,Salary
0,25.0,50000.0
1,28.0,55000.0
4,30.0,60000.0
5,31.0,62000.0
6,150.0,1000000.0
7,23.0,48000.0
9,27.0,53000.0


In [8]:
# Handling outliers using IQR
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
filter = (df['Age'] >= Q1 - 1.5 * IQR) & (df['Age'] <= Q3 + 1.5 * IQR)
df_outliers_removed = df[filter]
print("\nData after removing outliers:")
df_outliers_removed


Data after removing outliers:


Unnamed: 0,Age,Salary
0,25.0,50000.0
1,28.0,55000.0
3,24.0,
4,30.0,60000.0
5,31.0,62000.0
7,23.0,48000.0
9,27.0,53000.0
