# **Summary:** Data preprocessing is used to improve the quality of the data and prepare it for analysis or modeling. Specifically, it includes several tasks that allow data to be cleaned, transformed, and organized so that it is suitable for use in analysis or machine learning algorithms.

Considerations:


1. Missing data (NA, nan, null)
2. Repeated values
3. Outliers
4. Irrelevant/Redundant columns
5. Single-level categorical column/Single-value numerical column
6. Typographical errors/Change the values of some elements
7. Change datatype



# **Missing data**

We have two options, but the decision requires an analysis of the data:


1. Remove missing data
2. Impute data



In [None]:
nans = df.isna().sum()
print(nans)

Remove missing data:

*   When there is enough data and removing the missing ones doesn't have an impact
*   When there are few missing values to be removed





In [None]:
df.dropna(inplace=True)

Impute data:


*   Replace the missing values with the mean of each column.





In [None]:
df.fillna(df.mean(), inplace=True)

*   Simple interpolation




In [None]:
columns = ['A', 'B', 'C']
df[columns] = df[columns].interpolate(method='linear', limit_direction='forward', axis=0)

# **Repeated values**

This consists in remove the repeated rows in the dataframe

In [None]:
df.drop_duplicates(inplace=True)

# **Outliers**

Handling outliers in a DataFrame is an important task in data preprocessing, as they can significantly affect analyses and predictive models. Here are several common techniques for dealing with outliers, depending on the situation and the type of data

1. Identifying Outliers

We can identify outliers using various methods, such as the interquartile range (IQR), z-score, or simply through visualization (e.g., boxplots).

The IQR is the range between the first quartile (Q1) and the third quartile (Q3). A value that falls outside 1.5 times the IQR is considered an outlier.

In [None]:
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'A': [10, 12, 14, 15, 100],  # The value 100 is a potential outlier
    'B': [1, 2, 2, 3, 100]       # The value 100 is a potential outlier
}
df = pd.DataFrame(data)

# Calculate the interquartile range (IQR) for column A
Q1 = df['A'].quantile(0.25)  # First quartile (25%)
Q3 = df['A'].quantile(0.75)  # Third quartile (75%)
IQR = Q3 - Q1

# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['A'] < lower_bound) | (df['A'] > upper_bound)]
print("Outliers:\n", outliers)

2. Removing Outliers

Once identified, you can remove outliers by filtering the DataFrame to keep only the values within the defined bounds.

In [None]:
# Filter the DataFrame to remove outliers in column A
df_filtered = df[(df['A'] >= lower_bound) & (df['A'] <= upper_bound)]
print("DataFrame without outliers:\n", df_filtered)

3. Replacing Outliers

Another option is to replace outliers with more manageable values, such as the mean or median of the column.

Example: Replacing Outliers with the Median

In [None]:
# Replace outliers in column A with the median of the column
median = df['A'].median()
df['A'] = np.where((df['A'] < lower_bound) | (df['A'] > upper_bound), median, df['A'])
print("DataFrame with outliers replaced:\n", df)

4. Applying Transformations (Scaling or Normalization)

If you do not want to remove or modify outliers, you can apply transformations that reduce their impact, such as normalization or logarithmic scaling.

In [None]:
df['A'] = np.log1p(df['A'])  # log(1 + A), to avoid issues with 0 or negative values

5. Winsorization

Winsorization is a technique where extreme values are replaced by the nearest value within a defined percentile limit, reducing the impact of outliers without removing them completely.

In [None]:
from scipy.stats import mstats

# Winsorize to limit outliers to the 5th and 95th percentiles
df['A'] = mstats.winsorize(df['A'], limits=[0.05, 0.05])

6. Detecting Outliers with Z-Score

Another way is to use the z-score to calculate how far values are from the mean in terms of standard deviations.

In [None]:
from scipy import stats

# Calculate the z-score for each value in column A
z_scores = np.abs(stats.zscore(df['A']))

# Set a threshold for outliers (e.g., z > 3)
outliers = df[z_scores > 3]
print("Outliers with z-score:\n", outliers)

7. Visualizing Outliers

You can use plots like boxplots or scatter plots to easily visualize outliers.

In [None]:
import matplotlib.pyplot as plt

# Create a boxplot to visualize outliers
plt.boxplot(df['A'])
plt.title('Boxplot of Column A')
plt.show()

# **Irrelevant/Redundant columns**

This consists in remove the irrelevant or redundant columns from the dataframe

In [None]:
df.drop(columns=['A', 'C'], inplace=True)

For categorical data, it is important to review the number of unique values ​​(sublevels) in each column

In [None]:
cols_cat = ['A','B','C']

for col in df.cols_cat:
    print(f'Columna {col}: {df[col].nunique()} sublevels')

# **Single-level categorical column/Single-value numerical column**

Here we can delete rows using a filter for certain conditions

In [None]:
#Delete rows where column B (categorical) is 'apple'
df_filtered = df_filtered[df_filtered['B'] != 'apple']

#Delete rows where column A (numeric) is greater than 3
df_filtered = df[df['A'] <= 3]

# **Typographical errors/Change the values of some elements**

We can change the value of some of the elements of a column in a dataframe

In [None]:
#Change 'Blu' and 'Bleu' to 'Blue'
df['Color'] = df['Color'].replace({'Blu': 'Blue', 'Bleu': 'Blue'})

#Change the values ​​of 'column1' that are greater than 30 to the value 100
df.loc[df['column1'] > 30, 'column1'] = 100

# **Change datatype**

*   A common case is when we have a column as "object" type and we want it to be numeric as integer or float
*  Also, we can use it to change the Date column to datetime type



In [None]:
df['A'] = df['A'].astype(int)

df['Fecha'] = pd.to_datetime(df['Fecha'])

df['Categoría'] = df['Categoría'].astype('category')