# Pre-processing

#### Data cleaning

<p>

    Missing Data:
        Removing the object if it has a large amount of missing data
        Imputing empirically determined values
        Assigning a constant value for missing values
        Hot-deck imputation, imputing values based on similar objects or close distances
        Sorting the dataset and using previous or subsequent values
        Using mean or mode values from the entire dataset
        Using mean or mode values from objects of the same class
        Using classification/clustering/regression models to impute a value

    Noisy Data:
        Equal-width binning: Divide the dataset into bins and smooth the values using the mean of those bins
        Equal-frequency binning: Divide the dataset into bins and smooth the values using the nearest extreme
        Clustering: Grouping by class values
        Approximation: Finding the equation of the line and using smoothed values based on the equation

    Inconsistent Data:
        To address this type of problem, it is necessary to analyze and understand the dataset. A value outside the attribute's domain can be treated by examining the dataset and determining 
    if it is a valid outlier or an error that needs to be corrected.
</p>

#### Data integration

<p>

    In the process of integrating data from different databases, it is necessary to check for:
        Redundancy between objects and attributes (e.g., age and date of birth)
        Duplicity between objects and attributes (same data in different databases)
        Conflicts between objects and attributes (e.g., attribute scales and different values for the same object)
</p>

#### Dimensionality reduction

<p>

    Feature Selection:
        Elimination of repetitive or unnecessary attributes

    Attribute Compression:
        Principal Component Analysis (PCA)
        Identifies correlations between attributes

    Data Reduction:
        Sampling:
            Non-replacement Sampling
            Replacement Sampling
            Systematic Sampling
            Stratified Sampling
            Cluster Sampling
        Approximation Models:
            Approximation Function
            Clustering (k-means)
</p>

#### Data transformation

<p>

    Standardization:
        Standardizing uppercase and lowercase letters
        Standardizing special characters
        Standardizing formats (ex: date)
        Standardizing units and scales

    Normalization:
        Max-Min: transforms values into a range of [0,1]
        Z-Score: normalization using mean and standard deviation (when max and min values are unknown or in the presence of outliers)
        Decimal scaling
        Normalization by interquartile range
</p>

#### Data discretization

<p>

    Transforming attribute values into categorical data:
        Predefined intervals
        Binning
        Histogram analysis
        Clustering
        Entropy
</p>

### Example

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('../pre-processing/data/covid.csv', sep=';', encoding='utf-8')
df.head()

Unnamed: 0,id,idade,uf,renda,vacina
0,1,15.0,SP,C,1
1,2,38.0,MG,,0
2,3,49.0,RJ,C,0
3,4,71.0,BA,,0
4,5,58.0,MG,D,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      50 non-null     int64  
 1   idade   30 non-null     float64
 2   uf      29 non-null     object 
 3   renda   31 non-null     object 
 4   vacina  50 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 2.1+ KB


In [5]:
df.describe()

Unnamed: 0,id,idade,vacina
count,50.0,30.0,50.0
mean,25.5,48.666667,0.3
std,14.57738,22.197131,0.46291
min,1.0,14.0,0.0
25%,13.25,33.5,0.0
50%,25.5,48.5,0.0
75%,37.75,61.0,1.0
max,50.0,87.0,1.0


In [6]:
#verificar redundancia
df['uf'].value_counts()

uf
SP    9
MG    5
BA    3
RS    3
RJ    2
PE    2
sp    1
ce    1
rj    1
CE    1
AM    1
Name: count, dtype: int64

In [7]:
df['uf'] = df['uf'].str.upper()
df['uf'].value_counts()

uf
SP    10
MG     5
RJ     3
BA     3
RS     3
CE     2
PE     2
AM     1
Name: count, dtype: int64

In [8]:
df['vacina'].value_counts()

vacina
0    35
1    15
Name: count, dtype: int64

In [9]:
#preencher valores nulos de idade com a média
df['idade'].fillna(round(df['idade'].mean()), inplace=True)

In [10]:
#preencher valores nulos de uf com o próximo registro
df['uf'].fillna(method='ffill', inplace=True)

In [11]:
#preencher valores de renda com o valor anterior
df['renda'].fillna(method='bfill', inplace=True)