# **KNN Lib ® 2024**

 **Algoritmo de K-Nearest Neighbors - KNN**

![KNN](https://github.com/aluipio/ds_ada_santander_knn/blob/main/images/ds_ml_ada.png?raw=true)

***1. Do objetivo:***

O objetivo deste projeto é recriar o algoritmo do KNN para calcular e classificar um dataset específico.


***2. Das limitações:***

Ficando restrito a ementa do curso Santander Coders - ADA Tech, não foram utilizados:
* Recursos avançados;
* Orientação a Objetos;
* Lib estrangeiras: Pandas, Numpy ou Scikit Learning;


***3. Referências (ou materiais consultados):***

Artificial Intelligence. (2024, 3 de Janeiro). What are the most effective distance metrics for optimizing k-nearest neighbors algorithms? Linkedin.com; www.linkedin.com. https://www.linkedin.com/advice/3/what-most-effective-distance-metrics-optimizing-xndwc.

Bruce, P., & Bruce, A. (2019). Estatística prática para cientistas de dados: 50 conceitos essenciais. Alta Books.

de Maquina, A. [@aprendizagemdemaquina9452]. (2021, March 4). O que é o KNN e como implementar do zero. Youtube. Acesso em 20 Jan 2024 de https://www.youtube.com/watch?v=E7R6O4Aqw-M.

Comunidade Ada. (n.d.). Ada.Tech. Acesso em 18 Jan 2024 de https://lms.ada.tech/student.

Fávero, L. P., Lopes E, B., & Prado, P. (2017). Manual de análise de dados: estatística e modelagem multivariada com Excel, SPSS e Stata. Elsevier.

Kadiwal, A. (2021). Water Quality [Data set]. In Drinking Water Potability. Acesso em 12 Jan 2024 de  https://www.kaggle.com/datasets/adityakadiwal/water-potability/data.

Kaggle: Your machine learning and data science community. (n.d.). Kaggle.com. Acesso em 20 Jan 2024 de https://www.kaggle.com.

Kunumi. (2020, 10 Junho). Métricas de Avaliação em Machine Learning: Classificação. Kunumi Blog. Acesso em 17 Jan 2024 de https://medium.com/kunumi/m%C3%A9tricas-de-avalia%C3%A7%C3%A3o-em-machine-learning-classifica%C3%A7%C3%A3o-49340dcdb198.

Matos, G. (2023, December 5). K-Nearest Neighbors(KNN): Entendendo o seu funcionamento e o construindo do zero. Share! Por Ateliê de Software. Acesso em 18 Jan 2024 de https://share.atelie.software/k-nearest-neighbors-knn-entendo-o-seu-funcionamento-e-o-construindo-do-zero-a21b022acd6f.

PEP 257 – docstring conventions. (n.d.). Python.org. Acesso em 23 Jan 2024 de vhttps://peps.python.org/pep-0257.

Srivastava, T. (2018, 25 março). A complete guide to K-Nearest Neighbors (updated 2024). Analytics Vidhya. Acesso em 24 Jan 2024 de https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering.

Tavares, C. (2019, 26 Março). KNN sem caixa preta. Medium. Acesso em 22 Jan 2024 de https://medium.com/@caroli.agro/aplicando-knn-em-iris-dataset-d594b79652d1.

Yu, C., Ooi, B. C., Tan, K., & Jagadish, H. V. (2001). Indexing the Distance: An Efficient Method to KNN Processing. In Very Large Data Bases Conference.


***4. Integrantes do grupo:***
* Anderson Miranda - ID: 1116003
* André Kuster - ID: 1116029
* Arthur Steins - ID: 1116023
* João Souza - ID:
* Juliana Bertolucci Peixoto - ID: 1116030

### **Core Functions**

#### Functions Files

In [None]:
import csv

In [None]:
###########################################
# Criar arquivo CSV com dados
###########################################
def create_csv(data, name:str = 'data.csv', delimiter: str = ';'):
    '''
    Creates a CSV file from the input data.

    It takes a dictionary or a list of lists as input and writes it to a CSV file.

    The name of the file and the delimiter can be specified.

    Parameters
    -------------
        data: dict or list of lists
            The to-be-written-to-the-CSV-file input data.
            The keys will be used as column headers if it's a dictionary.
        name: str, optional
            The to-be-created CSV file's name
            The default is 'data.csv'.
        delimiter: str, optional
            The used character to separate values in the CSV file.
            The default to ';'.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary or a list of lists.
    '''
    # Cria o Arquivo CSV
    _name = str(name) if '.csv' in str(name) else str(name) + '.csv'
    _archive = open(_name, 'w')

    # Escreve no Arquivo CSV
    _escritor = csv.writer(_archive, delimiter=';', lineterminator='\n')
    _data_table = dict_to_list(data) if type(data) == dict else data
    _escritor.writerows(_data_table)

    # fecha e salva o arquivo
    _archive.close()

In [None]:
###########################################
# Ler arquivo CSV
###########################################
def read_csv(name, delimiter=',', tipo=None):
    '''
    Reads a CSV file and returns the data.

    It takes the name of a CSV file, a delimiter, and an optional type as input; reads the CSV file and returns the data in the specified type.

    Parameters
    -------------
        name: str
            The to-be-read CSV file name.
        delimiter: str, optional
            The used character to separate values in the CSV file.
            Defaults to ','.
        tipo: type, optional
            The type to which the data should be converted.
            If not specified, the function will infer the type based on the data.

    Returns
    -------------
        dict or list of lists
            The CSV file data converted to the specified type.

    Raises
    -------------
        FileNotFoundError
            If the specified CSV file does not exist.
    '''

    # Verifica nome do arquivo
    _name = str(name) if '.csv' in str(name) else str(name) + '.csv'

    try:
        with open(_name, 'r') as arquivo:
            data_reader = csv.reader(arquivo, delimiter=delimiter, lineterminator='\n')
            header = next(data_reader)
            data_dict = {col: [] for col in header}

            for row in data_reader:
                for col, value in zip(header, row):
                    data_dict[col].append(convert_value(value))

        return data_dict

    except FileNotFoundError as e:
        raise FileNotFoundError(f"Arquivo CSV '{_name}' não encontrado.") from e

    except csv.Error as e:
        raise ValueError(f"Erro ao ler o arquivo CSV '{_name}': {e}") from e

In [None]:
###########################################
# Converte dicionário para lista
###########################################
def dict_to_list(data):
    '''
    Converts a dictionary into a list of lists.

    If the input data is a dictionary, the function converts it into a list of lists where the first list is the dictionary's keys, and the subsequent lists are the values from each key.
    If the input data is already a list, the function returns the input list.

    Parameters
    -------------
        data: dict or list
            The input data (dictionary or a list).

    Returns
    -------------
        list
            A list of lists representing the input data (if a dictionary).
            The input list (if a list).

    Raises
    -------------
        TypeError
            If the input data is not a dictionary or a list.
    '''

    if type(data) == dict:
        _info = data_info(data)
        _num_rows = _info['num_rows']
        _data_table = [_info['columns']]
        _data_table.extend([data_row(data, i) for i in range(_info['num_rows'])])
        return _data_table
    elif type(data) == list:
        return data
    else:
        raise TypeError('Input data must be a dictionary or a list.')

In [None]:
###########################################
# Converte lista de listas para dicionário
###########################################
def list_to_dict(data):
    '''
    Converts a list into a dictionary.

    If the input data is a list, the function converts it into a dictionary where the keys are the first list's elements, and the values are the corresponding elements from the subsequent lists.
    If the input data is already a dictionary, the function returns the input dictionary.

    Parameters
    -------------
        data: list or dict
            The input data. It can be a list or a dictionary.

    Returns
    -------------
        dict
            A dictionary representing the input data.

    Raises
    -------------
        TypeError
            If the input data is not a list or a dictionary.
    '''
    if type(data) == list:
        return {name:[row[i] for row in data[1:]] for i, name in zip(range(len(data[0])), data[0])}
    elif type(data) == dict:
        return data
    else:
      raise TypeError('Input data must be a dictionary or a list.')

#### Functions Dataset

In [None]:
###########################################
# Apresenta uma leitura do Dataset
###########################################
def data_info(data):
    '''
    Return a dictionary with information about the input data.

    The returned dictionary includes the following keys:
        - 'columns': list of keys from the input dictionary.
        - 'num_columns': number of keys in the input dictionary.
        - 'num_rows': the list associated with the first key's length in the input dictionary.
        - 'dimension': a tuple with the columns' and rows' number.
        - 'info_col_{column_name}': a tuple with the null values and the type of values' number in the column.

    If the column contains varied types, the type will be returned as 'mixed'.

    Parameters
    -------------
    data: dict
        The input dictionary.

    Returns
    -------------
        dict
            A dictionary with the input data's information.
        '''

    # Verificação de tipo de data inserido (type checking)
    if not isinstance(data, dict):
        raise TypeError('Input data must be a dictionary.')

    _info = {}
    _data = data
    _info['columns'] = list(_data.keys())
    _info['num_columns'] = len(_data)
    _info['num_rows'] = len(_data[_info['columns'][0]])
    _info['dimension'] = (_info['num_columns'], _info['num_rows'])

    for _col in _data.keys():
        _var_null = _data[_col].count("")
        _var_type = str(type(_data[_col][0]))
        for _ in _data[_col]:
            if str(type(_)) != _var_type:
                _var_type = 'mixed'

        _info[f'info_col_{_col[:20]:_>20}'] = (f"null  {_var_null:>4}",f"type   {_var_type:>6}")

    return _info

In [None]:
###########################################
# Descreve os dados
###########################################
def data_describe(data):
    '''
    Describes the data in the input dictionary.

    It ierates over each dictionary's column, calculates various statistics, and returns a string that describes the data.

    Parameters
    -------------
        data: dict
        The input dictionary. Each key represents a column, and the corresponding value is a list of data for that column.

    Returns
    -------------
        str
        A string that describes the data, including the type, quantity, maximum, minimum, sum, range, mean, and median for each dictionary's column.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary.
    '''

    # Verificação de tipo de data inserido (type checking)
    if not isinstance(data, dict):
        raise TypeError('Input data must be a dictionary.')

    _data = data

    # Verifica o tipo de cada variável
    _info = {}
    for _col in _data.keys():
        _type = type(_data[_col][0])
        for _ in _data[_col]:
            if type(_) != _type:
                _type = 'mixed'
        _info[_col] = _type

    _var = [f"{'Analysis':<12}"]
    _var += [_col for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']
    _var += [f"{'Tipo':<12}"] + [str(_info[_col]) for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']
    _var += [f"{'Quant':<12}"] + [len(_data[_col]) for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']
    _var += [f"{'Máximo':<12}"] + [round(max(_data[_col]), 5) for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']
    _var += [f"{'Mínimo':<12}"] + [round(min(_data[_col]), 5) for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']
    _var += [f"{'Soma':<12}"] + [round(sum(_data[_col]), 5) for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']
    _var += [f"{'Amplitude':<12}"] + [round(max(_data[_col])-min(_data[_col]), 5) for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']
    _var += [f"{'Média':<12}"] + [round(sum(_data[_col])/len(_data[_col]), 5) for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']
    _var += [f"{'Mediana':<12}"] + [round(sorted(_data[_col])[int(len(_data[_col])/2)], 5) for _col in _data.keys() if _info[_col] not in ['mixed', str]] + ['\n']

    return "".join([f'{str(_)[:20]:>20}' for _ in _var])

In [None]:
###########################################
# Retorna dados de uma linha
###########################################
def data_row(data, row: int = 0):
    '''
    Returns a list of values from the specified row in the input dictionary.

    The function retrieves the values from the specified row across all columns in the dictionary. If the selected row index exceeds the number of rows in the dictionary, the function retrieves the values from the last row.

    Parameters
    -------------
        data: dict
            The input dictionary.
            Each key represents a column, and the corresponding value is a data list for that column.
        row: int, optional
            The index of the row to retrieve the values from.
            The default is 0.

    Returns
    -------------
        list
            A list of values from the specified row in the dictionary.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary.
    '''

    #Verificação de tipo de data inserido (type checking)
    if not isinstance(data, dict):
        raise TypeError("Input data must be a dictionary.")

    _columns = list(data.keys())
    _num_rows = len(data[_columns[0]])
    _row = int(row) if int(row) < _num_rows else _num_rows - 1

    return [data[col][_row] for col in _columns]

In [None]:
###################################################
# Retorna coluna de dados em formato especificado
###################################################
def data_col(data, column:str, tipo = list):
    '''
    Returns the data of a specified column in the desired format.

    The function retrieves the specified column data in the dictionary and returns it in the format specified by the 'type' parameter.

    Parameters
    -------------
        data: dict
        The input dictionary. Each key represents a column, and the corresponding value is a list of data for that column.
        column: str
        The name of the column from which data will be retrieved from.
        tipo: type, optional
        The desired format for the returned data. It can be a list (default), tuple, or dict.

    Returns
    -------------
        list, tuple, or dict
        The data from the specified column is now in the desired format. If the specified column does not exist in the dictionary, the function returns None.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary.
    '''
    # Verificação de tipo de data inserido (type checking)
    if not isinstance(data, dict):
        raise TypeError("Input data must be a dictionary.")

    if type == tuple:
        return (column, data[column]) if column in data.keys() else None
    if type == dict:
        return {column : data[column]} if column in data.keys() else None
    else:
        return [column, data[column]] if column in data.keys() else None

In [None]:
###########################################
# Quantidade por tipo
###########################################
def data_value_counts(data, col:str = ""):
    '''
    Counts each unique value's occurrence in the specified column or list.

    If the input data is a dictionary, the function counts each unique value's occurrence in the specified column.
    If it is a list, the function calculates each unique occurrence value in the list.

    Parameters
    -------------
        data: dict or list
            The input data. It can be a list of data or a dictionary (each key represents a column, and the corresponding value is a list of data for that column).
        col: str, optional
            The column to count the unique values in.
            This parameter is ignored if the input data is a list.

    Returns
    -------------
        dict:
            A dictionary where each key is a unique value from the specified column or list, and each value is its count.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary or a list.
    '''
    # Verificação de tipo de data inserido (type checking)
    if not isinstance(data, dict) and not isinstance(data, list):
        raise TypeError('Input data must be a dictionary or list.')

    if type(data) == dict:
        if col in list(data.keys()):
            _set = {_ for _ in data[col]}
            return dict(sorted({name : data[col].count(name) for name in _set}.items(), key=lambda x: x[1]))
    elif type(data) == list:
        _set = {_ for _ in data}
        return dict(sorted({name : data.count(name) for name in _set}.items(), key=lambda x: x[1]))
    else:
        return None

In [None]:
###########################################
# Retorna conjunto em serie
###########################################
def data_serie(data, row: int = 0):
    '''
    Returns a dictionary that represents a row of data from the input dictionary.

    The function retrieves the values from the specified row across all columns in the dictionary. It returns a dictionary with column names as the keys and corresponding values from the specified row as the values.

    Parameters
    -------------
        data: dict
            The input dictionary. Each key represents a column, and the corresponding value is a list of data for that column.
        column: str
            The column's name from which data will be retrieved.
        row: int, optional
            The row's index to retrieve the values from. The default is 0.

    Returns
    -------------
        dict:
            A dictionary that represents a row of data from the input dictionary.
            The keys are the column names, and the values are the corresponding values from the specified row.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary.
    '''
    #Verificação de tipo de data inserido (type checking)
    if not isinstance(data, dict):
        raise TypeError('Input data must be a dictionary.')

    _info = data_info(data)

    return dict(zip(_info['columns'], data_row(data, row)))

In [None]:
###########################################
# Visualiza o dataset
###########################################
def data_view(data, limit: int = 10):
    '''
    Prints a data's formatted view up to the specified limit.

    The function retrieves the input dictionary's values and prints a formatted data view up to the specified limit.
    The view includes the column names and the values from each row up to the limit.
    If the limit is less than the number of rows in the data, the function also indicates fewer rows than the function's limit.

    Parameters
    -------------
        data: dict
                The input dictionary. Each key represents a column, and the corresponding value is a data list for that column.
        limit: int, optional
            The maximum number of rows to display.
            The default is 10.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary.
    '''

    # Verificação de tipo de data inserido (type checking)
    if not isinstance(data, dict):
        raise TypeError("Input data must be a dictionary.")

    _data = data.copy()
    info = data_info(_data)

    row_title = f"|{'n':^5}|"
    row_title += "".join([f"{col[0:6]:^11}|" for col in info['columns']])

    print('-'*len(row_title))
    print(f'|{"VIEW DATASET":^{len(row_title)-2}}|')
    print('-'*len(row_title))

    print(row_title)
    print('-'*len(row_title))

    for i in range(info['num_rows'] if limit > info['num_rows'] else limit):
        row_set = f"|{i:^5}|"
        row_set += "".join([f"{str(_data[col][i])[:9]:^11}|" for col in info['columns']])
        print(row_set)

    if limit < info['num_rows']:
        row_set = f"|{'..':^5}|"
        row_set += "".join([f"{'...':^11}|" for _ in info['columns']])
        print(row_set)

    print('-'*len(row_title))
    dimension = info['dimension']
    print(f'|{f"  col/row: {dimension} | limit view: {limit}":<{len(row_title)-2}}|')
    print('-'*len(row_title))

In [None]:
###########################################
# Identifica e converte valor
###########################################
def convert_value(value):
    '''
    Converts a value to a specific type.

    It takes a value as input and converts it to a specific type based on the value itself.

    Parameters
    -------------
        value: str
            The value to be converted.

    Returns
    -------------
        int, float, or str
            The converted value.

    Raises
    -------------
        ValueError
            If the value cannot be converted to int or float.
    '''

    if value.isalpha():
        return str(value)
    elif value.isnumeric():
        return int(value)
    elif '.' in value and value.replace('.', '').isnumeric():
        return float(value)
    else:
        return str(value)

In [None]:
###########################################
# Deletar Atributo do dicionário
###########################################
def data_drop_attribute(data, drop):
    '''
    Removes a key-value pair from the input dictionary.

    The function takes a dictionary and a string as input removes the key-value pair from the dictionary (the key matches the provided string) and returns the modified dictionary.

    Parameters
    -------------
        data: dict
            The input dictionary.
        drop: str
            The key to be removed from the dictionary.

    Returns
    -------------
        dict
            The modified dictionary (with the specified key-value pair removed).

    Raises
    -------------
        TypeError
            If the input data is not a dictionary or the drop is not a string.
    '''

    if not isinstance(data, dict):
      raise TypeError('Input data must be a dictionary.')
    if not isinstance(drop, str):
      raise TypeError('Drop must be a string.')

    _data = data.copy()
    if type(drop) == str and type(data) == dict:
        if drop in data.keys():
            _data.pop(drop)
    return _data

In [None]:
###########################################
# Deletar Linha NaN
###########################################
def data_drop_na(data):
    '''
    Removes rows with missing or null values from the input dictionary.

    It takes a dictionary as input and iterates over its rows; if a row contains a missing or null value (represented as " " or None), the function removes that row from the dictionary and returns the modified dictionary.

    Parameters
    -------------
        data: dict
            The input dictionary in which each key represents a column, and the corresponding value is a data list for that column.

    Returns
    -------------
        dict
            The modified dictionary with removed rows containing missing or null values.

    Raises
    ------------
        TypeError
            If the input data is not a dictionary.
    '''

    #Verificação de tipo de data inserido (type checking)
    if not isinstance(data, dict):
      raise TypeError('Input data must be a dictionary.')

    _data = data.copy()
    _columns = list(data.keys())
    _num_rows = len(data[_columns[0]])

    for _indice in range(_num_rows-1,-1,-1):
        if "" in data_row(data, _indice) or None in data_row(data, _indice):
            for _col in _columns:
                _data[_col].pop(_indice)

    return _data

In [None]:
###########################################
# Substituir valor NaN
###########################################
def data_fill_na(data, column:str, value):
    '''
    Replaces missing or null values in the input dictionary's specified column.

    It takes a dictionary, a column name, and a value as input, then replaces any missing or null values (represented as " " or None) in the specified column with the provided value and returns the modified dictionary.

    Parameters
    -------------
        data: dict
            The input dictionary in which each key represents a column, and the corresponding value is a data list for that column.
        column: str
            The name of the column in which to replace missing or null values.
        value:
            The value to replace missing or null values with.

    Returns
    -------------
        dict
            The modified dictionary with missing or null values in the specified column replaced.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary or the specified column does not exist in the dictionary.
    '''

    if not isinstance(data, dict):
        raise TypeError('Input data must be a dictionary.')
    if column not in data:
        raise TypeError('Specified column does not exist in the dictionary.')

    _data = data.copy()
    _num_rows = len(data[column])

    for _indice in range(_num_rows):
        if data[column][_indice] in ("", None):
            data[column][_indice] = value

    return _data

#### Functions Statistics

In [None]:
###########################################
# Retorna media
###########################################
def data_mean(data_list) -> float:
    '''
    Calculates the values in the input list's mean.

    It takes a list as input and returns the values in the list's mean (average) (the mean is calculated as the sum divided by the number of values).

    Parameters
    -------------
        data_list: list
            The input list's numerical values.

    Returns
    -------------
        float
            The values in the input list's mean.

    Raises
    -------------
        TypeError
            If the input is not a list.
    '''
    if not type(data_list) is list:
        raise TypeError('Input must be a list.')

    _data = data_list
    _data = [_ for _ in data_list if _ != ""]
    return round(sum(_data)/len(_data), 3)

In [None]:
###########################################
# Retorna mediana
###########################################
def data_median(data_list) -> float:
    '''
    Calculates the values in the input list's median.

    It takes a list as input and returns the median of the values in the list (the median is calculated by sorting the list and selecting the middle value if the list length is odd, or the average of the two central values if the list length is even).

    Parameters
    -------------
        data_list: list
            The numerical values' input list.

    Returns
    -------------
        float
            The values in the input list's median.

    Raises
    -------------
        TypeError
            If the input is not a list.
    '''

    if not type(data_list) is list:
        raise TypeError("Deve passar uma lista como parametros.")

    _data = data_list
    _data = sorted([_ for _ in data_list if _ != ""])

    return _data[int(len(_data)/2)]

#### Functions Distances

In [None]:
###########################################
# Distancia Euclidiana entre dois pontos
###########################################
def euclidian_distance(point_1, point_2):
    '''
    Calculates the Euclidean distance between two points (in a space with arbitrary dimension).

    It takes two points (each represented as a list or tuple of coordinates) as input and calculates the Euclidean distance between them.
    The Euclidean distance is calculated as the square root of the sum of the squared differences between the corresponding coordinates of the two points.

    Parameters
    -------------
        point_1: list or tuple
            The first point's coordinates (each element can be an int or a float).
        point_2: list or tuple
            The second point's coordinates (each element can be an int or a float).

    Returns
    -------------
        float
            The Euclidean distance between the two points.

    Raises
    -------------
        ValueError
            If the two points do not have the same dimension number.
    '''
    # Verifica tipo de dados imputados
    if len(point_1) != len(point_2):
        raise ValueError('The points must have the same number of dimensions.')

    distance = 0
    for a, b in zip(point_1, point_2):
        distance += (a-b)**2

    return distance**(0.5)

In [None]:
###########################################
# Distancia Manhattan entre dois pontos
###########################################
def manhattan_distance(point_1, point_2):
    '''
    Calculates the Manhattan distance between two points (in a space with arbitrary dimension).

    It takes two points (each represented as a coordinates list or tuple) as input and calculates the Manhattan distance between them. The Manhattan distance is calculated as the sum of the absolute differences between corresponding coordinates of the two points.

    Parameters
    -------------
        point_1: list or tuple
            The first point's coordinates (each element can be an int or a float).
        point_2: list or tuple
            The second point's coordinates (each element can be an int or a float).

    Returns
    -------------
        float
            The Manhattan distance between the two points.

    Raises
    -------------
        ValueError
            If the two points do not have the same number of dimensions.
    '''
    # Verifica tipo de dados imputados
    if len(point_1) != len(point_2):
        raise ValueError('The points must have the same number of dimensions.')

    distance = 0
    for a, b in zip(point_1, point_2):
        distance += abs(a - b)

    return distance

In [None]:
###########################################
# Distancia de Minkowski entre dois pontos
###########################################
def minkowski_distance(point_1, point_2, p=3):
    '''
    Calculates the Minkowski distance between two points.

    It takes two points (each represented as a coordinates list or tuple) and a power parameter as input and calculates the Minkowski distance between them. The Minkowski distance is a generalization of the Euclidean and Manhattan distances. It is calculated as the p-th root of the absolute differences between corresponding coordinates of the two points (each raised to the power p) sum.

    Parameters
    -------------
        point_1: list or tuple
            The first point's coordinates (each element can be an int or a float).
        point_2: list or tuple
              The second point's coordinates (each element can be an int or a float).
        p: int, optional
            The power parameter. The default is 3.

    Returns
    -------------
        float
            The Minkowski distance between the two points.

    Raises
    -------------
        ValueError
            If the two points do not have the same number of dimensions.
    '''
    # Verifica tipo de dados imputados
    if len(point_1) != len(point_2):
        raise ValueError('The points must have the same number of dimensions.')

    distance = sum(abs(a - b) ** p for a, b in zip(point_1, point_2))

    return distance ** (1 / p)

#### Functions KNN

In [None]:
###########################################
# Normaliza dados do dataset
###########################################
def ml_normalize(data):
    '''
    Normalizes the values in each column of the input dictionary.

    It takes a dictionary as input and normalizes the values in each column by subtracting the minimum value from each value and dividing by the range of the column; it then returns the modified dictionary.

    Parameters
    -------------
        data: dict
            The input dictionary in which each key represents a column, and the corresponding value is a list of data for that column.

    Returns
    -------------
        dict
            The modified dictionary with normalized values in each column.

    Raises
    -------------
        TypeError
            If the input data is not a dictionary.
    '''
    # Verifica tipo de dados imputados
    if not isinstance(data, dict):
        raise TypeError('Input data must be a dictionary.')

    _df = data.copy()
    if type(data) == dict:
        for attr in _df.keys():
            ls = _df[attr]
            maximo = max(ls)
            minimo = min(ls)
            f_normalize = lambda x: round((x - minimo) / (maximo - minimo), 4)
            _df[attr] = list(map(f_normalize, ls))

    return _df

In [None]:
###########################################
# Padroniza dados do dataset
###########################################
def ml_standardize(data):
    '''
    Standardizes the values in each column of the input dictionary.

    It takes a dictionary as input and standardizes the values in each column by subtracting the column mean from each value and then dividing by the column's standard deviation. The function then returns the modified dictionary.

    Parameters
    -------------
        data: dict
            The input dictionary in which each key represents a column, and the corresponding value is a list of data for that column.

    Returns
    -------------
        dict
            The modified dictionary (with standardized values in each column).

    Raises
    -------------
        TypeError
            If the input data is not a dictionary.
    '''
    # Verifica tipo de dados imputados
    if not isinstance(data, dict):
        raise TypeError('Input data must be a dictionary.')

    _df = data.copy()
    if type(data) == dict:
        for attr in _df.keys():
            ls = _df[attr]
            mean_val = sum(ls) / len(ls)
            std_dev = (sum((x - mean_val) ** 2 for x in ls) / len(ls)) ** 0.5
            f_standardize = lambda x: round((x - mean_val) / std_dev, 4)
            _df[attr] = list(map(f_standardize, ls))

    return _df

In [None]:
###########################################
# Separar dados de treino e dados de teste
###########################################
def ml_train_test_split(data_x, date_y, test_rate: float = 0.3):
    '''
    Splits the input data into training and testing sets.

    It takes as input a features dictionary, a targets list, and a test rate.
    It then splits the data into training and testing sets based on the test rate and returns the training features, the testing features, the training targets, and the testing targets.

    Parameters
    ------------
        data_x: dict
            The input features dictionary in which each key represents a feature, and the corresponding value is a list of data for that feature.
        date_y: list
            The input targets list.
        test_rate: float, optional
            The data's proportion to include in the test split.
            The default is 0.3.

    Returns
    ------------
        dict, dict, list, list
            The training features (a dictionary with the same structure as data_x but with a subset of the data), the testing features (a dictionary with the same structure as data_x but with a different subset of the data), the training targets (a subset of date_y), and the testing targets (a different subset of date_y).

    Raises
    ------------
        TypeError
            If the input data_x is not a dictionary, or if the test_rate is not a float or is greater than 1.
    '''
    # Verifica tipo de dados imputados
    if not type(data_x) is dict:
        raise TypeError('Input data_x must be a dictionary.')
    if not isinstance(test_rate, float) or test_rate > 1:
        raise TypeError('Test rate must be a float less than or equal to 1.')

    # Verifica taxa de teste (test_rate)
    test_rate = test_rate if type(test_rate) == float and test_rate <= 1  else 0.3

    # Quantifica amostra de teste e separa dados
    _num_rows = len(date_y)
    _num_test = int(_num_rows * test_rate)

    # Separa amostra Y
    y_test = date_y[:_num_test]
    y_train = date_y[_num_test:]

    # Separa amostra X
    X_test = {name:[val for val in data_x[name][:_num_test]] for name in data_x.keys()}
    X_train = {name:[val for val in data_x[name][_num_test:]] for name in data_x.keys()}

    return X_train, X_test, y_train, y_test

In [None]:
#########################################################################################################
# Grid Search - Testa diferentes valores K e distancias, e retorna os resultados.
#########################################################################################################
def ml_grid_search(X_train, y_train, k_values, distances=['euclidian']):
    '''
    Tests a k-nearest neighbors (KNN) model for each k-value specified in the input list.

    It takes training features, training targets, a list of k-values, and a list of distances as input; then, it fits a KNN model to the training data and tests it on the testing data for each k-value and distance. It returns a list of dictionaries, each representing the results of the KNN model for a specific k-value and distance.

    Parameters
    -------------
        X_train : dict
            The input training features dictionary.
        y_train: list
            The input training targets list.
        k_values: list
            The input list of k-values to test.
        distances: list, optional
            The distances' input list to test. Defaults to ['euclidian'].

    Returns
    -------------
        list
            A list of dictionaries, each representing the KNN model's results for each specified k-value and distance.

    Raises
    -------------
        TypeError
            If the input X_train is not a dictionary, the input y_train is not a list, the input k_values is not a list, or the input distances is not a list.
    '''
    # Verifica tipo de dados imputados
    if not isinstance(X_train, dict) or not isinstance(X_test, dict):
        raise TypeError('Input X_train and X_test must be dictionaries.')
    if not isinstance(y_train, list) or not isinstance(y_test, list):
        raise TypeError('Input y_train and y_test must be lists.')
    if not isinstance(k_values, list):
        raise TypeError('Input k_values must be a list.')
    if not isinstance(distances, list):
        raise TypeError('Input distances must be a list.')

    # Testa a combinação dos k-values e das distances.
    _return = []
    for distance in distances:
        _return += [{'k': k, 'score': ml_score_knn(knn), 'matrix': knn['confusion_matrix'], 'distance': knn['distance'], 'knn': knn}
                        for k in k_values for knn in [ml_fit(X_train, y_train, k, distance)]]

    return _return

In [None]:
###########################################
# Treinar KNN
###########################################
def ml_fit(X, y, n=3, distance_min="euclidian"):
    '''
    Fits a k-nearest neighbors (KNN) model to the data.

    It takes as input a features dictionary, a targets list, a specified number of neighbors, and a distance metric.
    It then fits a KNN model to the data using the set number of neighbors and distance metric and returns a dictionary representing the model's fit.

    Parameters
    ----------
        X: dict
            The input features dictionary in which each key represents a feature, and the corresponding value is that feature's data list.
        y: list
            The input targets list.
        n: int, optional
            The number of neighbors to use in the KNN model.
            The default is 3.
        distance_min: str, optional
            The metric distance to use in the KNN model. It can be 'euclidian,' 'minkowski,' or 'manhattan.'
            The default is 'euclidian'.

    Returns
    -------------
        dict
            A dictionary representing the KNN model's fit. The dictionary includes keys such as 'n_neighbors,' 'distance,' 'data_x_train,' 'data_y_train,' 'hit,' 'error,' 'confusion_matrix,' etc.
            The exact keys may vary based on the implementation.)

    Raises
    -------------
        TypeError
            If input X is not a dictionary, input y is not a list; input distance_min is not one of the allowed values, or the lengths of X and y are unequal.
    '''
    # Verifica tipo de dados imputados
    if not type(X) is dict: raise TypeError('Input X must be a dictionary.')
    if not type(y) is list: raise TypeError('Input y must be a list.')
    if not distance_min in ['euclidian', 'minkowski', 'manhattan']:
        raise TypeError('Input distance_min must be one of the allowed values: "euclidian", "minkowski", "manhattan".')
    if len(X[list(X.keys())[0]]) != len(y): raise TypeError('Input X and y must have the same length.')

    # Armazena teste
    _retrieval = {}
    _retrieval['n_neighbors'] = n
    _retrieval['distance'] = distance_min
    _retrieval['data_x_train'] = X.copy()
    _retrieval['data_y_train'] = y.copy()
    _retrieval['hit'] = 0
    _retrieval['error'] = 0
    _retrieval['confusion_matrix'] = {col:{col2:0 for col2 in set(y)} for col in set(y)}
    _retrieval['rounds'] = 0
    _retrieval['num_rows'] = len(y)

    # Número impar para desempate
    n = n if n%2 == 1 else n+1

    # Verifica qual será a distância será utilizada
    if distance_min == "minkowski":
        _fun_distance = minkowski_distance
    elif distance_min == "manhattan":
        _fun_distance = manhattan_distance
    else:
        _fun_distance = euclidian_distance

    # Cria uma matriz com todos as linhas
    _matrix_x = [data_row(X, i) for i in range(len(y))]

    for index, y_expected in zip(range(len(y)), y):
        line_a = _matrix_x[index]
        list_distances = {}

        # Calcula a distancia com todas as linhas
        for row in range(len(y)):
            # Conta rodadas
            _retrieval['rounds'] += 1

            if index == row:
                continue
            line_b = _matrix_x[row]

            # Calcula distancia
            distance = _fun_distance(line_a, line_b)
            list_distances[distance] = y[row]

        list_distances = dict(sorted(list_distances.items()))
        list_distances = {_val[0]:_val[1] for _i, _val in zip(range(len(list_distances)), list_distances.items()) if _i < n}

        _tally = {list(list_distances.values()).count(c) : c for c in set(list_distances.values())}
        _tally = dict(sorted(_tally.items(), reverse=True))
        _y_predict = list(_tally.values())[0]

        _retrieval['confusion_matrix'][y_expected][_y_predict] += 1

        if _y_predict == y_expected:
            _retrieval['hit'] += 1
        else:
            _retrieval['error'] += 1

    return _retrieval

In [None]:
###########################################
# Preditor KNN
###########################################
def ml_predict(knn_fit, X_test, y_test):
    '''
    Fits a k-nearest neighbors (KNN) model to the data.

    It takes a dictionary representing a fitted KNN model, a features dictionary for the test data, and a targets list for the test data as input.
    It then uses the fitted KNN model to make predictions on the test data and returns a dictionary representing the model's predictions.

    Parameters
    ----------
        knn_fit: dict
	          The input dictionary representing a fitted KNN model.
        X_test: dict
	          The input features dictionary in which each key represents a feature, and the corresponding value is that feature's data list.
        y_test: list
            The input targets list for the test data.

    Returns
    -------------
        dict
            A dictionary representing the KNN model's prediction.

    Raises
    -------------
        TypeError
            If the input knn_fit is not a dictionary, the input X_test is not a dictionary, the input y_test is not a list, or the lengths of X_test and y_test are unequal.
    '''
    # Verifica tipo de dados imputados
    if not type(knn_fit) is dict: raise TypeError('Input knn_fit must be a dictionary.')
    if not type(X_test) is dict: raise TypeError('Input X_test must be a dictionary.')
    if not type(y_test) is list: raise TypeError('Input y_test must be a list.')
    if len(X_test[list(X_test.keys())[0]]) != len(y_test): raise TypeError('Input X_test and y_test must have the same length.')

    # Dados usados no treino
    _n = knn_fit['n_neighbors']
    _distance_min = knn_fit['distance']
    _X_train = knn_fit['data_x_train']
    _y_train = knn_fit['data_y_train']

    # Dados usados no Test
    _num_rows = len(y_test)

    # Armazena teste
    _retrieval = {}
    _retrieval['n_neighbors'] = _n
    _retrieval['distance'] = _distance_min
    _retrieval['data_x_test'] = X_test.copy()
    _retrieval['data_y_test'] = y_test.copy()
    _retrieval['data_y_predict'] = []
    _retrieval['hit'] = 0
    _retrieval['error'] = 0
    _retrieval['confusion_matrix'] = {col:{col2:0 for col2 in set(_y_train)} for col in set(_y_train)}
    _retrieval['num_rows'] = _num_rows
    _retrieval['rounds'] = 0

    # Número impar para desempate
    _n = _n if _n%2 == 1 else _n+1

    # Verifica qual será a distância será utilizada
    if _distance_min == "minkowski":
        _fun_distance = minkowski_distance
    elif _distance_min == "manhattan":
        _fun_distance = manhattan_distance
    else:
        _fun_distance = euclidian_distance

    # Cria uma matriz com todos as linhas
    _matrix_x_train = [data_row(_X_train, i) for i in range(len(_y_train))]
    _matrix_x_test = [data_row(X_test, i) for i in range(len(y_test))]

    for index, y_expected in zip(range(_num_rows), y_test):
        line_a = _matrix_x_test[index]
        list_distances = {}

        # Calcula a distancia com todas as linhas
        for row in range(len(_y_train)):
            # Conta rodadas
            _retrieval['rounds'] += 1

            # Captura linha de Comparação
            line_b = _matrix_x_train[row]

            # Calcula a distância
            distance = _fun_distance(line_a, line_b)
            list_distances[distance] = _y_train[row]

        list_distances = dict(sorted(list_distances.items()))
        list_distances = {_val[0]:_val[1] for _i, _val in zip(range(len(list_distances)), list_distances.items()) if _i < _n}

        _tally = {list(list_distances.values()).count(c) : c for c in set(list_distances.values())}


        _tally = {list(list_distances.values()).count(c) : c for c in {_ for _ in list_distances.values()}}
        _tally = dict(sorted(_tally.items(), reverse=True))
        _y_predict = list(_tally.values())[0]

        _retrieval['data_y_predict'].append(_y_predict)
        _retrieval['confusion_matrix'][y_expected][_y_predict] += 1

        if _y_predict == y_expected:
            _retrieval['hit'] += 1
        else:
            _retrieval['error'] += 1

    return _retrieval

#### Functions Metrics

In [None]:
###########################################
# Calcular Score
###########################################
def ml_score_knn(data_fit):
    '''
    Calculates a k-nearest neighbors (KNN) model accuracy score.

    It takes a dictionary as input, representing a KNN model's fit results.
    The dictionary should contain the keys 'hit' and 'num_rows.
    'hit' represents the number of correct predictions made by the model, and 'num_rows' the predictions' total number.
    It calculates the score as the ratio of 'hit' to 'num_rows' and returns this score.

    Parameters
    -------------
        data_fit: dict
            The input dictionary representing a KNN model fit.
                It should contain the keys 'hit' and 'num_rows', where 'hit' is the number of correct predictions and 'num_rows' is the total number of predictions.

    Returns
    -------------
        float
            The KNN model's accuracy score (calculated as the ratio of 'hit' to 'num_rows').

    Raises
    -------------
        TypeError
            If the input data_fit is not a dictionary.
    '''
    # Verifica tipo de dados imputados
    if not isinstance(data_fit, dict):
        raise TypeError('Input data_fit must be a dictionary.')

    _retorno = data_fit
    return round(_retorno['hit']/_retorno['num_rows'], 4)

In [None]:
###########################################
# Metrics - Classification Report
###########################################
def metric_classification_report(data_fit, label={}):
    '''
    Generates the input dictionary's classification report.

    It takes a dictionary as input, representing a fitted model's results; calculates precision, recall, f1-score, and support for each data class; and prints a classification report.

    Parameters
    -------------
        data_fit: dict
            The input dictionary that represents a fitted model's results. It should contain a 'confusion_matrix' key (a dictionary representing the model confusion matrix).
        label: dict
            The input dictionary that represents a label of dataset.

    Raises
    -------------
        TypeError
            If the input data_fit is not a dictionary.
    '''
    # Verifica tipo de dados imputados
    if not type(data_fit) is dict: raise TypeError('data_fit must be a dictionary.')
    if not type(label) is dict: raise TypeError('label must be a dictionary.')

    _confusion_matrix  = data_fit['confusion_matrix']
    _columns = list(_confusion_matrix.keys())
    _dim = len(max(_columns, key=len)) if type(_columns[0]) == str else 13
    _dimension_col = _dim if _dim > 13 else 13

    _var_print = f"{' ':^{_dimension_col}}{'precision':>11}{'recall':>11}{'f1-score':>11}{'support':>11}\n\n"

    _support_list = []
    _precision_list = []
    _recall_list = []
    _f1_score_list = []

    for _col in _columns:
        _support = sum(_confusion_matrix[_col].values())
        _precision = _confusion_matrix[_col][_col] / _support
        _recall = _confusion_matrix[_col][_col] / sum([_confusion_matrix[_col2][_col] for _col2 in _columns])
        _f1_score = 2 * _precision * _recall / (_precision + _recall)

        _col = label[_col] if _col in list(label.keys()) else _col

        _var_print += f"{_col:>{_dimension_col}}"
        _var_print += f"{str(round(_precision, 5)):>11}"
        _var_print += f"{str(round(_recall, 5)):>11}"
        _var_print += f"{str(round(_f1_score, 5)):>11}"
        _var_print += f"{str(round(_support, 5)):>11}"
        _var_print += "\n"

        _support_list.append(_support)
        _precision_list.append(_precision)
        _recall_list.append(_recall)
        _f1_score_list.append(_f1_score)

    _mean_precision = round(sum(_precision_list)/len(_precision_list), 5)
    _mean_recall = round(sum(_recall_list)/len(_recall_list), 5)
    _mean_f1_score = round(sum(_f1_score_list)/len(_f1_score_list), 5)

    print(_var_print)
    print(f"{'avg':>{_dimension_col}}{_mean_precision:>11}{_mean_recall:>11}{_mean_f1_score:>11}{sum(_support_list):>11}\n")

In [None]:
###########################################
# Metrics - Accuracy
###########################################
def metric_accuracy(data_fit):
    '''
    Calculates a model's accuracy based on the input dictionary.

    It takes a dictionary that represents a fitted model's results as input and uses the 'confusion_matrix' key in the dictionary to calculate and return the model's accuracy.

    Parameters
    -------------
        data_fit: dict
            The input dictionary representing the fitted model's results; it should contain a 'confusion_matrix' key (a dictionary representing the model's confusion matrix).

    Returns
    -------------
        float
            The model's accuracy.

    Raises
    -------------
        TypeError
            If the input data_fit is not a dictionary.
    '''
    # Verifica tipo de dados imputados
    if not type(data_fit) is dict:
        raise TypeError('data_fit must be a dictionary.')

    _confusion_matrix  = data_fit['confusion_matrix']
    _columns = list(_confusion_matrix.keys())

    _true_predict = sum([_confusion_matrix[_col][_col] for _col in _columns])
    _total_support = sum([sum(_confusion_matrix[_col].values()) for _col in _columns])

    return _true_predict / _total_support

### **Execução KNN**

In [None]:
# Carrega dados
data_test = read_csv('water_potability.csv', delimiter=",")

In [None]:
# Visualizar tipo
type(data_test)

dict

In [None]:
# Visualização de Dados
data_view(data_test, 5)

-------------------------------------------------------------------------------------------------------------------------------
|                                                        VIEW DATASET                                                         |
-------------------------------------------------------------------------------------------------------------------------------
|  n  |    ph     |  Hardne   |  Solids   |  Chlora   |  Sulfat   |  Conduc   |  Organi   |  Trihal   |  Turbid   |  Potabi   |
-------------------------------------------------------------------------------------------------------------------------------
|  0  |           | 204.89045 | 20791.318 | 7.3002118 | 368.51644 | 564.30865 | 10.379783 | 86.990970 | 2.9631353 |     0     |
|  1  | 3.7160800 | 129.42292 | 18630.057 | 6.6352458 |           | 592.88535 | 15.180013 | 56.329076 | 4.5006562 |     0     |
|  2  | 8.0991241 | 224.23625 | 19909.541 | 9.2758836 |           | 418.60621 | 16.868636 | 66.420092 | 

In [None]:
# Informações do dataset
data_info(data_test)

{'columns': ['ph',
  'Hardness',
  'Solids',
  'Chloramines',
  'Sulfate',
  'Conductivity',
  'Organic_carbon',
  'Trihalomethanes',
  'Turbidity',
  'Potability'],
 'num_columns': 10,
 'num_rows': 3276,
 'dimension': (10, 3276),
 'info_col___________________ph': ('null   491', 'type    mixed'),
 'info_col_____________Hardness': ('null     0', "type   <class 'float'>"),
 'info_col_______________Solids': ('null     0', "type   <class 'float'>"),
 'info_col__________Chloramines': ('null     0', "type   <class 'float'>"),
 'info_col______________Sulfate': ('null   781', 'type    mixed'),
 'info_col_________Conductivity': ('null     0', "type   <class 'float'>"),
 'info_col_______Organic_carbon': ('null     0', "type   <class 'float'>"),
 'info_col______Trihalomethanes': ('null   162', 'type    mixed'),
 'info_col____________Turbidity': ('null     0', "type   <class 'float'>"),
 'info_col___________Potability': ('null     0', "type   <class 'int'>")}

In [None]:
# Substituir Valores pela média
data_fill = data_fill_na(data_test,'ph',data_mean(data_test['ph']))
data_fill = data_fill_na(data_test,'Sulfate',data_mean(data_test['Sulfate']))
data_fill = data_fill_na(data_test,'Trihalomethanes',data_mean(data_test['Trihalomethanes']))
data_view(data_fill, 5)

-------------------------------------------------------------------------------------------------------------------------------
|                                                        VIEW DATASET                                                         |
-------------------------------------------------------------------------------------------------------------------------------
|  n  |    ph     |  Hardne   |  Solids   |  Chlora   |  Sulfat   |  Conduc   |  Organi   |  Trihal   |  Turbid   |  Potabi   |
-------------------------------------------------------------------------------------------------------------------------------
|  0  |   7.081   | 204.89045 | 20791.318 | 7.3002118 | 368.51644 | 564.30865 | 10.379783 | 86.990970 | 2.9631353 |     0     |
|  1  | 3.7160800 | 129.42292 | 18630.057 | 6.6352458 |  333.776  | 592.88535 | 15.180013 | 56.329076 | 4.5006562 |     0     |
|  2  | 8.0991241 | 224.23625 | 19909.541 | 9.2758836 |  333.776  | 418.60621 | 16.868636 | 66.420092 | 

In [None]:
# Opção - Remover Valores ausentes (caso fosse necessário)
# _data_ex = data_drop_na(_data_ex)
# data_view(_data_ex, 5)

In [None]:
# Descreve os dados
print(data_describe(data_fill))

        Analysis                      ph            Hardness              Solids         Chloramines             Sulfate        Conductivity      Organic_carbon     Trihalomethanes           Turbidity          Potability                   
        Tipo             <class 'float'>     <class 'float'>     <class 'float'>     <class 'float'>     <class 'float'>     <class 'float'>     <class 'float'>     <class 'float'>     <class 'float'>       <class 'int'>                   
        Quant                       3276                3276                3276                3276                3276                3276                3276                3276                3276                3276                   
        Máximo                      14.0             323.124         61227.19601              13.127           481.03064           753.34262                28.3               124.0               6.739                   1                   
        Mínimo                       0.0

In [None]:
# Slice data_x and data_y
dataset_cp = data_fill.copy()
data_y = dataset_cp.pop('Potability')
data_x = dataset_cp.copy()

In [None]:
# Conta dados
data_value_counts(data_y)

{1: 1278, 0: 1998}

In [None]:
data_view(data_x,5)

-------------------------------------------------------------------------------------------------------------------
|                                                  VIEW DATASET                                                   |
-------------------------------------------------------------------------------------------------------------------
|  n  |    ph     |  Hardne   |  Solids   |  Chlora   |  Sulfat   |  Conduc   |  Organi   |  Trihal   |  Turbid   |
-------------------------------------------------------------------------------------------------------------------
|  0  |   7.081   | 204.89045 | 20791.318 | 7.3002118 | 368.51644 | 564.30865 | 10.379783 | 86.990970 | 2.9631353 |
|  1  | 3.7160800 | 129.42292 | 18630.057 | 6.6352458 |  333.776  | 592.88535 | 15.180013 | 56.329076 | 4.5006562 |
|  2  | 8.0991241 | 224.23625 | 19909.541 | 9.2758836 |  333.776  | 418.60621 | 16.868636 | 66.420092 | 3.0559337 |
|  3  | 8.3167658 | 214.37339 | 22018.417 | 8.0593323 | 356.88613 | 363.

In [None]:
# Normaliza os dados
data_x_normaliado = ml_normalize(data_x)
data_view(data_x_normaliado, 5)

-------------------------------------------------------------------------------------------------------------------
|                                                  VIEW DATASET                                                   |
-------------------------------------------------------------------------------------------------------------------
|  n  |    ph     |  Hardne   |  Solids   |  Chlora   |  Sulfat   |  Conduc   |  Organi   |  Trihal   |  Turbid   |
-------------------------------------------------------------------------------------------------------------------
|  0  |  0.5058   |  0.5711   |  0.3361   |  0.5439   |  0.6804   |  0.6694   |  0.3134   |  0.6998   |  0.2861   |
|  1  |  0.2654   |  0.2974   |  0.3006   |  0.4918   |  0.5817   |  0.7194   |  0.4973   |   0.451   |  0.5768   |
|  2  |  0.5785   |  0.6413   |  0.3216   |  0.6985   |  0.5817   |  0.4147   |   0.562   |  0.5329   |  0.3036   |
|  3  |  0.5941   |  0.6055   |  0.3562   |  0.6033   |  0.6473   |  0.3

In [None]:
# Separa dados com ml_train_test_split()
X_train, X_test, y_train, y_test = ml_train_test_split(data_x_normaliado, data_y, 0.25)
print(len(X_train[list(X_train.keys())[0]]), len(X_test[list(X_test.keys())[0]]), len(y_train), len(y_test))

2457 819 2457 819


In [None]:
# distance_min: 'euclidian', 'minkowski', 'manhattan'
knn_fit = ml_fit(X_train, y_train, n=50, distance_min='euclidian')
print('Score:', ml_score_knn(knn_fit)*100, "%")
print('Rounds:', knn_fit['rounds'])

Score: 65.36 %
Rounds: 6036849


In [None]:
# Semelhante ao Grid Search - testar diferentes valores K
k_values = [20, 30, 40, 50, 60]
distances = ['euclidian', 'minkowski']
grid_search = ml_grid_search(X_train, y_train, k_values, distances)

# Armazena o melhor score
best_score = sorted(grid_search, key=lambda x: x['score'], reverse=True)[0]

# Visualizar os resultados do Grid Search
for index, row in zip(range(1, len(grid_search)+1), sorted(grid_search, key=lambda x: x['score'], reverse=True)):
    print(index, f"K: {row['k']}, Score: {row['score']}, Distance: {row['distance']}", '<- Best Score' if index == 1 else '')

1 K: 50, Score: 0.6536, Distance: euclidian <- Best Score
2 K: 30, Score: 0.6524, Distance: euclidian 
3 K: 40, Score: 0.6524, Distance: euclidian 
4 K: 30, Score: 0.6508, Distance: minkowski 
5 K: 40, Score: 0.6479, Distance: minkowski 
6 K: 50, Score: 0.6455, Distance: minkowski 
7 K: 60, Score: 0.6398, Distance: euclidian 
8 K: 20, Score: 0.6374, Distance: minkowski 
9 K: 60, Score: 0.637, Distance: minkowski 
10 K: 20, Score: 0.6345, Distance: euclidian 


In [None]:
# Analisa a Matriz de Confusão do best_score
metric_classification_report(best_score['knn'], {0:"No Drinkable", 1:"Drinkable"})

               precision     recall   f1-score    support

 No Drinkable    0.95928    0.64526    0.77154       1498
    Drinkable    0.17623    0.73478    0.28427        959

          avg    0.56775    0.69002    0.52791       2457



In [None]:
# Prediz os valores no X_test
knn_predict = ml_predict(knn_fit, X_test, y_test)
print('Score:', ml_score_knn(knn_predict)*100, "%")
print('Rounds:', knn_predict['rounds'])

Score: 62.27 %
Rounds: 2012283


In [None]:
# Analisa a Matriz de Confusão
metric_classification_report(knn_predict, {0:"No Drinkable", 1:"Drinkable"})

               precision     recall   f1-score    support

 No Drinkable      0.938    0.62784    0.75221        500
    Drinkable    0.12853    0.56944    0.20972        319

          avg    0.53326    0.59864    0.48096        819



In [None]:
# Analisa a Acurácia
print("Accuracy: ", round(metric_accuracy(knn_predict)*100, 5), "%")

Accuracy:  62.27106 %


### Comparando com Scikit-learn

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

df_base = pd.read_csv('water_potability.csv')

In [None]:
df_base.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


In [None]:
df_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2785 non-null   float64
 1   Hardness         3276 non-null   float64
 2   Solids           3276 non-null   float64
 3   Chloramines      3276 non-null   float64
 4   Sulfate          2495 non-null   float64
 5   Conductivity     3276 non-null   float64
 6   Organic_carbon   3276 non-null   float64
 7   Trihalomethanes  3114 non-null   float64
 8   Turbidity        3276 non-null   float64
 9   Potability       3276 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 256.1 KB


In [None]:
for col in df_base.columns:
    df_base[col] = df_base[col].fillna(df_base[col].mean())

df_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               3276 non-null   float64
 1   Hardness         3276 non-null   float64
 2   Solids           3276 non-null   float64
 3   Chloramines      3276 non-null   float64
 4   Sulfate          3276 non-null   float64
 5   Conductivity     3276 non-null   float64
 6   Organic_carbon   3276 non-null   float64
 7   Trihalomethanes  3276 non-null   float64
 8   Turbidity        3276 non-null   float64
 9   Potability       3276 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 256.1 KB


In [None]:
df_y = df_base['Potability']
df_x = df_base.drop('Potability', axis=1)

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.25)

from sklearn.preprocessing import StandardScaler

# Scale the features using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

(2457, 819, 2457, 819)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Accuracy:", round(accuracy*100, 5), "%")
print("Precision:", round(precision*100, 5), "%")
print("Recall:", round(recall*100, 5), "%")

Accuracy: 60.56166 %
Precision: 45.4902 %
Recall: 38.66667 %
