# üìú Projeto Final - Capacita√ß√£o IA (Ciclo 2)
# üéì Aluno: Filipe da Silva Rodrigues

## üíª Bibliotecas Necess√°rias

In [1]:
# Tratamento de Dataset e Gr√°ficos
from datasets import load_dataset
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error

# Modelos de Treinamento

# K-Nearest Neighbors (KNN)
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
# Decision Tree
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
# Multi-layer Perceptron (MLP)
from sklearn.neural_network import MLPClassifier, MLPRegressor
# Support Vector Machine
from sklearn.svm import SVC, SVR
# Random Forest
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# XGBoost
from xgboost import XGBClassifier, XGBRegressor



---

üëæ **Dataset de Regress√£o - Hugging Face: Einstellung/demo-salaries**

Esse dataframe √© um conjunto de dados que cont√©m informa√ß√µes sobre sal√°rios e caracter√≠sticas de diferentes cargos na √°rea de ci√™ncia de dados. As vari√°veis s√£o:

- `work_year`: o ano em que o sal√°rio foi reportado (ex: 2023).
- `experience_level`: o n√≠vel de experi√™ncia do funcion√°rio (EN = J√∫nior, MI = Pleno, SE = S√™nior, EX = Executivo).
- `employment_type`: o tipo de emprego (PT = Meio per√≠odo, FT = Tempo integral, CT = Contrato, FL = Freelance).
- `job_title`: o t√≠tulo do cargo do funcion√°rio (ex: Data Scientist, Data Engineer).
- `salary`: o sal√°rio anual bruto reportado.
- `salary_currency`: a moeda na qual o sal√°rio foi pago (ex: USD, EUR).
- `salary_in_usd`: o sal√°rio anual bruto convertido para USD.
- `employee_residence`: o pa√≠s de resid√™ncia do funcion√°rio (ex: US, CA, GB).
- `remote_ratio`: a propor√ß√£o de trabalho remoto (0 = Presencial, 50 = H√≠brido, 100 = Totalmente remoto).
- `company_location`: o pa√≠s onde a empresa est√° localizada.
- `company_size`: o tamanho da empresa (S = Pequena, M = M√©dia, L = Grande).

‚úÖ **Objetivo:** Prever o sal√°rio anual bruto em USD de acordo com as caracter√≠sticas coletadas.

---


In [None]:
# https://huggingface.co/datasets/Einstellung/demo-salaries

dataset = load_dataset("Einstellung/demo-salaries")



dataset = dataset.dropna()




print('\nDataset Original:\n')




display(dataset)




# Criando uma c√≥pia do dataset para efetuar os devidos tratamentos
df = pd.DataFrame(dataset).copy()

# Normalizando os dados das features na escala (0..1)
columns_to_normalize = ['salary_in_usd', 'salary', 'work_year', 'remote_ratio,']
df[columns_to_normalize] = MinMaxScaler().fit_transform(df[columns_to_normalize])

# Convertendo features categ√≥ricas para n√∫meros com OneHotEncoder
categorical_columns = ['sex', 'smoker', 'day', 'time']
column_transform = make_column_transformer(
    (OneHotEncoder(drop='first'), categorical_columns), remainder='passthrough')
df = column_transform.fit_transform(df)
columns_names = column_transform.get_feature_names_out()

# Transformando o resultado em um DataFrame
df = pd.DataFrame(data=df, columns=columns_names)

# Renomenado as colunas para melhor entendimento
columns = df.columns

# Dicion√°rio para mapear as colunas a serem renomeadas
rename_mapping = {}

for column in columns:
    if column.startswith('onehotencoder__'):
        new_column_name = column.replace('onehotencoder__', '')
        rename_mapping[column] = new_column_name
    if column.startswith('remainder__'):
        new_column_name = column.replace('remainder__', '')
        rename_mapping[column] = new_column_name

# print(rename_mapping)

# Renomeando as colunas
df.rename(columns=rename_mapping, inplace=True)

# Exibindo o DataFrame tratado com as colunas renomeadas
print('\nDataset Tratado para Treinamento:\n')
display(df)

# Separando os dados para treinamento e teste
y = df['tip']  # Coluna 'tip'
x = df.drop('tip', axis=1)  # Todas as outras colunas

ü§ñ Resultados

In [3]:
# Inicializando os modelos de treinamento

# DT - Decision Tree
model_dt1 = DecisionTreeClassifier(criterion='gini', max_depth=5)
model_dt2 = DecisionTreeClassifier(criterion='entropy', max_depth=10)

# KNN - K-Nearest Neighbors
model_knn1 = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
model_knn2 = KNeighborsClassifier(n_neighbors=10, metric='manhattan')

# MLP - Multi-layer Perceptron
model_mlp1 = MLPClassifier(hidden_layer_sizes=(50, 50), max_iter=1000, activation='relu')
model_mlp2 = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1500, activation='tanh')

# SVM - Support Vector Machine
model_svm1 = SVC(kernel='linear', C=1, gamma='scale')
model_svm2 = SVC(kernel='rbf', C=0.1, gamma='scale')

# RF - Random Forest
model_rf = RandomForestClassifier(n_estimators=100, max_depth=10)

# XGB - XGBoost
model_xgb = XGBClassifier(objective='binary:logistic', max_depth=3, learning_rate=0.1)



# Treinamento dos modelos e avalia√ß√£o da acur√°cia

# Lista para armazenar os resultados
accuracies = []

# N√∫mero de repeti√ß√µes do treinamento
n = 10

for i in range(n):
    # Realizando o Train-Test-Split
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=np.random.randint(1000))

    # Treinamento dos modelos
    model_dt1.fit(x_train, y_train)
    model_dt2.fit(x_train, y_train)

    model_knn1.fit(x_train, y_train)
    model_knn2.fit(x_train, y_train)

    model_mlp1.fit(x_train, y_train)
    model_mlp2.fit(x_train, y_train)

    model_svm1.fit(x_train, y_train)
    model_svm2.fit(x_train, y_train)

    model_rf.fit(x_train, y_train)

    model_xgb.fit(x_train, y_train)

    # Previs√µes para cada modelo
    predictions_dt1 = model_dt1.predict(x_test)
    predictions_dt2 = model_dt2.predict(x_test)

    predictions_knn1 = model_knn1.predict(x_test)
    predictions_knn2 = model_knn2.predict(x_test)

    predictions_mlp1 = model_mlp1.predict(x_test)
    predictions_mlp2 = model_mlp2.predict(x_test)

    predictions_svm1 = model_svm1.predict(x_test)
    predictions_svm2 = model_svm2.predict(x_test)

    predictions_rf = model_rf.predict(x_test)

    predictions_xgb = model_xgb.predict(x_test)

    # C√°lculo da acur√°cia para cada modelo
    acc_dt1 = accuracy_score(y_test, predictions_dt1)
    acc_dt2 = accuracy_score(y_test, predictions_dt2)

    acc_knn1 = accuracy_score(y_test, predictions_knn1)
    acc_knn2 = accuracy_score(y_test, predictions_knn2)

    acc_mlp1 = accuracy_score(y_test, predictions_mlp1)
    acc_mlp2 = accuracy_score(y_test, predictions_mlp2)

    acc_svm1 = accuracy_score(y_test, predictions_svm1)
    acc_svm2 = accuracy_score(y_test, predictions_svm2)

    acc_rf = accuracy_score(y_test, predictions_rf)

    acc_xgb = accuracy_score(y_test, predictions_xgb)

    # Armazamento das acur√°cias na lista
    accuracies.append([acc_dt1, acc_dt2, acc_knn1, acc_knn2, acc_mlp1, acc_mlp2, acc_svm1, acc_svm2, acc_rf, acc_xgb])


# Convertendo a lista para um array numpy para calcular a m√©dia
accuracies = np.array(accuracies)

# Calculando a m√©dia das acur√°cias para cada modelo
average_accuracies = np.mean(accuracies, axis=0)

# R√≥tulo com o nome dos modelos
model_names = ['DT1', 'DT2', 'KNN1', 'KNN2', 'MLP1', 'MLP2', 'SVM1', 'SVM2', 'RF', 'XGB']

# Apresentar a m√©dia das execu√ß√µes dos resultados de acur√°cia de todos os modelos
for model, acc in zip(model_names, average_accuracies):
    print(f'{model}: {acc*100:.3f} %')

DT1: 83.765 %
DT2: 82.941 %
KNN1: 80.588 %
KNN2: 80.294 %
MLP1: 85.176 %
MLP2: 83.235 %
SVM1: 78.118 %
SVM2: 75.529 %
RF: 79.824 %
XGB: 82.471 %


## üß™ Experimentos no MLFLOW