<a target="_blank" href="https://colab.research.google.com/github/paulotguerra/QXD0178/blob/main/01.E0-Excercicio-Limpeza-de-dados.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## QXD0178 - Mineração de Dados
# Preparação da base de dados com Scikit-Learn

### Carga do conjunto de dados `food_coded.csv`

In [111]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

df = pd.read_csv("https://raw.githubusercontent.com/She-Codes-Now/Intro-to-Data-Science-with-R/master/food_coded.csv")

### RESOLUÇÃO DA ATIVIDADE

Irei verificar quais células possuem dados faltantes para que possam ser preenchidos.

In [114]:
missing_data = df.isnull().sum()
print(missing_data[missing_data > 0])

GPA                            2
calories_day                  19
calories_scone                 1
comfort_food                   1
comfort_food_reasons           2
comfort_food_reasons_coded    19
cook                           3
cuisine                       17
diet_current                   1
drink                          2
eating_changes                 3
employment                     9
exercise                      13
father_education               1
father_profession              3
fav_cuisine                    2
fav_food                       2
food_childhood                 1
healthy_meal                   1
ideal_diet                     1
income                         1
life_rewarding                 1
marital_status                 1
meals_dinner_friend            3
mother_education               3
mother_profession              2
on_off_campus                  1
persian_food                   1
self_perception_weight         1
soup                           1
sports    

Filtrando as colunas do tipo `object`:


In [115]:
object_columns = df.select_dtypes(include=['object']).columns
print(object_columns)

Index(['GPA', 'comfort_food', 'comfort_food_reasons', 'diet_current',
       'eating_changes', 'father_profession', 'fav_cuisine', 'food_childhood',
       'healthy_meal', 'ideal_diet', 'meals_dinner_friend',
       'mother_profession', 'type_sports', 'weight'],
      dtype='object')


Após fazer a filtragem, observei que as colunas `GPA` e `Weight` estão como tipo `object`, quando na verdade deveriam ser numéricas. Irei ajustá-las.

In [116]:
# Convertendo as colunas para float:

df['GPA'] = pd.to_numeric(df['GPA'], errors='coerce')
df['weight'] = pd.to_numeric(df['weight'], errors='coerce')

Filtrando as colunas númericas para que posteriormente elas sejam preenchidas utilizando `SimpleImputer`:

In [117]:
numeric_columns = df.select_dtypes(include=['number']).columns
print(numeric_columns)

Index(['GPA', 'Gender', 'breakfast', 'calories_chicken', 'calories_day',
       'calories_scone', 'coffee', 'comfort_food_reasons_coded', 'cook',
       'comfort_food_reasons_coded.1', 'cuisine', 'diet_current_coded',
       'drink', 'eating_changes_coded', 'eating_changes_coded1', 'eating_out',
       'employment', 'ethnic_food', 'exercise', 'father_education',
       'fav_cuisine_coded', 'fav_food', 'fries', 'fruit_day', 'grade_level',
       'greek_food', 'healthy_feeling', 'ideal_diet_coded', 'income',
       'indian_food', 'italian_food', 'life_rewarding', 'marital_status',
       'mother_education', 'nutritional_check', 'on_off_campus',
       'parents_cook', 'pay_meal_out', 'persian_food',
       'self_perception_weight', 'soup', 'sports', 'thai_food',
       'tortilla_calories', 'turkey_calories', 'veggies_day', 'vitamins',
       'waffle_calories', 'weight'],
      dtype='object')


In [118]:
# Aplicando SimpleImputer somente nas colunas numéricas
imputer = SimpleImputer(strategy="median")
df[numeric_columns] = imputer.fit_transform(df[numeric_columns])

In [121]:
# Tratando as colunas não númericas com SimpleImputer:
colunas_nao_num = [
    'diet_current', 'eating_changes', 'father_profession',
    'fav_cuisine', 'food_childhood', 'healthy_meal',
    'ideal_diet', 'meals_dinner_friend', 'mother_profession', 'type_sports'
]

imputer_cat = SimpleImputer(strategy='most_frequent')
df[colunas_nao_num] = imputer_cat.fit_transform(df[colunas_nao_num])

Os dados das colunas comfort_food_reasons e comfort_food estão inconsistentes porque os textos não estão separados nem escritos de maneira padronizada. Irei ajustar para cada palavra ser separada por vírgula e escritas em caracteres minúsculos, utilizando `Pipeline`.

In [109]:
# Função para padronizar texto
def padronizar_texto(text):
    if pd.isna(text):  # Verificar valores nulos
        return text
    text = text.replace("and", ",").replace("/", ",")
    text = text.replace(", ", ",").replace(" ,", ",")
    text = text.lower()
    return text

# Criando o pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='')),
])

# Aplicando o pipeline nas colunas
df['comfort_food_reasons'] = df['comfort_food_reasons'].fillna('').apply(padronizar_texto)
df['comfort_food'] = df['comfort_food'].fillna('').apply(padronizar_texto)


In [110]:
df

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded,...,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
0,2.400,2.0,1.0,430.0,3.0,315.0,1.0,none,we dont have comfort,9.0,...,1.0,1.0,1.0,1165.0,345.0,car racing,5.0,1.0,1315.0,187.0
1,3.654,1.0,1.0,610.0,3.0,420.0,2.0,"chocolate,chips,ice cream","stress,bored,anger",1.0,...,1.0,1.0,2.0,725.0,690.0,Basketball,4.0,2.0,900.0,155.0
2,3.300,1.0,1.0,720.0,4.0,420.0,2.0,"frozen yogurt,pizza,fast food","stress,sadness",1.0,...,1.0,2.0,5.0,1165.0,500.0,none,5.0,1.0,900.0,155.0
3,3.200,1.0,1.0,430.0,3.0,420.0,2.0,"pizza,mac,cheese,ice cream",boredom,2.0,...,1.0,2.0,5.0,725.0,690.0,unknown,3.0,1.0,1315.0,155.0
4,3.500,1.0,1.0,720.0,2.0,420.0,2.0,"ice cream,chocolate,chips","stress,boredom,cravings",1.0,...,1.0,1.0,4.0,940.0,500.0,Softball,4.0,2.0,760.0,190.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120,3.500,1.0,1.0,610.0,4.0,420.0,2.0,"wine. mac,cheese,pizza,ice cream","boredom,sadness",2.0,...,1.0,1.0,5.0,940.0,500.0,Softball,5.0,1.0,1315.0,156.0
121,3.000,1.0,1.0,265.0,2.0,315.0,2.0,"pizza,wings,cheesecake","loneliness,homesick,sadness",2.0,...,1.0,1.0,4.0,940.0,500.0,basketball,5.0,2.0,1315.0,180.0
122,3.882,1.0,1.0,720.0,3.0,420.0,1.0,"rice,potato,seaweed soup",sadness,2.0,...,1.0,2.0,5.0,580.0,690.0,none,4.0,2.0,1315.0,120.0
123,3.000,2.0,1.0,720.0,4.0,420.0,1.0,"mac n cheese,lasagna,pizza","happiness,they are some of my favorite foods",2.0,...,2.0,2.0,1.0,940.0,500.0,unknown,3.0,1.0,1315.0,135.0
