# Dataset notebook

Import all the different chunks of the dataset, put into a single file and remove unneeded columns (first two)

In [2]:
import pandas as pd

In [31]:
df = pd.read_csv('food.csv')

## Extract just the column needed

- Description -> Product name
- Data.Fat.Total Lipid -> Fat
- Data.Carbohydrate -> Carbohydrates
- Data.Protein -> Proteins
- Data.Kilocalories -> Calories
- Data.Sugar Total -> Sugars
- Data.Fiber -> Fibers

In [32]:
COLUMNS = ["Description", "Data.Fat.Total Lipid", "Data.Carbohydrate", "Data.Protein", "Data.Kilocalories", "Data.Sugar Total", "Data.Fiber"]

In [33]:
df_projected = df[COLUMNS]
df_projected.columns = ['product name', 'fat', 'carbohydrates', 'proteins', 'calories', 'sugars', 'fiber']

## Sanitize dataset

It seems like some rows are not following the rule of values per 100g of product, since some foods have values for either protein, fat or carbs above 100. 
For this reason, those rules are pruned from the dataset.

Dataset is then sorted aphabetically by food name.

In [41]:
df_without_nulls = df_projected.dropna(how='any',axis=0) 

In [42]:
df_sanitized = df_without_nulls[(df_without_nulls["proteins"] < 100) | (df_without_nulls["carbohydrates"] < 100) | (df_without_nulls["fat"] < 100)]

## Remove duplicates

In [15]:
df_no_duplicates = df_sanitized.drop_duplicates('product name')

In [16]:
df_no_duplicates

Unnamed: 0,product name,fat,carbohydrates,proteins,calories,sugars,fiber
0,"BUTTER,WITH SALT",81.11,0.06,0.85,717,0.06,0.0
1,"BUTTER,WHIPPED,WITH SALT",81.11,0.06,0.85,717,0.06,0.0
2,"BUTTER OIL,ANHYDROUS",99.48,0.00,0.28,876,0.00,0.0
3,"CHEESE,BLUE",28.74,2.34,21.40,353,0.50,0.0
4,"CHEESE,BRICK",29.68,2.79,23.24,371,0.51,0.0
...,...,...,...,...,...,...,...
7408,"FROG LEGS,RAW",0.30,0.00,16.40,73,0.00,0.0
7409,"MACKEREL,SALTED",25.10,0.00,18.50,305,0.00,0.0
7410,"SCALLOP,(BAY&SEA),CKD,STMD",1.40,0.00,23.20,112,0.00,0.0
7411,"SNAIL,RAW",1.40,2.00,16.10,90,0.00,0.0


## Sort by name and save final dataset

In [45]:
final_dataset = df_no_duplicates.sort_values('product name')

In [46]:
final_dataset.to_csv("food_dataset.csv", index=False)