# ***NUTRIWISE.io***
#### *Trouve les meilleurs ingrédients pour le plat que tu veux manger !*

## Problématique
Après une réflexion intense, j'ai enfin choisi ! Ce soir, ce sera spaghetti à la bolognaise.\
Arrivé devant le rayon : difficile ! Un éventail large de pâtes et sauces...\
Mais quels ingrédients sont les meilleurs pour ma santé ? L'environnement ?\
Je n'ai pas envie de scanner chaque code-barre, il me faut quelque chose qui me dise instantanément quoi prendre !\
La solution : **NUTRIWISE.io**

## Données
### Source
Nous utiliserons ici le jeu de données `fr.openfoodfacts.org.products.csv` fourni par OpenFoodFacts.

### Variables utilisées

Après exploration des données ***NUTRIWISE.io*** va utiliser les données suivantes :`['code',  'product_name', 'main_category_fr', 'countries_tags', 'manufacturing_places_tags', 'nutriscore_score', 'nutriscore_grade', 'ecoscore_score','ecoscore_grade','carbon-footprint_100g', 'additives_n', 'sugars_100g', 'fat_100g', 'saturated-fat_100g', 'sodium_100g']`

Le but va être de créer un score mêlant le nutriscore et l'ecoscore pour trouver l'ingrédient optimal.
Lorsque le nutriscore est indisponible, nous allons essayer de le déterminer grâce aux teneurs en sucre, gras, gras saturé et sel.
Lorsque l'ecoscore est indisponible, nous allons essayer de le déterminer grâce à l'empreinte carbone.


## Nettoyage des données
### Démarche
#### Contrainte géographique

Dans un premier temps, nous allons nous concentrer sur la France. Notre première étape sera donc de garder les ingrédients seulement vendus en France.

In [63]:
# Imports
import pandas as pd
import time
import matplotlib.pyplot as plt

##### Lecture du CSV

In [65]:
# Reading of the CSV file and creation of the DataFrame
nrows=500000 # This value for the tests to improve the running time
cols = ['code',  'product_name', 'main_category_fr', 'countries_tags', 'manufacturing_places_tags', \
        'nutriscore_score', 'nutriscore_grade', 'ecoscore_score','ecoscore_grade','carbon-footprint_100g', \
        'additives_n', 'sugars_100g', 'fat_100g', 'saturated-fat_100g', 'sodium_100g']

start_time = time.time()
df= pd.read_csv('fr.openfoodfacts.org.products.csv', sep='\t', usecols=cols, low_memory=True, nrows=nrows) # To run the reading on the entire CSV, comment this line and uncomment the line below
# df= pd.read_csv('fr.openfoodfacts.org.products.csv', sep='\t', usecols=cols, low_memory=True) 
end_time = time.time()

# Drop rows with the selling countries are not known
df = df.dropna(subset=['countries_tags'])
print(f"\nElapsed time: {end_time - start_time:.2f} seconds")

# Filter the rows with only countries_tags that contains 'france'
france_df = df[df['countries_tags'].str.contains('france')]
france_df = france_df[cols]
france_df_length = france_df

  df= pd.read_csv('fr.openfoodfacts.org.products.csv', sep='\t', usecols=cols, low_memory=True, nrows=nrows) # To run the reading on the entire CSV, comment this line and uncomment the line below



Elapsed time: 6.42 seconds


Unnamed: 0,code,product_name,main_category_fr,countries_tags,manufacturing_places_tags,nutriscore_score,nutriscore_grade,ecoscore_score,ecoscore_grade,carbon-footprint_100g,additives_n,sugars_100g,fat_100g,saturated-fat_100g,sodium_100g
448297,213402013916,Cuisse de poulet,Cuisses de poulet,en:france,,0.0,b,45.0,c,,,0.0,14.0,3.5,0.072
441028,209858019688,Gratin dauphinois,Gratins dauphinois,en:france,,,,71.0,b,,,,,,
414902,200177013784,Palmier au beurre,,en:france,,,,,unknown,,,24.0,19.0,12.0,0.452
477024,276397036346,Poulet cuit fumé,,en:france,,,,,unknown,,,0.9,9.6,2.8,0.52
438276,208961012272,Le Roulé Ail & Fines Herbes,,en:france,,,,,unknown,,,0.0,28.0,0.0,0.48
43447,178631,Dessicated coconut,Noix de coco râpée,"en:france,en:united-kingdom",,12.0,d,40.0,c,,1.0,6.1,62.0,53.4,0.04
488858,329450370014,Chair de crabe,,en:france,,,,,unknown,,,1.2,0.5,0.3,
332101,76808010442,PESTO,,en:france,,,,,unknown,,,2.0,23.0,3.0,2.4
419344,201517011484,Pains au lait,,en:france,,,,,unknown,,,,,,
223443,58449860075,Envirokidz peanut butter and chocolate leaping...,Cereales-au-beurre-de-cacahuetes,"en:france,en:united-states",usa,9.0,c,31.0,d,,1.0,27.5,5.0,0.0,0.375


In [44]:
cannot_be_na_cols = ['sugars_100g', 'fat_100g', 'saturated-fat_100g', 'sodium_100g', 'carbon-footprint_100g']
# remove rows where any of the values in the selected columns are missing
cleaned_df = df.dropna(subset=cannot_be_na_cols, how='all')
cleaned_df

Unnamed: 0,code,product_name,manufacturing_places_tags,countries_tags,additives_n,nutriscore_score,nutriscore_grade,data_quality_errors_tags,main_category_fr,fat_100g,saturated-fat_100g,sugars_100g,sodium_100g,carbon-footprint_100g
1,0000000000000207025004,Andrè,,en:germany,,,,en:energy-value-in-kcal-does-not-match-value-c...,,2.00,2.00,12.60,,
2,00000000000003429145,L.casei,,en:spain,0.0,,,,,1.40,0.90,9.80,0.040,
3,00000000000026772226,Skyr,,en:france,,-5.0,a,,Fromages à la crème,0.20,0.10,3.90,0.036,
4,0000000000017,Vitória crackers,,en:france,,,,,,7.00,3.08,15.00,0.560,
6,000000000003327986,Filetes de pollo empanado,,en:spain,,,,,,1.90,1.00,,0.440,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,492860300329,Sliced apples with peanut butter,,en:united-states,0.0,-2.0,a,,Aliments à base de fruits et de légumes,13.81,1.93,9.94,0.127,
499996,492860300336,Celery & carrots,,en:united-states,4.0,2.0,b,,Aliments à base de fruits et de légumes,7.36,0.61,3.68,0.184,
499997,492860300381,Thai-style chicken wrap,,en:united-states,8.0,4.0,c,,Sandwichs,10.53,3.24,2.83,0.466,
499998,492860300541,Caesar chicken salad,,en:united-states,5.0,4.0,c,,en:salted-snacks,13.07,2.83,1.41,0.481,


In [34]:
france_df.sample(n=20)

Unnamed: 0,code,product_name,main_category_fr,countries_tags,manufacturing_places_tags,sugars_100g,fat_100g,saturated-fat_100g,nutriscore_score,nutriscore_grade,sodium_100g,additives_n,carbon-footprint_100g,nutrition-score-fr_100g,data_quality_errors_tags
431005,206040019358,Grosse Boule pain 1kg,,en:france,,,,,,,,,,,
468613,253099017323,6 aiguillettes de poulet jaune,Aiguillettes de poulet,en:france,,0.0,1.3,0.4,-4.0,a,0.044,,,-4.0,
453342,217638022704,La Belle Escalope de dinde,Escalopes de dinde,en:france,france,,,,,,,,,,
448406,213485047952,Les paupiettes du chef,,en:france,,0.4,24.0,8.2,,,0.6,,,,
451402,216292026240,Mini viennoiserie,Viennoiseries,en:france,,,,,,,,,,,
434659,207481016609,Shaker fromage blc fraise,,en:france,,10.0,3.9,2.6,,,0.04,,,,
424095,202971024072,Tomates,,en:france,,,,,,,,,,,
453878,217823037919,Cuisses poulet,Cuisses de poulet,en:france,,0.0,5.9,1.7,3.0,c,0.06,,,3.0,en:energy-value-in-kcal-does-not-match-value-c...
335760,77544001275,"Osem Toasted Pasta Stars, 1.1 LB",,en:france,,2.0,1.0,0.0,,,0.0,1.0,,,
192655,49000054361,Powerade zero,Boissons énergisantes,en:france,,0.5,0.0,0.0,1.0,b,0.006,,,1.0,


In [24]:
filtered_df = df[df['countries_tags'].notnull()]
filtered_df = filtered_df[filtered_df['countries_tags'].str.contains('france')]
percent_nonnull = filtered_df.count() / len(filtered_df) * 100
print(percent_nonnull)



code                         100.000000
product_name                  96.212627
manufacturing_places_tags      2.849841
countries_tags               100.000000
additives_n                   14.434142
nutriscore_score              28.335303
nutriscore_grade              28.335303
data_quality_errors_tags       5.491626
main_category_fr              40.965016
fat_100g                      67.430391
saturated-fat_100g            67.427823
sugars_100g                   67.454793
sodium_100g                   65.590003
carbon-footprint_100g          0.006421
nutrition-score-fr_100g       28.335303
dtype: float64


In [1]:
filtered_df = df[df['sugars_100g'] < 100]
sugars_and_nutriscore = filtered_df[filtered_df['sugars_100g'].notnull() & filtered_df['nutrition-score-fr_100g'].notnull()]


# Create a scatter plot
plt.scatter(sugars_and_nutriscore['sugars_100g'], sugars_and_nutriscore['nutrition-score-fr_100g'])

# Set the x and y-axis labels
plt.xlabel('Sugars (per 100g)')
plt.ylabel('Nutrition Score (per 100g)')

# Show the plot
plt.show()

NameError: name 'df' is not defined

### Bin

In [None]:
# import missingno as msno
# import matplotlib.pyplot as plt

# fig, ax = plt.subplots(figsize=(10, 6))
# msno.matrix(df.sample(100000), ax=ax)