# 🍊 Orange is the new black

* [Visualisation Plotly](#📉-Visualisation-Plotly)
    * [Répartition des jus d'orange par Nutriscore](#Répartition-des-jus-d'orange-par-Nutriscore)
    * [Ratio sucre / kcal](#Ratio-sucre-/-kilocalorie)
    * [Taux de sucre pour 100 gr](#Taux-de-sucre-pour-100gr)
* [Exploration](#🔍-Exploration)
    * [Marques les plus représentées](#Marques-les-plus-représentées)
    * [Les jus d'orange sans sucre ajouté, colorants et conservateurs](#Les-jus-d'orange-sans-sucre-ajouté,-sans-colorants,-sans-conservateurs)
    * [Jus d'orange avec nutriscore A](#Les-jus-d'orange-avec-Nutriscore-A)
    * [Jus d'orange avec le plus de vitamine C](#Les-jus-d'orange-contenant-le-plus-de-vitamine-C)

Import des librairies

In [21]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings("ignore")

# 📉 Visualisation Plotly

In [51]:
# import csv file from cleaning notebook
df = pd.read_csv("data/orangeisthenewblack.csv")

In [52]:
# convert NaN to 0
df = df.fillna(0)

In [53]:
df.head()

Unnamed: 0,product_name,brands,countries,nutriscore_score,nutriscore_grade,pnns_groups_1,pnns_groups_2,energy-kcal_100g,energy_100g,fat_100g,saturated-fat_100g,carbohydrates_100g,sugars_100g,proteins_100g,salt_100g,sodium_100g,nutrition-score-fr_100g
0,Pago Ace - Orange - Carotte - Citron,pago,France,3.0,c,Beverages,Sweetened beverages,40.0,168.0,0.5,0.1,9.2,9.2,0.5,0.0,0.0,3.0
1,LE PUR JUS Orange sans pulpe,Joker,France,3.0,c,Beverages,Fruit juices,42.0,186.0,0.0,0.0,8.6,8.6,0.6,0.0125,0.005,3.0
2,Jus d'orange sans pulpe,innocent,France,2.0,c,Beverages,Fruit juices,36.0,155.0,0.0,0.0,7.8,7.8,0.7,0.0,0.0,2.0
3,Innocent jus d'orange avec pulpe 900ml,Innocent,France,2.0,c,Beverages,Fruit juices,36.0,156.0,0.0,0.0,7.8,7.8,0.7,0.0,0.0,2.0
4,LE PUR JUS Orange sans pulpe,joker,France,2.0,c,Beverages,Fruit juices,42.0,176.0,0.0,0.0,8.6,8.6,0.6,0.0125,0.005,2.0


# <b>Répartition des jus d'orange par Nutriscore</b>

In [25]:
# delete product with nutriscore = 0
df_nutriscore = df[df.nutriscore_grade != 0]

In [26]:
df_nutriscore.nutriscore_grade.value_counts()

c    287
b     37
e     18
d     13
a      2
Name: nutriscore_grade, dtype: int64

In [27]:
# visualisation
fig = px.bar(df_nutriscore, x="nutriscore_grade", color="nutriscore_grade", labels={
                     "nutriscore_grade": "Nutriscore",
                     "count": "Nombre de produits",
                 },
                title="Répartition des jus d'orange par Nutriscore",
             category_orders={"nutriscore_grade":['a','b','c','d','e']})
#fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

# <b>Ratio sucre / kilocalorie</b>

In [28]:
# nouveau dataframe avec nos targets
df_visu1 = df[['sugars_100g', 'energy-kcal_100g']]

In [29]:
# standardisation de sugars_100g et energy-kcal_100g pour obtenir la même échelle
df_visu1 = MinMaxScaler().fit_transform(df_visu1)

In [30]:
# conversion du numpy array en dataframe
df_visu1 = pd.DataFrame(df_visu1, columns=['Sucre', 'Kcal'])

In [31]:
# visualisation
sucre = go.Histogram(
    x=df_visu1.Sucre,
    opacity=0.75,
    name = "Sucre",
    marker=dict(color='rgba(171, 50, 96, 0.6)'))
kcal = go.Histogram(
    x=df_visu1.Kcal,
    opacity=0.75,
    name = "Kcal",
    marker=dict(color='rgba(12, 50, 196, 0.6)'))

data = [sucre, kcal]
layout = go.Layout(barmode='overlay',
                   title='Ratio sucre / kilocalorie',
                   xaxis=dict(title='Ratio sucre / kcal'),
                   yaxis=dict( title='Nombres'))
fig = go.Figure(data=data, layout=layout)
fig.show()

# <b>Taux de sucre pour 100gr</b>

In [32]:
# new df for sugars
df_sugars = df

In [33]:
# delete product with sugars_100g = 0
df_sugars = df_sugars[df_sugars.sugars_100g != 0]

In [34]:
df_sugars['sugars_100g'].value_counts().sort_values(ascending=False)

8.7     77
10.0    51
9.0     42
11.0    23
9.5     18
        ..
10.6     1
6.0      1
10.6     1
5.1      1
7.9      1
Name: sugars_100g, Length: 78, dtype: int64

In [35]:
# create 7 bins starting with 0 up to 50
# bins = np.linspace(0, 50, 7)
bins = np.arange(0, 30, 5)

# use pd.cut to create the bins
df['sugars_100g'] = pd.cut(df['sugars_100g'], bins, include_lowest=True)

# pd.cut creates an interval category which is sorted from lowest bin to the greatest bin
df['sugars_100g'].cat.categories

# count the values in each bin. Bins are sorted based on the occurance (from most populated to the least one)
agg = df['sugars_100g'].value_counts()

# sort the values according to the bins (`sort_index`), turn into data frame (`to_frame`) and reset index
agg = agg.sort_index().to_frame().reset_index()

# rename index (containing the bin range to bins)
agg.rename(columns={"index":"bins"}, inplace=True)

# Plotly cannot work with categories index, so we need to turn it into string
agg["bins"] = agg["bins"].astype("str")

agg

Unnamed: 0,bins,sugars_100g
0,"(-0.001, 5.0]",83
1,"(5.0, 10.0]",372
2,"(10.0, 15.0]",55
3,"(15.0, 20.0]",3
4,"(20.0, 25.0]",2


In [20]:
#pie chart
fig = px.pie(agg, values='sugars_100g', names='bins', title="Répartition des produits avec taux de sucre entre 0,0001 et 50 grammes")
fig.show()

- Il y a 83 jus d'orange (soit 16 %) possédant un taux de sucre pour 100gr entre 0 et 5 gr
- La majorité (372 produits soit 72%) possède un taux de sucre pour 100gr  entre 5 et 10 gr

# 🔍 Exploration

# <b>Marques les plus représentées</b>

In [36]:
df_brands = df

In [37]:
# delete product with brands = 0
df_brands = df_brands[df_brands.brands != 0]

In [38]:
df_brands['brands'].value_counts(ascending=False)

U                               38
Jafaden,Marque Repère           22
Auchan                          19
Leader Price                    18
Franprix                        13
                                ..
Rik & Rok                        1
La vie claire                    1
Les Toques Blanches du Monde     1
je sais pad                      1
Fruima                           1
Name: brands, Length: 162, dtype: int64

<b>Les 5 marques de jus d'orange les plus representées dans l'échantillon sont des marques distributeurs : </b>
- U (Super U)
- Jafadan / Marque Repère (Leclerc)
- Auchan
- Leader Price
- Paquito (Intermarché)

<img src="img/brands_logo.jpg">

# <b>Les jus d'orange sans sucre ajouté, sans colorants, sans conservateurs</b>

In [39]:
df_juice = df

In [40]:
# delete product with brands = 0
df_juice = df_brands[df_juice.labels != 0]

In [41]:
# isolate all rows with labels "Sans sucre ajouté, Sans colorants, Sans conservateurs"
df_juice = df_juice[df_juice["labels"].str.contains("Sans sucre ajouté, Sans colorants, Sans conservateurs")]

<b>5 jus d'orange sans conservateurs, sans colorants, ni sucre ajouté</b>
- Pur jus orange bio de Les Fées Bio
- 100% pur jus d'oranges pressées d'Andros
- Pur jus d'orange sans pulpe de U Bio
- Pressade pur jus orange de Pressade
- Le pur jus orange du matin de Joker

<img src="img/juice_without.jpg">

# Les jus d'orange avec Nutriscore A

In [42]:
# isolate all rows with nutriscore_grade = A
df_nutriscoreA = df_nutriscore[df_nutriscore['nutriscore_grade'].str.contains('a')]

Il y en a qu'un seul dans notre dataset, et il s'avère, qu'après vérification sur le site du distributeur (Super U), ce jus d'orange possède un nutriscore de B et non de A comme mentionné dans notre dataset.

<img src="img/nutriA.jpg">

# Les jus d'orange contenant le plus de vitamine C

In [43]:
df_vit = df

In [44]:
# delete product with vitamin-c_100g = 0
df_vit = df_vit[df_vit['vitamin-c_100g'] != 0]

In [45]:
# delete product vitamin-c_100g > 0.039 
df_vit = df_vit[df_vit['vitamin-c_100g'] > 0.039]

In [46]:
df_vit['vitamin-c_100g'].value_counts(ascending=False)

0.0450     6
0.0400     2
0.0420     1
24.0000    1
0.0500     1
0.0448     1
Name: vitamin-c_100g, dtype: int64

<b>Echantillons de 5 jus d'orange possédant entre 0,0420 et 0,0450 grammes de vitamine C pour 100 gr</b>

- 100 % pur jus orange de Franprix
- Premium 100% jus d'orange sanguine de Monoprix	
- Le Pur jus - Sans pulpe Jus d'orange de Joker
- Pur jus de fruit pressé 100% orange de Innocent
- Pur jus d'oranges sanguines pressées de Super U

<img src="img/C.jpg">

# Les jus d'orange avec nutriscore E

In [47]:
df_E = df

In [48]:
# delete product with nutriscore = 0
df_E = df[df.nutriscore_grade != 0]

In [49]:
# convert sugars_100g to string values
df_E['nutriscore_grade']=df_E['nutriscore_grade'].astype(str)

In [50]:
# isolate all rows with nutriscore_grade = e
df_E = df_E[df_E["nutriscore_grade"].str.contains("e")]

<b>Les jus d'orange qui possèdent le nutriscore E sont majoritairement fabriqués à partir de jus concentré</b>

- Nectar d'orange de Casino
- Jus d'orange de Kas
- Jus d'orange de Fruima
- Nectar d'orange de Tous les jours

<img src="img/E.jpg">