# Sampling notebook

For this notebook execution, i assume you have downloaded the complete Open Food Fact CSV export (link in README file) in ./datas

In [44]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

***
# Application concept: Foodprint
## Description
Foodprint give carbon footprint and nutri score of a product, by scanning them.
The carbon footprint information is given through a new value, the Carbon-Score, or C-Score.
This C-Score is calculated with:
- product's origin
- country where process/transformation on the product has been done
- origin of the packaging
- packaging's materials
- purchase place
- ingredients (those who may be nocive to environment, by their culture or transformation)

Those metrics will help people to eat better and in a way better manner for the environment.

## Making the difference
Foodprint will not only give you C-Score and NutriScore for each product you scan, but you will have also the metrics for the brand of this product.
More specifically you will have ( for the C-Score and NutriScore)
- the average scores oof all the brand's products
- the worst scores of this brand's product
- the greater scores of this brand's product
- the number of products beyond the average score, and percentage
- total number of product

Thoses metrics will help people to consume better and force the big companies to make more effort.

## Useful fields

In [45]:
usecols = [
    "code",
    "url",
    "product_name",
    "quantity",
    "packaging_tags",
    "brands",
    "categories_en",
    "origins_en",
    "manufacturing_places",
    "stores",
    "countries",
    "nutriscore_grade",
    "image_url",
    "image_small_url"
]

In [46]:
data = pd.DataFrame()

types = {
    "code": np.string_,
    "emb_codes": np.string_,
    "first_packaging_code_geo": np.string_,
    "cities_tags": np.string_,
}

for chunk in pd.read_csv("./datas/en.openfoodfacts.org.products.csv", sep="\t", skipinitialspace=True, nrows=None, chunksize=100000, dtype=types, usecols=usecols):
    chunk = chunk.dropna()
    data = pd.concat([data, chunk])
    del chunk

row_nb, col_nb = data.shape
(row_nb, col_nb)

(25063, 14)

In [47]:
data.head()

Unnamed: 0,code,url,product_name,quantity,packaging_tags,brands,categories_en,origins_en,manufacturing_places,stores,countries,nutriscore_grade,image_url,image_small_url
339,290616,http://world-en.openfoodfacts.org/product/0000...,Salade Cesar,0.980 kg,frais,Kirkland Signature,"Plant-based foods and beverages,Plant-based fo...",fr:quebec,Brossard Québec,Costco,Canada,c,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
380,1938067,http://world-en.openfoodfacts.org/product/0000...,Chaussons tressés aux pommes,1.200 kg,frais,Kirkland Signature,"Snacks,Sweet snacks,Biscuits and cakes,Viennoi...",fr:quebec,Brossard Québec,Costco,Canada,c,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
397,4302544,http://world-en.openfoodfacts.org/product/0000...,Pain Burger Artisan,1.008 kg / 12 pain,"frais,plastique",Kirkland Signature,fr:boulange,fr:quebec,"Brossard,Québec",Costco,Canada,b,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
425,8237798,http://world-en.openfoodfacts.org/product/0000...,Quiche Lorraine,1 400 kg,frai,Kirkland Signature,"Meals,Pizzas pies and quiches,Quiches,Lorraine...",fr:quebec,"Brossard,Québec",Costco,Canada,b,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
497,12167005,http://world-en.openfoodfacts.org/product/0000...,Brioches roulées avec raisins,0.900 kg,en-caissette,Kirkland Signature,"Snacks,Sweet snacks,Biscuits and cakes,Pastries","Canada,fr:quebec","Brossars,Québec",Costco,Canada,b,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
981,50,http://world-en.openfoodfacts.org/product/0000...,Financiers aux Amandes,660 g,"boite-carton,30-emballages-individuels",Bijou,"Biscuits and cakes,Cakes,Financiers,fr:patisse...",France,"France,87500","Bordeaux,Brive,Limoges,Saint-Yrieix",France,e,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
995,5012,http://world-en.openfoodfacts.org/product/0000...,Passata de tomates bio,700 g,"verre,sous-vide",Kazidomi,"Plant-based foods and beverages,Plant-based fo...",Italy,Italie,Kazidomi,"France,Belgique",a,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
1620,12140,http://world-en.openfoodfacts.org/product/0001...,Salade végétarienne,"326,5 g",barquette,"Crous Languedoc Roussillon, Crous Resto'",fr:salade-vegetarienne,France,Montpellier,Crous Languedoc Roussillon,France,a,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
1692,13000,http://world-en.openfoodfacts.org/product/0001...,Salade fermière,"331,5 g",barquette,Crous Languedoc Roussillon,"Meals,Salads,Prepared salads",France,Montpellier,Crous Languedoc Roussillon,France,b,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
1771,14915,http://world-en.openfoodfacts.org/product/0001...,Ciabatta Rôti de porc BBQ,320 g,emballage,Crous Languedoc Roussillon,fr:sandwich-au-roti-de-porc,France,Montpellier,Crous Languedoc Roussillon,en:France,a,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...


***
# Create a sample of 5 000 rows
## Seeding
By keeping the same seed, we ensure the randomize method will ever return the same number. We want that behavior to always have the same sample.
## Sampling
We use the randint method to generate random number corresponding to the dataset'w rows

In [48]:
np.random.seed(294697)
sample_size = 5000
sample_index = np.random.randint(row_nb, size=sample_size)
sample_index

array([ 3650, 15739, 24013, ..., 23378,  3147,  7471])

In [49]:
sample = data.iloc[sample_index]

In [50]:
sample.head()

Unnamed: 0,code,url,product_name,quantity,packaging_tags,brands,categories_en,origins_en,manufacturing_places,stores,countries,nutriscore_grade,image_url,image_small_url
681972,3023290047507,http://world-en.openfoodfacts.org/product/3023...,Le VIENNOIS sur lit de poire en purée 4 x 100 g,400 g,"fr-pot-en-plastique,fr-opercule-en-metal,fr-et...",Nestlé,"Dairies,Desserts,Dairy desserts,Chocolate dess...",France,44,Auchan,"France,en:france",c,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
1119975,4008452010031,http://world-en.openfoodfacts.org/product/4008...,"Alpenmilch 1, 5%Fett Weihenstephan",1 l,"tetra-pak,tetra-top",Weihenstephan,"Dairies,Milks,Homogenized milks,Fresh milks,Pa...","Germany,de:bavaria,de:bayern","Bavaria, Germany","Rewe,Edeka",en:Germany,a,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
1796645,8480000341938,http://world-en.openfoodfacts.org/product/8480...,Pipa de girasol natural,200 g,"plastico,envasado-en-atmosfera-protectora,gree...",HACENDADO,"Plant-based foods and beverages,Plant-based fo...",Argentina,ARGENTINA,Mercadona,Spain,c,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
1619676,8033753051984,http://world-en.openfoodfacts.org/product/8033...,Ricotta,250 g,plastica,Dolce natura,"Dairies,Fermented foods,Fermented milk product...",Italy,italia,Tuodí,en:Italy,b,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...
1892433,9310663800017,http://world-en.openfoodfacts.org/product/9310...,Famous Beef Pie,200 g,"plastic,wrapper",Mrs Macs,"Pies,Beef-pie",Australia,Australia,Liberty,Australia,d,https://images.openfoodfacts.org/images/produc...,https://images.openfoodfacts.org/images/produc...


***
# Writing to CSV
## Separator
We use the ';' separator because there is fields that contains a list of tags, separated by ','

In [51]:
sample.to_csv("./datas/sample.csv", sep=";")