# Project Descreption

The Objective of this project is :

1.   Use the OpenFoodFact   that shows  characteristics of certain product groups, similarities between products and product groups, to provide a global view of the dataset, and exhibit salient features that are of interest for an analyst or stakeholder in this sector.

2.   Use some machine learning algorithms to : 

        *   predict the **nutriscore_grade** of a product given nutritional values and possibly other fields (as few as possible)
        *   predict the **nova_group** of a product given nutritional values and possibly other fields (as few as possible),
        *   predict the **pnns_groups_1** of a product given nutritional values and possibly other fields (as few as possible),
        *   predict the **pnns_groups_2** of a product given nutritional values and possibly other fields (as few as possible),
        *   predict the **categories** (either atomic categories or lists of categories) of a product given nutritional values and possibly other fields (as few as possible),
        *   predict one or more **nutritional values** (ex: sugars_100g) given nutritional values and possibly other fields (as few as possible)
        






# Importing the Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
plt.style.use('seaborn')
from scipy.stats import norm, skew
import numpy as np
import seaborn as sns
from sklearn.preprocessing import  StandardScaler,  LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from scipy.stats import skew
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
import statsmodels.api as sm
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import lightgbm as lgb
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, plot_roc_curve

# Importing the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data = pd.read_csv('/content/drive/My Drive/off_complete.csv', sep = '\t', nrows = 10000)

In [None]:
data.shape

In [None]:
data.head()

# Data Preprocessing

In [None]:
data.drop(['salt_100g'], 1, inplace=True)
data.drop(['nutrition-score-fr_100g'], 1, inplace=True)                         

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.info()

**Note**

While the code of products refers to manufacturer-specific coding with tariff indication, The country or the company markets the product, The manufacturer, the Article code and a control key, so we can drop the  product code , because we already have the country that markets the product.

In [None]:
data.drop(['code'], 1, inplace=True)

In [None]:
data['url'].values

**Note :**

We see that in every url of the products, it contains the product name and the brand, so we can drop this feature

In [None]:
data.drop(['url'], 1, inplace=True)

I will Try to find the pourcentage of the null values in every features, and drop the features with more than 70% of null values.

In [None]:
for i in data.columns:
    h = (data[i].isnull().sum()/len(data['product_name']))*100
    print('The pourcentage of the null values of '+i+' is : ', h, '%')


In [None]:
data.drop(['trans-fat_100g'], 1, inplace=True)
data.drop(['cholesterol_100g'], 1, inplace=True)
data.drop(['vitamin-a_100g'], 1, inplace=True)
data.drop(['vitamin-c_100g'], 1, inplace=True)
data.drop(['calcium_100g'], 1, inplace=True)
data.drop(['iron_100g'], 1, inplace=True)

I will Pick up the countries_tags with highest number of products 


In [None]:
data['countries_tags'].value_counts().head(17)

In [None]:
data['countries_tags']=data['countries_tags'].fillna('Unknown')

In [None]:
countries_tags_values = data['countries_tags'].values
most_countries = ('en:france' , 'en:germany',  'en:spain' , 'en:mexico', 'en:united-kingdom' ,'en:canada', 'en:united-states', 'en:belgium','en:switzerland',  'en:poland')
countrie= {'en:france' : [], 'en:germany': [],  'en:spain': [] , 'en:mexico': [], 'en:united-kingdom': [], 'en:canada': [], 'en:united-states': [], 'en:belgium': [],
           'en:switzerland': [],'en:poland': []} 
for i in countries_tags_values : 
   for j in most_countries:
       if j in str(i):
         countrie[j].append(1)
       else : 
          countrie[j].append(0)


In [None]:
for i in most_countries:
  data[i] = pd.DataFrame(countrie[i])

In [None]:
data.drop(['countries_tags'], 1, inplace=True)

I will work on the Product Name Feature

In [None]:
product_names_list = data['product_name'].values
list_products = []
for i in product_names_list:
  i = i.replace('Le', '')
  i = i.replace('La', '')
  i = i.replace('2', '')
  first_word = i.split()[0]
  list_products.append(first_word)
data['product_name'] = pd.DataFrame(list_products)

In [None]:
data['product_name'].value_counts().head(10)

In [None]:
product_name__values = data['product_name'].values
most_products = ('Organic' , 'Pain',  'Chocolate' , 'Filet', 'Sauce' ,'Salade', 'Crème', 'Jambon','Original', 'Bio', 'Jus')
products= {'Organic':[] , 'Pain':[],  'Chocolate':[] , 'Filet':[], 'Sauce':[] ,'Salade':[], 'Crème':[], 'Jambon':[],'Original':[], 'Bio':[], 'Jus':[]} 
for i in product_name__values : 
   for j in most_products:
       if j in str(i):
         products[j].append(1)
       else : 
          products[j].append(0)
for i in most_products:
  data[i] = pd.DataFrame(products[i])
data.drop(['product_name'], 1, inplace=True) 


In [None]:
data.drop(['brands'], 1, inplace=True)

In [None]:
categories_list = data['categories'].values
list_categories = []
for i in categories_list:
  first_word = i.split()[0]
  first_word = first_word.replace(',', '')
  list_categories.append(first_word)
data['categories'] = pd.DataFrame(list_categories)


In [None]:
data['categories'].unique().shape

In [None]:
data['categories'].value_counts().head(20)

In [None]:
categories__values = data['categories'].values
most_categories = ('Snacks' , 'Aliments',  'Plant-based' , 'Produits', 'Alimentos' ,'Viandes', 'Boissons', 'Pflanzliche','Plats', 'Groceries', 'Dairies', 'Epicerie','Beverages','Botanas',
                   'Desserts', 'Milchprodukte' , 'Imbiss' , 'Meats', 'Sandwichs','Conserves')       
list_categories = []
for i in categories__values :
   t=0 
   for j in most_categories:
       if j in str(i):
         list_categories.append(j)
         t+=1
   if t==0:
      list_categories.append('Other')

In [None]:
data['categories'] = pd.DataFrame(list_categories)

In [None]:
data.drop(['additives_tags','states'], 1, inplace=True)

In [None]:
data['nutriscore_grade'].fillna('None', inplace = True)

In [None]:
def function_escalier(x):
  if x >=0:
      return(int(x+0.5))
  else:
    return int(x)

In [None]:
data['nutriscore_score'] = data['nutriscore_score'].astype(float)
data_grouped = data.groupby('pnns_groups_2')['nutriscore_score'].mean()
grades = list(data['pnns_groups_2'].unique())
for grade in grades:
  p = data['pnns_groups_2'] == grade
  data.loc[p, 'nutriscore_score'] = data.loc[p, 'nutriscore_score'].fillna(function_escalier(data_grouped[grade]))

In [None]:
df =pd.get_dummies(data['nova_group'])
for i in df.columns[:-1] : 
         data['nova_group'+str(i)] = df[i]
data.drop(['nova_group'], 1, inplace=True)

In [None]:
dh = data.groupby(['pnns_groups_1'])['pnns_groups_2'].value_counts()

In [None]:
dh = list(dict(dh))
dh

In [None]:
data['pnns_groups_1'].fillna('To Replace', inplace = True)
d = data[data['pnns_groups_1'] =='To Replace']['pnns_groups_2']
d.unique()

In [None]:
new_data = data[['pnns_groups_1','pnns_groups_2']].values
a_list1 = []
a_list2 = []
for i in range(len(new_data)):
     if new_data[i][1]=='Alcoholic beverages':
        new_data[i][0] ='Beverages'
     if new_data[i][1]=='Pizza pies and quiches':
        new_data[i][0] ='Composite foods'
df= pd.DataFrame(new_data, columns = ['pnns_groups_1','pnns_groups_2'])
data['pnns_groups_1'] = df['pnns_groups_1'] 
data['pnns_groups_2'] = df['pnns_groups_2'] 

In [None]:
lists= ['energy-kcal_100g','fat_100g', 'saturated-fat_100g','carbohydrates_100g','sugars_100g','fiber_100g','proteins_100g','sodium_100g']
for feature in lists:
    data[feature] = data[feature].astype(float)
    data_grouped = data.groupby('pnns_groups_2')[feature].mean()
    list_feature = list(data['pnns_groups_2'].unique())
    for value_f in list_feature:
         p = data['pnns_groups_2'] == value_f
         data.loc[p, feature] = data.loc[p, feature].fillna(data_grouped[value_f])

In [None]:
data['fat_100g'] = data['fat_100g'].astype(float)
data_grouped = data.groupby('pnns_groups_2')['fat_100g'].mean()
grades = list(data['pnns_groups_2'].unique())
for grade in grades:
  p = data['pnns_groups_2'] == grade
  data.loc[p, 'fat_100g'] = data.loc[p, 'fat_100g'].fillna(function_escalier(data_grouped[grade]))

In [None]:
df =pd.get_dummies(data['nutriscore_grade'], drop_first=True)
for i in df.columns : 
         data['nutriscore_grade_'+i] = df[i]

In [None]:
data.head()

In [None]:
data.isnull().sum()

In [None]:
data['pnns_groups_2'].unique()

In [None]:
df =pd.get_dummies(data[['pnns_groups_2']], drop_first=True)
for i in df.columns : 
         data[i] = df[i]
data.drop(['pnns_groups_2'], 1, inplace=True)

In [None]:
df =pd.get_dummies(data[['pnns_groups_1']], drop_first=True)
for i in df.columns : 
         data[i] = df[i]
data.drop(['pnns_groups_1'], 1, inplace=True)

In [None]:
data['nutriscore_score'] = data['nutriscore_score'].astype(int)


In [None]:
data.drop(['nutriscore_grade'], 1, inplace=True)

In [None]:
data.head(15)

In [None]:
features = data
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerics2 = []
for i in features.columns:
    if features[i].dtype in numeric_dtypes: 
        numerics2.append(i)

skew_features = features[numerics2].apply(lambda x: skew(x)).sort_values(ascending=False)
skews = pd.DataFrame({'skew':skew_features})
skews

**Some Statistics : Skewness**

**Handling The  Outliers**

In [None]:
le =  LabelEncoder()
data['categories'] = le.fit_transform(data['categories'])

In [None]:
tr = data
for y in tr.columns :
  factor = 4
  upper_lim = data[y].mean () + data[y].std () * factor
  lower_lim = data[y].mean () - data[y].std () * factor
  tr = data[(data[y] < upper_lim) & (data[y] > lower_lim)]


# Prediction of the nova_group

In [None]:
X = data.drop(['categories'], 1)
y = data['categories']

In [None]:
overfit = []
for i in X.columns:
    counts = X[i].value_counts()
    zeros = counts.iloc[0]
    if zeros / len(X) * 100 >99.94:
        overfit.append(i)

In [None]:
overfit = list(overfit)
overfit

Let's drop these overfits from 'X' .  

In [None]:
X.drop(overfit,axis=1,inplace=True)

Spliting the datasets into the training set and the test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size = 0.2)

Scaling the dataset

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In this step, we will use three algorithms:
  - Two linear algorithms (Logistic Regression & Support Vector Classifier)
  - The $K$-Nearest Neighbors Classifier Algorithm
  - Two Ensemble learning  algorithms (Random Forest Classifier, Decision Tree Classifier)
  - Two Gradient Boosting algorithms (LighGBM, XGBoost)

In [None]:
train_accuracies = {'Logistic Regression':0, 'Support Vector Classifier':0, 'K-Neighbors Classifier':0, 'Random Forest Classifier':0, 'Decision Tree Classifier' : 0, 
                    'XGBoost Classifier' : 0,'lightgbm Classifier' : 0}
test_accuracies = {'Logistic Regression':0, 'Support Vector Classifier':0, 'K-Neighbors Classifier':0, 'Random Forest Classifier':0, 'Decision Tree Classifier' : 0, 
                    'XGBoost Classifier' : 0,'lightgbm Classifier' : 0}

In [None]:
lgr = LogisticRegression()
lgr.fit(X_train , y_train)
train_preds = lgr.predict(X_train)
test_preds = lgr.predict(X_test)
scores1 = cross_val_score(lgr, train_preds.reshape(-1, 1), y_train.ravel(), scoring= 'accuracy', cv=10)
scores2 = cross_val_score(lgr, test_preds.reshape(-1, 1), y_test.ravel(), scoring= 'accuracy', cv=10)

In [None]:
print("Logistic Regression results :")
print("   -   Accuracy on the train set : {:.2f}%".format(scores1.mean()*100))
train_accuracies['Logistic Regression'] = scores1.mean()*100
print("   -   Accuracy on the test set : {:.2f}%".format(scores2.mean()*100))
test_accuracies['Logistic Regression'] = scores2.mean()*100

In [None]:
svc = SVC()
svc.fit(X_train , y_train)
train_preds = svc.predict(X_train)
test_preds = svc.predict(X_test)
scores1 = cross_val_score(svc, train_preds.reshape(-1, 1), y_train.ravel(), scoring= 'accuracy', cv=10)
scores2 = cross_val_score(svc, test_preds.reshape(-1, 1), y_test.ravel(), scoring= 'accuracy', cv=10)

In [None]:
print("Support Vector Classifier results :")
print("   -   Accuracy on the train set : {:.2f}%".format(scores1.mean()*100))
train_accuracies['Support Vector Classifier'] = scores1.mean()*100
print("   -   Accuracy on the test set : {:.2f}%".format(scores2.mean()*100))
test_accuracies['Support Vector Classifier'] = scores2.mean()*100

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train , y_train)
train_preds = knn.predict(X_train)
test_preds = knn.predict(X_test)
scores1 = cross_val_score(knn, train_preds.reshape(-1, 1), y_train.ravel(), scoring= 'accuracy', cv=10)
scores2 = cross_val_score(knn, test_preds.reshape(-1, 1), y_test.ravel(), scoring= 'accuracy', cv=10)

In [None]:
print("K-Neighbors Classifier results :")
print("   -   Accuracy on the train set : {:.2f}%".format(scores1.mean()*100))
train_accuracies['K-Neighbors Classifier'] = scores1.mean()*100
print("   -   Accuracy on the test set : {:.2f}%".format(scores2.mean()*100))
test_accuracies['K-Neighbors Classifier'] = scores2.mean()*100

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train , y_train)
train_preds = dt.predict(X_train)
test_preds = dt.predict(X_test)
scores1 = cross_val_score(dt, train_preds.reshape(-1, 1), y_train.ravel(), scoring= 'accuracy', cv=10)
scores2 = cross_val_score(dt, test_preds.reshape(-1, 1), y_test.ravel(), scoring= 'accuracy', cv=10)

In [None]:
print("Decision Tree Classifier results :")
print("   -   Accuracy on the train set : {:.2f}%".format(scores1.mean()*100))
train_accuracies['Decision Tree Classifier'] = scores1.mean()*100
print("   -   Accuracy on the test set : {:.2f}%".format(scores2.mean()*100))
test_accuracies['Decision Tree Classifier'] = scores2.mean()*100

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train , y_train)
train_preds = rf.predict(X_train)
test_preds = rf.predict(X_test)
scores1 = cross_val_score(rf, train_preds.reshape(-1, 1), y_train.ravel(), scoring= 'accuracy', cv=10)
scores2 = cross_val_score(rf, test_preds.reshape(-1, 1), y_test.ravel(), scoring= 'accuracy', cv=10)

In [None]:
print("Random Forest Classifier results :")
print("   -   Accuracy on the train set : {:.2f}%".format(scores1.mean()*100))
train_accuracies['Random Forest Classifier'] = scores1.mean()*100
print("   -   Accuracy on the test set : {:.2f}%".format(scores2.mean()*100))
test_accuracies['Random Forest Classifier'] = scores2.mean()*100

In [None]:
gb = XGBClassifier()
gb.fit(X_train , y_train)
train_preds = gb.predict(X_train)
test_preds = gb.predict(X_test)
scores1 = cross_val_score(gb, X = train_preds.reshape(-1, 1), y = y_train.ravel(), scoring= 'accuracy', cv=10)
scores2 = cross_val_score(gb, X =  test_preds.reshape(-1, 1), y = y_test.ravel(), scoring= 'accuracy', cv=10)

In [None]:
print("XGBoost Classifier results :")
print("   -   Accuracy on the train set : {:.2f}%".format(scores1.mean()*100))
train_accuracies['XGBoost Classifier'] = scores1.mean()*100
print("   -   Accuracy on the test set : {:.2f}%".format(scores2.mean()*100))
test_accuracies['XGBoost Classifier'] = scores2.mean()*100

In [None]:
gbm = lgb.LGBMClassifier()
gbm.fit(X_train , y_train)
train_preds = gbm.predict(X_train)
test_preds = gbm.predict(X_test)
scores1 = cross_val_score(gb, X = train_preds.reshape(-1, 1), y = y_train.ravel(), scoring= 'accuracy', cv=10)
scores2 = cross_val_score(gb, X =  test_preds.reshape(-1, 1), y = y_test.ravel(), scoring= 'accuracy', cv=10)

In [None]:
print("XGBoost Classifier results :")
print("   -   Accuracy on the train set : {:.2f}%".format(scores1.mean()*100))
train_accuracies['lightgbm Classifier'] = scores1.mean()*100
print("   -   Accuracy on the test set : {:.2f}%".format(scores2.mean()*100))
test_accuracies['lightgbm Classifier'] = scores2.mean()*100

In [None]:
train_accuracies

In [None]:
test_accuracies

In [None]:
ind = np.arange(7)
width = 0.2

fig = plt.figure(figsize=(14,8))
ax = fig.add_subplot(111)

rects1 = ax.bar(ind, list(train_accuracies.values()), width, color='b')
rects2 = ax.bar(ind+width, list(test_accuracies.values()), width, color='g')

ax.set_ylabel('Accuracy Score (%)')
ax.set_xticks(ind+width)
ax.set_xticklabels(list(test_accuracies.keys()) )
ax.legend((rects1[0], rects2[0]), ('Train Accuracy', 'Test Accuracy'))

plt.ylim((20,120))
plt.show()


As we have seen from the data visualization, the **Boosting algorithms**  are the most efficient although the data are far from being linearly separable. Moreover, **LightGBM** seems to be the most promising model in our case since it is the one that overfits less the data.

In [None]:
y_pred = gbm.predict(X_test)

In [None]:
output = pd.DataFrame({'Nutriscore_grade_Prediction': y_pred})

filename = 'Predictions.csv'

output.to_csv(filename,index=False)

print('Saved file: ' + filename)