<H1>Analysis and prediction of calroies in food</H1>

In this project we use **The Nutritional Content of Food** dataset. It is a comprehensive database of nutritional values for thousands of different foods. It includes information on calories, vitamins, minerals, and more. This dataset is perfect for anyone interested in the nutritional content of their food. With this dataset, you can learn about the different nutrients in your food and how they contribute to your overall health



  <H2>In this project we perform:</H2>
 <H3>1-Data Wrangling<p><p>      
 2-Data Visualization and Summary<p><p>
 3-Classification (Low Medium High) of contents in every type of food<p><p>
 4-Correlation between food contents<p><p> 
 5-Estimation of Calories in food type knowing its contents<H3>                                                                                                        

<H1>1-Data Wrangling</H1>
Information from the columns <r>GmWt_1	GmWt_Desc1	GmWt_2	GmWt_Desc2 is quit not clear and the column Refuse_Pct is not useful for owr analysis. On the other hand,
there are many missing values but after dropping we still have enough sample size.
So we drop those columns and all missing values

In [1]:
nut = pd.read_csv('/kaggle/input/the-nutritional-content-of-food-a-comprehensive/ABBREV.csv')

nut

NameError: name 'pd' is not defined

In [None]:
nut.columns

In [None]:
missing_data = nut.isnull()
missing_data.head()

In [None]:
for column in missing_data.columns.values.tolist():
    print(column,missing_data[column].value_counts(),"")
  

In [None]:
nut_1 = nut.drop(['GmWt_1','GmWt_Desc1','GmWt_2','GmWt_Desc2','Refuse_Pct'],axis = 1)
nut_1 = nut_1.dropna()
nut_1

<H3>For now, we will exclude all food with missing data as well as useless features</H3>

In [None]:
missing_data_1 = nut_1.isnull()
missing_data_1.head()
for column in missing_data_1.columns.values.tolist():
    print(column,missing_data_1[column].value_counts(),"")


<H1>2-Data Visualization and Summary</H1>

We perform a summary statistics and a nice plotly figure using boxplots for every nutrient across food types...check it

In [None]:
columns = nut_1.drop(['index','NDB_No','Shrt_Desc','Energ_Kcal'],axis = 1)

In [None]:
contents_summary = columns.describe()
contents_summary

In [None]:
import pandas as pd
import plotly.graph_objects as go


# Create box plots using Plotly
fig = go.Figure()

for column in columns:
    fig.add_trace(go.Box(y=columns[column], name=column))

fig.update_layout(
    title="Distribution of food types across each Nutrient",
    yaxis_title="Values (each in its unit)",
    xaxis_title="Nutrients"
)

fig.show()

<H1>Classification (Low Medium High) of contents in every type of food</H1>

Regarding the amazing classification models, sometimes naive classification is useful.

We can classify for public the contents of each type of food in three categories, low medium high.

the idea is based on quartiles, i.e, we consider that quantity:

1.Lower than the 1st quartile is low

2.Between 1st and 3rd quartile is medium

3.Higher than the 3rd quartile is High


In [None]:
data = columns


# Calculate the quartiles for each variable
quartiles = data.quantile([0.25, 0.5, 0.75])

# Function to classify row values based on quartiles
def classify_row(row, quartiles):
    labels = []
    for column in row.index:
        q1 = quartiles.loc[0.25, column]
        q3 = quartiles.loc[0.75, column]
        if row[column] < q1:
            labels.append("Low")
        elif q1 <= row[column] <= q3:
            labels.append("Medium")
        else:
            labels.append("High")
    return pd.Series(labels, index=row.index)

# Classify rows for each column
classifications = data.apply(classify_row, axis=1, args=(quartiles,))

# Concatenate the classifications with the original data
for column in classifications.columns:
    if column not in data.columns:
        columns[column] = classifications[column]




In [None]:
quartiles

In [None]:
classifications.columns

In [None]:
classifications

In [None]:
classifications = pd.merge(nut_1['Shrt_Desc'],classifications,left_index=True, right_index=True)

In [None]:
classifications.tail()

<H4>Check all types of food by one of the three categories by using this function</H4>

In [None]:
def food_types_cat (nutrient,cat):
    z = pd.DataFrame(classifications[['Shrt_Desc',nutrient]][classifications[nutrient] == cat])
    return z

In [None]:
food_types_cat('Cholestrl_(mg)','High')

This cocludes the classification devision...check the general guide lines of nutrients in food to see if this classification fit well

<H1>4-Correlation between food contents</H1>

There are many different ways to show the correlation between nutrients, this simple informative way show for every selected nutrient a list of all nutrients correlation values with the selected one and their p-values.

note: Pearson correlation is used based on large sample hypothesis, we assume normal distribution...you can check normality it if you like

In [None]:
from scipy import stats
correlation_list = []

columns = nut_1.drop(['index','NDB_No','Shrt_Desc','Energ_Kcal'],axis = 1)

for i in range(45):
    for j in range(0,45):
        var1 = columns.iloc[:,i]
        var2 = columns.iloc[:,j]
        
        correlation, p_value = stats.pearsonr(var1,var2)
        correlation_list.append((var1.name, var2.name, correlation,p_value))

# Create a new DataFrame to store the correlation results
correlation_df = pd.DataFrame(correlation_list, columns=['Variable 1', 'Variable 2', 'Correlation','p_value'])

# Display the correlation DataFrame
correlation_df[correlation_df['Variable 1'] == 'Water_(g)']

Check the nutrient you want by modifying the nutrient name in the code

<H1>5-Estimation of Calories in food type knowing its contents</H1>

In this devision we will estimate the Calories in Kcal in every type of food based on the contents:

* Water
* Protein
* Lipid
* Carbohydrate
* Fiber
* Sugar

Many choices are for models, here the estimation is done by **XGBRegressor**

In [None]:
nut_measure = nut[['Shrt_Desc', 'Water_(g)', 'Energ_Kcal',
       'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)',
       'Fiber_TD_(g)', 'Sugar_Tot_(g)']]
nut_measure

In [None]:
missing_data_5 = nut_measure.isnull()
missing_data_5.head()
for column in missing_data_5.columns.values.tolist():
    print(column)
    print (missing_data_5[column].value_counts())
    print("")

In [None]:
nut_measure = nut_measure.dropna()

In [None]:
nut_measure

Here we assign X as the predictor and y as the target

In [None]:
y = nut_measure['Energ_Kcal']
X = nut_measure.drop(columns=['Energ_Kcal','Shrt_Desc'])

The needed Libraries 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor


Splitting the data

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

This function takes as input the number of trees in XGBRegressor and the learning rate as well as stopping rounds and return the predictions

In [None]:
def my_model(n,l,s):
    z= XGBRegressor(n_estimators=n, learning_rate=l)
    z.fit(X_train, y_train, 
             early_stopping_rounds=s, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)
    predictions = z.predict(X_valid)
    return z


In [None]:
m_1 = my_model(1000,0.05,5)
scores = -1 * cross_val_score(m_1, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("Average MAE score:", scores.mean())

You can change parameters and check the accuracy...try it your self...may be you will find better model


<H3>This concludes the project</H3>