## Libraries Used

- `pandas`: Used for data manipulation and analysis.
- `tensorflow`: A powerful library for building and training machine learning models, especially neural networks.
- `scikit-learn`: A popular library for traditional machine learning algorithms, useful for preprocessing, model selection, and more.
- `re`: regex library for pattern matching


In [45]:
import pandas as pd
import tensorflow as tf
import sklearn
import re  

print("All packages are imported successfully!")


All packages are imported successfully!


# Note on Independent Code Execution in Jupyter Notebook
One reason I encountered an issue running pd.read_csv() in a cell below was due to Jupyter Notebook’s independent cell execution. In Jupyter, each cell operates independently, which means that any variables, imports, or code defined in one cell won’t be recognized in another unless that cell has already been executed in the current session.

To ensure that the code in a cell below can recognize pd from pandas, I need to execute the cell with import pandas as pd first (click the “Run” button for that cell). This way, Jupyter registers pd, making it available for subsequent cells.

In [46]:
#Load the data set to be useable later under `data`
#pd.read_csv is a meathod from the pandas library to read CSVs
data = pd.read_csv('nutrition.csv')
#print the first few rows to check whatsup
print(data.head())

   Unnamed: 0             name serving_size  calories total_fat saturated_fat  \
0           0       Cornstarch        100 g       381      0.1g           NaN   
1           1     Nuts, pecans        100 g       691       72g          6.2g   
2           2    Eggplant, raw        100 g        25      0.2g           NaN   
3           3   Teff, uncooked        100 g       367      2.4g          0.4g   
4           4  Sherbet, orange        100 g       144        2g          1.2g   

  cholesterol    sodium  choline     folate  ...      fat  \
0           0   9.00 mg   0.4 mg   0.00 mcg  ...   0.05 g   
1           0   0.00 mg  40.5 mg  22.00 mcg  ...  71.97 g   
2           0   2.00 mg   6.9 mg  22.00 mcg  ...   0.18 g   
3           0  12.00 mg  13.1 mg          0  ...   2.38 g   
4         1mg  46.00 mg   7.7 mg   4.00 mcg  ...   2.00 g   

  saturated_fatty_acids monounsaturated_fatty_acids  \
0               0.009 g                     0.016 g   
1               6.180 g             

In [47]:
#checks for missing values in each column
missing_values = data.isnull().sum()
#display colimns with missing values only
missing_values = missing_values[missing_values > 0]
print(missing_values)

saturated_fat    1590
dtype: int64


# What Each Line Does:
data.isnull(): Creates a DataFrame of the same size as data, where each cell contains True if the corresponding cell in data is missing, and False otherwise.

.sum(): Adds up all True values (counting them as 1) to get the total number of missing values per column.

missing_values[missing_values > 0]: Filters out columns with no missing values, so we only see the columns that require attention.

In [48]:
#Ensuring the data is getting read as a String type so we can manipulate the 'g' out of it later
data['saturated_fat'] = data['saturated_fat'].astype(str)
# Remove the "g" from each entry in 'saturated_fat' and convert the result to numeric
data['saturated_fat'] = data['saturated_fat'].str.replace('g', '', regex=True)
data['saturated_fat'] = pd.to_numeric(data['saturated_fat'], errors='coerce')
#Calculate the mean of the column to feed into the null values of staturated fats, while ignoring the NaN values
mean_staturated_fat = data['saturated_fat'].mean()

#Fill missing values in 'saturated_dat' with the calculated mean
data['saturated_fat'] = data['saturated_fat'].fillna(mean_staturated_fat)  # Direct assignment

#Verifying that there are no more missing values left in the 'saturated fat column'
print(data['saturated_fat'].isnull().sum())

0


What I did above was clean the csv so I can manipulate it later

Specifcally...

- I had to change the g in the saturated fats to be able to handle all the null values (first I changed the column to (str) ) 
- Then I was able to change g into ' ' using regex
- Then I had to change the data type into a int
- Thing I was able to calculate the avarage all all the filled in numbs for the sat_fat total 
- Then I replaced all the blank values with the mean 
- Did this because when running reports the mean to fill in the blanks wont ruin the data set

Doing some EDA(Exploratory Data Analysis) 

Identify trends, patterns, or correlations in nutritional values.
Determine which columns are the most significant for predicting or analyzing food nutrition.

In [None]:
#Ensuring we use the cleaned data set now
cleaned_data = pd.read_csv("nutrition_cleaned.csv")
#just seeing some initial columns and few rows to get a lay of the land
cleaned_data.head()



Unnamed: 0.1,Unnamed: 0,name,serving_size,calories,total_fat,saturated_fat,cholesterol,sodium,choline,folate,...,fat,saturated_fatty_acids,monounsaturated_fatty_acids,polyunsaturated_fatty_acids,fatty_acids_total_trans,alcohol,ash,caffeine,theobromine,water
0,0,Cornstarch,100 g,381,0.1g,4.192791,0,9.00 mg,0.4 mg,0.00 mcg,...,0.05 g,0.009 g,0.016 g,0.025 g,0.00 mg,0.0 g,0.09 g,0.00 mg,0.00 mg,8.32 g
1,1,"Nuts, pecans",100 g,691,72g,6.2,0,0.00 mg,40.5 mg,22.00 mcg,...,71.97 g,6.180 g,40.801 g,21.614 g,0.00 mg,0.0 g,1.49 g,0.00 mg,0.00 mg,3.52 g
2,2,"Eggplant, raw",100 g,25,0.2g,4.192791,0,2.00 mg,6.9 mg,22.00 mcg,...,0.18 g,0.034 g,0.016 g,0.076 g,0.00 mg,0.0 g,0.66 g,0.00 mg,0.00 mg,92.30 g
3,3,"Teff, uncooked",100 g,367,2.4g,0.4,0,12.00 mg,13.1 mg,0,...,2.38 g,0.449 g,0.589 g,1.071 g,0,0,2.37 g,0,0,8.82 g
4,4,"Sherbet, orange",100 g,144,2g,1.2,1mg,46.00 mg,7.7 mg,4.00 mcg,...,2.00 g,1.160 g,0.530 g,0.080 g,1.00 mg,0.0 g,0.40 g,0.00 mg,0.00 mg,66.10 g


In [None]:
#Tells me the dimensions of the csv (rows & columns)
cleaned_data.shape


(8789, 77)

In [None]:
#Tells me  the columns names and data types
cleaned_data.info() #(object Dtype means text!)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8789 entries, 0 to 8788
Data columns (total 77 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   8789 non-null   int64  
 1   name                         8789 non-null   object 
 2   serving_size                 8789 non-null   object 
 3   calories                     8789 non-null   int64  
 4   total_fat                    8789 non-null   object 
 5   saturated_fat                8789 non-null   float64
 6   cholesterol                  8789 non-null   object 
 7   sodium                       8789 non-null   object 
 8   choline                      8789 non-null   object 
 9   folate                       8789 non-null   object 
 10  folic_acid                   8789 non-null   object 
 11  niacin                       8789 non-null   object 
 12  pantothenic_acid             8789 non-null   object 
 13  riboflavin        

In [None]:
#gives me a summary of the statistics
cleaned_data.describe()

Unnamed: 0.1,Unnamed: 0,calories,saturated_fat,lucopene
count,8789.0,8789.0,8789.0,8789.0
mean,4394.0,226.283878,4.192791,0.0
std,2537.310091,169.862001,6.22387,0.0
min,0.0,0.0,0.1,0.0
25%,2197.0,91.0,1.0,0.0
50%,4394.0,191.0,3.2,0.0
75%,6591.0,337.0,4.2,0.0
max,8788.0,902.0,96.0,0.0


In [None]:
import re
import pandas as pd

def clean_units(dataframe):
    #iterating through each column in the dataFrame
    for column in dataframe.columns:
        #this is saying, in the column currently, grab the first item that is not blank
        #key note of the algorithm, this runs in O(column# time) cause we only need to grab
        #the first non blank column to see if it has a unit after it, this is what its doing instead
        #of iterating through each item in the column. This is more of a checker instead of going through each item in the column
        sample_value = dataframe[column].dropna().values[0]
        #to understand how this works it is important to understand how regex works
        #what we are doing is setting up two capture groups. Each is defined by its own set of () it is contained in
        #the criteria for the first 'capture' group is (\d*\.?\d*)
        #the criteria for the second 'capture' group is ([a-zA-Z]+)
        #what it is going to do is start the capture anything that starts with a number 0-9 (which is indicated by \d)
        #then the * in it means any trailing nums (Specifcally zero or more numbers that come after it)
        #the \.? captures any periods
        # the \d* captures any number after the decimal point
        #esseentially its going to grab any numeric part of the string passed in with or without a decimal 
        #match is the result of running this regex
        #specifically this is set up so that the regex must return a value for group 1 and group 2 or match will equal none
        #if I wanted to make one group optional id add a ? sign at the immediate end of the parenthesis
        #([a-zA-Z]+)$ - this is saying that the second groups criteria must be a single string of unbroken unit of leters at the strings end.
        #meaning that there shuold be no letters and spaces after the text or it doesn't match the criteria
        match = re.search(r"(\d*\.?\d*)\s*([a-zA-Z]+)$", str(sample_value))

        #if there is a match, and it returns not none (doesnt meet criteria for group 1 and 2)
        #--> continue to clean the columns data
        if match:
            #grabbing the unit
            unit = match.group(2)

            #renaming the column with the unit in paranthesis
            dataframe.rename(columns={column: f"{column} ({unit})"}, inplace = True)

            #Remove the unit from each value in the column and convert it to numeric
            #the way this works below is that firs to understand
            #.str.extract IS RUNNING ALL AT ONE ON THE WHOLE COLUMN... sorry for yelling
            #ok, now, it is grabbing the just the numbers with the regex logic
            #then it creates a data frame, now the regex is only creating 1 group which is helpful to think of as a data strucre
            #with this dataFrame we created which resulted from the creation of the tegex group
            #we say that we want to return the [0] element which contains just the number all at once
            #of note, the extraction happens cell by cell althoug hthe assigming [0] happens all at once
            dataframe[column] = dataframe[column].str.extract(r"(\d*\.?\d*)")[0]
            dataframe[column] = pd.to_numeric(dataframe[column], errors='coerce')

    # Save the modified dataset
    dataframe.to_csv(r"C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\nutrition_cleaned.csv", index=False)




In [None]:
import re
import pandas as pd

def clean_units(dataframe):
    columns_to_rename = {}
    
    for column in dataframe.columns:
        sample_value = dataframe[column].dropna().values[0]
        # Skip columns without numeric values (like names or descriptions)
        if not re.search(r'\d', str(sample_value)):
            continue

        # Check for numeric part followed by unit
        match = re.search(r"(\d*\.?\d+)\s*([a-zA-Zμ]+)$", str(sample_value))
        
        if match:
            unit = match.group(2)
            columns_to_rename[column] = f"{column} ({unit})"
            
            # Extract numeric part and convert to numeric data type
            dataframe[column] = (
                dataframe[column]
                .astype(str)
                .str.extract(r"(\d*\.?\d+)", expand=False)
            )
            dataframe[column] = pd.to_numeric(dataframe[column], errors='coerce')
    
    # Rename columns with units appended
    dataframe.rename(columns=columns_to_rename, inplace=True)

    # Save the cleaned DataFrame
    dataframe.to_csv(r"C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\nutrition_cleaned.csv", index=False)

# Load the data
data = pd.read_csv(r"C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\nutrition.csv", index_col=0)

# Remove any leading/trailing spaces in column names
data.columns = data.columns.str.strip()

# Apply the cleaning function
clean_units(data)


In [None]:
cleaned_data.describe()

Unnamed: 0.1,Unnamed: 0,calories,saturated_fat,lucopene
count,8789.0,8789.0,8789.0,8789.0
mean,4394.0,226.283878,4.192791,0.0
std,2537.310091,169.862001,6.22387,0.0
min,0.0,0.0,0.1,0.0
25%,2197.0,91.0,1.0,0.0
50%,4394.0,191.0,3.2,0.0
75%,6591.0,337.0,4.2,0.0
max,8788.0,902.0,96.0,0.0


In [None]:
import re
import pandas as pd

def enforce_numeric(dataframe):
    columns_to_rename = {}

    # Iterate through each column
    for column in dataframe.columns:
        # Take the first non-null value as a sample
        sample_value = dataframe[column].dropna().values[0]
        
        # If no digits are found in the sample, likely it's non-numeric, so skip it
        if not re.search(r'\d', str(sample_value)):
            continue

        # Check if the column has a unit with regex and strip units
        match = re.search(r"(\d*\.?\d+)\s*([a-zA-Zμ]+)$", str(sample_value))
        if match:
            unit = match.group(2)
            columns_to_rename[column] = f"{column} ({unit})"
            
            # Remove units from values
            dataframe[column] = dataframe[column].astype(str).str.extract(r"(\d*\.?\d+)", expand=False)
            dataframe[column] = pd.to_numeric(dataframe[column], errors='coerce')

    # Rename columns to include units
    dataframe.rename(columns=columns_to_rename, inplace=True)

    # Return the cleaned DataFrame
    return dataframe

# Load and clean the data
data = pd.read_csv(r"C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\nutrition.csv", index_col=0)
data.columns = data.columns.str.strip()  # Strip whitespace from column names
cleaned_data = enforce_numeric(data)

# Verify non-numeric columns after enforcing numeric conversion
non_numeric_cols = cleaned_data.select_dtypes(exclude=['float64', 'int64']).columns
print("Non-numeric columns after cleaning:", non_numeric_cols)

# Save the cleaned data
cleaned_data.to_csv(r"C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\nutrition_cleaned.csv", index=False)


Non-numeric columns after cleaning: Index(['name', 'cholesterol', 'hydroxyproline', 'fructose', 'galactose',
       'glucose', 'lactose', 'maltose', 'sucrose'],
      dtype='object')


In [None]:
non_numeric_cols = cleaned_data.select_dtypes(exclude=['float64', 'int64']).columns
print("Non-numeric columns:", non_numeric_cols)


Non-numeric columns: Index(['name', 'cholesterol', 'hydroxyproline', 'fructose', 'galactose',
       'glucose', 'lactose', 'maltose', 'sucrose'],
      dtype='object')


In [None]:
import re
import pandas as pd

def clean_specific_columns(dataframe, columns):
    for column in columns:
        # Apply regex to extract only numeric parts, removing any trailing units or unexpected characters
        dataframe[column] = dataframe[column].astype(str).str.extract(r"(\d*\.?\d+)", expand=False)
        # Convert to numeric, coercing errors to NaN for consistency
        dataframe[column] = pd.to_numeric(dataframe[column], errors='coerce')
        print(f"Column '{column}' cleaned.")

    return dataframe

# Load the data
data = pd.read_csv(r"C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\nutrition_cleaned.csv")

# List of columns that are still non-numeric
non_numeric_columns = ['cholesterol', 'hydroxyproline', 'fructose', 'galactose', 
                       'glucose', 'lactose', 'maltose', 'sucrose']

# Apply targeted cleaning
cleaned_data = clean_specific_columns(data, non_numeric_columns)

# Verify if there are still any non-numeric columns left
non_numeric_cols_after = cleaned_data.select_dtypes(exclude=['float64', 'int64']).columns
print("Non-numeric columns after targeted cleaning:", non_numeric_cols_after)

# Save the cleaned data
cleaned_data.to_csv(r"C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\nutrition_cleaned.csv", index=False)


Column 'cholesterol' cleaned.
Column 'hydroxyproline' cleaned.
Column 'fructose' cleaned.
Column 'galactose' cleaned.
Column 'glucose' cleaned.
Column 'lactose' cleaned.
Column 'maltose' cleaned.
Column 'sucrose' cleaned.
Non-numeric columns after targeted cleaning: Index(['name'], dtype='object')


### DOING EDA NOW 11/2/2024

In [None]:
#basic stats on all columns
print(cleaned_data.describe())

       serving_size (g)     calories  total_fat (g)  saturated_fat (g)  \
count            8789.0  8789.000000    8789.000000        7199.000000   
mean              100.0   226.283878      10.556855           4.192791   
std                 0.0   169.862001      15.818247           6.877009   
min               100.0     0.000000       0.000000           0.100000   
25%               100.0    91.000000       1.000000           0.700000   
50%               100.0   191.000000       5.100000           2.200000   
75%               100.0   337.000000      14.000000           5.000000   
max               100.0   902.000000     100.000000          96.000000   

       cholesterol   sodium (mg)  choline (mg)  folate (mcg)  \
count  8789.000000   8789.000000   8789.000000   8789.000000   
mean     38.723063    306.353851     23.681249     44.085561   
std     117.358944    939.220468     51.332265    127.670410   
min       0.000000      0.000000      0.000000      0.000000   
25%       0.0

Making histograms to check the frequency of some important columns and boxplots to see any outliers

In [None]:
!pip install matplotlib


Collecting matplotlib
  Downloading matplotlib-3.9.2-cp310-cp310-win_amd64.whl (7.8 MB)
Collecting contourpy>=1.0.1
  Downloading contourpy-1.3.0-cp310-cp310-win_amd64.whl (216 kB)
Collecting pyparsing>=2.3.1
  Downloading pyparsing-3.2.0-py3-none-any.whl (106 kB)
Collecting cycler>=0.10
  Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting pillow>=8
  Downloading pillow-11.0.0-cp310-cp310-win_amd64.whl (2.6 MB)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.54.1-cp310-cp310-win_amd64.whl (2.2 MB)
Collecting kiwisolver>=1.3.1
  Downloading kiwisolver-1.4.7-cp310-cp310-win_amd64.whl (55 kB)
Installing collected packages: pyparsing, pillow, kiwisolver, fonttools, cycler, contourpy, matplotlib
Successfully installed contourpy-1.3.0 cycler-0.12.1 fonttools-4.54.1 kiwisolver-1.4.7 matplotlib-3.9.2 pillow-11.0.0 pyparsing-3.2.0


You should consider upgrading via the 'C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\.venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [None]:
import matplotlib.pyplot as plt

#columns to analyze
columns_to_analyze = ['calories', 'protein (g)', 'fat (g)', 'sodium (mg)', 'sugars (g)' ]
#making a histogram to show the distrubiton of values in that column
for column in columns_to_analyze:
    plt.figure(figsize=(8,5))
    plt.hist(cleaned_data[column].dropna(), bins=30, edgecolor = 'black')
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()
#making a box plot to track distrubution, quartiles and outliers
for column in columns_to_analyze:
    plt.figure(figsize=(8, 5))
    plt.boxplot(cleaned_data[column].dropna(), vert=False)
    plt.title(f'Box Plot of {column}')
    plt.xlabel(column)
    plt.show()

In [None]:
!pip install seaborn

Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2


You should consider upgrading via the 'C:\Users\liban\OneDrive\Documents\Bac\projects\foodApp\.venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [None]:
import seaborn as sns

#need to select only the numeric data types to do the analyses(first column is the items name)
numeric_data = cleaned_data.select_dtypes(include=['float64', 'int64'])
#making the correlation matrix, instatiating it using peacewise functions under the hood
correlation_matrix = numeric_data.corr()
#plot headmap of the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='cool', fmt='.2f')
plt.title('Correlation matrix of nutrients')
plt.show()

In [None]:
#top 10 foods by calories

# Top 10 foods by calories
top_calories = cleaned_data[['name', 'calories']].dropna().sort_values('calories', ascending=False).head(10)
print("Top 10 foods by calories:")
print(top_calories)


#print bottom 10 foods by calories
bottom_calories = cleaned_data[['name', 'calories']].dropna().sort_values('calories', ascending=True).head(10)
print("\nBottom 10 foods by calories:")
print(bottom_calories)

Top 10 foods by calories:
                                                   name  calories
2253             Fish oil, fully hydrogenated, menhaden       902
422                                    Fish oil, salmon       902
355                                   Fish oil, sardine       902
256                                 Fish oil, cod liver       902
293                                  Fish oil, menhaden       902
318                                  Fat, mutton tallow       902
430                                    Fat, beef tallow       902
356                                   Fish oil, herring       902
676                                                Lard       902
5621  Oil, all purpose salad or cooking, industrial ...       900

Bottom 10 foods by calories:
                                                   name  calories
8558  Beverages, mineral bottled water, naturally sp...         0
2435              Beverages, ready to drink, black, tea         0
8053  Beverages, for

About the start the machine learning 

In [None]:
#make sure that each column does not have missing data
#ensuring that we only run this logic on columns with numeric values
numeric_columns = cleaned_data.select_dtypes(include=['float64', 'int64']).columns
cleaned_data[numeric_columns] = cleaned_data[numeric_columns].fillna(cleaned_data[numeric_columns].mean())

Doing label encoding so the computer can translate abstract values to digestable infomration - it is assigning a number to each column

In [51]:
from sklearn.preprocessing import LabelEncoder

#what were doing below is setting each food name to be equal to a number essentially 
label_encoder = LabelEncoder()
#fit_transform finds all unique values in the column
#transform assigns a number to each colimn 
cleaned_data['name_encoded'] = label_encoder.fit_transform(cleaned_data['name'])

Doing some machine learning to be able to predict thje calories based on the nutrition value

In [54]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

x_nutrients = cleaned_data.drop(columns=['name', 'name_encoded', 'calories'], errors='ignore')
y_calories = cleaned_data['calories'] #this is our target (predict) so dont include it as a feature

#split data into traning and test sets

#the two train (X_ & Y_) are 80% of our data set and used to train our model
# the two test (x_ & y_) are used to test the models accuracy 
# test_size=0.2 specifies that 20% of the data should be set aside for testing.
# random_state=42 ensures that the split is reproducible. Using 42 (or any fixed number) guarantees that the same split occurs each time we run the code.
X_train, X_test, y_train, y_test = train_test_split(x_nutrients, y_calories, test_size=0.2, random_state=42)

#display the shapes of the training and testing data
#.shape returns the number of rows and columns in each data set
#this is key to verify we get our 80 20 split
print("Training set size", X_train.shape)
print("Training the set size", X_test.shape)
##accuracy checked!

#initializing the linear regression model 
#this will help us essentially learn the relationship between the nutrients and the calories 
calorie_model = LinearRegression()

#train the model on the training data
#going to be looking at the two inputs and learn the patterns in the data to understand the rleationship
calorie_model.fit(X_train, y_train)

#now were trained, we use it to make predicitions on the test data
y_pred_calories = calorie_model.predict(X_test)

#evaluate model performance (what the actual is, what the test produces)
mae = mean_absolute_error(y_test, y_pred_calories)
#how close we were to getting the right answer, the closer to one the better
r2 = r2_score(y_test, y_pred_calories)

print("Mean Absolute Error", mae)
print("R2 Score:", r2)

Training set size (7031, 74)
Training the set size (1758, 74)
Mean Absolute Error 5.019330795293003
R2 Score: 0.9971588436612396
