[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1xbq3-XLmREiBnZBQEweB7HCcW8OgstZu?usp=sharing)

# **SAAI** Overview | Background

The Summer School on Affordable AI **SAAI** is a project of the AGYA working group Innovation in close collaboration with the AGYA working group Health and Society. [The Arab-German Young Academy of Sciences and Humanities (AGYA)](https://agya.info/) is funded by the  [German Federal Ministry of Education and Research (BMBF)](https://www.bmbf.de/bmbf/en/home/home_node.html) and various Arab and German cooperation partners.
<img src="https://imgur.com/hMpk6HK.png" width="800">


This particular use'case was developed by Albarqouni Lab at the University of Bonn with our Lebanese partners, namely Dr. Nada El Darra from the Beirut Arab University (BAU). This use-csse was part of a funded project by the Arab German Young Academy (AGYA) and the German Academic Exchange Service (DAAD).

Please treat the data and the excercise as strictly confidential data.

# Affordable AI  *(~ 90 min)*


## Description

Welcome to the AffordableAI notebook, an interactive and educational exploration of affordable artificial intelligence. This exercise is designed to provide hands-on experience in working with a real-world dataset about pomogranate fruits (`'raw_dataset.xlsl'`). Throughout the challenges, you'll have the opportunity to manipulate features, apply various machine learning techniques, and gain practical insights into the application of AI concepts.

### Dataset Overview

The provided dataset (`'pomegranate_cleaned.csv'`) is a curated collection of statistics extracted from images and their corresponding features. Your journey in the AffordableAI notebook involves leveraging this dataset to address a series of challenges that cover key aspects of data exploration, feature manipulation, and application of machine learning models.

![img](https://github.com/albarqounilab/EEDA-Autumn-School/raw/main/images/i93Jbhz.png)

The cleaned and complete provided dataset, consists of 564 entries and 25 columns. Each entry corresponds to a sample, and the columns represent different features and attributes associated with pomegranates. Here's a detailed description of the features:

- **no.:** A numerical identifier for each entry.
  
- **receiving_date:** The date when the pomegranates were received.

- **location:** The location where the pomegranates were harvested.

- **harvesting_date:** The date when the pomegranates were harvested.

- **mass_g:** The mass of the pomegranates in grams.

- **volume_ml:** The volume of the pomegranate juice in milliliters.

- **yield_of_juice_ml:** The yield of juice obtained from the pomegranates.

- **length:** The length of the pomegranates.

- **width:** The width of the pomegranates.

- **location_ElJahliye, location_Hasbaya, location_Rachiine:** Boolean flags indicating the location of ElJahliye, Hasbaya, and Rachiine, respectively.

- **day_time_diff:** The time difference associated with the harvesting process.

- **n_days:** The number of days from receiving to harvesting.

- **humidity:** The humidity level during the harvesting process.

- **solar_radiation:** The amount of solar radiation during harvesting.

- **air_temperature:** The air temperature during harvesting.

Consider as target labels:

- **polyphenols_content_mg:** The content of polyphenols in milligrams.

- **polyphenols_concentration_mggae/ml:** The concentration of polyphenols per milliliter.

- **degree_brix:** The degree Brix, a measure of sugar content.

- **ta_av:** Is the average measure associated with acidity.

- **maturity:** Boolean indicating the maturity of the pomegranates.

- **mi:** MI is a parameter associated with ripening (mature index).

- **Cluster:** An integer representing the cluster to which the entry belongs.

- **polyphenols_category:** A categorical variable representing the category of polyphenols.

The dataset contains a mix of numerical, boolean, and categorical features, providing a comprehensive set of attributes associated with pomegranates. These features will be explored and utilized in the subsequent challenges to gain insights and build predictive models.

### Challenges

The notebook is structured into sub-challenges, each focusing on a specific aspect of AI exploration and implementation. Here's a brief overview:

1. **Preprocessing and Extraction (Sub-challenge 1):**
   - Objective: Preoprocess and choose interesting features from the dataset for further analysis.
   - Actions: Load the dataset, display initial information, curate and select relevant features for exploration.

2. **Feature Exploration and Dimensionality Reduction Visualization (Sub-challenge 2):**
   - Objective: Explore selected features, perform dimensionality reduction using PCA, and visualize the data.
   - Actions: Explore K-Means, visualize feature distributions, apply PCA for dimensionality reduction, and visualize reduced-dimensional data.

3. **Regression and Classification (Sub-challenge 3):**
   - Objective: Implement linear regression, logistic regression, and MLP for continuous and categorical variables.
   - Actions: Split the data, apply regression models for continuous variables, and classification models for categorical variables.

To facilitate your journey in the third sub-challenge, a training file (`'pomegranate_complete_cleaned.csv'`) containing the dataset is provided. This resource is your key to hands-on learning, enabling you to apply AI concepts to real-world scenarios.

Get ready to dive into the world of affordable AI, where you'll not only gain practical skills but also uncover the potential of applying artificial intelligence in a cost-effective manner. Happy exploring!

## Download dataset

In [1]:
!curl -O -L https://github.com/albarqounilab/EEDA-Autumn-School/raw/main/5.%20Affordable_AI/raw_dataset.xlsx
!curl -O -L https://github.com/albarqounilab/EEDA-Autumn-School/raw/main/5.%20Affordable_AI/pomegranate_cleaned.csv
!curl -O -L https://github.com/albarqounilab/EEDA-Autumn-School/raw/main/5.%20Affordable_AI/pomegranate_complete_cleaned.csv
!curl -O -L https://github.com/albarqounilab/SAAI-Summer-School/raw/main/5.%20Affordable_AI/updated_outlier_pomegranate_complete_cleaned.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) schannel: next InitializeSecurityContext failed: Unknown error (0x80092012) - The revocation function was unable to check revocation for the certificate.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (35) schannel: next InitializeSecur

## Sub-challenge 1 : Preprocessing and extraction (Choose interesting features)  *~ 30 min*

**Objective:**

In this exercise, the goal is to preprocess and extract relevant information from the provided dataset, '`raw_dataset.csv`'. This involves handling missing values, selectively removing unnecessary columns, and transforming specific columns for better analysis.

**Here you will:**

- **Clean Dataset from Missing Values:**
  - Identify and assess the presence of missing values in the dataset.
  - Implement appropriate strategies to handle missing values, such as replacement or removal.
  - Verify the dataset after handling missing values to ensure data integrity.
  - Use `dropna(axis=1, how='all')` to remove columns with all NaN values.

- **Selectively Remove Unnecessary Columns:**
  - Use the `df.drop(columns)` method to remove specified columns from the DataFrame.
  - Reset the index of the DataFrame using `df.reset_index(drop=True)` to maintain a clean index structure.

- **Handle Maturity Column:**
  - Utilize the code `df['Maturity'] = df['Maturity'] == 'mature'` to transform the 'Maturity' column.
  - Verify changes in the DataFrame to ensure the 'Maturity' column reflects the desired True/False values.

- **Extract information from unstructured data:**
  - Extract width and length values using regular expressions, crucial for information extraction from strings.
  - Unifty dates format and extract additional features.

By completing these tasks, you will prepare the dataset for further exploration and analysis in AffordableAI.

In [2]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import warnings
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Load the dataset from an Excel file
file_path = ''  # Replace with your actual file path of the 'raw_dataset'

# read excel file
df = pd.read_excel(file_path)

AssertionError: 

Use `df.head()` and `df.info()` to understand the structure of the DataFrame. Identify the columns with `df.columns`.

In [None]:
#ToDo: your code goes here

In [None]:
#ToDo: your code goes here

In [None]:
#ToDo: your code goes here

### Exercise 1: Clean dataset from missing values

**Objective:**

In this exercise, the goal is to clean the dataset by handling missing values. This step is crucial in preparing data for analysis or machine learning models.

**Actions:**

- Identify and assess the presence of missing values in the dataset.
- Implement appropriate strategies to handle missing values, such as replacement or removal.
- Verify the dataset after handling missing values to ensure data integrity.

**Your Task**: Check missing values
> **Actions:**
1. Count the number of values with `df.isnull().sum()`

In [None]:
# Check for missing values
print("\nMissing Values:")
#ToDo: your code goes here

**Your Task**: Drop all the missing rows in `No. ` and remove empty columns

> **Actions:**
1. Use the `df.dropna(subset=['No. '])` from [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)
2. Remember to reset the index once you update your dataframe `df.reset_index(drop=True)`

In [None]:
# Drop rows with missing values
# ToDo: your code goes here
# Reset index after dropping rows
# ToDo: your code goes here


**Your Task: Selectively Remove Unnecessary Columns**

In this task, your objective is to selectively remove columns that are deemed unnecessary for further analysis. Specifically, you are instructed to remove the following columns: `'No. ', 'Humidity data', 'Temperature', 'Digital camera', 'Degree Brix'`.

**Actions:**

1. Use the `df.drop(columns)` method to remove the specified columns from the DataFrame ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)).
2. Reset the index of the DataFrame using `df.reset_index(drop=True)` to maintain a clean index structure.

By completing this task, you will streamline the dataset, retaining only the relevant columns for subsequent analysis and ensuring a more focused exploration in AffordableAI.

In [None]:
columns_to_remove =  # Replace with the actual column names

# Remove the specified columns
# ToDo: your code goes here
# Drop rows where the 'column_name' has NaN values (if any)

In [None]:
# Check for missing values

**Your Task: Handle Maturity Column**

By examining the value counts using `df['Maturity'].value_counts()`, it appears that there are some values present. Your objective is to fill the '`Maturity`' column with `True` where 'mature' exists and `False` otherwise.

**Actions:**

1. Utilize the code `df['Maturity'] = df['Maturity'] == 'mature'` to transform the 'Maturity' column.
2. Verify the changes in the DataFrame to ensure the 'Maturity' column reflects the desired True/False values.

In [None]:
# Maturity value count
# ToDo: your code goes here

Lets fill the column with True in case mature exist and False otherwise:

In [None]:
# ToDo: your code goes here

We can drop all the NaN such as `'Unnamed: 3'	Unnamed: 4'` (verify the number automatically assigned to the column) `'color intensity'`
with `df.dropna(axis=1, how='all')`:

> - `axis=1`: columns
- `axis=0`: rows

In [None]:
# ToDo: your code goes here

What is the new shape of the cleaned DataFrame for rows and columns?
>`df.shape`. You may have noticed this can also be seen at the bottom of the DataFrame `print(df)` or `df`.

In [None]:
# ToDo: your code goes here

### Exercise 2: Extract Length and Width Using Regular Expressions

**Objective:**

In this exercise, the goal is to extract width and length values from a specific column in a Pandas DataFrame using regular expressions. This skill is essential when dealing with unstructured data that requires pattern matching for information extraction.

**Actions:**

1. **Define Regex Patterns:**
   - `length_pattern = r'length\D*([\d.]+)'`: Regex pattern for extracting 'Length' values. It looks for the word "length" followed by optional non-digit characters `(\D*)` and captures one or more digits or dots `(([\d.]+))`.
   - `width_pattern = r'width\D*([\d.]+)'`: Regex pattern for extracting 'Width' values. Similar to the 'Length' pattern, it looks for the word "width" followed by optional non-digit characters and captures one or more digits or dots.

2. **Extract and Clean 'Length' Values:**
   - `df['Length'] = df['Dimensions (mm)'].str.extract(length_pattern, flags=re.IGNORECASE)`: Use `str.extract` to apply the 'Length' regex pattern to the 'Dimensions (mm)' column and extract matched values into a new 'Length' column.
   - `df['Length'] = pd.to_numeric(df['Length'], errors='coerce')`: Clean up 'Length' values by converting them to numeric format and handling errors.

3. **Extract and Clean 'Width' Values:**
   - `df['Width'] = df['Unnamed: 11'].str.extract(width_pattern, flags=re.IGNORECASE)`: Use `str.extract` to apply the 'Width' regex pattern to the 'Unnamed: 11' column and extract matched values into a new 'Width' column.
   - `df['Width'] = pd.to_numeric(df['Width'], errors='coerce')`: Clean up 'Width' values by converting them to numeric format and handling errors.

4. **Drop Original Columns:**
   - `df = df.drop(['Dimensions (mm)', 'Unnamed: 11'], axis=1)`: Drop the original 'Dimensions (mm)' and 'Unnamed: 11' columns from the DataFrame.

In [None]:
import pandas as pd
import re

# Define a regex pattern for extracting length
length_pattern = # ToDo: your code goes here

# Extract 'Length' values using str.extract
df['Length'] = # ToDo: your code goes here

# Clean up 'Length' values (remove non-numeric characters)
df['Length'] = # ToDo: your code goes here

# Define a regex pattern for extracting width
width_pattern = # ToDo: your code goes here

# Extract 'Width' values using str.extract
df['Width'] = # ToDo: your code goes here

# Clean up 'Width' values (remove non-numeric characters)
df['Width'] = # ToDo: your code goes here

# Drop the original 'Dimensions (mm)' and 'Unnamed: 9' columns
df = # ToDo: your code goes here

In [None]:
# Now, df contains the fixed and renamed columns
df[['Length','Width']]

### Exercise 3: Fix Noise and Blank Spaces in Column Names

**Objective:**

In this exercise, the goal is to clean and standardize column names by removing noise, extra spaces, and ensuring a consistent lowercase format. This step is crucial for maintaining data integrity and facilitating ease of use in subsequent analyses.

**Actions:**

Use a loop to iterate over all columns `[col.strip() for col in df.columns]`


1. Use the `str.strip()` function to remove leading and trailing spaces from column names.
2. Utilize `str.lower()` to convert column names to lowercase for consistency.
3. Apply `str.replace()` to eliminate any noise or unwanted characters in column names. i.e. `str.replace(' ', '_'), str.replace('(', ''), str.replace(')', ''), str.replace('__', '_'), str.replace('_____', '_')`

These actions, when applied to `df.columns`, will result in cleaner, standardized column names, enhancing the overall data quality. Remember to adjust the functions as needed based on the specific noise or issues present in your dataset.

See [docs](https://docs.python.org/3/library/stdtypes.html)

In [None]:
# Fixing column names

# Apply str.strip()
df.columns = #ToDo: your code goes here

# Apply str.lower()
df.columns = #ToDo: your code goes here

# Apply str.replace() for specific character removal and space replacement
df.columns = #ToDo: your code goes here


In [None]:
df.columns

### Exercise 4: Encode the Location as a Numeric Value

**Objective:**

In this exercise, the goal is to encode location information as numeric values in a Pandas DataFrame. This encoding is essential for certain machine learning algorithms that require numerical input.

**Actions:**

1. **Show Value Counts and Histogram:**
   - Display the value counts using `df['location'].value_counts()`.
   - Plot a histogram for the 'location' column using `df['location'].hist()`.

2. **Fix Blank Spaces and Remove Blanks:**
   - Replace blank spaces in the 'Hasbaya' category with a corrected label.
   - Remove blank spaces from all location values using `df.location = df.location.str.replace(' ', '')`.

3. **One-Hot Encoding:**
   - Utilize one-hot encoding for the 'location' column.
   - Concatenate the one-hot encoded columns with the original DataFrame.
   - `pd.concat([df, pd.get_dummies(df['location'], prefix='location')], axis=1)`

Show the `value_counts()` and `hist()` for the location column.:

In [None]:
#ToDo: your code goes here

In [None]:
#ToDo: your code goes here

Fix the Hasbaya blank space

Choose an appropriate encoding method for the location column. This could involve using techniques like [label encoding](https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/) or [one-hot encoding](https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/), depending on the nature of the data.

In [None]:
# Remove blank spaces from location values
df.location = #ToDo: your code goes here

# One-hot encoding 'location' column
df = pd.concat([df, pd.get_dummies()]) #ToDo: your code goes here)

In [None]:
# Display the DataFrame with fixed columns and one-hot encoding
df['location'].value_counts()

### Exercise 4: Extract and Manipulate Date Information

**Explanation:**

Inconsistent date formats can hinder operations like calculating day differences or mapping continuous values. By ensuring a uniform date format, we enable accurate analysis, making it easier to perform tasks such as computing time differences and extracting meaningful insights from the data. This process involves formatting dates, calculating day differences, and mapping continuous values for an improved analysis.

**Objective:**

In this exercise, the goal is to extract and manipulate date information from the `'harvesting_date'` and `'receiving_date'` columns in a Pandas DataFrame. This process involves formatting dates to a consistent format.

**Actions:**

1. **Inspection:**
   - Inspect the date formats in `df['harvesting_date'].values` and `'receiving_date'`. In ocations, individual values are easilly spoted, and you can change a particular entry with `formatted_dates[i] = '04/09/2023'` with `i` being the index you want to change.

2. **Format Dates:**
   - Use the `format_date` function in `'harvesting_date'` and `'receiving_date'` to standardize them to `'%d/%m/%Y'`. Some entries may initially be in the format `'%Y-%d-%m %H:%M:%S'` or `'%Y-%m-%d %H:%M:%S'` (*day `d` and month `m` change positions*).

3. **Map a Continuous Value for Day Difference:**
   - Extract min and max day values from the 'harvesting_date' column. `df['harvesting_date'].dt.day.min()` and `df['harvesting_date'].dt.day.max()` will do the trick.
   - Map day values to a continuous range between 0.0 and 1.0, storing the result in a new column `'day_time_diff'`. `df['harvesting_date'].apply(lambda x: (x.day - min_day) / (max_day - min_day))`


4. **Calculate Number of Days Since Harvesting:**
   - Calculate the difference in days and store it in a new column 'n_days'. Use simple substraction between the two date type formats: `(df['receiving_date'] - df['harvesting_date']).dt.days`

5. **Reset Index:**
   - Reset the DataFrame index to enhance data structure. `df = df.reset_index(drop=True)`

Inspect the date formats

In [None]:
#ToDo: your code goes here

Use the format dates function

In [None]:
from datetime import datetime
# Function to format the date in both directions
def format_date(value, from_format, to_format):
    if isinstance(value, datetime):
        # If the value is already a datetime object, format it accordingly
        return value.strftime(to_format)
    try:
        # Attempt to parse the date using the specified format
        date_obj = datetime.strptime(value, from_format)
        return date_obj.strftime(to_format)
    except ValueError:
        # If parsing fails, return the original value
        return value

# Apply the function to the 'harvesting_date' column and create a new list
fornat_a = '' #ToDo: your code goes here
fornat_b = '' #ToDo: your code goes here
fornat_c = '' #ToDo: your code goes here
formatted_dates = [format_date(date, fornat_a, fornat_c) if not isinstance(date, datetime) else format_date(date, fornat_b, fornat_c) for date in df['harvesting_date'].values]
formatted_dates

You need to replace the third item which appear as a typing error `'4/92023',`

In [None]:
formatted_dates[2] =  #ToDo: your code goes here

# Conversion to datetime format
df['harvesting_date'] = formatted_dates
df['harvesting_date'] = pd.to_datetime(df['harvesting_date'], format='%d/%m/%Y')

df['receiving_date'] = pd.to_datetime(df['receiving_date'], format='%d/%m/%Y')
print(df['harvesting_date'].value_counts())

And also map a continuous value for the day difference

In [None]:
from datetime import datetime

# Extract min and max day values
min_day = #ToDo: your code goes here
max_day = #ToDo: your code goes here

# Map it to a value between 0.0 and 1.0
df['day_time_diff'] = df['harvesting_date'].apply(lambda x: (x.day - min_day) / (max_day - min_day))

In [None]:
#ToDo: your code goes here
df[['harvesting_date', 'day_time_diff']].value_counts()

It may be useful to know the number of days since harvesting. Lets add it to our feature columns:

In [None]:
# Calculate the difference in days and store it in a new column 'n_days'
df['n_days'] = #ToDo: your code goes here

let's see the value count

In [None]:
#ToDo: your code goes here

now reset the dataset index

In [None]:
#ToDo: your code goes here

### Exercise 5: Discretize Output

**Objective:**

In this exercise, the goal is to discretize the 'polyphenols_content_mg' variable into categories and visualize the distribution of polyphenols content in the dataset.

**Actions:**

1. **Display Descriptive Statistics:**
   - Print descriptive statistics for the 'polyphenols_content_mg' variable using `df['polyphenols_content_mg'].describe()`.

2. **Create Histogram:**
   - Generate a histogram to visualize the distribution of polyphenols content.
   - Use `plt.hist()` with specified bins and formatting options.

3. **Discretize Polyphenols Content:**
   - Create a new column '`polyphenols_category`' with discrete labels based on polyphenols content.
   - Utilize `pd.cut()` to discretize 'polyphenols_content_mg' into categories ('low', 'moderate', 'high') with specified bins.
   - After inspecting the histogram and value count, find experimental values that you consider can explain the categories better.

4. **Visualize Discretized Categories:**
   - Plot a histogram to visualize the distribution of polyphenols content categories using `plt.hist()`.

Explore the distribution of polyphenols content and understand how discretizing the variable into categories ('low', 'moderate', 'high') affects the overall distribution.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Display descriptive statistics
#ToDo: your code goes here

# Create a histogram
plt.hist(df['polyphenols_content_mg'], bins=100, edgecolor='black')
plt.title('Distribution of Polyphenols Content')
plt.xlabel('Polyphenols Content (mg)')
plt.ylabel('Frequency')
plt.show()



Try different values for the lower and upper threshold, these will be important for the classification model.

In [None]:
# Create a new column with discrete labels
lower_threshold = '' #ToDo: your code goes here
upper_threshold = '' #ToDo: your code goes here
labels_ = ['low', 'moderate', 'high']
df['polyphenols_category'] = pd.cut(df['polyphenols_content_(mg)'],
                                    bins=[-float('inf'), lower_threshold, upper_threshold, float('inf')],
                                    labels= labels_,
                                    right=False)  # Include the left bin edge, exclude the right bin edge

print(df['polyphenols_category'].value_counts())
df['polyphenols_category'].hist()

As you can see, real-life datasets can be messy. Completing these exercises equips you with essential skills to tackle everyday life datasets effectively.

Finally, when you are finished you will want to save the cleaned dataframe in a .csv format
- `index=False` helps you load the DataFrame without the previous index

In [None]:
df.to_csv('pomegranate_cleaned_.csv', index=False)

### Bonus feature extraction *(Do not re-run | Image dataset is not available)*

Here we have pre-computed some additional metrics from the pomegranade images, which you can compute yourself with the following scripts.

In [None]:
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np

from tqdm import tqdm  # Import tqdm for the progress bar


def crop_pomegranate(image):
    # Get the original image dimensions
    height, width, _ = image.shape

    # Calculate the new height for cropping
    new_height = int(height * 1 / 2.5)
    # Calculate the new width for cropping
    new_width = int(width * 6 / 9)

    # Calculate the amount to remove from each side
    remove_from_each_side = (width - new_width) // 2

    # Crop the image by removing 2/10 from each side of the width
    cropped_image = image[new_height:, remove_from_each_side:(width - remove_from_each_side)]

    return cropped_image

def resize_image(image, target_height):
    # Calculate the aspect ratio of the original image
    aspect_ratio = image.shape[1] / image.shape[0]

    # Calculate the new width based on the target height and original aspect ratio
    new_width = int(target_height * aspect_ratio)

    # Resize the image to the calculated width and target height
    resized_image = cv2.resize(image, (new_width, target_height))
    return resized_image

def crop_center(image, target_width, target_height):
    # Get the dimensions of the original image
    original_height, original_width = image.shape[:2]

    # Calculate the center of the image
    center_x = original_width // 2
    center_y = original_height // 2

    # Calculate the crop box coordinates
    crop_x1 = max(0, center_x - target_width // 2)
    crop_y1 = max(0, center_y - target_height // 2)
    crop_x2 = min(original_width, center_x + target_width // 2)
    crop_y2 = min(original_height, center_y + target_height // 2)

    # Crop the image
    cropped_image = image[crop_y1:crop_y2, crop_x1:crop_x2]

    return cropped_image

# Function to convert RGB image to Lab color space
def rgb_to_lab(image):
    return cv2.cvtColor(image, cv2.COLOR_RGB2Lab)

# Specify the target resolution after cropping
target_height = 250  # Adjust as needed
target_width = 200

# Specify the directory containing your images
base_directory = "/home/Images_folder" # Use the actual path of the images

# Get a list of all subdirectories in the base directory
subdirectories = [subdir for subdir in os.listdir(base_directory) if os.path.isdir(os.path.join(base_directory, subdir))]

# Create an empty list to store the first image array from each subdirectory
side_images = []

# Loop through each subdirectory with tqdm for progress tracking
for subdir in tqdm(subdirectories, desc="Processing Subdirectories"):
    # Get the list of files in the subdirectory
    image_files = os.listdir(os.path.join(base_directory, subdir))

    # Find the first image file in the subdirectory
    for image_file in image_files:
        if "Side 2" in image_file:
          if image_file.lower().endswith(('.png', '.jpg', '.jpeg')):
              image_path = os.path.join(base_directory, subdir, image_file)
              break


    # Read the image using OpenCV
    image = cv2.imread(image_path)

    # Crop the image to keep two-thirds of the width and cut one-third of the height
    cropped_image = crop_pomegranate(image)

    # Resize the cropped image
    resized_image = resize_image(cropped_image, target_height)

    # Crop center to ensure same array dimensions resolution
    cropped_center_image = crop_center(resized_image, target_width, target_height)

    # Convert the image to Lab color space
    resized_image_lab = rgb_to_lab(cropped_center_image)

    # Append the images to the list
    side_images.append(resized_image_lab)

    print(f"\ncropped_image {cropped_image.shape}, resized_image {resized_image.shape}, cropped_center_image {cropped_center_image.shape}, resized_image_lab {resized_image_lab.shape}")

    # Display the original, cropped, and Lab color space images
    plt.figure(figsize=(15, 5))

    plt.subplot(1, 3, 1)
    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.title("Original Image")

    plt.subplot(1, 3, 2)
    plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB))
    plt.title("Cropped Image")

    plt.subplot(1, 3, 3)
    plt.imshow(resized_image_lab)
    plt.title("Lab Color Space")

    plt.show()

# Convert the list of arrays to a NumPy array
side_images = np.array(side_images)


In [None]:
# @title Compute statistical moments from RGB color space

from PIL import Image
import numpy as np
import pandas as pd
from scipy.stats import moment

def extract_features_from_dataset(image_dataset):
    features_list = []

    for img_array in image_dataset:
        # Calculate mean pixel values for each channel
        mean_pixel_values = np.mean(img_array, axis=(0, 1))

        # Flatten the image array to calculate statistical moments
        flattened_img = img_array.flatten()

        # Calculate statistical moments (mean, variance, skewness, kurtosis)
        moments = [moment(flattened_img, moment=i) for i in range(1, 5)]

        # Calculate median and mode
        median_intensity = np.median(flattened_img)
        mode_intensity = float(pd.Series(flattened_img).mode().iloc[0])

        # Combine features for this image
        image_features = mean_pixel_values.tolist() + moments + [median_intensity, mode_intensity]
        features_list.append(image_features)

    # Create a DataFrame to store the features
    column_names = ['mean_R', 'mean_G', 'mean_B', 'mean_intensity', 'variance_intensity', 'skewness_intensity', 'kurtosis_intensity', 'median_intensity', 'mode_intensity']
    df = pd.DataFrame(features_list, columns=column_names)

    return df

# Load your dataset
#image_dataset = np.load('numpy_array.npy')

# Extract features from the dataset
df = extract_features_from_dataset(side_images)

# Now, you can use the DataFrame for regression or other analysis
df

In [None]:
# @title Compute statistical moments from LAB color space
def extract_features_from_dataset_lab(image_dataset):
    features_list = []

    for img_array in image_dataset:
        # Calculate mean pixel values for each channel
        mean_pixel_values = np.mean(img_array, axis=(0, 1))

        # Flatten the image array to calculate statistical moments
        flattened_img = img_array.flatten()

        # Calculate statistical moments (mean, variance, skewness, kurtosis)
        moments = [moment(flattened_img, moment=i) for i in range(1, 5)]

        # Calculate median and mode
        median_intensity = np.median(flattened_img)
        mode_intensity = float(pd.Series(flattened_img).mode().iloc[0])

        # Combine features for this image
        image_features = mean_pixel_values.tolist() + moments + [median_intensity, mode_intensity]
        features_list.append(image_features)

    # Create a DataFrame to store the features
    column_names = ['mean_a', 'mean_b', 'mean_L', 'mean_intensity_lab', 'variance_intensity_lab', 'skewness_intensity_lab', 'kurtosis_intensity_lab', 'median_intensity_lab', 'mode_intensity_lab']
    df = pd.DataFrame(features_list, columns=column_names)

    return df

# Load your dataset
#image_dataset = np.load('numpy_array.npy')

# Extract features from the dataset
df = extract_features_from_dataset_lab(side_images)

# Now, you can use the DataFrame for regression or other analysis
df

## Sub-challenge 2 : Feature Exploration and Dimensionality Reduction Visualization (Data exploration)

Objective: In this exercise, you will have the opportunity to explore various features within the dataset and apply a dimensionality reduction technique—PCA (Principal Component Analysis). The goal is to visualize the dataset in reduced dimensions and observe potential patterns and clusters.

**Here you will apply:**

- **Correlation matrix**
- **Clustering (K-means) & Dim. Reduction and Visualization (PCA)**

### Load dataset

Objective: Load the dataset with the additional features and give it a look to the metrics, data types, missing values and histograms.

`df.info(), df.describe(), df.isnull().sum(), df.hist()`



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('pomegranate_cleaned.csv')  # Replace '_.csv' with the actual file path
df.info()

In [None]:
print("Dataset Preview:")
df

In [None]:
print("Display descriptive statistics:")
df.describe()

In [None]:
# Check for missing values
print("\nMissing Values:")
df.isnull().sum()

In [None]:
#@title Visualize distributions of features in the DataFrame
import seaborn as sns
import warnings
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning)
# Visualize the distributions of features
def visualize_distributions(df):
    # Plot histograms for each feature
    for column in df.columns:
        plt.figure(figsize=(12,10))
        sns.histplot(df[column], kde=True)
        plt.title(f'Distribution of {column}')
        plt.xlabel(column)
        plt.ylabel('Frequency')
        plt.show()

# Visualize distributions of features in the DataFrame
visualize_distributions(df)

### Exercise 1: Visualize Correlation Matrix

**Objective:**

In this exercise, the goal is to visually explore the correlation between variables in the dataset by creating and visualizing a correlation matrix. Understanding the relationships between variables is crucial for gaining insights into the dataset's structure.

**Actions:**

1. **Create Correlation Matrix:**
   - Use the `corr()` method in Pandas to compute the pairwise correlation of columns: `correlation_matrix = df.corr()`. Make sure to remove the feature formats you cannot correlate i.e. `%_date` and discrete `location`.

2. **Visualize Correlation Matrix:**
   - Utilize a visualization library, such as Seaborn and/or Matplotlib, to create a heatmap of the correlation matrix. `sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', annot_kws={'size': 10})`
   To enlarge the confusion matrix you can previously run `plt.figure(figsize=(16, 12))`.
   
3. **Analyze the Heatmap:**
   - Interpret the heatmap to identify patterns and relationships between variables.
   - Positive values indicate a positive correlation, while negative values indicate a negative correlation. The intensity of color represents the strength of the correlation.

Completing these actions will provide a visual representation of the relationships between variables in the dataset, aiding in the identification of patterns and insights.

In [None]:
correlation_matrix = df.drop(columns=['harvesting_date','receiving_date', 'location']).corr()  # Corrected from data.corr() to df.corr()
# Enlarge the heatmap
plt.figure(figsize=(16, 12))

# Plot the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', annot_kws={'size': 10})

# Set plot title and show the plot
plt.title('Correlation Matrix')
plt.show()

From the correlation matrix try to address the following questions:

- Strongest Correlations: Identify the pairs with the strongest positive and negative correlations.

- Interpretation of Coefficients: Explain the interpretation of correlation coefficients close to 1, 0, and -1.

- Correlation vs. Causation: [Does correlation means causation](https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/correlation-and-causation#:~:text=A%20correlation%20between%20variables%2C%20however,relationship%20between%20the%20two%20events.)? Can you think of an example?

- Feature Selection: Which features do you think will have the highest impact?




**Amswer:**

- Strongest correlations:

  - The strongest positive correlation is between the `mass_g` (mass of pomegranates in grams) and `volume_ml` (volume of pomegranate juice in milliliters) features, with a correlation coefficient of `1.0`. This indicates that the mass of pomegranates is linearly correlated with the volume they have.

  - Mass and Yield of Juice: There is a strong positive correlation between the mass (`mass_g`) of pomegranates and the yield of juice (`yield_of_juice`) obtained. This is expected, as larger pomegranates are likely to produce more juice.

  - Length and Width: The length and width of the pomegranates (`length` and `width`) also exhibit a strong positive correlation. This indicates that the shape of the pomegranates is relatively consistent, as increases in length tend to be associated with increases in width.

  - Location Flags: The location flags (`location_ElJahliye`, `location_Hasbaya`, `location_Rachiine`) show some correlation with other features. For example, `location_ElJahliye` has negative correlations with mass and volume, indicating potential differences in pomegranates from this location.

  - Temperature and Solar Radiation: The features related to environmental conditions during harvesting (`humidity`, `solar_radiation`, `air_temperature`) have moderate correlations with other features. These correlations can provide insights into how environmental factors might influence pomegranate characteristics.

  - Polyphenols Content and Concentration: The polyphenols-related features (`polyphenols_content_mg`, `polyphenols_concentration_mggae/ml`) show positive correlations with several other features, such as mass, volume, and yield of juice. This suggests a potential relationship between the size of the pomegranates and their polyphenol content.

  - Day Time Difference and Number of Days: The features related to time (`day_time_diff`, `n_days`) have relatively weak correlations with other features. However, they may still be valuable in understanding how the duration between receiving and harvesting affects pomegranate characteristics.

- Interpretation of Coefficients

  - Correlation coefficients close to 1 indicate a strong positive linear relationship between two variables. For example, a correlation coefficient of 1.0 between mass_g and volume_ml means that for every 1 gram increase in pomegranate mass, the volume of juice increases by an average of 1.0 milliliters.

  - Correlation coefficients close to 0 indicate no linear relationship between two variables. For example, a correlation coefficient of 0 between `n_days` (number of days from receiving to harvesting) and `maturity` (Boolean indicating the maturity of pomegranates) means that there is no significant relationship between these two factors.

  - Correlation coefficients close to `-1` indicate a strong negative linear relationship between two variables. For example, a correlation coefficient of `-0.47` between `location_ElJahliye` and `mass_g` indicates that pomegranates from El Jahliye tend to have smaller mass than the ones from Rachiine.

- Correlation vs. Causation

  - Correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. For example, just because the number of ice cream sales is correlated with the number of shark attacks does not mean that ice cream sales cause shark attacks.

- Feature Selection

  - The features that will have the highest impact on the target labels are those that are most strongly correlated with the target labels. For example, the `mass_g`, `volume_ml`, and `degree_brix` features are all strongly correlated with the `polyphenols_content_mg` target label, so they are likely to be important for predicting polyphenol content.

  - However, it is important to note that correlation does not necessarily imply causation, so it is important to be careful when selecting features for a machine learning model. It is also important to consider the specific application of the model, as the important features may vary depending on the task.

### Exercise 2: KMeans Clustering and Visualization

**Objective:**

In this exercise, the goal is to apply KMeans clustering to a dataset with relevant features, visualize the resulting clusters in a two-dimensional space, and explore patterns and groupings within the data.

**Actions:**

1. **Feature Exploration:**
   - Select a set of features from the dataset that you find interesting or relevant. Features such as `'mass_g', 'volume_ml', 'length', 'width', 'location_ElJahliye', 'location_Hasbaya', 'location_Rachiine', 'n_days', and 'day_time_diff'` may be considered. Avoid using target labels.

2. **Evaluate Clustering Quality:**
   - Calculate and interpret the inertia, Davies-Bouldin index, and silhouette score to assess the quality of the clustering solution.

3. **Standardize the Features:**
   - Use `StandardScaler()` to standardize the selected features. Standardization ensures consistent scaling for both visualization and clustering.

4. **Apply PCA for Visualization:**
   - Utilize PCA to reduce the dimensionality of the standardized features to two components.
   - Visualize the dataset in a 2D space.

5. **Perform KMeans Clustering:**
   - Apply the KMeans clustering algorithm to the reduced-dimensional data.
   - Choose an appropriate number of clusters (e.g., `n_clusters=3`).

6. **Visualize Clusters:**
   - Visualize the clusters in the 2D space obtained from PCA.
   - Color each point based on its assigned cluster.

Explore different features and visualize their behavior with varying numbers of clusters. Interpret the clustering results to gain insights into the underlying patterns within the data.



*Optional:* Lets try to infer a reasonable/optimal number of clusters for our data. Here we will assess the quality of clusters generated by a clustering algorithm using three metrics: [inertia](https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/), [Davies-Bouldin index](https://www.geeksforgeeks.org/davies-bouldin-index/), and [silhouette score](https://www.geeksforgeeks.org/silhouette-algorithm-to-determine-the-optimal-value-of-k/). Remember that these metrics help quantify how well the data is partitioned into distinct groups.

In [None]:
# Select relevant features for clustering
features = ['mass_g', 'length', #'width', ... # In this example, features such 'mass_g' and 'lenght' are used
            ]

In [None]:
# @title Optimal number of clusters
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score, silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

inertia_values = []
davies_bouldin_scores = []  # to store Davies-Bouldin scores
silhouette_scores = []  # to store Silhouette Scores
possible_clusters = range(2, 10)  # Try different numbers of clusters (computation time may vary)

for n_clusters in possible_clusters:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(df[features])  # fitting the features
    inertia_values.append(kmeans.inertia_)

    # Calculate Davies-Bouldin Index
    labels = kmeans.labels_
    davies_bouldin = davies_bouldin_score(df[features], labels)
    davies_bouldin_scores.append(davies_bouldin)

    # Calculate Silhouette Score
    silhouette = silhouette_score(df[features], labels)
    silhouette_scores.append(silhouette)

# Plot the Elbow Method graph
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(possible_clusters, inertia_values, marker='o', linestyle='-', color='b', label='Intertia values')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia (Within-cluster Sum of Squares)')
plt.legend()
plt.grid(True)

# Plot the Davies-Bouldin Index graph
plt.subplot(1, 3, 2)
plt.plot(possible_clusters, davies_bouldin_scores, marker='o', linestyle='-', color='r', label='Davies-Bouldin Score')
plt.title('Davies-Bouldin Score for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Davies-Bouldin Score')
plt.legend()
plt.grid(True)

# Plot the Silhouette Score graph
plt.subplot(1, 3, 3)
plt.plot(possible_clusters, silhouette_scores, marker='o', linestyle='-', color='g', label='Silhouette Score')
plt.title('Silhouette Score for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()


Now its your time to explore different features, and visualize them using the learnred visualization algorthms PCA.


In [None]:
#@title 2D PCA Visualization | Kmeans clustering | Continuous variable | Discrete variable

# @markdown Try to find the best number of clusters or components that explain the data in 2D for continuous and discrete labels.

# @markdown **Select the visualization mode:**
mode = "Discrete target" # @param ["Clusters", "Continuous target", "Discrete target"]

# @markdown **Setup:**

# @markdown Clusters:
n_clusters = 5 # @param {type:"integer"}

# @markdown PCA components:
n_components = 3 # @param {type:"number"}

# @markdown Annotate Continuous or discrete labels

# Continuous labels:
continuous_output = "polyphenols_concentration_mggae/ml" # @param ["polyphenols_content_mg", "polyphenols_concentration_mggae/ml", "degree_brix", "ta_av", "mi"]
continuous_output = df[continuous_output]

# Discrete label:
# df['polyphenols_category'] # df['maturity']
# Continuous labels:
discrete_output = "polyphenols_category" # @param ["polyphenols_category", "maturity"]
discrete_output = df[discrete_output]

# Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[features])

# Perform KMeans clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_data)

# Reduce dimensionality for visualization (using PCA)
pca = PCA(n_components=n_components)
reduced_data = pca.fit_transform(scaled_data)

# Visualize clusters in 2D with gradient based on polyphenols_content_mg
plt.figure(figsize=(10, 6))

if mode=="Clusters":
  plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=df['Cluster'], cmap='coolwarm')
  # Add legend
  for cluster_label in range(n_clusters):
        cluster_points = reduced_data[df['Cluster'] == cluster_label]
        plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=f'Cluster {cluster_label}')
  plt.legend()
elif mode=="Continuous target":
  scatter = plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=continuous_output, cmap='coolwarm', alpha=0.8)
  plt.colorbar(scatter, label=f'{continuous_output.name}')  # Add colorbar with label
elif mode=="Discrete target":
  discrete_labels = pd.Categorical(discrete_output)
  scatter = plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=discrete_labels.codes, cmap='coolwarm', alpha=0.8)
  for label in discrete_output.unique():
        indices = discrete_output[discrete_output == label].index
        plt.scatter(reduced_data[indices, 0], reduced_data[indices, 1], label=f'{discrete_output.name}: {label}')
  plt.legend()


# Customize plot
plt.title(f'2D PCA visualizaation with {continuous_output.name}')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

In [None]:
# @title 3D visualization Kmeans clustering

import plotly.graph_objects as go
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# @markdown Try to find the best number of clusters or components that explain the data in 3D for continuous and discrete labels.

# @markdown **Select the visualization mode:**
mode = "Discrete target" # @param ["Clusters", "Continuous target", "Discrete target"]

# @markdown **Setup:**

# @markdown Clusters:
n_clusters = 3 # @param {type:"integer"}

# @markdown PCA components:
n_components = 0.95 # @param {type:"number"}

# @markdown Annotate Continuous or discrete labels

# Continuous labels:
continuous_output = "polyphenols_content_mg" # @param ["polyphenols_content_mg", "polyphenols_concentration_mggae/ml", "degree_brix", "ta_av", "mi"]
continuous_output = df[continuous_output]

# Discrete label:
# df['polyphenols_category'] # df['maturity']
# Continuous labels:
discrete_output = "maturity" # @param ["polyphenols_category", "maturity"]
discrete_output = df[discrete_output]

# Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[features])

# Perform KMeans clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_data)

# Reduce dimensionality for visualization (using PCA)
pca = PCA(n_components=n_components)
reduced_data = pca.fit_transform(scaled_data)

# Visualize clusters in 2D with gradient based on polyphenols_content_mg
plt.figure(figsize=(10, 6))

# Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[features])

# Perform KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_data)

# Reduce dimensionality for visualization (using PCA)
pca = PCA(n_components=3)
reduced_data = pca.fit_transform(scaled_data)

# Visualize clusters in 3D using Plotly
fig = go.Figure()

if mode == "Clusters":
    for cluster_label in range(3):
        cluster_points = reduced_data[df['Cluster'] == cluster_label]
        fig.add_trace(go.Scatter3d(
            x=cluster_points[:, 0],
            y=cluster_points[:, 1],
            z=cluster_points[:, 2],
            mode='markers',
            marker=dict(size=4, opacity=0.8),
            name=f'Cluster {cluster_label}',
            customdata=df[df['Cluster'] == cluster_label]['no.'],
            hovertemplate='<b>Cluster</b>: %{text}<br><b>no.</b>: %{customdata}',
            text=df[df['Cluster'] == cluster_label]['Cluster']
        ))
    fig.update_layout(
        scene=dict(xaxis_title='PC1', yaxis_title='PC2', zaxis_title='PC3'),
        title='KMeans Clustering - 3D'
    )

elif mode == "Continuous target":
    scatter = go.Scatter3d(
        x=reduced_data[:, 0],
        y=reduced_data[:, 1],
        z=reduced_data[:, 2],
        mode='markers',
        marker=dict(size=4, opacity=0.8, color=continuous_output, colorscale='agsunset'),
        customdata=df['no.'],
        hovertemplate='<b>no.</b>: %{customdata}',
        text=df['Cluster']
    )

    # Add gradient colorbar to the trace
    scatter.marker.colorbar = dict(title=f'{continuous_output.name}')

    fig.add_trace(scatter)
    fig.update_layout(scene=dict(xaxis_title='PC1', yaxis_title='PC2', zaxis_title='PC3'),
                      title=f'PCA with {continuous_output.name}')

elif mode == "Discrete target":
    discrete_labels = pd.Categorical(discrete_output)
    for label in discrete_output.unique():
        indices = discrete_output[discrete_output == label].index
        cluster_points = reduced_data[indices]
        fig.add_trace(go.Scatter3d(
            x=cluster_points[:, 0],
            y=cluster_points[:, 1],
            z=cluster_points[:, 2],
            mode='markers',
            marker=dict(size=4, opacity=0.8),
            name=f'{discrete_output.name}: {label}',
            customdata=df.loc[indices]['no.'],
            hovertemplate='<b>no.</b>: %{customdata}',
            text=df.loc[indices]['Cluster']
        ))
    fig.update_layout(
        scene=dict(xaxis_title='PC1', yaxis_title='PC2', zaxis_title='PC3'),
        title=f'PCA with {discrete_output.name}'
    )

fig.show()


## Sub-challenge 3: Regression and Classification

**Objective:**

In this exercise, the goal is to perform regression and classification tasks using the provided dataset `'pomegranate_complete_cleaned.csv'`. The objective is to implement and evaluate the performance of linear regression, logistic regression, and a Multi-Layer Perceptron (MLP) for both continuous and categorical variables.

**Here you will:**

- **Select the Features:**
  - Choose relevant features from the dataset for regression and classification tasks.

- **Regression:**
  - Explore linear regression to predict a continuous variable.
  - Explore logistic regression to predict categorical variables.

- **Multi-Layer Perceptron:**
  - **MLP Regression:**
    - Implement a Multi-Layer Perceptron Regressor for predicting continuous variables.
    - Assess the performance of the MLP Regressor compared with linear regression.

  - **MLP Classification:**
    - Implement a Multi-Layer Perceptron Classifier for categorical variables.
    - Evaluate the model's classification performance compared with logistic regression.

Adjust the code as needed based on the specifics of your dataset. Perform a thorough evaluation and analysis of the regression and classification results with the selected features.

In [None]:
# @title Import libraries for sub-challenge 3
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix, classification_report, mean_absolute_error, explained_variance_score, mean_squared_log_error
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.decomposition import PCA

import warnings
# Suppress warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### Load dataset

Load the dataset and inspect column names and data types with `df.info()`.



In [None]:
# # Load the dataset
df = pd.read_csv('pomegranate_complete_cleaned.csv')  # Replace '_.csv' with the actual file path
df.info()

### Exercise 1: Regression - Linear and Logistic with Feature Scaling

**Objective:**

In this exercise, the goal is to perform regression tasks using the provided dataset 'pomegranate_complete_cleaned.csv'. Specifically, we will implement linear regression for continuous variables and logistic regression for categorical variables while incorporating feature scaling.

**Actions:**

1. **Select the Features:**
   - Choose relevant features from the dataset for both linear and logistic regression.

2. **Feature Scaling:**
   - Utilize feature scaling, such as `StandardScaler()`, to standardize the selected features. This step ensures consistent scaling for both linear and logistic regression.

3. **Linear Regression:**
   - Implement linear regression to predict a continuous target variable.
   - Define a split ratio for the dataset into training and testing sets using `train_test_split`.
   - Train the linear regression model on the training set.
   - Make predictions on the test set and evaluate the model's performance using appropriate regression metrics (e.g., R2, Mean Squared Error, Mean Absolute Error, Mean-Median Std).

4. **Logistic Regression:**
   - Select a categorical target variable for logistic regression.
   - Make predictions on the test set and evaluate the model's classification performance using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

In [None]:
features = ['volume_ml', 'mass_g', 'length', 'width',
       'location_ElJahliye', 'location_Hasbaya', 'location_Rachiine',
       'day_time_diff', 'n_days','humidity', 'solar_radiation',
       'air_temperature']

In [None]:
# @title Linear regression
# Assuming 'df' is your DataFrame containing the data
# and 'polyphenols_content_mg' is the target variable

# @markdown **Setup:**
X = df[features]
# Continuous labels:
y = "polyphenols_content_mg" # @param ["polyphenols_content_mg", "polyphenols_concentration_mggae/ml", "degree_brix", "ta_av", "mi"]
y = df[y]

# @markdown Train-test split:
split = 0.2 # @param {type:"number"}

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split, random_state=42)

# Initialize and train the regression model with scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the scaled test set
y_pred_scaled = model.predict(X_test_scaled)

# Calculate metrics on the scaled predictions
mse_scaled = mean_squared_error(y_test, y_pred_scaled)
r2_scaled = r2_score(y_test, y_pred_scaled)
mae_scaled = mean_absolute_error(y_test, y_pred_scaled)
rmse_scaled = np.sqrt(mse_scaled)
explained_variance = explained_variance_score(y_test, y_pred_scaled)
msle_scaled = mean_squared_log_error(y_test, y_pred_scaled)


# Visualization
plt.scatter(y_test, y_pred_scaled)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], lw=2, color='red')  # Identity line
plt.xlabel(f'Actual {y.name}')
plt.ylabel(f'Predicted {y.name}')
plt.title('Linear Regression with Scaling - Actual vs. Predicted')
plt.show()

# Report the results with scaling
print(f'Mean Squared Error with Scaling: {mse_scaled}')
print(f'Mean Absolute Error with Scaling: {mae_scaled}')
print(f'Root Mean Squared Error with Scaling: {rmse_scaled}')
print(f'R-squared with Scaling: {r2_scaled}')
print(f'Explained Variance Score with Scaling: {explained_variance}')

In [None]:
# @title Logistic regression

# @markdown **Setup:**
X = df[features]
# Continuous labels:
y = "polyphenols_category" # @param ["polyphenols_category", "maturity"]
y = df[y]


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (optional but can be beneficial for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train_pca, y_train)

# Make predictions on the test set features
y_pred = model.predict(X_test_pca)

# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title(f'Confusion Matrix {y.name}')
plt.colorbar()

classes = sorted(y.unique())
tick_marks = range(len(classes))
plt.xticks(tick_marks, classes)
plt.yticks(tick_marks, classes)

plt.xlabel('Predicted Label')
plt.ylabel('True Label')

# Add text annotations
for i in range(len(classes)):
    for j in range(len(classes)):
        plt.text(j, i, str(cm[i, j]), horizontalalignment='center', verticalalignment='center')

plt.show()

# Classification Report
print('Classification Report:')
print(classification_report(y_test, y_pred))


In [None]:
# @title Logistic regression with Custom Confusion Matrix Labels

# @markdown **Setup:**
X = df[features]
# Continuous labels:
y = "maturity" # @param ["polyphenols_category", "maturity"]
y = df[y]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (optional but can be beneficial for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the test set features
y_pred = model.predict(X_test_scaled)

# Labels for confusion matrix
class_labels = sorted(y.unique()) # Inherit the class labels

if len(class_labels) == 2:
  # Custom labels for confusion matrix
  class_labels = ["Negative", "Positive"] # custom class labels

# Visualize confusion matrix with custom labels
cm = confusion_matrix(y_test, y_pred)

plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title(f'Confusion Matrix: {y.name}')
plt.colorbar()

plt.xticks(range(len(class_labels)), class_labels)
plt.yticks(range(len(class_labels)), class_labels)

plt.xlabel(f'Predicted {y.name}')
plt.ylabel(f'True {y.name}')

# Add text annotations
for i in range(len(class_labels)):
    for j in range(len(class_labels)):
        plt.text(j, i, str(cm[i, j]), horizontalalignment='center', verticalalignment='center')

plt.show()

# Classification Report
print('Classification Report:')
print(classification_report(y_test, y_pred))



### Exercise 2: MLP - Regression and Classification

**Objective:**

In this exercise, the goal is to explore a Multi-Layer Perceptron (MLP) Regressor for continuous variables and an MLP Classifier for categorical variables, while incorporating feature scaling.

**Actions:**

1. **Select the Features:**
   - Choose relevant features from the dataset for both MLP regression and classification.

2. **Feature Scaling:**
   - Utilize feature scaling, such as `StandardScaler()`, to standardize the selected features. This step ensures consistent scaling for both MLP regression and classification.

3. **MLP Regressor:**
   - Implement a Multi-Layer Perceptron Regressor for predicting continuous target variables.
   - Split the standardized dataset into training and testing sets using `train_test_split`.
   - Train the MLP Regressor on the training set.
   - Make predictions on the test set and evaluate the model's performance using appropriate regression metrics.

4. **MLP Classifier:**
   - Select a categorical target variable for MLP classification.
   - Convert categorical labels into numerical format if needed.
   - Split the standardized dataset into training and testing sets.
   - Train the MLP Classifier on the training set.
   - Make predictions on the test set and evaluate the model's classification performance using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

In [None]:
features = ['volume_ml', 'mass_g', 'length', 'width',
       'location_ElJahliye', 'location_Hasbaya', 'location_Rachiine',
       'day_time_diff', 'n_days','humidity', 'solar_radiation',
       'air_temperature']

In [None]:
# @title Multi-Layer Perceptron Regressor
# Assuming 'df' is your DataFrame containing the data
# and 'polyphenols_content_mg' is the target variable

# @markdown **Setup:**
X = df[features]
# Continuous labels:
y = "polyphenols_content_mg" # @param ["polyphenols_content_mg", "polyphenols_concentration_mggae/ml", "degree_brix", "ta_av", "mi"]
y = df[y]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the MLP model with scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# You can adjust the parameters of MLPRegressor as needed
model = MLPRegressor(hidden_layer_sizes=(500, 1000), max_iter=200, random_state=42)
model.fit(X_train_scaled, y_train)

# Make predictions on the scaled test set
y_pred_scaled = model.predict(X_test_scaled)

# Calculate metrics on the scaled predictions
mse_scaled = mean_squared_error(y_test, y_pred_scaled)
r2_scaled = r2_score(y_test, y_pred_scaled)
mae_scaled = mean_absolute_error(y_test, y_pred_scaled)
rmse_scaled = np.sqrt(mse_scaled)
explained_variance = explained_variance_score(y_test, y_pred_scaled)
msle_scaled = mean_squared_log_error(y_test, y_pred_scaled)


# Visualization
plt.scatter(y_test, y_pred_scaled)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], lw=2, color='red')  # Identity line
plt.xlabel('Actual polyphenols content (mg)')
plt.ylabel('Predicted polyphenols content (mg) - Scaled')
plt.title('MLP Regression with Scaling - Actual vs. Predicted')
plt.show()

# Report the results with scaling
print(f'Mean Squared Error with Scaling: {mse_scaled}')
print(f'Mean Absolute Error with Scaling: {mae_scaled}')
print(f'Root Mean Squared Error with Scaling: {rmse_scaled}')
print(f'R-squared with Scaling: {r2_scaled}')
print(f'Explained Variance Score with Scaling: {explained_variance}')

In [None]:
# @title Multi-Layer Perceptron Classifier (Categorical variables)
# @markdown **Setup:**
X = df[features]
# Continuous labels:
y = "polyphenols_category" # @param ["polyphenols_category", "maturity"]
y = df[y]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (optional but can be beneficial for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the MLP model on original features
model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Make predictions on the test set with original features
y_pred = model.predict(X_test_scaled)

# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title(f'Confusion Matrix {y.name}')
plt.colorbar()

# Labels for confusion matrix
# classes = sorted(y.unique())

# Custom labels for confusion matrix
classes = ["Negative", "Positive"] # custom class labels

tick_marks = range(len(classes))
plt.xticks(tick_marks, classes)
plt.yticks(tick_marks, classes)

plt.xlabel(f'Predicted {y.name}')
plt.ylabel(f'True {y.name}')

# Add text annotations
for i in range(len(classes)):
    for j in range(len(classes)):
        plt.text(j, i, str(cm[i, j]), horizontalalignment='center', verticalalignment='center')

plt.show()

# Classification Report
print('Classification Report:')
print(classification_report(y_test, y_pred))

### Affordable AI: Image Classification (NEW)

In real-world scenarios, it's imperative to choose models that balance performance with efficiency, especially when dealing with mobile and edge devices. Lightweight backbones such as MobileNet and the recent MobileViT from Apple are more suitable options when training time and inference time are sensitive parameters.

In this exercise, we will delve into the world of affordable AI by utilizing lightweight, efficient models designed for mobile and edge devices. Our focus will be on MobileViT, a hybrid model that combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This model is particularly suitable for mobile vision tasks due to its low latency and high performance.

You will fine-tune a pre-trained MobileViT model on the Pomegranate dataset. Find the curated image dataset and corresponding tabular data of the pomegranate below:

- [CSV File](https://drive.google.com/file/d/1LcTAAA4HZjbiiAabU8pyHIKE9qr1LjWn/view?usp=sharing)
- [Data](https://drive.google.com/file/d/1-QRj_YThPDzvUhxVB4Aew647rdl3AGeq/view)


Use the following command to unzip the file:
```
!unzip "side1.zip" -q
```





In [None]:
!unzip "side1.zip" -q

In [None]:
# # Load the dataset
df = pd.read_csv('./updated_outlier_pomegranate_complete_cleaned.csv')  # Replace '_.csv' with the actual file path
df.info()

#### Install Required Libraries
Ensure you have the timm library installed, which provides access to a variety of pre-trained models, including MobileViT.

In [None]:
!pip install timm -q

In [None]:
#@title import required libraries
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
from tqdm import tqdm
import timm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
import os
from timm.optim import AdamP


#### Load and preprocess the image dataset, and select features


In [None]:
# Combine the encoded metadata with continuous features
meta_features = ['no.', 'mass_g', 'volume_ml', 'length', 'width', 'location_ElJahliye', 'location_Hasbaya', 'location_Rachiine',
                 'day_time_diff', 'n_days', 'humidity', 'solar_radiation', 'air_temperature']
metadata_features = df[meta_features]

# Scale numerical metadata features
numerical_features = ['mass_g', 'volume_ml', 'length', 'width', 'day_time_diff', 'n_days', 'humidity', 'solar_radiation', 'air_temperature']
scaler = StandardScaler()
df_numerical = df[meta_features].copy()
metadata_features.loc[:, numerical_features] = scaler.fit_transform(metadata_features[numerical_features])

# Encode the target variable
label_encoder = LabelEncoder()
df['polyphenols_category'] = label_encoder.fit_transform(df['polyphenols_category'])

Define `CustomDataset` and `Dataloaders` for the data

In [None]:
# Update the CustomDataset to include metadata
class CustomDataset(Dataset):
    def __init__(self, df, image_dir, metadata, transform=None):
        self.df = df
        self.image_dir = image_dir
        self.metadata = metadata
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img_path = os.path.join(self.image_dir, f"{int(self.df.iloc[idx, 0])}.jpg")
        img = Image.open(img_path).convert('RGB')
        target = torch.tensor(int(self.df['polyphenols_category'].iloc[idx]))

        # Extract metadata for the corresponding sample
        metadata_sample = self.metadata.iloc[idx, :].values.astype('float32')

        if self.transform:
            img = self.transform(img)

        return img, metadata_sample, target

# Split the data into training and validation sets
train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)

# Define transformations for data augmentation with MobileViT normalization
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Create custom datasets and dataloaders
image_dir = 'Side 1'  # Update with your actual image directory path
train_dataset = CustomDataset(df=train_data, image_dir=image_dir, metadata=metadata_features.loc[train_data.index], transform=transform)
val_dataset = CustomDataset(df=val_data, image_dir=image_dir, metadata=metadata_features.loc[val_data.index], transform=transform)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)



#### MobileViT

**Why MobileViT?**

![img](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*IiHPF1KDc5Qd57zyFzwIRw.png)

*Source: https://arxiv.org/pdf/2110.02178v2*

MobileViT stands out by merging the spatial inductive biases of CNNs with the global representation capabilities of ViTs. This hybrid approach allows MobileViT to achieve remarkable performance with fewer parameters, making it ideal for resource-constrained environments. For instance, MobileViT achieves a top-1 accuracy of 78.4% on the ImageNet-1k dataset with only about 6 million parameters, outperforming both MobileNetv3 (CNN-based) and DeIT (ViT-based) models.

### Utilizing Features as Metadata Input for MobileViT

Incorporating metadata features into the MobileViT model can significantly enhance its performance and adaptability in image classification tasks. Metadata, which includes additional information such as image context, capture conditions, or even external sensor data, can provide valuable context that the raw image data alone may not capture.

#### Benefits of Using Metadata with MobileViT

1. **Enhanced Contextual Understanding**:
   Metadata can provide context that helps the model better understand the image. For instance, knowing the lighting conditions or the type of device used to capture the image can help the model adjust its processing accordingly.

2. **Improved Accuracy**:
   By integrating metadata, the model can make more informed predictions. For example, in medical imaging, patient information can be crucial for accurate diagnosis. Similarly, in environmental monitoring, sensor data like temperature and humidity can improve the interpretation of visual data.

3. **Reduced Ambiguity**:
   Metadata can help reduce ambiguities in image data. For example, two images that look similar might be differentiated by their metadata, such as geographic location or timestamp.

4. **Efficient Resource Utilization**:
   Leveraging metadata allows the model to allocate resources more efficiently. For instance, if the metadata indicates a high likelihood of a particular class, the model can focus more computational power on verifying that class.

#### How to Integrate Metadata into MobileViT

To integrate metadata with MobileViT, the metadata can be encoded into a format compatible with the model, such as numerical vectors. These vectors can then be concatenated with the feature maps extracted from the image data before being passed through the classification layers. This approach ensures that the model considers both the visual features and the contextual metadata during the inference process.

Incorporating metadata in this manner can make the model achieve a more robust and context-aware image classification



In [None]:
# Define the MobileViT model with metadata for classification
class MobileViTClassificationWithMetadata(nn.Module):
    def __init__(self, num_metadata_features, num_classes, freeze_backbone=True):
        super(MobileViTClassificationWithMetadata, self).__init__()
        mobilevit = timm.create_model('mobilevit_s.cvnets_in1k', pretrained=True)

        # Freeze the backbone layers
        if freeze_backbone:
            for param in mobilevit.parameters():
                param.requires_grad = False

        self.feature_extractor = nn.Sequential(*list(mobilevit.children())[:-1])  # Remove the last fully connected layer
        self.meta_model = nn.Sequential(
            nn.Linear(num_metadata_features, 512),
            nn.ReLU(),
            nn.Linear(512, 1000),
            nn.ReLU()
        )
        self.head = nn.Linear(640 + 1000, num_classes)  # Adjust the input size to match the concatenated feature dimensions

    def forward(self, x, meta):
        x = self.feature_extractor(x)
        x = nn.AdaptiveAvgPool2d((1, 1))(x)  # Ensure the feature map is 1x1
        x = torch.flatten(x, 1)  # Flatten the feature map

        meta = self.meta_model(meta)

        z = torch.cat([x, meta], 1)

        # pass to head
        return self.head(z)

# In case you want to try another light-weight CNN such as MobileNetV2
class MobileNetV2ClassificationWithMetadata(nn.Module):
    def __init__(self, num_metadata_features, num_classes, freeze_backbone=False):
        super(MobileNetV2ClassificationWithMetadata, self).__init__()
        mobilenetv2 = models.mobilenet_v2(pretrained=True)
        # Freeze the backbone layers
        if freeze_backbone:
            for param in mobilenetv2.parameters():
                param.requires_grad = False

        self.features = mobilenetv2.features
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.features = nn.Sequential(*list(mobilenetv2.children())[:-1])  # Remove the last fully connected layer
        self.meta_model = nn.Sequential(
            nn.Linear(num_metadata_features, 512),
            nn.Linear(512,1000),
        )
        self.head = nn.Linear(1280 + 1000, num_classes)  # Adjust the input size to match the concatenated feature dimensions

    # Inside the forward method of your model
    def forward(self, x, meta):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)

        meta = self.meta_model(meta)

        z = torch.cat([x, meta], 1)

        # pass to backbone
        return self.head(z)

# Define the MobileViT model with metadata for classification
class MobileViTClassificationWithMetadata(nn.Module):
    def __init__(self, num_metadata_features, num_classes, freeze_backbone=True):
        super(MobileViTClassificationWithMetadata, self).__init__()
        mobilevit = timm.create_model('mobilevit_s.cvnets_in1k', pretrained=True)

        # Freeze the backbone layers
        if freeze_backbone:
            for param in mobilevit.parameters():
                param.requires_grad = False

        self.feature_extractor = nn.Sequential(*list(mobilevit.children())[:-1])  # Remove the last fully connected layer
        self.meta_model = nn.Sequential(
            nn.Linear(num_metadata_features, 512),
            nn.ReLU(),
            nn.Linear(512, 1000),
            nn.ReLU()
        )
        self.head = nn.Linear(640 + 1000, num_classes)  # Adjust the input size to match the concatenated feature dimensions

    def forward(self, x, meta):
        x = self.feature_extractor(x)
        x = nn.AdaptiveAvgPool2d((1, 1))(x)  # Ensure the feature map is 1x1
        x = torch.flatten(x, 1)  # Flatten the feature map

        meta = self.meta_model(meta)

        z = torch.cat([x, meta], 1)

        # pass to head
        return self.head(z)

# Instantiate the model for three classes (low, moderate, high)
num_metadata_features = metadata_features.shape[1]
num_classes = 3
model = MobileViTClassificationWithMetadata(num_metadata_features, num_classes)
num_epochs = 100
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = AdamP(model.parameters(), lr=0.0001, betas=(0.9, 0.999), weight_decay=1e-3)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=0.00001)

# Training loop
model.to(device)

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    for inputs, metadata, targets in tqdm(train_loader, desc=f'Epoch {epoch + 1}/{num_epochs}'):
        inputs, metadata, targets = inputs.to(device), metadata.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs, metadata)

        # Debugging output
        if outputs is None:
            print(f'Outputs are None for batch {inputs.size(0)}')

        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, metadata, targets in tqdm(val_loader, desc=f'Validation'):
            inputs, metadata, targets = inputs.to(device), metadata.to(device), targets.to(device)
            outputs = model(inputs, metadata)

            # Debugging output
            if outputs is None:
                print(f'Outputs are None for validation batch {inputs.size(0)}')

            loss = criterion(outputs, targets)
            val_loss += loss.item()

            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

    accuracy = correct / total
    train_loss /= len(train_loader)
    val_loss /= len(val_loader)
    print(f'Epoch [{epoch + 1}/{num_epochs}], Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}, Accuracy: {accuracy:.4f}')

    # Step the scheduler
    scheduler.step()

# Save the model
torch.save(model.state_dict(), f'{y_}_classification_model_with_metadata_mobilevit.pth')


#### Continuous variables

We could also expand it to a regression problem, similar to our MLP regressor, and parameterize the features as such:

In [None]:
# @markdown **Features:**
# @markdown **Features:**
root_directory="./"
volume_ml = False  # @param {type:"boolean"}
mass_g = True  # @param {type:"boolean"}
length = True  # @param {type:"boolean"}
width = True  # @param {type:"boolean"}
location_ElJahliye = True  # @param {type:"boolean"}
location_Hasbaya = True  # @param {type:"boolean"}
location_Rachiine = True  # @param {type:"boolean"}
days_since_harvesting = False  # @param {type:"boolean"}
harvesting_mapping = False # @param {type:"boolean"}
humidity = True  # @param {type:"boolean"}
solar_radiation = True  # @param {type:"boolean"}
air_temperature = True  # @param {type:"boolean"}
precipitation = False  # @param {type:"boolean"}
dilution = False  # @param {type:"boolean"}

features = []

# Dictionary containing feature names and their corresponding boolean values
boolean_values = {
    'volume_ml': volume_ml,
    'mass_g': mass_g,
    'length': length,
    'width': width,
    'location_ElJahliye': location_ElJahliye,
    'location_Hasbaya': location_Hasbaya,
    'location_Rachiine': location_Rachiine,
    'normalized_days_since_harvesting': days_since_harvesting,
    'harvesting_mapping': harvesting_mapping,
    'humidity': humidity,
    'solar_radiation': solar_radiation,
    'air_temperature': air_temperature,
    'precipitation': precipitation,
    'dilution': dilution,
}

# Loop through the dictionary and append feature names to the features list if the boolean value is True
for feature, value in boolean_values.items():
    if value:
        features.append(feature)

print("Features used:\n",features)

# @markdown **Setup:**
metadata_features = df[features]
# # Continuous labels:
y_ = "polyphenols_content_mg" # @param ["polyphenols_content_mg", "polyphenols_concentration_mggae/ml", "degree_brix", "ta_av", "mi", "color_intensity"]
y = df[y_]

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, StratifiedKFold
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
from PIL import Image
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from sklearn.preprocessing import OneHotEncoder
import os
import timm.optim
import timm.scheduler.cosine_lr
import torch.optim.lr_scheduler as lr_scheduler

# Extract numerical features dynamically
numerical_features = metadata_features.select_dtypes(include=['int', 'float']).columns.tolist()

# Scale numerical metadata features
scaler = StandardScaler()
df_numerical = metadata_features[numerical_features].copy()
df_numerical[numerical_features] = scaler.fit_transform(df_numerical[numerical_features])

metadata_features[numerical_features]= df_numerical[numerical_features]

# Update the CustomDataset to include metadata
class CustomDataset(Dataset):
    def __init__(self, df, image_dir, metadata, transform=None):
        self.df = df
        self.image_dir = image_dir
        self.metadata = metadata
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img_path = os.path.join(self.image_dir, f"{int(self.df.iloc[idx, 0])}.jpg")
        img = Image.open(img_path).convert('RGB')
        target = torch.tensor(float(self.df[['no.',y_]].iloc[idx, 1]))

        # Extract metadata for the corresponding sample
        metadata_sample = self.metadata.iloc[idx, :].values.astype('float32')

        if self.transform:
            img = self.transform(img)

        return img, metadata_sample, target


# @markdown Train-test split:
split = 0.2 # @param {type:"number"}

# Initialize Stratified Shuffle Split based on location
stratify= "location" # @param ["location", "None"]
if stratify=="None":
  # Split the data into training and validation sets
  train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)
else:
  strat_split = StratifiedShuffleSplit(n_splits=1, test_size=split, random_state=42)
  for train_index, test_index in strat_split.split(df, df[stratify]):
      train_data, val_data = df.iloc[train_index], df.iloc[test_index]
      # y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Define transformations for data augmentation
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
])

# Create custom datasets and dataloaders
image_dir = "./Side 1"  # Update with your actual image directory path
train_dataset = CustomDataset(df=train_data, image_dir=image_dir, metadata=metadata_features.loc[train_data.index], transform=transform)
val_dataset = CustomDataset(df=val_data, image_dir=image_dir, metadata=metadata_features.loc[val_data.index], transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)


# Define the MobileViT model with metadata for classification
class MobileViTClassificationWithMetadata(nn.Module):
    def __init__(self, num_metadata_features, freeze_backbone=True):
        super(MobileViTClassificationWithMetadata, self).__init__()
        mobilevit = timm.create_model('mobilevit_s.cvnets_in1k', pretrained=True)

        # Freeze the backbone layers
        if freeze_backbone:
            for param in mobilevit.parameters():
                param.requires_grad = False

        self.feature_extractor = nn.Sequential(*list(mobilevit.children())[:-1])  # Remove the last fully connected layer
        self.meta_model = nn.Sequential(
            nn.Linear(num_metadata_features, 512),
            nn.ReLU(),
            nn.Linear(512, 1000),
            nn.ReLU()
        )
        self.head = nn.Linear(640 + 1000, 1)  # Adjust the input size to match the concatenated feature dimensions

    def forward(self, x, meta):
        x = self.feature_extractor(x)
        x = nn.AdaptiveAvgPool2d((1, 1))(x)  # Ensure the feature map is 1x1
        x = torch.flatten(x, 1)  # Flatten the feature map

        meta = self.meta_model(meta)

        z = torch.cat([x, meta], 1)

        # pass to head
        return self.head(z)

# Instantiate the model
num_metadata_features = metadata_features.shape[1]
model = MobileViTClassificationWithMetadata(num_metadata_features)

# Loss function and optimizer
criterion = nn.MSELoss()
#optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = timm.optim.AdamP(model.parameters(), lr=0.001, weight_decay= 0.00005)

# Training loop
num_epochs = 50
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

best_val_loss = float('inf')  # Initialize with a large value
best_val_r2_score = float('-inf')  # Initialize with negative infinity
best_model_state = None

# Cosine annealing scheduler
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)
# Cosine annealing scheduler
# scheduler = timm.scheduler.cosine_lr(optimizer, t_initial=num_epochs, warmup_t=5, cycle_decay= 0.9, cycle_limit= 2, lr_min= 0.00001)


# Lists to store training and validation loss, and scores
train_losses = []
val_losses = []
train_scores = []
val_scores = []
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, explained_variance_score

# Lists to store training and validation loss, R2 score, MSE, MAE, RMSE, and explained variance score
train_losses = []
val_losses = []
train_r2_scores = []
val_r2_scores = []
train_mse = []
val_mse = []
train_mae = []
val_mae = []
train_rmse = []
val_rmse = []
train_ev_scores = []
val_ev_scores = []

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    train_loss_sum = 0.0
    train_predictions = []
    train_targets = []
    for inputs, metadata, targets in tqdm(train_loader, desc=f'Epoch {epoch + 1}/{num_epochs}'):
        inputs, metadata, targets = inputs.to(device), metadata.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs, metadata)
        loss = criterion(outputs.squeeze(), targets)
        loss.backward()
        optimizer.step()
        # scheduler.step(epoch)

        train_loss += loss.item()
        train_loss_sum += np.sum((outputs.squeeze().detach().cpu().numpy() - targets.detach().cpu().numpy()) ** 2)

        # Collect predictions and targets for computing R2 score, MSE, MAE, and explained variance score
        train_predictions.extend(outputs.squeeze().detach().cpu().numpy())
        train_targets.extend(targets.detach().cpu().numpy())

    train_loss /= len(train_loader)
    train_losses.append(train_loss)
    train_mse.append(train_loss_sum / len(train_loader.dataset))
    train_r2_scores.append(r2_score(train_targets, train_predictions))
    train_mae.append(mean_absolute_error(train_targets, train_predictions))
    train_rmse.append(np.sqrt(train_mse[-1]))
    train_ev_scores.append(explained_variance_score(train_targets, train_predictions))

    # Validation
    model.eval()
    val_loss = 0.0
    val_loss_sum = 0.0
    val_predictions = []
    val_targets = []
    with torch.no_grad():
        for inputs, metadata, targets in tqdm(val_loader, desc=f'Validation'):
            inputs, metadata, targets = inputs.to(device), metadata.to(device), targets.to(device)
            outputs = model(inputs, metadata)
            loss = criterion(outputs.squeeze(), targets)
            val_loss += loss.item()
            val_loss_sum += np.sum((outputs.squeeze().detach().cpu().numpy() - targets.detach().cpu().numpy()) ** 2)

            # Collect predictions and targets for computing R2 score, MSE, MAE, and explained variance score
            val_predictions.extend(outputs.squeeze().detach().cpu().numpy())
            val_targets.extend(targets.detach().cpu().numpy())

    val_loss /= len(val_loader)
    val_losses.append(val_loss)
    val_mse.append(val_loss_sum / len(val_loader.dataset))
    val_r2_score = r2_score(val_targets, val_predictions) # R2 metric
    val_r2_scores.append(val_r2_score)
    val_mae.append(mean_absolute_error(val_targets, val_predictions))
    val_rmse.append(np.sqrt(val_mse[-1]))
    val_ev_scores.append(explained_variance_score(val_targets, val_predictions))

    print(f'Epoch [{epoch + 1}/{num_epochs}], Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}, Training MSE: {train_mse[-1]:.4f}, Validation MSE: {val_mse[-1]:.4f}, Training R2 Score: {train_r2_scores[-1]:.4f}, Validation R2 Score: {val_r2_scores[-1]:.4f}, Training MAE: {train_mae[-1]:.4f}, Validation MAE: {val_mae[-1]:.4f}, Training RMSE: {train_rmse[-1]:.4f}, Validation RMSE: {val_rmse[-1]:.4f}, Training Explained Variance Score: {train_ev_scores[-1]:.4f}, Validation Explained Variance Score: {val_ev_scores[-1]:.4f}')
    # Update best validation loss and save the model if it's the best so far
    # Save the model if validation R2 score improves
    if val_r2_score > best_val_r2_score:
        best_val_r2_score = val_r2_score
        best_model_state = model.state_dict()
        # Save the best model state
        torch.save(best_model_state, f'{root_directory}/best_model.pth')

# Plotting
# Plot training and validation loss curves
fig_loss = plt.figure(figsize=(6, 4))
plt.plot(train_losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.tight_layout()
fig_loss.savefig('training_validation_loss.png')

# Plot training and validation R2 scores
fig_r2 = plt.figure(figsize=(6, 4))
plt.plot(train_r2_scores, label='Training R2 Score')
plt.plot(val_r2_scores, label='Validation R2 Score')
plt.xlabel('Epoch')
plt.ylabel('R2 Score')
plt.title('Training and Validation R2 Score')
plt.legend()
plt.tight_layout()
fig_r2.savefig('training_validation_r2.png')

# Plot training and validation MSE
fig_mse = plt.figure(figsize=(6, 4))
plt.plot(train_mse, label='Training MSE')
plt.plot(val_mse, label='Validation MSE')
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error')
plt.title('Training and Validation MSE')
plt.legend()
plt.tight_layout()
fig_mse.savefig('training_validation_mse.png')

# Plot training and validation MAE
fig_mae = plt.figure(figsize=(6, 4))
plt.plot(train_mae, label='Training MAE')
plt.plot(val_mae, label='Validation MAE')
plt.xlabel('Epoch')
plt.ylabel('Mean Absolute Error')
plt.title('Training and Validation MAE')
plt.legend()
plt.tight_layout()
fig_mae.savefig('training_validation_mae.png')

# Plot training and validation RMSE
fig_rmse = plt.figure(figsize=(6, 4))
plt.plot(train_rmse, label='Training RMSE')
plt.plot(val_rmse, label='Validation RMSE')
plt.xlabel('Epoch')
plt.ylabel('Root Mean Squared Error')
plt.title('Training and Validation RMSE')
plt.legend()
plt.tight_layout()
fig_rmse.savefig('training_validation_rmse.png')

# Plot training and validation explained variance score
fig_ev = plt.figure(figsize=(6, 4))
plt.plot(train_ev_scores, label='Training Explained Variance Score')
plt.plot(val_ev_scores, label='Validation Explained Variance Score')
plt.xlabel('Epoch')
plt.ylabel('Explained Variance Score')
plt.title('Training and Validation Explained Variance Score')
plt.legend()
plt.tight_layout()
fig_ev.savefig('training_validation_ev.png')

import pickle

# Create a dictionary to store values
metrics_dict = {
    'train_losses': train_losses,
    'val_losses': val_losses,
    'train_r2_scores': train_r2_scores,
    'val_r2_scores': val_r2_scores,
    'train_mse': train_mse,
    'val_mse': val_mse,
    'train_mae': train_mae,
    'val_mae': val_mae,
    'train_rmse': train_rmse,
    'val_rmse': val_rmse,
    'train_ev_scores': train_ev_scores,
    'val_ev_scores': val_ev_scores
}

# # Save the dictionary to a file using pickle
# with open('metrics_dict.pkl', 'wb') as f:
#     pickle.dump(metrics_dict, f)

# Save the dictionary to a file using pickle
with open(f'{root_directory}/metrics_dict.pkl', 'wb') as f:
    pickle.dump(metrics_dict, f)

#### Bonus: Pomegranate Fruit Dataset

With the following code snipets you'll be able to train a similar dataset on pomogranate frout,. However in this case you can use it as a pretrained model and then fine-tune it on our interest data.

This procedure brings closer the distribution of the pretrained weights for the pomegranate context and may improve performance in some scenarios.

[The Pomegranate Fruit dataset](https://www.kaggle.com/datasets/kumararun37/pomegranate-fruit-dataset), you may also download it here [here](https://drive.google.com/file/d/1tmwglFG8SZ8U7iuwBsB68YmLwQB9pfwq/view).

Simlarly we will unzip the contents with the command

```
!unzip "pomegranate_fruit.zip"
```









In [None]:
!unzip "pomegranate_fruit.zip"

The task here is simple and we'll summarize it even more:

There are three qualities for the pomegranate:

*   `G1` Superior quality look with fully ripe Attractive red or rose pink 300-400 85.39-106.12 Smooth, slight superficial defects not affecting the look and quality 10%.
*   `G2` Good look with fully ripe Attractive red or rose pink. Improper coloring may be present 200-300 74.72-89.23
Slight defects like scar, scratch, scrape, blemish etc. may be allowed, not affecting the look and quality 10%.
*   `G3` Good look with fully ripe Attractive red or rose pink. Improper coloring may be present 100-200 64.85-78.15 Slight defects like scar, scratch, scrape, blemish etc. may be allowed, not affecting the look and quality 10%.

We'll train a classifier on this task and then fine-tune the model in our pomegranate dataset. Can you spot an improvement? Feel free to play with the hyperparameters such as: `learning rate`, `batch_size`, etc.


In [None]:
#@title Show images

# Define the paths
base_path = '/content/to upload'
folders = ['G1_Q1', 'G1_Q2', 'G1_Q3', 'G1_Q4', 'G2_Q1', 'G2_Q2', 'G2_Q3', 'G2_Q4', 'G3_Q1', 'G3_Q2', 'G3_Q3', 'G3_Q4']

# Prepare a list to hold image paths and labels
data = []

# Iterate through each folder and gather the images and their labels
for folder in folders:
    quality = folder.split('_')[0]  # Extract G1, G2, or G3
    quality_label = int(quality[1]) - 1  # Convert G1, G2, G3 to 0, 1, 2
    folder_path = os.path.join(base_path, folder)
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.jpg'):
            file_path = os.path.join(folder_path, file_name)
            data.append((file_path, quality_label))

# Create a DataFrame
df = pd.DataFrame(data, columns=['image_path', 'label'])

# Update the CustomDataset class
class CustomDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img_path = self.df.iloc[idx]['image_path']
        img = Image.open(img_path).convert('RGB')
        label = torch.tensor(self.df.iloc[idx]['label'])

        if self.transform:
            img = self.transform(img)

        return img, label

# Stratified split into training and validation sets
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_idx, val_idx = next(skf.split(df, df['label']))

train_data = df.iloc[train_idx].reset_index(drop=True)
val_data = df.iloc[val_idx].reset_index(drop=True)

# Define transformations for data augmentation with MobileViT normalization
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    # transforms.RandomHorizontalFlip(),
    # transforms.RandomVerticalFlip(),
    # transforms.RandomRotation(20),
    # transforms.RandomCrop(224, padding=4),
    # transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Create custom datasets and dataloaders
train_dataset = CustomDataset(df=train_data, transform=transform)
val_dataset = CustomDataset(df=val_data, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Function to plot some samples from the DataLoader
def plot_samples(dataloader, num_samples=8):
    images, labels = next(iter(dataloader))
    images = images[:num_samples]
    labels = labels[:num_samples]

    # Unnormalize the images for visualization
    inv_normalize = transforms.Normalize(
        mean=[-0.485 / 0.229, -0.456 / 0.224, -0.406 / 0.225],
        std=[1 / 0.229, 1 / 0.224, 1 / 0.225]
    )
    unnormalized_images = [inv_normalize(img).permute(1, 2, 0).numpy() for img in images]

    plt.figure(figsize=(16, 8))
    for i in range(num_samples):
        ax = plt.subplot(2, num_samples // 2, i + 1)
        plt.imshow(unnormalized_images[i])
        plt.title(f'Label: {labels[i].item()}')
        plt.axis('off')
    plt.tight_layout()
    plt.show()

# Plot samples from the train_loader
plot_samples(train_loader)

In [None]:
#@title Train model
# Define the paths
base_path = '/content/to upload'
folders = ['G1_Q1', 'G1_Q2', 'G1_Q3', 'G1_Q4', 'G2_Q1', 'G2_Q2', 'G2_Q3', 'G2_Q4', 'G3_Q1', 'G3_Q2', 'G3_Q3', 'G3_Q4']

# Prepare a list to hold image paths and labels
data = []

# Iterate through each folder and gather the images and their labels
for folder in folders:
    quality = folder.split('_')[0]  # Extract G1, G2, or G3
    quality_label = int(quality[1]) - 1  # Convert G1, G2, G3 to 0, 1, 2
    folder_path = os.path.join(base_path, folder)
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.jpg'):
            file_path = os.path.join(folder_path, file_name)
            data.append((file_path, quality_label))

# Create a DataFrame
df = pd.DataFrame(data, columns=['image_path', 'label'])

# Update the CustomDataset class
class CustomDataset(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img_path = self.df.iloc[idx]['image_path']
        img = Image.open(img_path).convert('RGB')
        label = torch.tensor(self.df.iloc[idx]['label'])

        if self.transform:
            img = self.transform(img)

        return img, label

# Stratified split into training and validation sets
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_idx, val_idx = next(skf.split(df, df['label']))

train_data = df.iloc[train_idx].reset_index(drop=True)
val_data = df.iloc[val_idx].reset_index(drop=True)

# Define transformations for data augmentation with MobileViT normalization
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(20),
    transforms.RandomCrop(224, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Create custom datasets and dataloaders
train_dataset = CustomDataset(df=train_data, transform=transform)
val_dataset = CustomDataset(df=val_data, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Define the MobileViT model for classification
class MobileViTClassification(nn.Module):
    def __init__(self, num_classes, freeze_backbone=True):
        super(MobileViTClassification, self).__init__()
        mobilevit = timm.create_model('mobilevit_s.cvnets_in1k', pretrained=True)

        # Freeze the backbone layers
        if freeze_backbone:
            for param in mobilevit.parameters():
                param.requires_grad = False

        self.feature_extractor = nn.Sequential(*list(mobilevit.children())[:-1])  # Remove the last fully connected layer
        self.head = nn.Linear(mobilevit.num_features, num_classes)

    def forward(self, x):
        x = self.feature_extractor(x)
        x = nn.AdaptiveAvgPool2d((1, 1))(x)  # Ensure the feature map is 1x1
        x = torch.flatten(x, 1)  # Flatten the feature map
        return self.head(x)

# Instantiate the model for three classes (G1, G2, G3)
num_classes = 3
model = MobileViTClassification(num_classes)
num_epochs = 100
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = AdamP(model.parameters(), lr=0.0001, betas=(0.9, 0.999), weight_decay=1e-3)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=0.00001)

# Training loop
model.to(device)

# Lists to store training and validation loss and accuracy
train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    correct_train = 0
    total_train = 0
    for inputs, targets in tqdm(train_loader, desc=f'Epoch {epoch + 1}/{num_epochs}'):
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)

        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

        _, predicted_train = outputs.max(1)
        total_train += targets.size(0)
        correct_train += predicted_train.eq(targets).sum().item()

    train_accuracy = correct_train / total_train
    train_losses.append(train_loss / len(train_loader))
    train_accuracies.append(train_accuracy)

    # Validation
    model.eval()
    val_loss = 0.0
    correct_val = 0
    total_val = 0
    with torch.no_grad():
        for inputs, targets in tqdm(val_loader, desc=f'Validation'):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)

            loss = criterion(outputs, targets)
            val_loss += loss.item()

            _, predicted_val = outputs.max(1)
            total_val += targets.size(0)
            correct_val += predicted_val.eq(targets).sum().item()

    val_accuracy = correct_val / total_val
    val_losses.append(val_loss / len(val_loader))
    val_accuracies.append(val_accuracy)

    print(f'Epoch [{epoch + 1}/{num_epochs}], Training Loss: {train_losses[-1]:.4f}, Validation Loss: {val_losses[-1]:.4f}, Training Accuracy: {train_accuracies[-1]:.4f}, Validation Accuracy: {val_accuracies[-1]:.4f}')

    # Step the scheduler
    scheduler.step()

# Plot the loss and accuracy
def plot_loss_and_accuracy(epochs_range, train_losses, val_losses, train_accuracies, val_accuracies):
    plt.figure(figsize=(12, 4))

    # Plotting loss
    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, train_losses, label='Train')
    plt.plot(epochs_range, val_losses, label='Validation')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title('Training and Validation Loss')
    plt.legend()

    # Plotting accuracy
    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, train_accuracies, label='Train')
    plt.plot(epochs_range, val_accuracies, label='Validation')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.title('Training and Validation Accuracy')
    plt.legend()

    plt.tight_layout()
    plt.show()

# Usage
plot_loss_and_accuracy(range(num_epochs), train_losses, val_losses, train_accuracies, val_accuracies)

# Save the model
torch.save(model.state_dict(), 'quality_classification_model_mobilevit.pth')


In [None]:
#@title Load our previous code snipet in a single cell and run the model
# Combine the encoded metadata with continuous features
meta_features = ['no.', 'mass_g', 'volume_ml', 'length', 'width', 'location_ElJahliye', 'location_Hasbaya', 'location_Rachiine',
                 'day_time_diff', 'n_days', 'humidity', 'solar_radiation', 'air_temperature']
metadata_features = df[meta_features]

# Scale numerical metadata features
numerical_features = ['mass_g', 'volume_ml', 'length', 'width', 'day_time_diff', 'n_days', 'humidity', 'solar_radiation', 'air_temperature']
scaler = StandardScaler()
df_numerical = df[meta_features].copy()
metadata_features.loc[:, numerical_features] = scaler.fit_transform(metadata_features[numerical_features])

# Encode the target variable
label_encoder = LabelEncoder()
df['polyphenols_category'] = label_encoder.fit_transform(df['polyphenols_category'])

# Update the CustomDataset to include metadata
class CustomDataset(Dataset):
    def __init__(self, df, image_dir, metadata, transform=None):
        self.df = df
        self.image_dir = image_dir
        self.metadata = metadata
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img_path = os.path.join(self.image_dir, f"{int(self.df.iloc[idx, 0])}.jpg")
        img = Image.open(img_path).convert('RGB')
        target = torch.tensor(int(self.df['polyphenols_category'].iloc[idx]))

        # Extract metadata for the corresponding sample
        metadata_sample = self.metadata.iloc[idx, :].values.astype('float32')

        if self.transform:
            img = self.transform(img)

        return img, metadata_sample, target

# Split the data into training and validation sets
train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)

# Define transformations for data augmentation with MobileViT normalization
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Create custom datasets and dataloaders
image_dir = 'Side 1'  # Update with your actual image directory path
train_dataset = CustomDataset(df=train_data, image_dir=image_dir, metadata=metadata_features.loc[train_data.index], transform=transform)
val_dataset = CustomDataset(df=val_data, image_dir=image_dir, metadata=metadata_features.loc[val_data.index], transform=transform)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Define the MobileViT model with metadata for classification
class MobileViTClassificationWithMetadata(nn.Module):
    def __init__(self, num_metadata_features, num_classes, freeze_backbone=True):
        super(MobileViTClassificationWithMetadata, self).__init__()
        mobilevit = timm.create_model('mobilevit_s.cvnets_in1k', pretrained=True)

        # Load the pre-trained weights from code 1
        state_dict = torch.load('quality_classification_model_mobilevit.pth')
        mobilevit.load_state_dict(state_dict, strict=False)

        # Freeze the backbone layers
        if freeze_backbone:
            for param in mobilevit.parameters():
                param.requires_grad = False

        self.feature_extractor = nn.Sequential(*list(mobilevit.children())[:-1])  # Remove the last fully connected layer
        self.meta_model = nn.Sequential(
            nn.Linear(num_metadata_features, 512),
            nn.ReLU(),
            nn.Linear(512, 1000),
            nn.ReLU()
        )
        self.head = nn.Linear(mobilevit.num_features + 1000, num_classes)  # Adjust the input size to match the concatenated feature dimensions

    def forward(self, x, meta):
        x = self.feature_extractor(x)
        x = nn.AdaptiveAvgPool2d((1, 1))(x)  # Ensure the feature map is 1x1
        x = torch.flatten(x, 1)  # Flatten the feature map

        meta = self.meta_model(meta)

        z = torch.cat([x, meta], 1)

        # Pass to head
        return self.head(z)

# Instantiate the model for three classes (low, moderate, high)
num_metadata_features = metadata_features.shape[1]
num_classes = 3
model = MobileViTClassificationWithMetadata(num_metadata_features, num_classes)
num_epochs = 100
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = AdamP(model.parameters(), lr=0.0001, betas=(0.9, 0.999), weight_decay=1e-3)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=0.00001)

# Training loop
model.to(device)

train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    correct_train = 0
    total_train = 0
    for inputs, metadata, targets in tqdm(train_loader, desc=f'Epoch {epoch + 1}/{num_epochs}'):
        inputs, metadata, targets = inputs.to(device), metadata.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs, metadata)

        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

        _, predicted = outputs.max(1)
        total_train += targets.size(0)
        correct_train += predicted.eq(targets).sum().item()

    train_losses.append(train_loss / len(train_loader))
    train_accuracies.append(correct_train / total_train)

    # Validation
    model.eval()
    val_loss = 0.0
    correct_val = 0
    total_val = 0
    with torch.no_grad():
        for inputs, metadata, targets in tqdm(val_loader, desc=f'Validation'):
            inputs, metadata, targets = inputs.to(device), metadata.to(device), targets.to(device)
            outputs = model(inputs, metadata)

            loss = criterion(outputs, targets)
            val_loss += loss.item()

            _, predicted = outputs.max(1)
            total_val += targets.size(0)
            correct_val += predicted.eq(targets).sum().item()

    val_losses.append(val_loss / len(val_loader))
    val_accuracies.append(correct_val / total_val)

    print(f'Epoch [{epoch + 1}/{num_epochs}], Training Loss: {train_losses[-1]:.4f}, Validation Loss: {val_losses[-1]:.4f}, Training Accuracy: {train_accuracies[-1]:.4f}, Validation Accuracy: {val_accuracies[-1]:.4f}')

    # Step the scheduler
    scheduler.step()

# Save the model
torch.save(model.state_dict(), 'polyphenols_classification_model_with_metadata_mobilevit.pth')

# Plot the loss and accuracy
def plot_loss_and_accuracy(epochs_range, train_losses, val_losses, train_accuracies, val_accuracies):
    plt.figure(figsize=(12, 4))

    # Plotting loss
    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, train_losses, label='Train')
    plt.plot(epochs_range, val_losses, label='Validation')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title('Training and Validation Loss')
    plt.legend()

    # Plotting accuracy
    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, train_accuracies, label='Train')
    plt.plot(epochs_range, val_accuracies, label='Validation')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.title('Training and Validation Accuracy')
    plt.legend()

    plt.tight_layout()
    plt.show()

# Usage
plot_loss_and_accuracy(range(num_epochs), train_losses, val_losses, train_accuracies, val_accuracies)


### Questionnaire
Hope you enjoyed the exercise! 
Before you leave, please fill in our questionnaire (link via the qr code below).
Thank you!

<img src="https://github.com/albarqounilab/SAAI-Summer-School/raw/main/questionnaires/AffAI_question_qr.png" width="200" height="100"/>

`Please remember to treat this notebook and the data confidentially.`