<a href="https://colab.research.google.com/github/ayushmorbar/Code-Directory/blob/main/ML_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Download Images and Extract Features
Using `utils.py`, use the `download_images` function to fetch images into a local directory for further processing.

In [None]:
import pandas as pd
import numpy as np

In [None]:

from src.utils import download_images

# Load the train dataset
train_df = pd.read_csv('/content/dataset/train.csv')

# Get the list of image links
image_links = train_df['image_link'].tolist()

# Download images to a folder named 'images'
download_folder = 'images'
download_images(image_links, download_folder)


100%|██████████| 263859/263859 [26:35<00:00, 165.36it/s]


#2. Image Preprocessing and Feature Extraction
For feature extraction, process each image and extract the relevant features (like entity values: weight, volume, etc.). If you're using pre-trained models, you could use models like ResNet, VGG or any custom CNN for this task.

In [None]:
from PIL import Image
import os

def preprocess_image(image_path):
    """Function to preprocess the image."""
    image = Image.open(image_path)
    image = image.resize((224, 224))  # Example: Resize to 224x224
    return image

# Preprocess images in the 'images' folder
for img_file in os.listdir(download_folder):
    img_path = os.path.join(download_folder, img_file)
    image = preprocess_image(img_path)
    # Further processing like converting to tensors, etc.


# 3. Entity Value Extraction
Extract entity values from the images. Since the dataset includes both the entity name and value, starting by training a machine learning model using deep learning approach as illustrated to predict these values based on the images

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import numpy as np

# Example CNN model for extracting features from images
def create_model():
    model = Sequential([
        Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3)),
        MaxPooling2D(pool_size=(2, 2)),
        Flatten(),
        Dense(128, activation='relu'),
        Dense(1)  # For predicting a single continuous value like weight/volume
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

model = create_model()
model.summary()


#4. Handling Data (Entity Names and Values)
From the `train.csv` file, map entity names to their respective units using the `constants.py` file.

In [None]:
# Map entity names to units from constants.py
from src.constants import entity_unit_map

# For each entity in train data, find its allowed units
def map_entity_to_units(entity_name):
    return entity_unit_map.get(entity_name, None)

train_df['allowed_units'] = train_df['entity_name'].apply(map_entity_to_units)

#5. Training the Model

Once we got the images and the entity values, split the data into training and validation sets and train the model:

In [None]:
from sklearn.model_selection import train_test_split

# Example: Use the 'entity_value' as target
X = [preprocess_image(os.path.join(download_folder, img)) for img in train_df['image_link']]
y = train_df['entity_value'].values

# Split the dataset
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert X to numpy arrays for model input
X_train = np.array(X_train)
X_val = np.array(X_val)

# Train the model
model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val))


#6. Sanity Check for Output
Finally, use the `sanity.py` file in `src` directory to perform checks on the output.

In [None]:
from src.sanity import sanity_check

# Assuming the model output has been saved as a CSV file with 'index' and 'prediction'
sanity_check('datasets/test.csv', 'output/predictions.csv')


In [None]:
# prompt: calc f1 score for this problem

from sklearn.metrics import f1_score

# Assuming you have y_true (true labels) and y_pred (predicted labels)
# Replace these with your actual true and predicted values
y_true = y_val  # True labels from the validation set
y_pred = model.predict(X_val)  # Predicted labels from the model

# Convert predicted values to binary classification (e.g., based on a threshold)
# You might need to adjust the threshold based on your specific problem
threshold = 0.5
y_pred_binary = (y_pred > threshold).astype(int)

# Calculate the F1 score
f1 = f1_score(y_true, y_pred_binary)

print("F1 Score:", f1)
