## Task 1 Full evaluation 

## Task Overview

# 1. Data Collection
To build this classifier, a raw labeled dataset of grocery items and their corresponding categories was required.

Data Needed: A collection of common grocery items (e.g., "apples", "bananas") paired with specific supermarket categories.

Category Schema: The project utilizes nine key categories:

Fruit and Veg

Eggs and Dairy

Bakery

Pantry

Snacks

Household

Meat and Fish

Drinks

Frozen Foods

Rationale: These categories were selected because they are standard in supermarket organization and align with existing consumer grocery applications.

Quantity: The current dataset consists of approximately 50 labeled examples to establish the initial logic.

# 2. Data Processing
Raw text data must be converted into a numerical format that a machine learning model can understand.

Cleaning and Tokenization: Using the NLTK library, the item names are tokenized (split into words) and lemmatized (converted to their base form, like "apples" to "apple") to ensure consistency.

Vectorization: A CountVectorizer is used to convert the cleaned text into a matrix of token counts (BoW - Bag of Words).

Label Encoding: The text categories (eg "Fruit & Vegetables") are converted into numerical IDs (e.g., category '4') using LabelEncoder to serve as targets for the model.

# 3. Machine Learning Technique
A Neural Network built with TensorFlow/Keras was chosen for the classification task.

Reason for Choice: While simpler models like Naive Bayes could work, a neural network was selected to allow for more complex feature extraction as the dataset grows. It provides a flexible framework for handling text classification through multi-class categorical cross-entropy.

Improvements Considered: To improve accuracy, TF-IDF (Term Frequency-Inverse Document Frequency) was identified as a better alternative to CountVectorizer because it weights words by their relative importance, helping distinguish between items like "milk" and "oat milk".

# 4. Results and Analysis
Description of Results: Initial testing shows the model can predict categories for simple inputs, though it is limited by the small size of the training data (50 examples).

Explanation:

Data Constraints: With only 50 examples, the model may struggle with "unseen" words or variations it hasn't encountered in the training set.

Technique Limitations: The current use of CountVectorizer only counts word occurrences and doesn't account for word importance or context. Implementing n-grams was attempted to help the model recognize multi-word items, though initial results didn't meet expectations, indicating a need for more diverse data.

# Task 2 Full evaluation 


In this task, I developed a Market Basket Analysis (MBA) system designed to recommend complementary grocery items based on the contents of a customer's shopping basket. My approach focused on modular data handling, interpretable statistical modeling, and robust fallback logic to ensure utility even for unique inputs.

# 1. Data Collection
To train a model on realistic shopping behavior, a large-scale transactional dataset was required.

Dataset Used: The UCI "Groceries" dataset, which contains approximately 15,000 anonymized shopping transactions.

Structure: Each transaction represents a single shopping trip, recorded as a set of items purchased together.

Rationale: Unlike the hand-labeled data in Task 1, this public dataset provides the volume and variety needed to extract statistically significant purchasing patterns.

# 2. Data Processing
The raw transaction data must be transformed into a format suitable for association rule mining.

Basket Parsing: The raw CSV data (containing customer IDs and dates) is grouped and cleaned to create a basket_list. Each entry in this list is a sub-list of items belonging to one transaction.

One-Hot Encoding: Using TransactionEncoder from the mlxtend library, the item lists are converted into a sparse matrix where columns represent items and rows represent transactions. A "True" value indicates an item's presence in a specific basket.

Filtering: A minimum support threshold (e.g., min_support=0.005) is applied to focus on itemsets that appear frequently enough to provide reliable recommendations while still allowing for some rare item combinations.

# 3. Machine Learning Technique
The project employs the Apriori Algorithm and Association Rule Mining.

Reason for Choice: Apriori is the industry standard for Market Basket Analysis because it efficiently identifies frequent itemsets. It generates "rules" (e.g., {pasta, olive oil} to {canned tomato}) based on three key metrics:

Support: How often the itemset appears in the total dataset.

Confidence: How often the recommendation is true when the initial items are present.

Lift: The strength of the association (a lift > 1 indicates the items are bought together more often than by random chance).

Fallback Logic: For live recommendations, a three-step approach is used: exact matches of rules, partial matches for overlapping items, and a frequency-based fallback to ensure the user always receives a suggestion.

# 4. Results and Analysis
Description of Results: The model successfully generated 57 distinct association rules from the dataset. For common items like "whole milk" or "yogurt," the system provides high-confidence suggestions based on thousands of real-world transactions.

Explanation:

Data Influence: The specific recommendations are heavily influenced by the UCI dataset's demographics. For example, "whole milk" appears in many rules because it is the most frequent item in the dataset.

Integration with Task 1: By reusing the structured item names from Task 1, the system ensures that categorized items can be directly mapped to recommendation rules, creating a seamless user experience from item entry to suggestion.

# 5. Future Improvements
To move beyond simple co-occurrence, the system could be enhanced with Collaborative Filtering. This would allow the model to suggest items based on what similar customers bought, rather than just what is currently in the basket, providing a more personalized shopping experience.

# Task 3 Full Evaluation

Building on the classification and recommendation logic established in previous steps, Task 3 introduces an Image Recognition Model using a Convolutional Neural Network (CNN). This model identifies specific grocery items from images and feeds them into the Task 1 classifier for automatic categorization.

# 1. Data Collection
To train a model capable of visual recognition, a diverse set of images was required for each grocery item.

Source: Images were programmatically collected from Wikimedia Commons and Pixabay using a custom scraper.

Items Selected: The model focuses on four common items across two categories:

Fruit: Apples, Bananas

Vegetables: Carrots, Tomatoes

Quantity: Approximately 50 high-quality images were scraped per item to ensure a balanced dataset.

# 2. Data Preprocessing
Raw images vary in size, format, and quality, requiring standardization before they can be used for training.

Cleaning: A dedicated script was used to filter and remove any corrupted or unreadable image files from the dataset.

Resizing & Normalization: All images are resized to a uniform 128x128 pixels. Pixel values are normalized to a range of [0, 1] by dividing by 255, which helps the neural network converge faster during training.

Batching: Data is organized into batches of 32 for efficient processing during the training phase.

# 3. Machine Learning Technique
A Convolutional Neural Network (CNN) was designed specifically for this multi-class image classification task.

Model Architecture: The model consists of several layers designed to extract visual features:

Convolutional Layers: Three layers with increasing filters (32, 64, and 128) to detect shapes and patterns.

MaxPooling Layers: Used after each convolution to reduce spatial dimensions and focus on the most important features.

Dropout Layer: A 50% dropout rate is applied before the final layer to prevent overfitting, ensuring the model generalizes well to new, unseen images.

Reason for Choice: CNNs are the gold standard for image recognition because they can automatically learn spatial hierarchies of features, making them far more effective than traditional flat neural networks for visual data.

# 4. Results and Analysis
Performance: The model demonstrates high accuracy in identifying "Apples" and "Bananas" on simple backgrounds.

Integration with Task 1: When an image (eg an apple) is correctly identified, the string "apple" is passed to the Task 1 model, which then successfully assigns it to the "Fruit and Veg" category.

Challenges:

Color Bias: Initial testing showed the model might rely too heavily on color; for instance, it once misclassified a multicolored apple as a banana.

Complex Backgrounds: Items photographed in messy environments or in varied lighting conditions are harder for the current model to identify accurately.

# 5. Future Improvements
Data Augmentation: Implementing techniques like rotation, flipping, and zooming on existing images would help the model recognize items from different angles and in various lighting.

Transfer Learning: Utilizing pre-trained models like ResNet or EfficientNet could significantly boost accuracy by leveraging knowledge from millions of existing images.