![logo](https://lh3.googleusercontent.com/drive-viewer/AK7aPaD3ufeMCQTD1-doWtQSHK8snQjgqYdAscSL7mNuLmhVVAoDmdbuV5Z1eG_j5-vT4N64bUvfOrHjuw-3WrS532FsUSu9=s1600)

# Data Science Assessment

## Theoretical Part

### Question 1
#### You are given thousands of images like the one in the link below:
#### https://drive.google.com/file/d/1Q7ri0UcGmtsfYiAJ1hb8fI0bLdmsYMNi/view?usp=sharing
#### Describe the method you would use for getting key information from each product image.

To extract key information from product images, we could use the following pipeline of procedures: 

1 Image Preprocessing:
- Enhance image quality and adjust brightness/contrast if necesary.
- Check for alternative angles for the product, in order to process all images of the same product as a group.
    
2 Object Detection: 
- Use pre-trained object detection models (such as YOLO) to identify and locate objects with the image
- Identify the main (target) object and ignore text boxes, banners and additional marketing elements.

3 Image Segmentation:
- Employ deep learning segmentation and/or edge detection techniques to divide the image into meaningful regions.
- Identify the product's boundaries, in order to detect the packaging type (bottle, box, etc) to record it.

4 Feature Extraction:
- Use pattern matching or train a classificiation model to identify specific markers (e.g. US GROWN) on the packaging.
- Use another model to identify markers that indicate applicable discounts.

5 Text Extraction:
- Use optical character recognition (OCR) to extract text information from the product image, such as product names, descriptions and other labels.
- Additionally detect information such as volume, weight and nutritional information.

6 Classification:
- Train a classification model to categorize products (e.g. beverage, rice, pasta, etc).
  
7 Quality Control:
- Implement checks to ensure the accuracy of the information extracted.
- Address false positives/negatives and refine the model if necessary.

8 Integration with database: 
- Store the extracted information in a structured format and integrate it into a database.
  
9 Maintenance:
- Update and retrain the models with new data to adapt to changes in the product images.tem category


### Question 2
#### One of our customers is interested in monitoring the grocery section of a direct competitor, Target, in the state of New York.
#### https://www.target.com/
#### Target is a huge website with tons of data publicly available. Suggest three valuable insights that we can be capturing by scraping the website on a daily basis.

1 Product pricing and promotions:
- Track daily flunctuations in product prices as continuous time series and identify any promotions, discounts, or special offers.

2 Product availability and stock levels:
- Track the availability of popular grocery items and track stock levels over time.
- Identify out-of-stock or low-stock situations for specific products.

3 New product launches and featured products:
- Identify new items on the website.
- Identify featured products or those being heavily promoted on the website.lines.

## Coding Part

### Question 3
#### Here’s an example item from Target:
#### https://www.target.com/p/ocean-spray-cranberry-juice-cocktail-64-fl-oz-bottle/-/A-12935714#lnk=sametab
#### Design an SQL schema for storing item information when scraping all items from Target website.

In [None]:
CREATE TABLE Product (
    product_id INT PRIMARY KEY,
    name VARCHAR(255),
    brand VARCHAR(100),
    category VARCHAR(100),
    price DECIMAL(10, 2),
    currency VARCHAR(3),
    availability BOOLEAN,
    stock_quantity INT,
    url VARCHAR(255),
);

CREATE TABLE ProductDetails (
    product_id INT PRIMARY KEY,
    description TEXT,
    highlights TEXT,
    contains TEXT,
    features TEXT,
    image_url VARCHAR(255),
    rating FLOAT,
    num_reviews INT,
    on_sale BOOLEAN,
    net_weight FLOAT,
    package_quantity INT,
    TCIN INT,
    UPC INT,
    origin VARCHAR(255)
    form VARCHAR(255),
    date_scraped TIMESTAMP WITH TIME ZONE,
    FOREIGN KEY (product_id) REFERENCES Product(product_id)
);

In more detail: 
- Product: This table stores basic information about each product.
    - product_id: A unique identifier for each product.
    - name: The name of the product (e.g., Ocean Spray Cranberry Juice Cocktail - 64 fl oz Bottle).
    - brand: The brand of the product (e.g., Ocean Spray).
    - category: The category to which the product belongs (e.g., "Beverages/Juice & Cider").
    - price: The price of the product (e.g., 3.39).
    - currency: The currency in which the price is specified (e.g., US Dollars).
    - availability: A boolean indicating whether the product is available (e.g., True).
    - stock_quantity: The quantity of the product in stock (e.g., 10).
    - url: The URL of the product on the Target website.
        
- ProductDetails: This table stores additional details about each product.
    - product_id: A foreign key linking to the product_id in the Product table.
    - description: A text field containing a detailed description of the product.
    - highlights: A text field containing highlights of the product.
    - contains: A text field containing the allergen contents of the product.
    - features: A text field containing the features of the product.
    - on_sale: A boolean indicating whether the product is on sale.
    - net_weight: The net weight of the product in grams.
    - package_quantity: The package quantity of the product.
    - image_url: The URL of the product image.
    - rating: The average rating of the product (e.g., 4.7).
    - num_reviews: The number of reviews for the product (e.g., 481).
    - TCIN: The TCIN identifier (e.g., 12935714).
    - UPC: The UPC identifier (e.g., 031200200075).
    - origin: The origin of the product (e.g., Made in the USA or Imported).
    - form: The form of the product (e.g., Liquid).
    - date_scraped: The timestamp when the information was scraped.


A foreign key is set to allign the two data tables. SQL supports foreign keys, which permit cross-referencing related data across tables, and foreign key constraints, which help keep the related data consistent, in this case Product and ProductDetails. e.

### Question 4
#### We have scraped two online grocery storefronts, one of Whole Foods and and one of Fresh Direct, both located in 10002. You are given the lists of Wine & Beer 
#### items of each storefront in the link below.
#### https://drive.google.com/drive/folders/17W1vjy4Vi0d32ThDlJZTneH4bNAjgFxV?usp=sharing
#### Please create a Python script that reads the two lists shared, processes the data to find common items, and outputs the list of common items to an output file. 
#### The script should be executable from the terminal using the command:  python3 {script_name} {input1} {input2} {output}
#### If your script relies on external libraries or packages, include a requirements.txt file specifying the dependencies.

In [7]:
!python .\compare_lists.py .\fresh_direct.csv .\whole_foods.csv .\output.csv

DEBUG:compare_lists:Original List1:
              category  ...                                           item_url
0  Beer, Non-Alcoholic  ...  https://www.freshdirect.com/supergro/beer/sc/b...
1  Beer, Non-Alcoholic  ...  https://www.freshdirect.com/supergro/beer/sc/b...
2  Beer, Non-Alcoholic  ...  https://www.freshdirect.com/supergro/beer/sc/b...
3  Beer, Non-Alcoholic  ...  https://www.freshdirect.com/supergro/beer/sc/b...
4  Beer, Non-Alcoholic  ...  https://www.freshdirect.com/supergro/beer/sc/b...

[5 rows x 9 columns]

DEBUG:compare_lists:Original List2:
                                           name  ...  serving_size_uom
0                     Modelo Especial, 12 fl oz  ...               NaN
1          Run Wild Non-Alcoholic IPA, 12 fl oz  ...             fl oz
2  Blue Moon Belgian White Ale (6-pk), 12 fl oz  ...             fl oz
3                 Modelo Especial 6pk, 12 fl oz  ...               NaN
4                 Blood Orange Mint 6pk, 355 ml  ...               can

[5 r