# Aldi crawler

The aldi server seems to require the client to make a handshake with the server. Without any headers, the response status code is 403 Forbiddent. 
After first request, update the request with the headers, second request will succedd. Use session will make life easier in this case.

In [1]:
import bs4, requests, json

user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
sess = requests.Session()
sess.headers.update({'User-Agent': user_agent})
sess.get('https://groceries.aldi.ie/en-GB/Search?keywords=peanut%20butter')

response = sess.get('https://groceries.aldi.ie/en-GB/Search?keywords=peanut%20butter')

## parse the soup

`soup.select(css selector)` helps to specify the target. Use `id` attribute is preferred though.

after select, it returns a `ResultSet` collection.
Iterate through the items, each node is a `Tag` object. Get the value by passing `item['target_attribute']` will have the attribute value.

Use `json.loads()` will load the json to dictionary

In [2]:
soup = bs4.BeautifulSoup(response.content, 'html.parser')
items = soup.select('div[data-oc-controller="Product.SearchSummary"]')
if items:
    item = items[0]  # first item
    raw_json = item['data-context']
    data = json.loads(raw_json)
    
    # iterate through the data nodes, create the object item

In [3]:
attribute_dict = json.loads(items[0]['data-context'])
products_list = attribute_dict['SearchResults']

In [11]:
print(products_list[0]['FullDisplayName'])
print(products_list[0]['ListPrice'])
print(products_list[0]['DisplayPrice'])

Smooth Peanut Butter 340g Grandessa
1.19
£1.19


{'FullDisplayName': 'Smooth Peanut Butter 340g Grandessa',
 'DefinitionName': 'Food',
 'SearchTerm': None,
 'HasPriceRange': False,
 'PriceListId': 'StoreLevelPricesExcVAT',
 'ListPrice': 1.19,
 'Price': None,
 'ItemFormat': 1.0,
 'IsSeasonalProductAvailableForCollect': False,
 'QtyMaxReachMessage': None,
 'RemoveOnlyEditedProductMessage': None,
 'ProductId': '4088600294261',
 'VariantId': None,
 'HasVariants': False,
 'Sku': '4088600294261',
 'DisplayName': 'Smooth Peanut Butter 340g Grandessa',
 'Brand': None,
 'BrandId': None,
 'Description': None,
 'Url': '/en-GB/p-smooth-peanut-butter-340g-grandessa/4088600294261',
 'ImageUrl': 'https://aldprdproductimages.azureedge.net/media/$Aldi_IE/11.05.22 Drinks and Jams/4088600294261_0.jpg',
 'FallbackImageUrl': 'https://aldprdproductimages.azureedge.net/media/image_not_found.jpg',
 'IsAvailableToSell': True,
 'CategoryId': 'C5304',
 'IsRecurringOrderEligible': False,
 'RecurringOrderProgramName': None,
 'DisplayPrice': '£1.19',
 'DisplaySpe

The result page, might have multiple pages. Aldi only displays 40 products by default.

In [5]:
print(f"The search page found {len(products_list)} products (at least)")

The search page found 40 products (at least)


## product information

We get the product information in the form of dictionary. We can create a dataframe from it. 

In [6]:
import pandas as pd
item_df = pd.DataFrame(products_list[0], index=[0])
item_df.columns

Index(['FullDisplayName', 'DefinitionName', 'SearchTerm', 'HasPriceRange',
       'PriceListId', 'ListPrice', 'Price', 'ItemFormat',
       'IsSeasonalProductAvailableForCollect', 'QtyMaxReachMessage',
       'RemoveOnlyEditedProductMessage', 'ProductId', 'VariantId',
       'HasVariants', 'Sku', 'DisplayName', 'Brand', 'BrandId', 'Description',
       'Url', 'ImageUrl', 'FallbackImageUrl', 'IsAvailableToSell',
       'CategoryId', 'IsRecurringOrderEligible', 'RecurringOrderProgramName',
       'DisplayPrice', 'DisplaySpecialPrice', 'IsOnSale', 'SellingMethod',
       'UnitOfMeasure', 'IsUnit', 'IsUnitMeasure', 'IsApproxUnit',
       'JsonContext', 'DefaultListPrice', 'CurrentListPrice', 'SizeVolume',
       'UnitPriceDeclaration', 'UnitPrice', 'HasUnitPrice', 'ImageRibbonText',
       'ImageBannerText', 'ImageBadges', 'ImageBadges_Facet',
       'AssortmentState', 'ImageRibbonColour', 'ImageBannerColour'],
      dtype='object')

# Get the product page url

This product page url will link to the detailed product page, it contains all nutrition facts, item categories, etc.

In [7]:
products_list[0]['Url']

'/en-GB/p-smooth-peanut-butter-340g-grandessa/4088600294261'