<a href="https://colab.research.google.com/github/amaraaabn/Web-Scraping-PomeloFashion/blob/main/Pomelo_Fashion_Web_Scraping_(by_Amara).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import & Prepare Data

I have a list of product IDs for Pomelo Fashion items. Each ID uniquely identifies a product SKU and serves as the key for accessing the landing page. From there, we can scrape the required data.

In [80]:
import pandas as pd
import numpy as np
from google.colab import files

# Upload the CSV file
uploaded = files.upload()

# Get the filename
filename = list(uploaded.keys())[0]

# Read the CSV file into a pandas DataFrame
data = pd.read_csv(filename)

Saving PORTO_dataset_fabric.csv to PORTO_dataset_fabric (1).csv


In [81]:
# Filter & select non-blank data rows
data_used = data[~pd.isna(data['prod_id'])].copy()

# Add a column to store newly scraped data
data_used[['Product Name', 'Image Link', 'Product Link', 'Fabric Composition']] = np.nan
data_used.head()

Unnamed: 0,prod_id,Product Name,Image Link,Product Link,Fabric Composition
0,102055,,,,
1,102059,,,,
2,101754,,,,
3,102050,,,,
4,101939,,,,


In [82]:
# Get data information
data_used.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   prod_id             150 non-null    int64  
 1   Product Name        0 non-null      float64
 2   Image Link          0 non-null      float64
 3   Product Link        0 non-null      float64
 4   Fabric Composition  0 non-null      float64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


# Initial Data Investigation (Identifying Pattern)

Pomelo Fashion has stores in multiple countries and operates an online shop platform at [pomelofashion.com](https://pomelofashion.com).

The website offers access to collections tailored to specific countries through countries & languages codes. However, not every collection is available in each country.

Therefore, it is necessary to identify which website hosts the collection so that its data can be scraped.

In [83]:
# Pomelo Fashion Website URL
pomelo = 'https://pomelofashion.com/'

# Available Country/Language Codes on the Website
countries_code = ['id/id/', 'th/en', 'th/th', 'sg/en', 'us/en', 'au/en', 'my/en', 'ph/en', 'hk/en', 'mo/en', 'global/en']

## Data selection

The `Product Link` is obtained from the constructed website URL.

Before iterating through each `prod_id`, it's essential to understand the website pattern for extracting key information such as `Product Name`, `Image Link`, and `Fabric Composition` from the HTML code.

To uncover this pattern, I examine the HTML code using the first prod_id from the dataset and 'id/id' Country/Language Code.

In [84]:
# Selecting Product ID and Country/Language Code
PID = data_used.at[0, 'prod_id']
country = countries_code[0]

# Constructing Web Page URL
URL = pomelo + country + str(PID) + '.html'
print('Product ID:', PID)
print('Website example:',URL)

Product ID: 102055
Website example: https://pomelofashion.com/id/id/102055.html


By manually extracting data from the webpage, we have gathered information about the product with the Product ID 102055, which includes:

- **Product Name:** Long Sleeve Stripes Knitted Top - Multi Color
- **Image Link:** https://cdn.pomelofashion.com/img/p/5/8/5/2/6/7/585267.jpg
- **Fabric Composition:** Viscose 50%, Poliester 28%, Nilon 22%

\

**) Note: We have extracted only the image address of the first photo of the product.*

## HTML Import

In [85]:
# Extracting html code using beautiful soup
import requests
from bs4 import BeautifulSoup

## Make HTTP request to the webpage
response = requests.get(url)

## Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

## Print the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<html lang="id">
 <head>
  <meta charset="utf-8" class="next-head"/>
  <title class="next-head">
   Logo Hardware Stripes Cardigan - Navy/White - Pomelo Fashion
  </title>
  <meta class="next-head" content="Shop the latest fashion at POMELO fashion. Designed and manufactured in house for quality and style. Dresses, tops, knits, jackets, denim, accessories and more!" name="description"/>
  <meta class="next-head" content="pomelofashion://deeplink" property="al:ios:url"/>
  <meta class="next-head" content="pomelofashion://deeplink" property="al:android:url"/>
  <meta class="next-head" content="home" name="branch:deeplink:page"/>
  <meta class="next-head" content="101939" name="branch:deeplink:id"/>
  <meta class="next-head" content="pomelofashion://deeplink" name="branch:deeplink:$deeplink_path"/>
  <meta class="next-head" content="pomelofashion://deeplink" name="branch:deeplink:$ios_deeplink_path"/>
  <meta class="next-head" content="pomelofashion://deeplink" name="branc

## Pattern Recognition

The identified pattern is as follows



>**Product Name:**\
`... "manufacturer":{"@type":"Organization","name":"Pomelo"},"name":"Long Sleeve Stripes Knitted Top - Multi Color","productID":102055, ...`

\

>**Image Link:**\
`..."height":null,"image":"https://cdn.pomelofashion.com/img/p/5/8/5/2/6/7/585267.jpg","inProductGroupWithID":6,"manufacturer":{ ...`

\

>**Fabric Composition:**\
`... "is_new_measurement_added":false}],"materials":{"fabrics":{"fabrics":[{"name":"Viscose","percent":50},{"name":"Poliester","percent":28},{"name":"Nilon","percent":22}],"lining": ...`

From the pattern above, string manipulation techniques will be applied to extract the information.

# Data Retrieval

## Scraping

In [87]:
import re
import json
import requests
from bs4 import BeautifulSoup

df = data_used.copy()


# Website URL & Country/Language Codes
pomelo = 'https://pomelofashion.com/'
countries_code = ['id/id/', 'th/en', 'th/th', 'sg/en', 'us/en', 'au/en', 'my/en', 'ph/en', 'hk/en', 'mo/en', 'global/en']

for i, PID in enumerate(df['prod_id']):
  # Initialize a flag to track if a successful response is obtained
  success_flag = False

  # Try different web column options until a successful response is obtained
  for country in countries_code:
    url = pomelo + country + str(PID) + '.html'
    response = requests.get(url)

    # Check if the server responded positively to the request
    if response.status_code == 200:
      #Check is the product available in the current website url
      if "Product Unavailable" not in response.text:
        success_flag = True
        break
    continue

  # Extracting data
  ## If the server responded positively and the product available in the website
  if success_flag:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    a = str(soup)

    # Find Product Name
    beg = a.find('},"name":"', a.find('"manufacturer":{',0))
    end = a.find('",', beg)
    prod_name = a[(beg+len('},"name":"')):end]

    # Find Image Link
    beg = a.find("https://cdn.pomelofashion.com/img/p",0)
    end = a.find('.jpg"',beg)
    img_link = a[beg:end+4]

    # Fabric Composition
    beg = a.find('"materials":{"fabrics":{"fabrics":[',0)
    end = a.find(']', beg)
    json_txt = a[(beg + len('"materials":{"fabrics":{"fabrics":')):end+1]
    if json_txt != '':
      # Convert JSON text to list of dictionaries
      fabric_dict = json.loads(json_txt)
      # Construct the formatted string
      fabric_comp = ', '.join([f"{item['name']} {item['percent']}%" for item in fabric_dict])
    else:
      fabric_comp = 'None'

  else:
    prod_name, img_link, fabric_comp = 'Product Unavailable', 'Product Unavailable', 'Product Unavailable'

  df['Product Name'][i] = prod_name
  df['Image Link'][i] = img_link
  df['Product Link'][i] = url
  df['Fabric Composition'][i] = fabric_comp


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Product Name'][i] = prod_name
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Image Link'][i] = img_link
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Product Link'][i] = url
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Fabric Composition'][i] = fabric_comp


## Output

In [88]:
# Showing / Printing
df

Unnamed: 0,prod_id,Product Name,Image Link,Product Link,Fabric Composition
0,102055,Long Sleeve Stripes Knitted Top - Multi Color,https://cdn.pomelofashion.com/img/p/5/8/5/2/6/...,https://pomelofashion.com/id/id/102055.html,"Viscose 50%, Poliester 28%, Nilon 22%"
1,102059,Contrast Trim Knitted Top - Cream,https://cdn.pomelofashion.com/img/p/5/8/5/2/7/...,https://pomelofashion.com/id/id/102059.html,"Viscose 50%, Poliester 28%, Nilon 22%"
2,101754,Collar Puffed Sleeve Blouse - White,https://cdn.pomelofashion.com/img/p/5/8/4/4/6/...,https://pomelofashion.com/id/id/101754.html,"Poliester 85%, Rayon 15%"
3,102050,Knitted Round Neck Mini Dress - Black,https://cdn.pomelofashion.com/img/p/5/8/5/3/5/...,https://pomelofashion.com/id/id/102050.html,"Viscose 50%, Poliester 28%, Nilon 22%"
4,101939,Logo Hardware Stripes Cardigan - Navy/White,https://cdn.pomelofashion.com/img/p/5/8/4/8/0/...,https://pomelofashion.com/id/id/101939.html,Recycled polyester 100%
...,...,...,...,...,...
145,96297,Long Sleeve Knit Top - Black,https://cdn.pomelofashion.com/img/p/5/6/2/9/1/...,https://pomelofashion.com/id/id/96297.html,"Viscose 50%, Nilon 22%, Poliester 28%"
146,101691,Knitted Short Sleeves Top - Navy,https://cdn.pomelofashion.com/img/p/5/8/2/4/8/...,https://pomelofashion.com/id/id/101691.html,"Viscose 80%, Poliamida 20%"
147,102098,Regular Fit Knitted Top - Dark Blue,https://cdn.pomelofashion.com/img/p/5/8/4/5/5/...,https://pomelofashion.com/id/id/102098.html,"Viscose 49%, Poliester 28%, Nilon 23%"
148,101473,Chain Print Knitted Top - Navy,https://cdn.pomelofashion.com/img/p/5/8/2/6/3/...,https://pomelofashion.com/id/id/101473.html,"Viscose 85%, Spandeks 15%"


In [89]:
# Export to excel files
df_excel = df.to_excel('Product Information.xlsx', index=False)