## **Alibaba Web Scraping Using Python**

### **1. Importing Libraries**

In [27]:
from bs4 import BeautifulSoup 
import requests 
import time 
import datetime 
import pandas as pd

*In this section, we import the necessary libraries: BeautifulSoup for parsing HTML, requests for making HTTP requests, time and datetime for handling timestamps, and pandas for organizing and saving the scraped data.*

---

### **2. Set Up URL and Headers**


In [28]:
# URL of the Alibaba product page
URL = 'https://www.alibaba.com/product-detail/MereSports-Men-s-100-Merino-Wool_1601352569953.html?spm=a2700.galleryofferlist.p_offer.d_image.f0d313a0eX1fnF&s=p'

# Custom headers to mimic a real browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    "Accept-Encoding": "gzip, deflate",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "DNT": "1",
    "Connection": "close",
    "Upgrade-Insecure-Requests": "1"
}

*In this part, we define the URL of the Alibaba product page and set custom headers to simulate a real browser request. This is necessary to avoid being blocked by the website.*

---

### **3. Fetching the Web Page**



In [29]:
# Send a GET request to fetch the webpage content
page = requests.get(URL, headers=headers)

# Parse the content using BeautifulSoup
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")

# Print the prettified HTML content (optional for debugging)
print(soup2)

<!-- tangram:5542 begin-->
<!DOCTYPE html>

<html class="rwd" dir="ltr" lang="en">
<head>
<meta content="a2700" name="data-spm"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
   Meresports Men's 100% Merino Wool T-shirts Premium Comfort Durability Casual Fit Short Sleeve Fitness Shirts - Buy Merino Wool Shirt
100% Merino Wool Tshirt
merino Wool Sport Cycling
merino Wool Men
merino Wool Clothing Manufacturers
merino Wool Pullover
100% Merino Wool Shirt
merino Wool Underwear
tshirt Merino Wool
merino Wool Tshirt
merino Wool Base Layer
merino Wool T Shirt
merino Wool Clothing
100% Merino Wool
100% Merino Wool Men
merino Wool Shirt
merino Wool T-shirts Product on Alibaba.com
  </title>
<meta content="Meresports Men's 100% Merino Wool T-shirts Premium Comfort Durability Casual Fit Short Sleeve Fitness Shirts - Buy Merino Wool Shirt
100% Merino Wool Tshirt
merino Wool Sport Cycling
merino Wool Men
merino Wool Clothing Manufacturers
merino Wool Pullover
100% M

*Here, we make an HTTP GET request to the URL and parse the page's HTML content using BeautifulSoup. We also prettify the HTML for better readability and print it out for inspection.*

---

### **4. Extracting the Product Title**


In [31]:
# Find the product title using its HTML tag and attribute
title = soup2.find('h1')['title']
print(title)


MereSports Men's 100% Merino Wool T-Shirts Premium Comfort Durability Casual Fit Short Sleeve Fitness Shirts


*This section extracts the product title by finding the `<h1>` tag and accessing the `title` attribute.*




---
### **5. Scraping Price and Quantity Information**


In [32]:
# Find all price items on the page
price_items = soup2.find_all('div', class_='price-item')

# Initialize an empty list to store price and quantity data
price_info = []

# Loop through each price item to extract quantity and price
for item in price_items:
    quantity = item.find('div', class_='quality').text.strip().replace('pieces', '')
    price = item.find('span').text.strip()
    price_info.append([quantity, price])

# Output the extracted data (optional)
print(price_info)

[['500 - 1999 ', '$16.50'], ['>= 2000 ', '$15.50']]


*This part finds all the price-related items on the page and extracts the quantity and price for each. The quantity is cleaned to remove the "pieces" text.*

---

### **6. Timestamp for Data Collection**


In [33]:
# Create a timestamp for the data collection date
today = datetime.date.today()
print(today)

2025-03-03


*This section generates the current date, which will be used to track when the data was collected.*

---

### **7. Organize Data for Output**


In [34]:
# Organize the scraped data into a structured format
data = []
for i in range(len(price_info)):
    quantity = price_info[i][0]
    price = price_info[i][1]
    data.append([title, quantity, price, today])

# Output the organized data (optional)
print(data)

[["MereSports Men's 100% Merino Wool T-Shirts Premium Comfort Durability Casual Fit Short Sleeve Fitness Shirts", '500 - 1999 ', '$16.50', datetime.date(2025, 3, 3)], ["MereSports Men's 100% Merino Wool T-Shirts Premium Comfort Durability Casual Fit Short Sleeve Fitness Shirts", '>= 2000 ', '$15.50', datetime.date(2025, 3, 3)]]


*We organize the scraped price and quantity information, along with the product title and collection date, into a structured list.*

---

### **8. Saving Data to CSV**


In [35]:
# Define column headers for the DataFrame
header = ['Title', 'Quantity', 'Price', 'Date']

# Create a DataFrame using the scraped data
df = pd.DataFrame(data, columns=header)

# Display the DataFrame (optional)
df

# Save the DataFrame to a CSV file
df.to_csv(r"C:\Users\user\Desktop\studying\1- python\python projects\web scraping\alibaba_data_scraped.csv", index=False)

*Finally, we create a pandas DataFrame to organize the data in a table format. Then, we save the DataFrame to a CSV file for further analysis.*
