# Extracción y análisis de datos de comercio electrónico para categorías de cuidado de la piel (Colab version)
[Author: Elias Buitrago Bolivar](https://github.com/ebuitrago?tab=repositories)

Este cuaderno Jupyter presenta un algoritmo de raspado web basado en Python diseñado para extraer datos de sitios web de comercio electrónico de cuidado de la piel con el fin de analizar tendencias en diferentes categorías de productos. Los datos de lociones, cremas, sueros, limpiadores y protectores solares se obtienen de [mercado libre](www.mercadolibre.com.co). El código proporcionado es totalmente funcional y se ha probado con datos reales. Esta versión está optimizada para ser compatible con Google Colab, lo que garantiza su facilidad de uso y escalabilidad.

_Updated: Nov 24, 2024_


## Install required libraries

In [None]:
!pip install lxml
!pip install scrapy
!pip3 install requests-html
!pip3 install selenium

Collecting scrapy
  Downloading Scrapy-2.12.0-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting Twisted>=21.7.0 (from scrapy)
  Downloading twisted-24.10.0-py3-none-any.whl.metadata (20 kB)
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.3.2-py3-none-any.whl.metadata (3.9 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.9.1-py2.py3-none-any.whl.metadata (11 kB)
Collecting queuelib>=1.4.2 (from scrapy)
  Downloading queuelib-1.7.0-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting service-identity>=18.1.0 (from scrapy)
  Downloading service_identity-24.2.0-py3-none-any.whl.metadata (5.1 kB)
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-2.2.1-py3-none-any.whl.metadata (2.1 kB)
Collecting zope.interface>=5.1.0 (from scrapy)
  Downloading zope.interface-7.1.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_

In [None]:
%%shell
# Install chromedriver
# Credits: https://medium.com/@MinatoNamikaze02/running-selenium-on-google-colab-a118d10ca5f8
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb

wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.62/linux64/chromedriver-linux64.zip -P /tmp/
unzip -o /tmp/chromedriver-linux64.zip -d /tmp/
chmod +x /tmp/chromedriver-linux64/chromedriver
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver

pip install selenium chromedriver_autoinstaller

[33m0% [Working][0m            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,452 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [3,323 kB]
Get:13 https://d



### Web Scraping Used Car Sales Data
This section explains the web scraping process implemented to obtain the data from the used car sales web site [Tu Carro](www.tucarro.com.co).

In [None]:
!pip install undetected_chromedriver

Collecting undetected_chromedriver
  Downloading undetected-chromedriver-3.5.5.tar.gz (65 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/65.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.4/65.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: undetected_chromedriver
  Building wheel for undetected_chromedriver (setup.py) ... [?25l[?25hdone
  Created wheel for undetected_chromedriver: filename=undetected_chromedriver-3.5.5-py3-none-any.whl size=47048 sha256=2afe06a562856f63f67c195623e779c19c99cfb85e624483ccc8eff423134b84
  Stored in directory: /root/.cache/pip/wheels/cf/a1/db/e1275b6f7259aacd6b045f8bfcb1fcbc93827a3916ba55d5b7
Successfully built undetected_chromedriver
Installing collected packages: undetected_chromedriver
Successfully installed undetected_chromedriver-3.5.5


## Import required libraries


---

In [None]:
'''
credits:
https://github.com/googlecolab/colabtools/issues/3347
https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com
Sept 19, 2023
'''

#
!pip3 install chromedriver-autoinstaller



In [None]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import chromedriver_autoinstaller
import json

## Setup chrome and chrome driver


---



In [None]:
# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# # set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

'/usr/local/lib/python3.10/dist-packages/chromedriver_autoinstaller/131/chromedriver'


## Section to declare functions

---



### Function scrapebyPages

In [None]:
# categorias belleza y cuidado personal:
# Perfumes y lociones: https://listado.mercadolibre.com.co/belleza-cuidado-personal/perfumes/lociones-perfumes_Desde_49_NoIndex_True
# Limpieza-facial: https://listado.mercadolibre.com.co/belleza-cuidado-personal/cuidado-piel/limpieza-facial/limpieza-facial_Desde_49_NoIndex_True



def scrapebyPages(min, max):
  #Range of pages from the total search to scrape in.
  #It is recomended to cover a range of one hundred pages in each iteration of this section.
  data = pd.DataFrame()
  for i in range(min,max):

      print(f'************************************')
      print(f'WEB SCRAPING FROM SEARCH PAGE #{i}')
      pag = i
      url = f'https://listado.mercadolibre.com.co/belleza-cuidado-personal/perfumes/lociones-perfumes_Desde_{49*i}_NoIndex_True'
      # url = f'https://listado.mercadolibre.com.co/belleza-cuidado-personal/cuidado-piel/limpieza-facial/limpieza-facial_Desde_{49*i}_NoIndex_True'
      print(url)

      driver = webdriver.Chrome(options=chrome_options)
      driver.get(url)
      driver.implicitly_wait(10)
      html = driver.page_source
      soup = bs(html,'lxml')

      #Get href
      links = gethref(soup)

      p = []
      #Scraping
      for i in range(0,len(links)):
          print('Scrapping', i, '/', len(links), '...')
          p.append(scrapper(links[i]))
          print(f'Este es el valor de p[i]: {p[i]}')

      # append list to DataFrame
      temp_df = pd.DataFrame(p)
      data = pd.concat([data, temp_df], ignore_index=True)

  #Close the web browser tab
  driver.close()

  # quit the driver
  driver.quit()

  return data

### Function gethref

In [None]:
#Function to get 'href' from each article item
def gethref(soup):

    links = []
    for link in soup.findAll('a'):
      url_item = link.get('href')
      # print(url_item)
    #   # links.append(url_item)
      if 'MCO' and 'polycard_client' in url_item:
        print(url_item)          #Print each car url as a validity test
        links.append(url_item)

    # for link in soup.findAll('a', class_='poly-componente__link'):
    #   url_item = link.get('href')
    #   print(url_item)
    #   links.append(url_item)

    #   # if 'MCO' in url_item:
    #   #   print(url_item)          #Print each car url as a validity test
    #   #   links.append(url_item)

    print("Href obtained: ", len(links))

    return links
    # return

### Function scrapper

In [None]:
#Function to call housing_features routine on each href
def scrapper(url_item):

    # set up the webdriver
    driver = webdriver.Chrome(options=chrome_options)

    # Scrape
    driver.get(url_item)
    driver.implicitly_wait(10)
    html=driver.page_source

    #Obtaining the html from the web page after applying Selenium
    soup = bs(html,'lxml')

    #Create a list to store info obtained from one particular property
    features = []

    #Applying function to obtain variables defined from one particular property
    features = extract_product_features(soup)

    #Close the web browser tab
    driver.close()

    # quit the driver
    driver.quit()

    return(features)

### Function to extract product features

In [None]:
# Version 1.0
def extract_product_features(soup):

  features_list = []
  # product_name
  try:
    product_name = soup.find('h1',{'class': 'ui-pdp-title'}).text
    features_list.append(product_name)
    # print(f"Product's name is: {product_name}")
  except:
    product_name = ' '
    features_list.append(product_name)

  # price
  try:
    price_div=soup.find('div',{'class': 'ui-pdp-price__second-line'})
    price = price_div.find('span',{'class': 'andes-money-amount__fraction'}).text
    features_list.append(price)
    # print(f"Product's price is: {price}")
  except:
    price = 0
    features_list.append(price)

  # discount
  discount = 0
  try:
    price_div=soup.find('div',{'class': 'ui-pdp-price__second-line'})
    discount = price_div.find('span',{'class': 'andes-money-amount__discount'}).text.strip('% OFF')
    features_list.append(discount)
    # print(f"Product's discount is: {discount}")
  except:
    price = 0
    features_list.append(discount)

  # review_rating
  try:
    review_rating = soup.find('span',{'class': 'ui-pdp-review__rating'}).text
    features_list.append(review_rating)
    # print(f"Product's review rating is: {review_rating}")
  except:
    review_rating = 0
    features_list.append(review_rating)

  # review_amount
  try:
    review_amount = soup.find('span',{'class': 'ui-pdp-review__amount'}).text.strip('()')
    features_list.append(review_amount)
    # print(f"Product's review is: {review_amount}")
  except:
    review_amount = 0
    features_list.append(review_amount)

  # volume
  # try:
  #   volume = soup.find('span', string="Volumen de la unidad:").find_next('span').text.strip()
  #   features_list.append(volume)
  #   print(f"Product's review is: {volume}")
  # except:
  #   volume = 0
  #   features_list.append(volume)



# brand
  try:
    script = soup.find("script", {'type': 'application/ld+json'})
    if script:
      # script content
      script_text = json.loads(script.string)

      # json keys for color and fuel type
      brand = script_text.get('brand', 'Brand not found')

      # Append results
      features_list.extend([brand])
    else:
      print("JavaScript script was not found on the page.")
  except json.JSONDecodeError as e:
      print("Error decoding JSON:", str(e))
      features_list.extend([0, 0])
  except Exception as e:
      print("An unexpected error occurred:", str(e))
      features_list.extend([0, 0])


  # print(features_list)


  return features_list

## Start scraping

---

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

In [None]:
"""
 The input parameters for the 'scrapebyPages' function are: Brand name, Car model
 name. Be careful to write the brand and model names exactly as they are in tucarro.com.
 The third input parameter is the initial results page (always initialize to 1)
 and the fourth input parameter is the final results page you want to download data from;
 this parameter depends on the amount of results pages your car returns
 for the brand and model you want to get data from. So, it is recommended to search
 the web portal first to find out how many pages of results you can get
 for the car you want to get data from.
"""

data = scrapebyPages(1,30)

************************************
WEB SCRAPING FROM SEARCH PAGE #1
https://listado.mercadolibre.com.co/belleza-cuidado-personal/perfumes/lociones-perfumes_Desde_49_NoIndex_True
https://click1.mercadolibre.com.co/mclics/clicks/external/MCO/count?a=OX%2BfniAxYxwHE%2FnIPw75gTXpSzTp6pd60mBUgi8gcVhbHpb22xSO3wMX9Ib5ZAv8YN9skjmWG7exq2GjjYazQGmKj1Gej3HNqfqkbWPCz2sK%2B6KaY%2FvKo9WooLbxam5uWB25BWQ7WenFjGCqdcA8wJhffONeUq70LlB5926JE%2BBBz6WTHSV6DiR3XeYa5REjT%2BY4B2WEd5mec1WTAj%2Fe0Ivum1b%2BpvYFQYyDz2lEoBqHnQw9M66%2F0IsqkEDbcdiQCKZUgiaCVyqKg20NFEyLDfOztMvWrGp8QL6%2FiUmyHHmo%2FsLgiSfZIoVEaEvhiHkZsLaP3mCRRLBYZ2OP%2BBWG8H30w7OlbbiH5ms9tTuHsI5bP6rPVK1Xhn8TipYe6iemcIHG8xf%2BZ5%2Bd3keiGNMjK6K1xitMPtgDz9f6gUQG%2BAVMCrmZjK4s7nngwbRaHu%2BBrzZFzQ9%2FtBjAmESjLQUnfte1bAC9%2B6ghDn0woNYhd5ONLFnONKEORI2%2BhBXSPXzc%2BkS5VPfIX4Fn0rWMlkxPaM7uFMuNpNyJAumlef2rvoAbdPIMh13jKNMXK1XmgxvkZVFf0PSyObwtfdXlX6qzN7dyUTfiA26HouX%2FdvdwajoHIc%2FY%2BhYntDno6wmu3%2B%2FLpkPBXbPdQcu%2F82%2F0ds%2BBZzCovk%2BxVQ%3D%3D&rb=x#polycard_c

In [None]:
cols = ['product_name','price','discount','review_rating','review_amount','brand']
data.columns = cols
print(data.shape)
data.tail(50)

(1398, 6)


Unnamed: 0,product_name,price,discount,review_rating,review_amount,brand
1348,Locion Perfume Clinique Happy For Men. Original,175.816,29,4.8,5,Clinique
1349,Lociones Dorsay + Winner Sport +nitro+ You,178.2,10,4.8,13,Ésika
1350,Loción Vibranza X 45 Ml.,69.9,0,0.0,0,Ésika
1351,Locion Nitro Y Loción Vibranza,90.0,0,0.0,0,Ésika
1352,Loción Winner Sport Y Loción Vibranza,135.0,0,0.0,0,Ésika
1353,Loción Vanilla Y Loción Fantasía Azul Infinito,95.0,0,0.0,0,Ésika
1354,Loción Vanilla Y Loción Fiori. Promoción Insup...,109.0,0,0.0,0,Ésika
1355,Loción Vanilla Y Loción Limage,105.9,0,0.0,0,Ésika
1356,Loción Vanilla Y Loción Vibranza,145.0,0,0.0,0,Ésika
1357,Loción Bleu Intense Y Loción Fantasía Azul Inf...,142.0,0,0.0,0,Ésika


In [None]:
saved_name=f'scraping_perfumes_241124.csv'
data.to_csv(saved_name, encoding='utf-8', index=False)

## Referencias
---



https://github.com/kiteco/kite-python-blog-post-code/blob/master/Web%20Scraping%20Tutorial/script.py

https://medium.com/geekculture/scrappy-guide-to-web-scraping-with-python-475385364381

https://stackoverflow.com/questions/47730671/python-3-using-requests-does-not-get-the-full-content-of-a-web-page