# DigiKrawl: DigiKala Comment Crawler

## Prerequisites

The following code snippet import required third-party libraries:
+ `requests` for sending HTTP requests to DigiKala server
+ `bs4` for parsing and crawling HTTP responses
+ `pandas` for data manipulation
+ `datetime` for using time series data
+ `unicode` for decode Farsi characters
+ `jdatetime` for work with Jalali dates

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
from unidecode import unidecode
import jdatetime 

DigiKala comments different features. One important one is the date of comment, its format is `'yyyy month_name dd'` so it is necessary to convert them into a proper shape. Following function will do this for you.

In [4]:
def change_date(date):
    """
    date: Jalali date in format '%YYYY %MonthName %dd' string
    return the corresponding Gregorian date
    """
    # Specify number of each month
    months = {"فروردین" : 1,
              "اردیبهشت" : 2,
              "خرداد" : 3,
              "تیر" : 4,
              "مرداد" : 5,
              "شهریور" : 6,
              "مهر" : 7,
              "آبان" : 8,
              "آذر" : 9,
              "دی" : 10,
              "بهمن" : 11,
              "اسفند" : 12,}
    # Extract day
    day = int(unidecode(u'{}'.format(date.split()[0])))
    
    # Extract month
    month = months[date.split()[1]]
    
    # Extract year
    year = int(unidecode(u'{}'.format(date.split()[2])))
    
    # Transform calendar format of the date
    date = jdatetime.date(year,month,day).togregorian()
        
    return date

Then we need to find the data in product's page. For this purpose, we will create next function.

In [5]:
def crawl_digikala_comments(url, pages = 1, mode = 1):
    """
    url: The product of DigiKala that must be crawled
    page: Number of pages of comments, by default it will crawl only first page
    mode: mode = 1 is for newest comment in webpage, mode = 2 is for most liked comments
    returns pandas.DataFrame containing all comments
    """
    # Configure mode of comments
    if (mode == 1):
        mode_str = 'newest_comment'
    elif (mode == 2):
        mode_str = 'most_liked'
    
    # Send request to the specified url
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find product_id and product_title from url and http response respectively
    product_id = url.split('/')[4][4:]
    product_title = soup.find('h1', {'class':'c-product__title'}).text.strip()
    
    # Initialize DataFrame for comments consists of 13 main features
    data = pd.DataFrame(columns = {'Product_id', 'Product_title', 'Title', 'Date', 'Username', 'Badge', 'Status', 'Content',
                                           'Advantages', 'Disadvantages', 'Color', 'Seller', 'Likes'})
    
    # Iterate each page of comments
    for page in range(pages):
        # Modify url for comments and send request
        comment_url = 'https://www.digikala.com/ajax/product/comments/list/{0}/?mode={1}&page={2}'.format(product_id, mode_str, page+1)
        response = requests.get(comment_url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find each comment for further process
        comment_section = soup.find('div', {'class' : 'c-comments__list'})
        comments = comment_section.findChildren('div', {'class' : 'c-comments__item c-comments__item--pdp'})
        
        # Iterate each comment in the page
        for comment in comments:
            # Find title of the commnet
            title = comment.find('span', {'class' : 'c-comments__title'}).text.strip()
            
            # Find string of date of the comment
            date = comment.find_all('span', {'class' : 'c-comments__detail'})[0].text
            # Transform string date into datetime.date instance
            date = change_date(date)
            
            # Find username of the comment's writer
            username = comment.find_all('span', {'class' : 'c-comments__detail'})[1].text.strip()
            
            # Find user badge if exists
            badge = comment.find('div', {'class' : 'c-comments__buyer-badge'})
            if badge is not None:
                badge = badge.text
                
            # Find status of the comment if exists
            status = comment.find('div', {'class' : 'c-comments__status'})
            if status is not None:
                status = status.text
                
            # Find content of the comment
            content = comment.find('div', {'class' : 'c-comments__content'}).text
            
            # Find mentioned advantages in the comment 
            advantages = comment.find_all('div', {'class' : 'c-comments__modal-evaluation-item--positive'})            
            advantages = [advantage.text for advantage in advantages]
                        
            # Find mentioned disadvantages in the comment
            disadvantages = comment.find_all('div', {'class' : 'c-comments__modal-evaluation-item--negative'})
            disadvantages = [disadvantage.text.strip() for disadvantage in disadvantages]
            
            # Find color of the products if exists
            color = comment.find('div', {'class' : 'c-comments__color'})
            if color is not None:
                color = color.text.strip()
                
            # Find the seller's name if exists
            seller = comment.find('a', {'class' : 'c-comments__seller'})
            if seller is not None:
                seller = seller.text.strip()
            
            # Find the number of likes and convert into English digits 
            likes = unidecode(u'{}'.format(comment.find('div', {'class' : 'c-comments__helpful-yes js-comment-like'}).text))
                        
            # Append new commnet into DataFrame
            data = data.append({'Product_id':product_id, 'Product_title': product_title, 'Title': title,
                                'Date': date, 'Username': username, 'Badge': badge, 'Status': status,
                                'Content': content, 'Advantages': advantages, 'Disadvantages': disadvantages,
                                'Color': color, 'Seller': seller, 'Likes':likes}, ignore_index = True)

            
    return data

Let's look at a little example.

In [6]:
url = "https://www.digikala.com/product/dkp-186008/%DA%A9%D8%AA%D8%A7%D8%A8-%D9%BE%D8%A7%D8%B3%D8%AA%DB%8C%D9%84-%D9%87%D8%A7%DB%8C-%D8%A8%D9%86%D9%81%D8%B4-%D8%A7%D8%AB%D8%B1-%DA%A9%D8%A7%D8%AA%D8%B1%DB%8C%D9%86-%D8%A7%D9%BE%D9%84%DA%AF%DB%8C%D8%AA"
data_digikala = crawl_digikala_comments(url,2,2)
data_digikala.head()

Unnamed: 0,Color,Seller,Username,Product_title,Content,Badge,Date,Title,Disadvantages,Advantages,Status,Likes,Product_id
0,,,کاربر دیجی‌کالا,کتاب پاستیل های بنفش اثر کاترین اپلگیت,خیلی داستان زیبایی داره خیلی دوسش دارم جذابیت ...,خریدار,2020-12-01,پاستیل های بنفش,[],[],خرید این محصول را توصیه می‌کنم,\n 124\n ...,186008
1,,,سید محمد موسوی,کتاب پاستیل های بنفش اثر کاترین اپلگیت,از محتواش اطلاعی ندارم چون برای خواهرم خریدم.\...,خریدار,2020-04-26,,[],[],در مورد خرید این محصول مطمئن نیستم,\n 64\n ...,186008
2,,,محمد حاجی رسولیها,کتاب پاستیل های بنفش اثر کاترین اپلگیت,این کتاب درباره ی پسر بچه ای به نام جکسون است ...,,2018-02-12,کتاب پاستیل های بنفش,[ندارد],[\n با کیفیت\n ...,,\n 61\n ...,186008
3,,,مهدیه تاجیک,کتاب پاستیل های بنفش اثر کاترین اپلگیت,اول از جلدش بگم که خیلی قشنگ تر از چیزی بود که...,خریدار,2020-09-30,پاستیل های بنفش,[],[\n داستان مناس...,خرید این محصول را توصیه می‌کنم,\n 54\n ...,186008
4,,,نگین سبحانی,کتاب پاستیل های بنفش اثر کاترین اپلگیت,خیلی عالیه,خریدار,2020-08-13,,[],[],خرید این محصول را توصیه می‌کنم,\n 53\n ...,186008
