# Parsing of the website of the Charter of the Agro-Industrial Complex of Russia 
## Data Collection
### Author: Vladislav Rubanov

Data collection within the project on parsing the website of companies participating in the Russian Agro-Industrial Complex (AIC) Charter.

Site address: https://хартия-апк.радо.рус 

In [1]:
import requests
import pandas as pd
pd.set_option('display.max_columns', 50)

from bs4 import BeautifulSoup

### Generate links for parsing

The links on the Charter members' page are simple: the URL link has the page number **directly embedded** in it. As of April 2024, it includes 9366 companies, information about which can be found on 187 pages. Let's generate them:

In [2]:
links = []
for i in range(1, 188):
    links.append(f'https://хартия-апк.радо.рус/uchastniki?page={i}&dp-50-per-page=50&dp-1-per-page=50')

In [3]:
# look at the first 5 links
links[:5]

['https://хартия-апк.радо.рус/uchastniki?page=1&dp-50-per-page=50&dp-1-per-page=50',
 'https://хартия-апк.радо.рус/uchastniki?page=2&dp-50-per-page=50&dp-1-per-page=50',
 'https://хартия-апк.радо.рус/uchastniki?page=3&dp-50-per-page=50&dp-1-per-page=50',
 'https://хартия-апк.радо.рус/uchastniki?page=4&dp-50-per-page=50&dp-1-per-page=50',
 'https://хартия-апк.радо.рус/uchastniki?page=5&dp-50-per-page=50&dp-1-per-page=50']

### Logging

Let's create two loggers to log the parsing process: one of them will record errors and the other will inform about the parsing process.

In [4]:
import logging

# defin loggers
# set logging level to minimum
logger_good = logging.getLogger("logger_good")
logger_good.setLevel(logging.DEBUG)

# keep the default logging level: errors and critical (levels 4 and 5, respectively)
logger_bad = logging.getLogger("logger_bad")

# define a different handler for each logger to write logs to a file
handler_for_good_logger = logging.FileHandler("logs/info.log", mode='w')
handler_for_bad_logger = logging.FileHandler("logs/errors.log", mode='w')

# define a formatter for each logger to standardize the output format
# dased on docs: https://docs.python.org/3/library/logging.html#logrecord-attributes
# format: Date, time, logger name, error level, comment (message)
str_logging_format = "%(asctime)s - %(name)-12s - %(levelname)-8s - %(message)s"

handler_for_good_logger.setFormatter(logging.Formatter(fmt=str_logging_format))
handler_for_bad_logger.setFormatter(logging.Formatter(fmt=str_logging_format))

# add handlers to loggers
logger_good.addHandler(handler_for_good_logger)
logger_bad.addHandler(handler_for_bad_logger)

### Parsing

Let's iterate over links, send GET request with `requests` and collect information about companies in a `pd.DataFrame`:

In [2]:
# iterate over links
for link_no, link in enumerate(links):
    # send a GET request to the next link
    r = requests.get(link)
    if r.status_code != 200:
        # received an incorrect status in response, log and skip it
        logger_bad.error(f'Link no. {link_no}. Failed to send a successful GET request to the page with URL: {link}')
        continue
    logger_good.info(f'Link no. {link_no}. Successful response to GET request to the page with URL: {link}')
    # read html-source of the page
    soup = BeautifulSoup(r.text)
    # look for table - data on charter member companies 
    table = soup.find_all('table', attrs={'id': 'organizations-table'})
    if len(table) != 1:
        # did not find needed table
        logger_bad.error(f'Link no. {link_no}. Failed to find interested data from page with URL: {link}')
        continue
    # get the table source
    data = table[0]
    # extract rows from the table
    companies = data.find_all('tr', attrs={'class': 'collapse-toggle collapsed'})

    # lists for main attributes
    company_no_list = []
    company_name_list = []
    company_region_list = []
    company_inn_list = []
    company_okved_list = [] 
    company_profile_raw_list = [] 
    company_link_list = []
    # lists for company profiles
    intermediary_list = []
    exporter_list = []
    producer_list = []
    elevator_list = []
    importer_list = []
    trader_list = []
    exchange_list = []
    central_counterparty_list = []
    broker_list = []
    surveyor_list = []
    industry_union_list = []

    for company in companies:
        # iterate over companies in a current link
        # column by column
        attrs = company.find_all('td')
        for i, attr in enumerate(attrs):
            if i == 0:
                # it is the number of the company in the charter
                company_no = attr.text.strip()
                company_no_list.append(company_no)
            elif i == 1:
                # it is the name of the company
                company_name = attr.text.strip()
                company_name_list.append(company_name)
                # also get the link to the company
                if len(attr.find_all('a')) == 1:
                    rel_link = attr.find_all('a')[0].get('href')
                    company_link = 'https://хартия-апк.радо.рус' + rel_link
                    company_link_list.append(company_link)
                else:
                    # couldn't find a link to the company
                    logger_bad.error(f'Failed to find link to the company {company_name} from page with URL: {link}')

            elif i == 2:
                # it is the region of the company
                company_region = attr.text.strip()
                company_region_list.append(company_region)

            elif i == 3:
                # it is the Taxpayer Identification Number (INN) of the company
                company_inn = attr.text.strip()
                company_inn_list.append(company_inn)

            elif i == 4:
                # it is the all-russian classifier of types of economic taxpayer (OKVED) of the company
                company_okved = ' '.join(attr.text.strip().split())
                company_okved_list.append(company_okved)

            elif i == 5:
                # there are the profiles of the company
                # default vars for profiles 
                intermediary = False
                exporter = False
                producer = False
                elevator = False
                importer = False
                trader = False
                exchange = False
                central_counterparty = False
                broker = False
                surveyor = False
                industry_union = False
                
                # all profiles
                company_profile_raw = attr.text.strip()
                company_profile_raw_list.append(company_profile_raw)

                # separate flags for profiles
                if 'посредник' in company_profile_raw.lower():
                    intermediary = True
                if 'экспортер' in company_profile_raw.lower():
                    exporter = True
                if 'производитель' in company_profile_raw.lower():
                    producer = True
                if 'элеватор' in company_profile_raw.lower():
                    elevator = True
                if 'импортер' in company_profile_raw.lower():
                    importer = True
                if 'трейдер' in company_profile_raw.lower():
                    trader = True
                if 'биржа' in company_profile_raw.lower():
                    exchange = True
                if 'центральный контрагент' in company_profile_raw.lower():
                    central_counterparty = True
                if 'брокер' in company_profile_raw.lower():
                    broker = True
                if 'сюрвейер' in company_profile_raw.lower():
                    surveyor = True
                if 'отраслевой союз' in company_profile_raw.lower():
                    industry_union = True

                intermediary_list.append(intermediary)
                exporter_list.append(exporter)
                producer_list.append(producer)
                elevator_list.append(elevator)
                importer_list.append(importer)
                trader_list.append(trader)
                exchange_list.append(exchange)
                central_counterparty_list.append(central_counterparty)
                broker_list.append(broker)
                surveyor_list.append(surveyor)
                industry_union_list.append(industry_union)

    # create a dataframe by the table on the page and add it to the common one
    df = pd.DataFrame({
        'number': company_no_list,
        'name': company_name_list,
        'region': company_region_list,
        'inn': company_inn_list,
        'okved': company_okved_list,
        'profile_raw': company_profile_raw_list,
        'company_link': company_link_list,
        'page_link': [link] * len(company_no_list),
        'intermediary': intermediary_list,
        'exporter': exporter_list,
        'producer': producer_list,
        'elevator': elevator_list,
        'importer': importer_list,
        'trader': trader_list,
        'exchange': exchange_list,
        'central_counterparty': central_counterparty_list,
        'broker': broker_list,
        'surveyor': surveyor_list,
        'industry_union': industry_union_list
    })

    if link_no == 0:
        stacked_df = df.copy()
    else:
        stacked_df = pd.concat([stacked_df, df])

In [109]:
df = stacked_df.reset_index(drop=True)

In [110]:
df

Unnamed: 0,number,name,region,inn,okved,profile_raw,company_link,page_link,intermediary,exporter,producer,elevator,importer,trader,exchange,central_counterparty,broker,surveyor,industry_union
0,9344,"ООО ""ИННАГРО""",Москва,7722498765,74.90.4 Предоставление консультационных услуг ...,Производитель,https://хартия-апк.радо.рус/uchastniki/31463-o...,https://хартия-апк.радо.рус/uchastniki?page=1&...,False,False,True,False,False,False,False,False,False,False,False
1,9343,"ООО ""КАРАВАЙ""",Ростовская область,6163156397,"46.21 Торговля оптовая зерном, необработанным ...",Трейдер,https://хартия-апк.радо.рус/uchastniki/31462-o...,https://хартия-апк.радо.рус/uchastniki?page=1&...,False,False,False,False,False,True,False,False,False,False,False
2,9342,"ООО ""ЭЙ БИ ГРЕЙН""",Ростовская область,6164128071,"46.21 Торговля оптовая зерном, необработанным ...",Трейдер,https://хартия-апк.радо.рус/uchastniki/31461-o...,https://хартия-апк.радо.рус/uchastniki?page=1&...,False,False,False,False,False,True,False,False,False,False,False
3,9341,"ООО ""РЕВАДА-НЕВА""",Санкт-Петербург,7807054342,46.75.2 Торговля оптовая промышленными химикатами,Производитель Импортер Трейдер,https://хартия-апк.радо.рус/uchastniki/31460-o...,https://хартия-апк.радо.рус/uchastniki?page=1&...,False,False,True,False,True,True,False,False,False,False,False
4,9340,"ООО ТК ""ДТС""",Ростовская область,6168110237,"52.29 Деятельность вспомогательная прочая, свя...",Посредник,https://хартия-апк.радо.рус/uchastniki/31459-o...,https://хартия-апк.радо.рус/uchastniki?page=1&...,True,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9339,5,"ОАО ""Астон""",Ростовская область,6162015019,10.41.5 Производство рафинированных растительн...,Экспортер,https://хартия-апк.радо.рус/uchastniki/21323-5...,https://хартия-апк.радо.рус/uchastniki?page=18...,False,True,False,False,False,False,False,False,False,False,False
9340,4,"ЗАО ""ВИТАЛМАР АГРО""",Москва,7729391002,,Экспортер,https://хартия-апк.радо.рус/uchastniki/21322-4...,https://хартия-апк.радо.рус/uchastniki?page=18...,False,True,False,False,False,False,False,False,False,False,False
9341,3,"ЗАО ""Агропром-Импэкс""",Ростовская область,6154057939,"46.21 Торговля оптовая зерном, необработанным ...",Экспортер,https://хартия-апк.радо.рус/uchastniki/21321-3...,https://хартия-апк.радо.рус/uchastniki?page=18...,False,True,False,False,False,False,False,False,False,False,False
9342,2,"АО ""ОЗК""",Москва,7708632345,73.20.1 Исследование конъюнктуры рынка,Экспортер,https://хартия-апк.радо.рус/uchastniki/21320-2...,https://хартия-апк.радо.рус/uchastniki?page=18...,False,True,False,False,False,False,False,False,False,False,False


In [88]:
df.isna().sum()

number                  0
name                    0
region                  0
inn                     0
okved                   0
profile_raw             0
company_link            0
page_link               0
intermediary            0
exporter                0
producer                0
elevator                0
importer                0
trader                  0
exchange                0
central_counterparty    0
broker                  0
surveyor                0
industry_union          0
dtype: int64

There are no missing values!

Save the result:

In [84]:
df.to_csv('result.csv')

In [86]:
df.to_excel('result.xlsx')