<a href="https://colab.research.google.com/github/dominikjanyga/network-analysis/blob/main/1_network_analysis_data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Loading libraries.
In the first step, I imported libraries which will be helpful in data scraping and processing. I used `pandas` to read HTML tables from the [Stooq website](https://stooq.pl/), which lists all the components of the WIG index. Additionally, I used the `yfinance` library to obtain financial information on companies listed on the Warsaw Stock Exchange (WSE).


In [1]:
import pandas as pd
import yfinance as yf

I mount Google Drive in a Google Colab environment. All the future data will be saved in my google drive, making it easy to access and load datasets.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#2. Collecting Ticker Symbols from Stooq

This step involves scraping ticker symbols for companies listed on the Warsaw Stock Exchange. By looping through Stooq’s HTML tables, I extract each ticker symbol, which represents a stock listed in the WSE, and save it to a list for later use.

In [4]:
ticker_list = []

for page in range(1, 9):
    page_url = f"https://stooq.pl/q/i/?s=wig&l={page}"
    wig_table = pd.read_html(page_url)[1]
    wig_selected_rows = wig_table.iloc[5:-1, 0].tolist()
    ticker_list.extend(wig_selected_rows)

print(ticker_list)
len(ticker_list)

['06N', '11B', '1AT', '3RG', 'AAT', 'ABE', 'ABS', 'ACG', 'ACP', 'ACT', 'AGO', 'AGT', 'ALE', 'ALI', 'ALL', 'ALR', 'AMB', 'AMC', 'ANR', 'APE', 'APN', 'APR', 'APT', 'ARH', 'ART', 'ASB', 'ASE', 'AST', 'ATC', 'ATD', 'ATG', 'ATP', 'ATR', 'ATS', 'ATT', 'AWM', 'B24', 'BBD', 'BBT', 'BCM', 'BCS', 'BCX', 'BDX', 'BDZ', 'BFT', 'BHW', 'BIO', 'BIP', 'BLO', 'BMC', 'BMX', 'BNP', 'BOS', 'BOW', 'BRS', 'CAP', 'CAR', 'CAV', 'CBF', 'CCC', 'CDL', 'CDR', 'CEZ', 'CIG', 'CLC', 'CLD', 'CLE', 'CLN', 'CMP', 'COG', 'CPL', 'CPR', 'CPS', 'CRI', 'CRJ', 'CRM', 'CSR', 'CTX', 'DAD', 'DAT', 'DBE', 'DCR', 'DEK', 'DEL', 'DGA', 'DGE', 'DIG', 'DNP', 'DOM', 'DVL', 'EAH', 'EAT', 'ECH', 'EHG', 'EKP', 'ELT', 'ENA', 'ENE', 'ENI', 'ENT', 'EQU', 'ERB', 'ERG', 'ETL', 'EUR', 'FAB', 'FEE', 'FMG', 'FON', 'FRO', 'FSG', 'FTE', 'GEA', 'GIF', 'GKI', 'GMT', 'GOP', 'GPP', 'GPW', 'GRN', 'GRX', 'GTC', 'GTN', 'HDR', 'HEL', 'HRP', 'HUG', 'ICE', 'IFI', 'IIA', 'IMC', 'IMS', 'INC', 'ING', 'INK', 'INL', 'INP', 'IPE', 'IPO', 'ITB', 'IZO', 'IZS', 'JRH'

321

In order to be able to access the data from `yfinance` library, we need to have the stock exchange suffix. I appended the ".WA" to each ticker symbol in 'ticker_list', indicating that they are listed on the Warsaw Stock Exchange, and stores the modified symbols in a new list called tickers.

In [5]:
tickers = [ticker + ".WA" for ticker in ticker_list]

We use yfinance to search for the company details such as: sector, industry,companyOfficers, shortName, longName etc.

In [2]:
yf.Ticker('DNP.WA').info

{'address1': 'ul. Ostrowska 122',
 'city': 'Krotoszyn',
 'zip': '63-700',
 'country': 'Poland',
 'phone': '48 62 725 5400',
 'fax': '48 62 725 5467',
 'website': 'https://www.grupadino.pl',
 'industry': 'Grocery Stores',
 'industryKey': 'grocery-stores',
 'industryDisp': 'Grocery Stores',
 'sector': 'Consumer Defensive',
 'sectorKey': 'consumer-defensive',
 'sectorDisp': 'Consumer Defensive',
 'longBusinessSummary': "Dino Polska S.A., together with its subsidiaries, operates a network of mid-sized grocery supermarkets under the Dino brand name in Poland. The company offers range of food products, including meat, poultry and cold cuts, fruit and vegetables, bread, and dairy products, as well as other food, chemical, and cosmetic products; grocery products, such as children's food, breakfast products, ready to eat meals, beverages, candies, snacks, frozen goods, processed goods, oils, grain and bulk products, condiments, and alcohol and cigarettes; and non-grocery products, which include

In [6]:
company_data = []

for ticker in tickers:
    stock = yf.Ticker(ticker)
    info = stock.info
    sector = info.get('sector')
    industry = info.get('industry')
    name = info.get('longName')
    short_name = info.get('shortName')
    roe = info.get('returnOnEquity')
    roa = info.get('returnOnAssets')
    p_to_book = info.get('priceToBook')
    debt_to_eq = info.get('debtToEquity')
    beta = info.get('beta')

    company_data.append({
        'longName': name,
        'shortName': short_name,
        'ticker': ticker,
        'sector': sector,
        'industry': industry,
        'returnOnEquity': roe,
        'returnOnAssets': roa,
        'p_to_book': p_to_book,
        'debt_to_eq': debt_to_eq,
        'beta': beta

    })

company_df = pd.DataFrame(company_data)

In [7]:
company_df.to_csv("/content/drive/MyDrive/Projects/network-analysis/company_financials_28102024", index=False)

# 3. Creating a DataFrame with Company Officers.
In this section, I retrieved data on the officers of each company listed. For each ticker, I extracted officer names and titles and stored them in a structured format. This information will be useful for mapping connections and relationships among company officials in the network analysis.

In [None]:
officers_data = []

for ticker in tickers:
  stock = yf.Ticker(ticker)
  info = stock.info
  name = info.get('longName', 'No data')
  short_name = info.get('shortName', 'No data')
  companyOfficers = info.get('companyOfficers', [])

  for officer in companyOfficers:
    officer_name = officer.get('name', 'No data')
    officer_title = officer.get('title', 'No data')

    officers_data.append({
        'officer_name': officer_name,
        'officer_title': officer_title,
        'ticker': ticker
    })

officer_df = pd.DataFrame(officers_data)
officer_df

Unnamed: 0,officer_name,officer_title,ticker
0,Mr. Miroslaw Janisiewicz,President of the Management Board,06N.WA
1,Mr. Przemyslaw Piotr Marszal,President of the Management Board,11B.WA
2,Mr. Grzegorz Miechowski,Member of Management Board,11B.WA
3,Mr. Michal Wojciech Drozdowski,Member of the Management Board,11B.WA
4,Mr. Pawel Feldman,Member of the Management Board,11B.WA
...,...,...,...
1592,Mr. Wieslaw Nowak,President of the Management Board & CEO,ZUE.WA
1593,Mr. Marcin Wisniewski,Vice President of the Management Board & Direc...,ZUE.WA
1594,Mr. Maciej Nowak,Vice President of Management Board and Legal &...,ZUE.WA
1595,Mr. Jerzy Czeremuga,Vice President of the Management Board & Direc...,ZUE.WA


In [None]:
officer_df.to_csv("/content/drive/MyDrive/Projects/network-analysis/company_officers_28102024", index=False)

#4. Collecting Shareholder Data
Using the previously gathered ticker list from step 1, I accessed shareholder data for each company. Each table retrieved from Stooq was processed, assigned a ticker symbol, and combined into a single DataFrame.

In [None]:
urls = []
for company in ticker_list:
  url = f'https://stooq.pl/q/h/?s={company}'
  urls.append(url)

In [None]:
wig_tables = []
for index, url in enumerate(urls):
  table = pd.read_html(url)[1].copy()
  ticker_symbol = ticker_list[index]
  table['ticker'] = ticker_symbol
  wig_tables.append(table)

In [None]:
final_wig_table = pd.concat(wig_tables, ignore_index=True)

In [None]:
final_wig_table['Lp'] = pd.to_numeric(final_wig_table['Lp'], errors='coerce')

In [None]:
final_wig_table = final_wig_table.dropna(subset=['Lp'])

In [None]:
final_wig_table.to_csv("/content/drive/MyDrive/Projects/network-analysis/shareholder_data_15102024", index=False)

#5. Conclusion

All gathered data has been saved in CSV format and will serve as a foundation for the next phase, where it will be transformed for network analysis.