## Imports and Functions

We begin by loading the necessary packages/modules and define the following functions:

- `concat_excel(directory)`: Merges the data pulled from Refinitiv in the [`tick`](<..\data\raw\tick>) file directory.
- `process_summary(directory)`:  Merges the data pulled from Refinitiv in the [`summary`](<..\data\raw\summary>) file directory.

In [25]:
import os
import pandas as pd
from datetime import datetime

def concat_excel(directory):
    concatenated_df = pd.DataFrame()

    # loop through all files in the directory
    for file in os.listdir(directory):        
        if file.endswith('.xlsx'):
            file_path = os.path.join(directory, file)
            df = pd.read_excel(file_path, header=6) # change header position
            concatenated_df = pd.concat([concatenated_df, df], ignore_index=True)

    return concatenated_df

def process_summary(directory):
    ticker_data = []

    # loop through all files in the directory
    for file in os.listdir(directory):        
        if file.endswith('.xlsx'):
            file_path = os.path.join(directory, file)

            # read without headers to find the correct header row
            temp_df = pd.read_excel(file_path, header=None)
            temp_df = temp_df.astype(str)  # Convert all columns to strings

            header_row = temp_df[temp_df.apply(lambda x: x.str.contains('Exchange Date', na=False)).any(axis=1)].index[0]
            
            # re-read with correct header row
            df = pd.read_excel(file_path, header=header_row)

            # convert 'Exchange Date' to datetime and filter by date range
            df['Exchange Date'] = pd.to_datetime(df['Exchange Date'])
            start_date = datetime(2023, 12, 21)
            end_date = datetime(2024, 3, 18)
            df = df[df['Exchange Date'].between(start_date, end_date)]

            # calc avgs
            averages = df[['Bid', 'Ask', 'Volume']].mean()
            averages.name = file.split('.')[0]  # use the file name as the ticker
                        
            # append data to list
            ticker = file.split('.')[0]
            ticker_data.append([ticker, averages['Bid'], averages['Ask'], averages['Volume']])

    # convert list of data to a DataFrame
    main_df = pd.DataFrame(ticker_data, columns=['Ticker', 'Avg Bid', 'Avg Ask', 'Avg Volume'])

    return main_df

## Merging intraday tick data

We merge the raw tick data (from 21.12.2023 to 18.03.2024) of the three different cohorts (made up of 15 SWX equities separated by liquidity level) in the [`tick`](<..\data\raw\tick>) file directory before proceeding to the final merge into one dataset. The process is separated by cohorts as each merge takes a few minutes.

In [2]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'tick', 'high'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge1.csv'))

In [3]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'tick', 'medium'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge2.csv'))

In [4]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'tick', 'low'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge3.csv'))

In [5]:
dir = os.path.join('..', 'data', 'interim', 'prelim merge')
merged_df = pd.DataFrame()
for file in os.listdir(dir):        
    if file.endswith('.csv'):
        file_path = os.path.join(dir, file)
        df = pd.read_csv(file_path)
        merged_df = pd.concat([merged_df, df], ignore_index=True)
merged_df.to_csv(os.path.join('..', 'data', 'interim', 'merged.csv'), index=False)
print(merged_df.head(5))

  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


   Unnamed: 0      RIC                 Timestamp    Tick  Last Trade  Volume  \
0           0   ABBN.S  21-Dec-2023 17:40:16.203  DOWN ↓        37.1     1.0   
1           1   ABBN.S  21-Dec-2023 17:40:02.423     NaN         NaN     NaN   
2           2   UBSG.S  21-Dec-2023 17:40:02.422     NaN         NaN     NaN   
3           3    KUD.S  21-Dec-2023 17:40:02.420     NaN         NaN     NaN   
4           4  SRENH.S  21-Dec-2023 17:40:02.419     NaN         NaN     NaN   

   VWAP  Bid  Ask          Flow  Calc VWAP Venue  Bid Size  Ask Size  Turnover  
0   NaN  NaN  NaN -2.273539e+08  36.886702   NaN       NaN       NaN       NaN  
1   NaN  NaN  NaN -2.273538e+08  36.886702   NaN       NaN       NaN       NaN  
2   NaN  NaN  NaN -8.667000e+09  26.254416   NaN       NaN       NaN       NaN  
3   NaN  NaN  NaN  6.720380e+03   1.202547   NaN       NaN       NaN       NaN  
4   NaN  NaN  NaN  1.430147e+09  94.215826   NaN       NaN       NaN       NaN  


## Data cleaning

We clean the data by removing unnecessary columns and values and deleting rows that have NaN values in 'Last Trade' and/or 'Volume'.

In [6]:
# define and keep desired columns
columns = ['RIC', 'Timestamp', 'Tick', 'Last Trade', 'Volume']
final_df = merged_df[columns].copy()

# remove unnecessary arrow sign in 'Tick' and null values
final_df['Tick'] =final_df['Tick'].astype(str).str[:-2]
final_df.dropna(subset=['Last Trade', 'Volume'], inplace=True)

### Check for duplicates

We check for duplicates in the columns 'RIC' and 'Timestamp' which indicate simultaneously executed trades and keep the last occurence in each series.

In [7]:
duplicates = final_df.duplicated(subset=['RIC', 'Timestamp'])

# count duplicates
num_duplicates = duplicates.sum()
print(f"Number of duplicates found: {num_duplicates}")

final_df.drop_duplicates(subset=['RIC', 'Timestamp'], keep='last', inplace=True)
final_df.to_csv(os.path.join('..', 'data', 'processed', 'tick_data.csv'), index=False)

Number of duplicates found: 1662427


## Merging issue-level time series data

We continue by merging summary statistics and market cap data for our modeling time period (21.12.2023 to 18.03.2024).

In [28]:
result_df = process_summary(os.path.join('..', 'data', 'raw', 'summary'))
print(result_df)

   Ticker        Avg Bid        Avg Ask    Avg Volume
0    ABBN      38.430678      38.443390  3.117397e+06
1     CLN      11.331864      11.343051  9.777089e+05
2    DOKA     444.076271     444.703390  4.409932e+03
3    GIVN    3612.220339    3613.694915  1.869014e+04
4     KUD       1.277119       1.304237  6.600285e+04
5    LISN  107949.152542  108413.559322  1.026271e+02
6    LONN     418.901695     419.050847  2.393532e+05
7    NESN      96.595254      96.609492  3.676267e+06
8    SCHN     210.284746     210.516949  2.383359e+04
9    SCMN     510.150847     510.376271  8.610905e+04
10   SIKA     250.783051     250.896610  2.944247e+05
11  SRENH     101.622034     101.660678  8.196646e+05
12   UBSG      25.525763      25.535932  6.549273e+06
13    UHR     211.226271     211.328814  1.652034e+05
14   VLRT      11.042241      11.981356  2.366591e+02


We don't bother with creating a custom function to merge MC data. After reading the file, we merge it with our `result_df` dataframe using the ticker as the key.

In [42]:
mc_df = pd.read_excel(os.path.join('..', 'data', 'raw', 'market_cap.xlsx'))
mc_df = mc_df[['Identifier', 'Company Name', 'Avg MC']]
mc_df['Identifier'] = mc_df['Identifier'].str.replace('.S', '', regex=False)

sum_df = pd.merge(mc_df, result_df, left_on='Identifier', right_on='Ticker')
print(sum_df)
sum_df.to_csv(os.path.join('..', 'data', 'processed', 'summary_data.csv'), index=False)

   Identifier                            Company Name        Avg MC Ticker  \
0        ABBN                                 Abb Ltd  7.224162e+10   ABBN   
1         CLN                             Clariant AG  3.790858e+09    CLN   
2        DOKA                    Dormakaba Holding AG  1.870050e+09   DOKA   
3        GIVN                             Givaudan SA  3.331868e+10   GIVN   
4         KUD                             Kudelski SA  6.584905e+07    KUD   
5        LISN  Chocoladefabriken Lindt & Spruengli AG  1.453010e+10   LISN   
6        LONN                          Lonza Group AG  3.091884e+10   LONN   
7        NESN                               Nestle SA  2.580040e+11   NESN   
8        SCHN                    Schindler Holding AG  1.407383e+10   SCHN   
9        SCMN                             Swisscom AG  2.644564e+10   SCMN   
10       SIKA                                 Sika AG  4.053589e+10   SIKA   
11      SRENH                             Swiss Re AG  3.213975e