## Imports and Functions

We begin by loading the necessary packages/modules and define a fucntion to merge the raw data pulled from Refinitiv in [`\raw`](<..\data\raw>).

In [3]:
import os
import pandas as pd

def concat_excel(directory):
    # create an empty DataFrame to hold concatenated data
    concatenated_df = pd.DataFrame()

    # loop through all files in the directory
    for file in os.listdir(directory):        
        if file.endswith('.xlsx') or file.endswith('.xls'):
            file_path = os.path.join(directory, file)
            df = pd.read_excel(file_path, header=6)
            concatenated_df = pd.concat([concatenated_df, df], ignore_index=True)

    return concatenated_df

## Merging datasets

We merge the raw data of the three different cohorts (made up of 15 SWX equities separated by liquidity level) separately before proceeding to the final merge into one dataset. The process is separated into code blocks as each merge takes a few minutes.

In [6]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'high'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge1.csv'))

In [7]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'medium'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge2.csv'))

In [8]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'low'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge3.csv'))

In [4]:
dir = os.path.join('..', 'data', 'interim', 'prelim merge')
merged_df = pd.DataFrame()
for file in os.listdir(dir):        
        if file.endswith('.csv'):
            file_path = os.path.join(dir, file)
            df = pd.read_csv(file_path)
            merged_df = pd.concat([merged_df, df], ignore_index=True)
merged_df.to_csv(os.path.join('..', 'data', 'interim', 'merged.csv'), index=False)
print(merged_df.head(10))

  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


   Unnamed: 0      RIC                 Timestamp    Tick  Last Trade  Volume  \
0           0   UBSG.S  29-Jan-2024 17:40:05.746     NaN         NaN     NaN   
1           1   NESN.S  29-Jan-2024 17:40:05.743     NaN         NaN     NaN   
2           2  SRENH.S  29-Jan-2024 17:40:05.742     NaN         NaN     NaN   
3           3   ABBN.S  29-Jan-2024 17:40:05.742     NaN         NaN     NaN   
4           4    KUD.S  29-Jan-2024 17:40:05.736     NaN         NaN     NaN   
5           5   NESN.S  29-Jan-2024 17:39:23.360    UP ↑       99.39    35.0   
6           6   ABBN.S  29-Jan-2024 17:39:07.293  DOWN ↓       36.95   397.0   
7           7   NESN.S  29-Jan-2024 17:39:02.740    UP ↑       99.39    10.0   
8           8   NESN.S  29-Jan-2024 17:34:35.421    UP ↑       99.39   120.0   
9           9  SRENH.S  29-Jan-2024 17:34:35.253  DOWN ↓       99.28   989.0   

      VWAP  Bid Size    Bid    Ask  Ask Size  Turnover          Flow  \
0      NaN       NaN    NaN    NaN       NaN   

## Data cleaning

We clean the data by removing unnecessary columns and values and deleting rows that have NaN values in 'Last Trade' and/or 'Volume'.

In [13]:
# define and keep desired columns
columns = ['RIC', 'Timestamp', 'Tick', 'Last Trade', 'Volume']
final_df = merged_df[columns].copy()

# remove unnecessary arrow sign in 'Tick' and null values
final_df['Tick'] =final_df['Tick'].astype(str).str[:-2]
final_df.dropna(subset=['Last Trade', 'Volume'], inplace=True)

### Check for duplicates

We check for duplicates in the columns 'RIC' and 'Timestamp' which indicate simultaneously executed trades and keep the last occurence in each series.

In [15]:
duplicates = final_df.duplicated(subset=['RIC', 'Timestamp'], keep='last')

# count duplicates
num_duplicates = duplicates.sum()
print(f"Number of duplicates found: {num_duplicates}")

final_df.drop_duplicates(subset=['RIC', 'Timestamp'], keep='last', inplace=True)
final_df.to_csv(os.path.join('..', 'data', 'processed', 'tick_data.csv'), index=False)

Number of duplicates found: 902738
