## Imports and Functions

We begin by loading the necessary packages/modules and define a function to merge the data pulled from Refinitiv in the [`tick`](<..\data\raw\tick>) file directory.

In [2]:
import os
import pandas as pd

def concat_excel(directory):
    """
    Concatenates all Excel files (.xlsx) in a given directory into a single DataFrame.

    Args:
    directory (str): Path to the directory containing the Excel files.

    Returns:
    pandas DataFrame: A DataFrame containing the concatenated data from all Excel files in 
        the directory. Assumes the data starts from the 7th row (header=6) in each file.
    """
    concatenated_df = pd.DataFrame()

    # loop through all files in the directory
    for file in os.listdir(directory):        
        if file.endswith('.xlsx'):
            file_path = os.path.join(directory, file)
            df = pd.read_excel(file_path, header=6) # change header position
            concatenated_df = pd.concat([concatenated_df, df], ignore_index=True)

    return concatenated_df

## Merging intraday tick data

We merge the raw tick data (from 21.12.2023 to 18.03.2024) of the three different cohorts (made up of 15 SWX equities separated by liquidity level) in the [`tick`](<..\data\raw\tick>) file directory before proceeding to the final merge into one dataset. The process is separated by cohorts as each merge takes a few minutes.

In [3]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'tick', 'high'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge1.csv'))

In [4]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'tick', 'medium'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge2.csv'))

In [5]:
df = concat_excel(os.path.join('..', 'data', 'raw', 'tick', 'low'))
df.to_csv(os.path.join('..', 'data', 'interim', 'prelim merge', 'merge3.csv'))

In [6]:
dir = os.path.join('..', 'data', 'interim', 'prelim merge')
merged_df = pd.DataFrame()
for file in os.listdir(dir):        
    if file.endswith('.csv'):
        file_path = os.path.join(dir, file)
        df = pd.read_csv(file_path)
        merged_df = pd.concat([merged_df, df], ignore_index=True)
merged_df.to_csv(os.path.join('..', 'data', 'interim', 'merged.csv'), index=False)
display(merged_df.head(5))

  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


Unnamed: 0.1,Unnamed: 0,RIC,Timestamp,Tick,Last Trade,Volume,VWAP,Bid,Ask,Flow,Calc VWAP,Venue,Bid Size,Ask Size,Turnover
0,0,ABBN.S,21-Dec-2023 17:40:16.203,DOWN ↓,37.1,1.0,,,,-227353900.0,36.886702,,,,
1,1,ABBN.S,21-Dec-2023 17:40:02.423,,,,,,,-227353800.0,36.886702,,,,
2,2,UBSG.S,21-Dec-2023 17:40:02.422,,,,,,,-8667000000.0,26.254416,,,,
3,3,KUD.S,21-Dec-2023 17:40:02.420,,,,,,,6720.38,1.202547,,,,
4,4,SRENH.S,21-Dec-2023 17:40:02.419,,,,,,,1430147000.0,94.215826,,,,


## Data cleaning

We clean the data by removing unnecessary columns and values and deleting rows that have NaN values in 'Last Trade' and/or 'Volume'.

In [7]:
# define and keep desired columns
columns = ['RIC', 'Timestamp', 'Tick', 'Last Trade', 'Volume']
final_df = merged_df[columns].copy()

# remove unnecessary arrow sign in 'Tick' and null values
final_df['Tick'] =final_df['Tick'].astype(str).str[:-2]
final_df.dropna(subset=['Last Trade', 'Volume'], inplace=True)

### Check for duplicates

We check for duplicates in the columns 'RIC' and 'Timestamp' which indicate simultaneously executed trades and keep the last occurence in each series.

In [8]:
duplicates = final_df.duplicated(subset=['RIC', 'Timestamp'])

# count duplicates
num_duplicates = duplicates.sum()
print(f"Number of duplicates found: {num_duplicates}")

final_df.drop_duplicates(subset=['RIC', 'Timestamp'], keep='last', inplace=True)
final_df.to_csv(os.path.join('..', 'data', 'processed', 'tick_data.csv'), index=False)

Number of duplicates found: 1662427
