In [1]:
%load_ext autoreload
%autoreload 2
import sys, os
from os.path import expanduser
## actions required!!!!!!!!!!!!!!!!!!!! change your folder path 
path = "~/Documents/G3/MA-prediction"
path = expanduser(path)
sys.path.append(path)

import pandas as pd
import numpy as np
import datetime
from tqdm import tqdm
import re
# import wrds

pd.options.mode.chained_assignment = None
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

In [11]:
from MA_prediction.preprocessing import *
from MA_prediction.mkt_calendar import *

# Data Processing Notebooks Outline: 

We downloaded raw M&A deals data from SDC platinum, and process them in several notebooks (in order):

- Notebook 0: basic cleaning.
- Notebook 1: match with CRSP database.
- Notebook 2: date correction.
- Notebook 3: pull market data from CRSP.
- Notebook 4: create new variables.
- Notebook 5: process market data.
- Notebook 6: create variables for prediction model.
- Notebook 7: apply filters.

General guidelines for these data processing notebooks:

- We create new columns (variables)  on all the rows first, before applying  any  filters. 
- When filtering we should not drop any row directly, in case we want to retrieve them later. Instead we add another column called `retain` to indicate whether to retain the row after applying the filters. 
- These notebooks shall be highly modular, meaning that almost every data operation should be encapsulated in a function in the helper package. Each function is developed in another individual notebook (thus tens of development notebooks). In this way the end user only needs to read the documentation without digging into the codes.
- From time to time we save the intermediate result as an `hdf` file, as some codes (especially those querying the CRSP database) need tens of minutes to run. Thus we want to run it just for once and store the results for later use. The advantage of `hdf` over `csv` is that it preserves data type like `datetime.date`. Only when we need to inspect the dataset by `Excel` or `Numbers` shall we save it as `csv`.


## Data Processing 0: Basic Cleaning
Specifically in this notebook we will do the following:

- Load column names from the report file. Load raw data. Change column names. 
- Transform date-like columns to `datetime.date` dtype. Transform float-like columns to float.
- Correct `consid` for some deals manually.
- Fill missing:
    - `pr_initial` by `pr`. 
    - `one_day` by the previous trading day to `dao`.
    
## I/O    
- Input: 
    - `df.csv`
- Output: 
    - `df_basic_cleaning.h5`

# Load data
## Load column names
Full column names in the raw data are too long and unwieldy to carry out python operations; thus we replace them with the acronyms in the database from the report file. Their correspondence is saved as a `csv` file called `column_names.csv`.  Another comprehensive file `SDC_MA_guide.pdf` explains the exact definition of all the variables in the database.

In [4]:
filepath = f"{path}/data/reference/report.rpt"
# extract acronyms of variables from the report file; the first name is name of index. Later replace colnames with them.
colnames = extract_colnames_from_report_file(filepath)
# show the last 10 column names
colnames[-10:]

['pricebook',
 'eqvalcf',
 'eqvalsales',
 'eqval',
 'tlia',
 'cass',
 'clia',
 'lockup',
 'dae',
 'vest']

## Load raw data
Load raw data from the `csv` file. 

In [5]:
# load data
filepath = f"{path}/data/raw/df.csv"
df = pd.read_csv(filepath, index_col=0, na_values=['nm', 'np'], low_memory=False)

## Change column names
Change column names. 

In [6]:
# extract full column names
colnames_full = list(map(lambda x: " ".join(x.split()).strip(), [df.index.name] + list(df.columns)))

# save the correspondence between acronym and full name for convenience
filepath = f"{path}/data/reference/column_names.csv"
pd.Series(colnames_full, index=colnames, name='column name').to_csv(filepath)

In [10]:
# change column names
df.index.name = colnames[0]
df.columns = colnames[1:]

print_shape(df)
df.tail()

The dataset is of size (12082, 94).


Unnamed: 0_level_0,statc,one_day,aone_day,dao,da,dateannorig_days,de,dateeffexp,dw,definitive_agt,...,pricebook,eqvalcf,eqvalsales,eqval,tlia,cass,clia,lockup,dae,vest
master_deal_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3992461020,P,10/24/22,12/16/22,10/25/22,12/18/22,54,,12/31/23,,Yes,...,8.659,16.011,2.087,4547.2,1748.5,1001.7,802.9,No,No,No
4015877020,P,12/16/22,,12/19/22,12/19/22,0,,02/28/23,,Yes,...,4.839,,0.752,16.141,18.3,14.2,16.4,No,No,No
4016515020,P,12/14/22,12/19/22,12/20/22,12/20/22,0,,06/30/23,,Yes,...,,,,52.581,,,,No,No,No
4017224020,P,12/20/22,12/20/22,12/21/22,12/21/22,0,,03/31/23,,Yes,...,0.75,,2.912,55.152,61.3,97.6,11.2,No,No,No
4019588020,P,12/23/22,,12/27/22,12/27/22,0,,,,No,...,,,0.895,25.412,52.4,34.1,36.4,No,No,No


# Transform date-like and float-like columns
Transform date-like columns to `datetime.date` dtype. Transform float-like columns to float.

In [12]:
# date-like columns to transform
cols_dt = ['one_day', 'aone_day', 'dao', 'da', 'de', 'dateeffexp', 'dw', 'da_date', 'dateval', 'dcom', 'dcomeff']

# apply function to each column
df[cols_dt] = df[cols_dt].apply(convert_date_str_ser_to_datetime)

# numeric-like columns to transform
cols_float = ['val', 'mv', 'amv', 'pr', 'ppmday', 'ppmwk', 'ppm4wk', 'roe', 'tlia', 'cass', 'clia']

# apply function to each column
df[cols_float] = df[cols_float].apply(convert_num_str_ser_to_float)

NameError: name 'atof' is not defined

# Correct `consido` for some deals manually
Correct `consid` for some deals manually.

In [7]:
# correct data errors
cols = ['consid', 'consido']
df[cols] = correct_consid(df[cols])

# Fill missing 
## `pr_initial` by `pr`

In [8]:
# fill missing `pr_initial` by `pr`
df.pr_initial[df.pr_initial.isna()]=df.pr[df.pr_initial.isna()]

## `one_day` by the previous trading day to `dao`

In [9]:
# fill missing one_day by the previous trading day to <dao>
df.one_day[df.one_day.isna()] = get_trading_day_offset(df.dao[df.one_day.isna()], -1)

# Save results

In [10]:
filepath = f"{path}/data/intermediate/df_basic_cleaning.h5"

df.to_hdf(filepath, key = 'df', mode='w')