In [1]:
import pandas as pd
import pickle
import numpy as np
import datetime 
from os.path import join as pjoin
import os
#import argparse
#import yamlb

In [2]:
tmp_data_path =  '../MA_data/data/tmp'
data_path = '../MA_data/data'
s_year = 1997
e_year = 2019

In [3]:
from Master1_data_prepare import MADataLoader

# import data

3 groups of data
- bridge 1: wrds bridge
- bridge 2: evans bridge
- SDC MA data

before concat all data, manually convert all date on excel file to "YY-mm-dd" format.

Guide: select all date var --> right-click --> Cell format --> Date --> 2012-03-14 --> OK

![](https://cdn.mathpix.com/snip/images/58hcVJ3qFlC446Ns4SaJKtm-UroEqUKyqu4oCnTWhKY.original.fullsize.png)


###  sdc data

1. drop where either `ACU` or `TCU` is Nas
1. fill DEAL_NO NAs to -1
1. change all identifier to `str`; including: `ACU`, `AUP`, `TCU`, `TUP`, `DEAL_NO`, `GVKEY`a


### bridges data

1. match variable type for merging
    1. for CUSIP/GVKEY/DEALNUM, all convert to `string`; 
        - do not keep 0s at front 
            - e.g. `002030` will be curtail to `2030`
    2. for time, all convert to pandas `Timestamp` instance
2. drop na or fill na

#### evans_bridge

for evans_brdge, load as float:
1. fill na as -1
2. convert all var to integer
3. convert all var to string


**so the GVKEY has no 0 at front**

#### wrds_bridge

- `GVKEY` and `CUSIP`, load as int; so just convert to str
- no need to worry about NA

In [4]:
sdc = MADataLoader(tmp_data_path, data_path, s_year, e_year, glaspe=True)

WRDS Linking Table looks like:          CUSIP  GVKEY     LINKDT  LINKENDDT                          CONM
9348   879286  10400 1983-10-10 1988-10-28  TELECOMMUNICATIONS SPECIALST
4303   397627   5339 1969-09-11 1996-03-29       GREINER ENGINEERING INC
24514  59540G  66288 1997-12-04 2099-12-31          MID PENN BANCORP INC
ATTENTION, DealNumber, tgvkey, agvkey NAs in evans_bridge are interpolated as '-1'. 
 
       DealNumber  agvkey  tgvkey
8445    608045020   22974    4503
67319  1751016020    8148      -1
73255  1858313020  110250  121659
date variables loading ok 

1997 data shape: (13255, 35)
date variables loading ok 

1998 data shape: (15081, 35)
date variables loading ok 

1999 data shape: (13203, 35)
date variables loading ok 

2000 data shape: (12610, 35)
date variables loading ok 

2001 data shape: (8771, 35)
date variables loading ok 

2002 data shape: (7943, 35)
date variables loading ok 

2003 data shape: (8573, 35)
date variables loading ok 

2004 data shape: (9704, 35)
d

In [15]:
# run variable type check
sdc.var_type_checker()

variable type checking finished, No error Found. 



# Filtering "Majority" MA

the variable meaning please refer to [appendix1.2](Appendix_1_2_variable_description.ipynb)

see P523 Ahern, Kenneth R., and Jarrad Harford. 2014. “The Importance of Industry Links in Merger Waves.” The Journal of Finance 69 (2): 527–76. https://doi.org/10.1111/jofi.12122.


- the acquirer buys 20% or more of the target’s share: `PCTACQ > 20.0`
- the acquirer owns 51% or more of the target’s shares after the deal; (5) the merger is completed:  `PCTOWN > 51.0`
- transaction value is at least 1 million: `VAL > 1`
- legal form of organization of the target or acquirer not restricted


before restriction, we'd better look at the missing pattern of related varibles
- if large portion of this varibale is missing, it's not good to use this variable as restriction. [exploration see Appendix1.1 Q4](./Appendix1_data_explore.ipynb)

$\begin{array}{ll}\text { VAL } & 58.802123 \\ \text { PCTACQ } & 27.158338 \\ \text { PSOUGHTOWN } & 13.356764 \\ \text { PSOUGHT } & 13.383123 \\ \text { PHDA } & 97.952041 \\ \text { PCTOWN } & 26.963939 \\ \text { PSOUGHTT } & 99.122094\end{array}$

So VAL, PHDA, PSOUGHTT are bad restrictors


In [5]:
sdc_df = sdc.sdc_df


In [10]:
from filter_helpers import majority_filter

In [11]:
sdc_majority = majority_filter(sdc_df)


original df shape:  (259778, 39) 

filtered df shape:  (241546, 39)


# STATC = C

In [13]:
from filter_helpers import complete_deal_filter

In [14]:
sdc_majority_complete = complete_deal_filter(sdc_majority)

original df shape:  (241546, 39) 

filtered df shape:  (198835, 39)


# Merge GVKEY

## obtain GVKEY from linkings

Rules:
1. use WRDS linking table as primary table, EVANS as secondary
2. first, use ACU and TCU to search GVKEY; If no result return, use AUP and TUP to search. If still no result return, this row has to be dropped.


reason for rules refering to [Appendix 1](./Appendix1_data_explore.ipynb)

+ save merged data to pickle before checking out Appendix 1!

In [22]:
from merge_helper import merge_gvkey

In [24]:
# 1st, take care of Acquiror

sdc_merged = merge_gvkey(sdc_majority_complete, sdc.wrds_bridge, sdc.name_lst)

ValueError: Length mismatch: Expected axis has 59 elements, new values have 56 elements

## Remove self merge

some time firm itself merge self

In [53]:
merged_w_ut = merged_w_ut[merged_w_ut.ACU != merged_w_ut.TCU]

In [56]:
merged_w_ut = merged_w_ut.reset_index(drop=True)

## Filter those Gvkey condition not ok

### filtering:

the following conditions are marked as GVKEY merged successfully:

`ok = (GVKEY Found in Bridge table) & (GVKEY in valid time period)`

num of succcess condition = (C22 + C21) * (C22 + C21) = 9

| ACU ok | AUP  ok | TCU ok | TUP ok | mark as                                           |
|------------------|-------------------|------------------|-------------------|---------------------------------------------------|
| 1                | 1                 | 1                | 1                 | 1                                                 |
| 1                | 1                 | 1                | 0                 | 2                                                 |
| 1                | 1                 | 0                | 1                 | 3                                                 |
| 1                | 0                 | 1                | 1                 | 4                                                 |
| 1                | 0                 | 1                | 0                 | 5                                                 |
| 1                | 0                 | 0                | 1                 | 6                                                 |
| 0                | 1                 | 1                | 1                 | 7                                                 |
| 0                | 1                 | 1                | 0                 | 8                                                 |
| 0                | 1                 | 0                | 1                 | 9                                                 |
|                  |                   |                  |                   | all other combination is unanalysiable, mark as 0 |


1. Target and Acquirer must at least successfully match with one table
1. for those records (acquirer or target) that evans bridge did not much successfully, but wrds bridge matched successfully, confirm the `date accounced` of the deal falls in the effective time period of the linking. 

    why use `DA` instead of `DE`? since `DA` has less Nas
    
    
* I chose to use wrds to be my primary linking. If two linking both matched, I will choose the result from wrds.


In [None]:
def gvkey_checker(df):
    '''
    add an indicator variable named "XXX_ok" and 'GVKEY_OVERALL' to indicate GVKEY merge conditions
    Do not downsize the df !
    '''
    merged_raw = df
    # pre define 2 inner helpers
    def mark_gvkey_ok(row, key):
    #    print(row['GVKEY_'+key])
        if pd.notna(row['GVKEY_'+key]) & (row['LINKDT_'+key] <= row['DA']) & (row['LINKENDDT_'+key] >= row['DA']):
            return 1
        else:
            return 0
        
    def mark_gvkey_total_ok(row):
        if row.ACU_OK & row.TCU_OK:
            return '1'
        elif row.AUP_OK & row.TCU_OK:
            return '2'
        elif row.ACU_OK & row.TUP_OK:
            return '3'
        elif row.AUP_OK & row.TUP_OK:
            return '4'
        else:
            return '0'
    drop_name_lst = []
    for part in ['A', 'T']:
        for ent in ['CU', 'UP']:
            key = part+ent
            merged_raw[key+'_OK'] = merged_raw.apply(mark_gvkey_ok, key=key, axis=1)
            drop_name_lst += [key+'_OK']
            
    merged_raw['GVKEY_OVERALL'] = merged_raw.apply(mark_gvkey_total_ok, axis=1)
    
    merged_raw.drop(drop_name_lst, axis=1) # drop "xx_ok" variables
    
    print('Number of Conditions: \n', merged_raw['GVKEY_OVERALL'].value_counts(),'\n')
    return merged_raw

here I simplify the condition.

Since if the GVKEY of direct participants are exist, we will use the GVKEY of them instead of the GVKEY of their ultimate parents.

So, the simplier version of the gvkey condition is:

| ACU ok | AUP  ok | TCU ok | TUP ok | mark as                                           |
|------------------|-------------------|------------------|-------------------|---------------------------------------------------|
| 1                | 1                 | 1                | 1                 | 1                                                 |
| 1                | 1                 | 1                | 0                 | 1                                                 |
| 1                | 1                 | 0                | 1                 | 3                                                 |
| 1                | 0                 | 1                | 1                 | 1                                                 |
| 1                | 0                 | 1                | 0                 | 1                                                 |
| 1                | 0                 | 0                | 1                 | 3                                                 |
| 0                | 1                 | 1                | 1                 | 2                                                 |
| 0                | 1                 | 1                | 0                 | 2                                                 |
| 0                | 1                 | 0                | 1                 | 4                                                 |
|                  |                   |                  |                   | all other combination is unanalysiable, mark as 0 |



In [53]:

merged_raw[merged_raw['GVKEY_OVERALL'] != '0' ].sample(3).T

NameError: name 'merged_raw' is not defined

0    273072
3      6784
1      4938
4      3415
2      1625
Name: GVKEY_OVERALL, dtype: int64

In [128]:
merged_raw.to_pickle(pjoin(tmp_data_path , f'max_master1_{s_year}_{e_year}.pickle'))

# only keep some var

merged_raw['GVKEY_OVERALL'] must be [1,2,3,4]; and drop other help variables (Only contain name_lst + AGVKEY + TGVKEY)


In [132]:
def gvkey_filter_a(row):
    if row['GVKEY_OVERALL'] in ['1', '3']:
        return row['GVKEY_'+'ACU']
    elif row['GVKEY_OVERALL'] in ['2', '4']:
        return row['GVKEY_'+'AUP']
    
    
def gvkey_filter_t(row):
    if row['GVKEY_OVERALL'] in ['1', '2']:
        return row['GVKEY_'+'TCU']
    elif row['GVKEY_OVERALL'] in ['3', '4']:
        return row['GVKEY_'+'TUP']

        
    

In [145]:
merged_raw_no0 = merged_raw[merged_raw['GVKEY_OVERALL'] != '0' ].copy()

In [146]:
merged_raw_no0['AGVKEY'] = merged_raw_no0.apply(gvkey_filter_a, axis=1)

In [147]:
merged_raw_no0['TGVKEY'] = merged_raw_no0.apply(gvkey_filter_t, axis=1)

In [144]:
keep_lst = [
    'ACU', 'ASIC2', 'ABL', 'ANL', 'APUBC', 'AUP', 'AUPSIC', 'AUPBL', 'AUPNAMES', 'AUPPUB',
    'BLOCK','CREEP','DA','DE','STATC','SYNOP','VAL','PCTACQ','PSOUGHTOWN','PSOUGHT','PHDA','PCTOWN','PSOUGHTT','PRIVATIZATION','DEAL_NO',
    'TCU', 'TSIC2', 'TBL', 'TNL', 'TPUBC', 'TUP', 'TUPSIC', 'TUPBL', 'TUPNAMES', 'TUPPUB',
    'AGVKEY', 'TGVKEY','GVKEY_OVERALL'
] 

In [148]:
merged = merged_raw_no0[keep_lst]

In [150]:
merged.shape

(16762, 38)

# Add 2 variables SIC and YEAR

In [152]:
def get_sic(df):
    '''
    df: the sdc table contains sic variable named as `ASIC2`
    
    '''
    x = df.ASIC2.str.split('/')
    x = x.transform(lambda x: x[0] if not isinstance(x, float) else np.nan)
    df['SIC_A'] = x

    x = df.ASIC2.str.split('/')
    x = x.transform(lambda x: x[0] if not isinstance(x, float) else np.nan)
    df['SIC_T'] = x
    
    return df

In [154]:
merged["YEAR"] = merged.DA.dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged["YEAR"] = merged.DA.dt.year


In [155]:
merged = get_sic(merged)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['SIC_A'] = x
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['SIC_T'] = x


In [157]:
merged.to_pickle(pjoin(tmp_data_path , f'master1_{s_year}_{e_year}.pickle'))