In [2]:
import pandas as pd
import pickle
import numpy as np
import datetime 
from os.path import join as pjoin
import os
#import argparse
#import yamlb

In [3]:
tmp_data_path =  '../MA_data/data/tmp'
data_path = '../MA_data/data'
s_year = 1997
e_year = 2019

In [4]:
from Master1_data_prepare import MADataLoader

# import data

3 groups of data
- bridge 1: wrds bridge
- bridge 2: evans bridge
- SDC MA data

before concat all data, manually convert all date on excel file to "YY-mm-dd" format.

Guide: select all date var --> right-click --> Cell format --> Date --> 2012-03-14 --> OK

![](https://cdn.mathpix.com/snip/images/58hcVJ3qFlC446Ns4SaJKtm-UroEqUKyqu4oCnTWhKY.original.fullsize.png)


###  sdc data

1. drop where either `ACU` or `TCU` is Nas
1. fill DEAL_NO NAs to -1
1. change all identifier to `str`; including: `ACU`, `AUP`, `TCU`, `TUP`, `DEAL_NO`, `GVKEY`a


### bridges data

1. match variable type for merging
    1. for CUSIP/GVKEY/DEALNUM, all convert to `string`; 
        - do not keep 0s at front 
            - e.g. `002030` will be curtail to `2030`
    2. for time, all convert to pandas `Timestamp` instance
2. drop na or fill na

#### evans_bridge

for evans_brdge, load as float:
1. fill na as -1
2. convert all var to integer
3. convert all var to string


**so the GVKEY has no 0 at front**

#### wrds_bridge

- `GVKEY` and `CUSIP`, load as int; so just convert to str
- no need to worry about NA

In [5]:
sdc = MADataLoader(tmp_data_path, data_path, s_year, e_year, glaspe=True)

WRDS Linking Table looks like:          CUSIP   GVKEY     LINKDT  LINKENDDT                    CONM
6073   578592    7139 1962-01-31 2006-03-31             MAYTAG CORP
29571  81763U  186129 2011-03-25 2099-12-31  SERVICESOURCE INTL INC
9442   882387   10490 1966-01-01 1966-01-30      TEXAS EASTERN CORP
ATTENTION, DealNumber, tgvkey, agvkey NAs in evans_bridge are interpolated as '-1'. 
 
       DealNumber  agvkey tgvkey
30703   904176020    8431     -1
69516  1787671020  116004     -1
74904  1890908020   15444     -1
date variables loading ok 

1997 data shape: (13255, 35)
date variables loading ok 

1998 data shape: (15081, 35)
date variables loading ok 

1999 data shape: (13203, 35)
date variables loading ok 

2000 data shape: (12610, 35)
date variables loading ok 

2001 data shape: (8771, 35)
date variables loading ok 

2002 data shape: (7943, 35)
date variables loading ok 

2003 data shape: (8573, 35)
date variables loading ok 

2004 data shape: (9704, 35)
date variables loading ok

In [7]:
# run variable type check
sdc.var_type_checker()

variable type checking finished, No error Found. 



# Filtering "Majority" MA

the variable meaning please refer to [appendix1.2](Appendix_1_2_variable_description.ipynb)

see P523 Ahern, Kenneth R., and Jarrad Harford. 2014. “The Importance of Industry Links in Merger Waves.” The Journal of Finance 69 (2): 527–76. https://doi.org/10.1111/jofi.12122.


- the acquirer buys 20% or more of the target’s share: `PCTACQ > 20.0`
- the acquirer owns 51% or more of the target’s shares after the deal; (5) the merger is completed:  `PCTOWN > 51.0`
- transaction value is at least 1 million: `VAL > 1`
- legal form of organization of the target or acquirer not restricted


before restriction, we'd better look at the missing pattern of related varibles
- if large portion of this varibale is missing, it's not good to use this variable as restriction. [exploration see Appendix1.1 Q4](./Appendix1_data_explore.ipynb)

$\begin{array}{ll}\text { VAL } & 58.802123 \\ \text { PCTACQ } & 27.158338 \\ \text { PSOUGHTOWN } & 13.356764 \\ \text { PSOUGHT } & 13.383123 \\ \text { PHDA } & 97.952041 \\ \text { PCTOWN } & 26.963939 \\ \text { PSOUGHTT } & 99.122094\end{array}$

So VAL, PHDA, PSOUGHTT are bad restrictors


In [8]:
sdc_df = sdc.sdc_df


In [9]:
from filter_helpers import majority_filter

In [10]:
sdc_majority = majority_filter(sdc_df)


original df shape:  (259778, 39) 

filtered df shape:  (241546, 39)


# STATC = C

In [11]:
from filter_helpers import complete_deal_filter

In [12]:
sdc_majority_complete = complete_deal_filter(sdc_majority)

original df shape:  (241546, 39) 

filtered df shape:  (198835, 39)


# Remove self-merge

In [14]:
sdc_majority_complete.columns

Index(['index', 'ACU', 'ASIC2', 'ABL', 'ANL', 'APUBC', 'AUP', 'AUPSIC',
       'AUPBL', 'AUPNAMES', 'AUPPUB', 'BLOCK', 'CREEP', 'DA', 'DE', 'STATC',
       'SYNOP', 'VAL', 'PCTACQ', 'PSOUGHTOWN', 'PSOUGHT', 'PHDA', 'PCTOWN',
       'PSOUGHTT', 'PRIVATIZATION', 'DEAL_NO', 'TCU', 'TSIC2', 'TBL', 'TNL',
       'TPUBC', 'TUP', 'TUPSIC', 'TUPBL', 'TUPNAMES', 'TUPPUB', 'SIC_A',
       'SIC_T', 'YEAR'],
      dtype='object')

In [16]:
sdc_majority_complete = sdc_majority_complete[sdc_majority_complete['ACU'] != sdc_majority_complete['TCU']]

In [17]:
print("after remove self-merge:", sdc_majority_complete.shape)

after remove self-merge: (197992, 39)


# Merge GVKEY

## obtain GVKEY from linkings

Rules:
1. use WRDS linking table as primary table, EVANS as secondary
2. first, use ACU and TCU to search GVKEY; If no result return, use AUP and TUP to search. If still no result return, this row has to be dropped.


reason for rules refering to [Appendix 1](./Appendix1_data_explore.ipynb)

+ save merged data to pickle before checking out Appendix 1!

In [19]:
from merge_helpers import merge_gvkey_wrds

In [20]:
wrds_merged = merge_gvkey_wrds(sdc_majority_complete, sdc.wrds_bridge)

(197992, 39)
(207549, 44)
(209150, 49)
(240030, 54)
(257498, 59)


In [21]:
wrds_merged.shape

(257498, 59)

## Filter those Gvkey condition not ok

### Step1: WRDS merged result:

the following conditions are marked as GVKEY merged successfully:

`ok = (GVKEY Found in Bridge table) & (GVKEY in valid time period)`

num of succcess condition = (C22 + C21) * (C22 + C21) = 9

| ACU ok | AUP  ok | TCU ok | TUP ok | mark as                                           |
|------------------|-------------------|------------------|-------------------|---------------------------------------------------|
| 1                | 1                 | 1                | 1                 | 1                                                 |
| 1                | 1                 | 1                | 0                 | 2                                                 |
| 1                | 1                 | 0                | 1                 | 3                                                 |
| 1                | 0                 | 1                | 1                 | 4                                                 |
| 1                | 0                 | 1                | 0                 | 5                                                 |
| 1                | 0                 | 0                | 1                 | 6                                                 |
| 0                | 1                 | 1                | 1                 | 7                                                 |
| 0                | 1                 | 1                | 0                 | 8                                                 |
| 0                | 1                 | 0                | 1                 | 9                                                 |
|                  |                   |                  |                   | all other combination is unanalysiable |


1. Target and Acquirer must at least successfully match with one table
1. for those records (acquirer or target) that evans bridge did not much successfully, but wrds bridge matched successfully, confirm the `date accounced` of the deal falls in the effective time period of the linking. 

    why use `DA` instead of `DE`? since `DA` has less Nas
    
    
* I chose to use wrds to be my primary linking. If two linking both matched, I will choose the result from wrds.

here I simplify the condition.

Since if the GVKEY of direct participants are exist, we will use the GVKEY of them instead of the GVKEY of their ultimate parents.

So, the simplier version of the gvkey condition is:

| ACU ok | AUP  ok | TCU ok | TUP ok | mark as |
|--------|---------|--------|--------|---------|
| 1      | 1       | 1      | 1      | 1       |
| 1      | 1       | 1      | 0      | 1       |
| 1      | 1       | 0      | 1      | 3       |
| 1      | 0       | 1      | 1      | 1       |
| 1      | 0       | 1      | 0      | 1       |
| 1      | 0       | 0      | 1      | 3       |
| 0      | 1       | 1      | 1      | 2       |
| 0      | 1       | 1      | 0      | 2       |
| 0      | 1       | 0      | 1      | 4       |
| 1      | 0       | 0      | 0      | -1      |
| 0      | 1       | 0      | 0      | -2      |
| 0      | 0       | 1      | 0      | -3      |
| 0      | 0       | 0      | 1      | -4      |
| 0      | 0       | 0      | 0      | 0       |


In [22]:
from merge_helpers import wrds_gvkey_checker

In [24]:
merged = wrds_gvkey_checker(wrds_merged)

Number of Conditions: 
 0     156231
-1     45612
-2     18768
-4     16522
3       6083
-3      5677
1       4276
4       3040
2       1289
Name: GVKEY_OVERALL, dtype: int64 



In [25]:
wrds_merged_fail = merged[merged['GVKEY_OVERALL'].isin(['0','-1','-2','-3','-4']) ]

In [26]:
wrds_merged_ok = merged[merged['GVKEY_OVERALL'].isin(['1','2','3','4']) ]

### Step 2: help with EVANS
Since **WRDS fail to merge with majority quantity of MA data**. We need to use evans bridge for ambiguous gvkey match

Notice that we don't necessaly need both agvkey and tgvkey matched successfully from evans_bridge.
Here is the linking target:

| WRDS status | If EVANS AGVKEY OK | If EVANS TGVKEY  | Final Status  | mark_as |
|-------------|--------------------|------------------|---------------|---------|
| 0           | 1                  | 1                | 1             | 1       |
| -1          |                    | 1                | 1             | 2       |
| -2          |                    | 1                | 1             | 3       |
| -3          | 1                  |                  | 1             | 4       |
| -4          | 1                  |                  | 1             | 5       |
|             |                    |                  | otherwise = 0(unanalysisable) | 0       |


In [27]:
from merge_helpers import merge_gvkey_evans

In [28]:
evans_merged = merge_gvkey_evans(wrds_merged_fail, sdc.evans_bridge)

In [29]:
from merge_helpers import evans_gvkey_checker

In [31]:
evans_merged = evans_gvkey_checker(evans_merged)

Number of conditions: 
 0    225007
1      4786
2      4469
5      3727
4      2553
3      2268
Name: GVKEY_EVANS_STATUS, dtype: int64 



# Obtain GVKEY var

and remove those helper variables

## Step1, obtain GVKEY var from evans merge result

In [32]:
from merge_helpers import create_gvkey_var_evans

In [33]:
evans_result = create_gvkey_var_evans(evans_merged[evans_merged.GVKEY_EVANS_STATUS != '0'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['AGVKEY'] = df.apply(AGVKEY_BUILDER, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['TGVKEY'] = df.apply(TGVKEY_BUILDER, axis=1)


In [34]:
evans_result.shape

(17803, 41)

## Step 2, obtain GVKEY var from WRDS merge result 

`df['GVKEY_OVERALL']` must be `[1,2,3,4]`; and drop other help variables (Only contain name_lst + AGVKEY + TGVKEY)



In [35]:
from merge_helpers import create_gvkey_var_wrds

In [36]:
wrds_result = create_gvkey_var_wrds(wrds_merged_ok)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['AGVKEY'] = df.apply(gvkey_filter_a, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['TGVKEY'] = df.apply(gvkey_filter_t, axis=1)


## Concat 2 dataframe

In [37]:
merge_result= pd.concat([evans_result, wrds_result], axis=0)

In [38]:
merge_result.shape

(32491, 41)

In [39]:
print(f"saving gvkey merged sdc data to {tmp_data_path} \n")
merge_result.to_pickle(pjoin(tmp_data_path , f'sdc_gvkey_{s_year}_{e_year}.pickle'))

saving gvkey merged sdc data to ../MA_data/data/tmp 

