## Final Report

This is the final report notebook for Team 12 in FIN 377. In this notebook we will explore our processes, modeling, as well as our conclusions.

To start: These are all the packages we used throughout the project.

In [None]:
# Data download and cleaning packages.
import pandas as pd
import os
import zipfile
import numpy as np
# We tried to fuzzymatch firms to their names, but ran into trouble and manually matched the CUSIPS
# !pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process

# Modeling packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LogisticRegression

# Plotting packages
import matplotlib.pyplot as plt
import seaborn as sns

# Other Packages
from tqdm import tqdm
# We did not end up using wrds in our final process, but we thought it was a cool database and included it.
# !pip install wrds
import wrds

### On to the data!

We used a lot of data in this project. It was very hard to clean, but we got there in the end.

The goal was to end up with a merged dataset of 3 inputs:
- Jay Ritter's SPAC data
    - This data contains all SPAC mergers from 2016-2021.
- Jay Ritter's IPO data
    - This data contains all IPOs since 1975
- CCM data
    - This data contains 950 columns of observations on firms from 2000-2018

We tried a lot of methods to get this final dataset right.
1. We started by merging on the different CUSIPs provided in each of the datasets.
    - The problem with this method was that we could merge all of the data together, only to find there were 0 SPAC observations in the resulting dataset.
    - We learned this is because the CUSIPs overlapped a bit between SPAC data and both IPO and CCM data, but none of the SPACs had CUSIPs that matched both datasets.
2. FuzzyWuzzy -- We tried running a fuzzy match on the Company Names given in each of the datasets.
    - This worked somewhat well, but was prone to innacurate readings.
    - There was a company called "Acquisition" which the fuzzy match thought looked similar to any SPAC named "Bob Joe Acquisition Corp"
    - When we deleted this company we ran a fuzzymatch that took 3 hours!
    - It worked for about 60% of the data, but when combined with a confidence level of 90%, the fuzzy match ended up with only around 20% of SPACs.
3. Manually adding merge keys
    - We realized that the most accurate way to gather all the data would be to merge on CUSIP, all of the datasets have them, and although some are different, we could manually go through the SPAC dataset to match them to the IPO and CCM datasets.
    - This worked! We ended up creating CCM_Cusip.csv which has CUSIP merge keys for both IPO data and CCM data. 

#### Below is the final merge we came up with

In [None]:
# Downloading data into DFs
ipo_age_df = pd.read_csv('inputs/IPO-age(9).csv')
cleaned_spacs = pd.read_csv('inputs/CCM_Cusip.csv')
ccm_df = pd.read_csv('inputs/all_ccm_data.csv')
# Narrowing the horizons to be more memory-friendly
ipo_age_df = ipo_age_df[ipo_age_df['offer date'] > 20000000]
ipo_age_df = ipo_age_df[ipo_age_df['offer date'] < 20190000]
ipo_age_df = ipo_age_df.iloc[:, :-3]
# Merging the SPAC and IPO data
ziggymerge = pd.merge(ipo_age_df, cleaned_spacs, how='left', left_on='CUSIP', right_on='IPO_age_Cusip')
# Merging the previous merge with CCM Data
ziggymerge2 = pd.merge(left= ccm_df, right= ziggymerge, how='left', left_on='cusip', right_on='CCM_Cusip')
ziggymerge2.to_csv('inputs/masterMerge.csv')

We then filtered this data based on what was streamlit-friendly and what numeric values we thought would be best to model.

## Next up: Modeling!