# Predicting Transaction Type (BUY or SELL) Via Stock Trades by Members of the US House of Representatives

**James Bentley, Haicheng Xu**

## Summary of Findings


### Introduction
Objective: Predicting whether a trade is BUY or SELL.

Type: Binary classification: there is only BUY (1) or SELL (0).

Response Variable: type of transaction (BUY/SELL). We chose this as a indicator to detect market trends.

Metric: F1-Score. We are using this because it a good balance between precision and recall. We believe precision and recall are important when predicting whether a trade is a buy or sell. 

### Baseline Model
Made our first Pipeline using the KNNeighborsClassifier model

    Features: 
        Quantitative:
            1. Day of transaction 
            2. Month of transaction
            3. Year of transaction        
        Nominal:
            4. Party
        

How: For the quantitative features, we used helper functions to strip respective parts of the date string. For 'party', we used one hot encoding to encode categorical values to numerical values.

Why: For quantitative columns, buy or sell depends on the date since historical events such as the 2008 recession can cause systematic changes to buying and selling behaviors. For 'party', the buy or sell can depend on party affiliation since the parties could have different buying and selling patterns

Model Performance: Using F1-Score for our metric of evaluation, we achieved a score of 0.6602746825602488 on our test set. We are able to conclude that this model is decent, since it's a lot better than the baseline accuracy of 0.5252973381159146 (guessing all 'buy'). Our pipeline improved our F-1 score by 0.13497734444433418.

Generalization Ability: Our baseline model ability to generalize on unseen is mediocre since our pipeline only improved by around 13.5 percent. 

### Final Model

Final Model

    New Features:
        Quantitative:
            1. Day of disclosure 
            2. Month of disclosure
            3. Year of disclosure
            4. est_amount
        Nominal:
            5. ticker

How: For the quantitative features (beside est_amount), we used helper functions to strip respective parts of the date string. For est_amount, we converted it to z-score with standard scaler. For 'ticker', we used one hot encoding to encode categorical values to numerical values.

Why: For 'ticker', the buy or sell can depend on the performance of a specific stock. For example, when a company has a risk of being delisted, there is a higher proportion of individuals selling than buying that stock. For quantitative columns, buy or sell depends on disclosure date since congress members would likely disclose their buys and sales on separate dates. Therefore, we added disclosure dates in addition to transaction date.

Generalization Ability: We improved the final model's F1-Score from 0.5252973381159146 to 0.7225288509784245, or about a 20 percent increase. With an F1-Score of 0.7225288509784245, the final model's ability to generalize on unseen data is a lot better than the baseline model. 

### Fairness Analysis

Null Hypothesis: Our model is fair. Its F1-Score for Republicans and Democrats are roughly the same, and differences would be due to random chance.

Alternative Hypothesis: Our model is unfair. Its F1-Score for Republicans and Democrats are significantly different.

alpha = 0.05

p-value: 0.0

Conclusion: We observed a p-value of 0.0, which is less than our alpha of 0.05. Therefore, we have enough evidence to reject our null hypothesis. Therefore, our model is unfairly predicting more accurately for Republicans.

## Code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [2]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import f1_score

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### Baseline Model

Read in cleaned stock transaction file that we combined with party affiliation from project 3

In [4]:
df = pd.read_csv('transactions_w_party.csv')

In [5]:
# Dropped the index column
df = df.drop(['index'], axis = 1)
# Dropped the three Libertarian Party member's rows to simplify our model
df = df[df['Party']!='Libertarian']

Simplified the four exchange categories, 'purchase', 'sale_partial, 'sale_full', 'exchange' to just buy and sell

In [323]:
# We dropped exchange
df = df[df['type']!='exchange']
df['type'] = df['type'].replace({'purchase': 'buy', 'sale_partial': 'sell', 'sale_full': 'sell'})

Created training and testing data for our model

In [324]:
# train test split
X = df[['transaction_date','Party']]
y = df['type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Helper functions to be used in our function transformer

In [325]:
# Turns string date into number of days since year 0 for the transaction_date column
def num_tdays(df):
    return pd.DataFrame(df['transaction_date'].transform(lambda ser: int(ser.split('-')[0]) * 365 + int(ser.split('-')[1]) * 30 + int(ser.split('-')[-1])))
# Turns string date into number of days since year 0 for the disclosure_date column
def num_ddays(df):
    return pd.DataFrame(df['disclosure_date'].transform(lambda ser: int(ser.split('-')[0]) * 365 + int(ser.split('-')[1]) * 30 + int(ser.split('-')[-1])))

In [326]:
# Turns string date into day of the month for transaction_date
def day(df):
    return pd.DataFrame(df['transaction_date'].transform(lambda ser: int(ser.split('-')[-1])))
# Turns string date into month for transaction_date
def month(df):
    return pd.DataFrame(df['transaction_date'].transform(lambda ser: int(ser.split('-')[1])))
#Turns string date into year for transaction_date
def year(df):
    return pd.DataFrame(df['transaction_date'].transform(lambda ser: int(ser.split('-')[0])))

Made our first Pipeline using the KNNeighborsClassifier model

    Features: 
        Quantitative:
            1. Day of transaction 
            2. Month of transaction
            3. Year of transaction        
        Nominal:
            4. Party
How: For the quantitative features, we used helper functions to strip respective parts of the date string. For 'party', we used one hot encoding to encode categorical values to numerical values.

Why: For quantitative columns, buy or sell depends on the date since historical events such as the 2008 recession can cause systematic changes to buying and selling behaviors. For 'party', the buy or sell can depend on party affiliation since the parties could have different buying and selling patterns

In [327]:
# Column Transformer to encode the four above features
feature_eng_pipeline = ColumnTransformer([
    ('day', FunctionTransformer(day), ['transaction_date']),
    ('month', FunctionTransformer(month), ['transaction_date']),
    ('year', FunctionTransformer(year), ['transaction_date']),
    ('nominal', OneHotEncoder(), ['Party'])]
)
# Pipeline to make combine column transforming and KNN Classifier
pl = Pipeline([
    # Performs feature engineering 
    ('features', feature_eng_pipeline),
    ('tree', KNeighborsClassifier(n_neighbors=3))
])
# Fits the training data
pl.fit(X_train, y_train)
# F1 Score for the training set
f1_score(pl.predict(X_train), np.array(y_train),pos_label='buy')

0.7021872265966754

In [328]:
# F1 Score for testing set
f1_score(pl.predict(X_test), np.array(y_test),pos_label='buy')

0.6363636363636364

In [329]:
# baseline accuracy
np.mean(y_train == 'buy')

0.5280347366433831

In [330]:
# improvement 
0.6602746825602488 - 0.5252973381159146

0.13497734444433418

Model Performance:
    
    Using F1-Score for our metric of evaluation, we achieved a score of 0.6602746825602488. We are able to conclude that this model is decent, since it's a lot better than the baseline accuracy of 0.5252973381159146 (guessing all 'buy'). Our pipeline improved our F-1 score by 0.13497734444433418.

### Final Model

New Features: 
    
    Quantitative:
        1. Day of disclosure 
        2. Month of disclosure
        3. Year of disclosure
        4. est_amount
    Nominal:
        4. ticker
        
How: For the quantitative features (beside est_amount), we used helper functions to strip respective parts of the date string. For est_amount, we converted it to z-score with standard scaler. For 'ticker', we used one hot encoding to encode categorical values to numerical values.

Why: For 'ticker', the buy or sell can depend on the performance of a specific stock. For example, when a company has a risk of being delisted, there is a higher proportion of individuals selling than buying that stock. For quantitative columns, buy or sell depends on disclosure date since congress members would likely disclose their buys and sales on separate dates. Therefore, we added disclosure dates in addition to transaction date.

In [331]:
# train test split
X = df[['transaction_date','est_amount','Party','disclosure_date','ticker']]
y = df['type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [332]:
# Turns string date into day of the month for disclosure_date
def dday(df):
    return pd.DataFrame(df['disclosure_date'].transform(lambda ser: ser.split('-')[-1]))
# Turns string date into month for disclosure_date
def dmonth(df):
    return pd.DataFrame(df['disclosure_date'].transform(lambda ser: ser.split('-')[1]))
#Turns string date into year for disclosure_date
def dyear(df):
    return pd.DataFrame(df['disclosure_date'].transform(lambda ser: ser.split('-')[0]))

In [306]:
# KNN ClASSIFIER WITH NEW FEATURES

feature_eng_pipeline = ColumnTransformer([
        ('dday', FunctionTransformer(dday), ['disclosure_date']),
        ('dmonth', FunctionTransformer(dmonth), ['disclosure_date']),
        ('dyear', FunctionTransformer(dyear), ['disclosure_date']),
        ('tday', FunctionTransformer(day), ['transaction_date']),
        ('month', FunctionTransformer(month), ['transaction_date']),
        ('year', FunctionTransformer(year), ['transaction_date']),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), ['Party','ticker']),
        ('quant', StandardScaler(), ['est_amount'])
])
pl = Pipeline([
    ('features', feature_eng_pipeline),
    ('tree', KNeighborsClassifier(n_neighbors=3))
])
pl.fit(X_train, y_train)
f1_score(pl.predict(X_train), np.array(y_train),pos_label='buy')

0.8575771122038445

In [307]:
f1_score(pl.predict(X_test), np.array(y_test),pos_label='buy')

0.69800796812749

 ##### Model Selection:
    We manually checked two other models: DecisionTreeClassifier, and RandomTreeClassifier. Using the same features in each model, we found that our F1-score was the highest using the DecisionTreeClassifier, with KNeighborsClassifier in second, and RandomTreeClassifier in a close third. 

In [312]:
#RANDOM FOREST CLASSIFIER

feature_eng_pipeline = ColumnTransformer([
        ('dday', FunctionTransformer(dday), ['disclosure_date']),
        ('dmonth', FunctionTransformer(dmonth), ['disclosure_date']),
        ('dyear', FunctionTransformer(dyear), ['disclosure_date']),
        ('tday', FunctionTransformer(day), ['transaction_date']),
        ('month', FunctionTransformer(month), ['transaction_date']),
        ('year', FunctionTransformer(year), ['transaction_date']),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), ['Party','ticker']),
        ('quant', StandardScaler(), ['est_amount'])
])
pl = Pipeline([
    ('features', feature_eng_pipeline),
    ('tree', RandomForestClassifier(max_depth=19))
])
pl.fit(X_train, y_train)
f1_score(pl.predict(X_train), np.array(y_train),pos_label='buy')

0.7848828185627813

In [313]:
f1_score(pl.predict(X_test), np.array(y_test),pos_label='buy')

0.6988399071925754

In [318]:
#DECISION TREE CLASSIFIER

feature_eng_pipeline = ColumnTransformer([
        ('dday', FunctionTransformer(dday), ['disclosure_date']),
        ('dmonth', FunctionTransformer(dmonth), ['disclosure_date']),
        ('dyear', FunctionTransformer(dyear), ['disclosure_date']),
        ('tday', FunctionTransformer(day), ['transaction_date']),
        ('month', FunctionTransformer(month), ['transaction_date']),
        ('year', FunctionTransformer(year), ['transaction_date']),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), ['Party','ticker']),
        ('quant', StandardScaler(), ['est_amount'])
])
pl = Pipeline([
    ('features', feature_eng_pipeline),
    ('tree', DecisionTreeClassifier(max_depth=19))
])
pl.fit(X_train, y_train)
f1_score(pl.predict(X_train), np.array(y_train),pos_label='buy')

0.8855039350088855

In [319]:
f1_score(pl.predict(X_test), np.array(y_test),pos_label='buy')

0.7225288509784245

 ##### GridSearch:
    Using GridSearchCV, we found that the best max_depth hyperparameter is 19.

In [216]:
hyperparameters = {'tree__max_depth': np.arange(1,20)}
searcher = GridSearchCV(pl, hyperparameters)
searcher.fit(X_train, y_train)
searcher.best_estimator_

Pipeline(steps=[('features',
                 ColumnTransformer(transformers=[('num_dday',
                                                  FunctionTransformer(func=<function num_ddays at 0x000002990C4F2D38>),
                                                  ['disclosure_date']),
                                                 ('dmonth',
                                                  FunctionTransformer(func=<function dmonth at 0x000002990C44B0D8>),
                                                  ['disclosure_date']),
                                                 ('dyear',
                                                  FunctionTransformer(func=<function dyear at 0x000002990C44B558>),
                                                  ['disclosure_date']),
                                                 ('day',
                                                  FunctionTransformer(func=<function day at 0x000002990C1051F8>),
                                                  

##### Final Model
    Classification Model: Decision Tree Classifier
    Parameter: Disclosure date, transaction date, party, and ticker
    Hyperparameter: max_depth = 19
    F1-Score: 0.7225288509784245

In [318]:
# FINAL MODEL

feature_eng_pipeline = ColumnTransformer([
        ('dday', FunctionTransformer(dday), ['disclosure_date']),
        ('dmonth', FunctionTransformer(dmonth), ['disclosure_date']),
        ('dyear', FunctionTransformer(dyear), ['disclosure_date']),
        ('tday', FunctionTransformer(day), ['transaction_date']),
        ('month', FunctionTransformer(month), ['transaction_date']),
        ('year', FunctionTransformer(year), ['transaction_date']),
        ('ohe', OneHotEncoder(handle_unknown='ignore'), ['Party','ticker']),
        ('quant', StandardScaler(), ['est_amount'])
])
pl = Pipeline([
    ('features', feature_eng_pipeline),
    ('tree', DecisionTreeClassifier(max_depth=19))
])
pl.fit(X_train, y_train)
f1_score(pl.predict(X_train), np.array(y_train),pos_label='buy')

0.8855039350088855

In [319]:
# Final Model F1 Score
f1_score(pl.predict(X_test), np.array(y_test),pos_label='buy')

0.7225288509784245

### Fairness Analysis

Null Hypothesis: Our model is fair. Its F1-Score for Republicans and Democrats are roughly the same, and differences would be due to random chance.

Alternative Hypothesis: Our model is unfair. Its F1-Score for Republicans and Democrats are significantly different.

alpha = 0.05

p-value: 0.0

Conclusion: We observed a p-value of 0.0, which is less than our alpha of 0.05. Therefore, we have enough evidence to reject our null hypothesis. Therefore, our model is unfairly predicting more accurately for Republicans.

##### Computing observed test statistic

In [523]:
Republican = df[df['Party'] == 'Republican']
Democratic = df[df['Party'] == 'Democratic']

# train test split REPUBLICAN

Xr = Republican[['transaction_date','est_amount','Party','disclosure_date','ticker']]
yr = Republican['type']
Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.25)
pl.fit(Xr_train, yr_train)

r_f1 = f1_score(pl.predict(Xr_test), np.array(yr_test),pos_label='buy')

# train test split DEMOCRATIC

Xd = Democratic[['transaction_date','est_amount','Party','disclosure_date','ticker']]
yd = Democratic['type']
Xd_train, Xd_test, yd_train, yd_test = train_test_split(Xd, yd, test_size=0.25)
pl.fit(Xd_train, yd_train)

d_f1 = f1_score(pl.predict(Xd_test), np.array(yd_test),pos_label='buy')

# observed test statistics
observed_abs_diff = np.abs(r_f1-d_f1)
observed_diff = r_f1-d_f1

In [524]:
observed_abs_diff

0.1488955689445054

##### Permutation test

In [525]:
# Simulation
results = []
df_p = df.copy()

for _ in range(200):
    
    # Permutation
    
    df_p['Shuffled_Party'] = np.random.permutation(df_p['Party'])
    Republican = df_p[df_p['Shuffled_Party'] == 'Republican']
    Democratic = df_p[df_p['Shuffled_Party'] == 'Democratic']
    
    # Train test split Republican
    
    Xr = Republican[['transaction_date','est_amount','Party','disclosure_date','ticker']]
    yr = Republican['type']
    Xr_train, Xr_test, yr_train, yr_test = train_test_split(Xr, yr, test_size=0.25)
    pl.fit(Xr_train, yr_train)
    # Republican F1-Score
    r_f1 = f1_score(pl.predict(Xr_test), np.array(yr_test),pos_label='buy')


    # Train test split Democratic
    
    Xd = Democratic[['transaction_date','est_amount','Party','disclosure_date','ticker']]
    yd = Democratic['type']
    Xd_train, Xd_test, yd_train, yd_test = train_test_split(Xd, yd, test_size=0.25)
    pl.fit(Xd_train, yd_train)
    # Democratic F1-Score
    d_f1 = f1_score(pl.predict(Xd_test), np.array(yd_test),pos_label='buy')

    
    abs_diff = np.abs(r_f1-d_f1)
    
    # Test statistics
    results.append(abs_diff)

In [526]:
p_value = np.mean(results >= observed_abs_diff)

In [527]:
p_value

0.0