## <center> FX Market Prediction using Machine Learning and Google Trend Data </center>

### Google Trends Data

Google Trends allows you to search for a particular topic on Google or a specific set of search terms. With it you can see search interest in a topic over time. Search interest is an index of the volume of Google queries by geographic location and category. For instance, we can search for 'Real Estate' within the 'United States'and get search interest dating back several year.

Importantly, the Trends data is not the raw level of queries but a 'query index'. The index starts with the query share: the total query volume for search term in a given geographic region divided by the total number of queries in that region at a point in time.

### Predicting FX Market Changes

In this notebook we will explore Google Trend data that includes relative search volumes for roughly for country-level search terms broken down by specific categories. The hypothesis is that worldwide country-specific search categories such as 'Europe Business News', 'US Financial Markets', 'Canada(ian) Politics', etc can be used to predict FX currency pair changes.

The idea is to see if there exist a tradeable pattern, that an machine learning classifier can learn, that captures the public's 'mood', 'concerns', or 'fears' and leads to significant forecasts. Note we will attempt to predict a FX pair movement of '+1' = positive return or '-1' = negative return.

#### \*Search Term Bias

Given limited time, we take a biased search term approach and chose search terms based on fundamental 'intution'  vs other more robust approaches. As such we choose search keywords related to various News subcategories: financial market, economic market, fiscal policy news, politics.

### FX Data

Besides the Google Trend API data, FX market data is obtained freely from [Forex Forum](http://www.global-view.com/forex-trading-tools/forex-history/index.html). For here we obtain weekly FX pair price levels from **4/2013 to 4/2018** for currency pairs:

<center> **EUR/USD, EUR/GBP, USD/CAD** </center>

In [1]:
from src.fx_preprocessing import *

## Preprocessing

In [3]:
# Included 'Real Estate' subcategories
news_categories = {
                    784 : "Biz_News",   #Business News
                    1163: "FMrkt_News", #Financial Markets News
                    1164: "Econ_News",  #Economy News
                    1165: "Fis_News",   #Fiscal Policy News
                    396 : "Pol_News",   #Politics
                    112 : "BN_News",    #Broadcast & Network News
                }

travel_categories = {
                    1010: "TravelAgncs", #Travel Agencies
                    203 : "AirTravel",   #Air Travel
                    1004: "SpecialTrvl", #Specialty Travel
                    208:  "TourDest"     #Tourist Destinations: 
                    }

In [211]:
search_key1 = "US"
search_key2 = "Europe"
#search_key3 = "United Kingdom"
#search_key4 = "Canada"

us_GTNews = get_search_data(search_key1, news_categories)
eu_GTNews = get_search_data(search_key2, news_categories)
us_GTravel = get_search_data(search_key1, travel_categories)
eu_GTravel = get_search_data(search_key2, travel_categories)

### Load Google Trend Data & FX Data

In [180]:
search_key1 = "US"
search_key2 = "Europe"
fileName = 'exchange.csv'

data = run_preprocessing(fileName, search_key1, search_key2, news_categories, travel_categories)

## Building Naive Bayes/SVM Classifier

In [181]:
train_start = '2013-01-01'
test_start = '2017-05-01'

train_data, test_data = test_train_split(data, train_start, test_start)

In [182]:
print("Number of training observations: ", len(train_data))
print("Number of testing observations: ", len(test_data))

Number of training observations:  188
Number of testing observations:  50


In [197]:
from sklearn.svm import OneClassSVM, SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn import metrics 

#### Note the '_x' and '_y' are results of combining EUR and USD Google Trends data. Can clean up in future

In [184]:
train_data.columns

Index(['BN_News_x', 'Pol_News_x', 'Fis_News_x', 'Econ_News_x', 'FMrkt_News_x',
       'Biz_News_x', 'date_orig_y', 'TourDest_x', 'SpecialTrvl_x',
       'AirTravel_x', 'TravelAgncs_x', 'BN_News_y', 'Pol_News_y', 'Fis_News_y',
       'Econ_News_y', 'FMrkt_News_y', 'Biz_News_y', 'date_orig_y',
       'TourDest_y', 'SpecialTrvl_y', 'AirTravel_y', 'TravelAgncs_y',
       'date_orig', 'EUR/GBP Close', 'USD/CAD Close', 'EUR/USD Close'],
      dtype='object')

In [185]:
features = set(data.columns)-set(['index', 'USD/CAD Close', 'EUR/GBP Close', 'EUR/USD Close',
                                  'GBP/USD Close', 'date_orig_x', 'date_orig_y', 'date_orig'])
y_train = train_data['EUR/USD Close']
X_train = train_data[list(features)]

In [193]:
model = OneClassSVM(kernel='linear', nu=0.05, gamma=0.1)
# model = SVC(C=0.5, gamma=0.01)
model.fit(X_train, y_train)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma=0.1, kernel='linear',
      max_iter=-1, nu=0.05, random_state=None, shrinking=True, tol=0.001,
      verbose=False)

### Result

Ran of time and wanted to test other more powerful classifers like decision tree, emsemble methods, or even shallow Neural Net. but see some initial promising results for the classifier. More analysis is needed but it appears Google Trends has some predictive power on EUR/USD cross but deep/better out of sample testing is required.

#### OneClassSVM

In [196]:
preds = model.predict(X_test)
targs = y_test

print("accuracy: ", metrics.accuracy_score(targs, preds))  
print("precision: ", metrics.precision_score(targs, preds))  
print("recall: ", metrics.recall_score(targs, preds))  
print("f1: ", metrics.f1_score(targs, preds))  
print("area under curve (auc): ", metrics.roc_auc_score(targs, preds)) 

accuracy:  0.5
precision:  0.5625
recall:  0.620689655172
f1:  0.590163934426
area under curve (auc):  0.477011494253


#### Gaussian Naive Bayes

In [199]:
model = GaussianNB()
model.fit(X_train, y_train)

# In sample score
in_sample_score = model.score(X_train, y_train)
print("Model in sample train score: ", round(in_sample_score,2))
print()
# in-sample historical percent up weeks for pair
per_up_weeks = ((y_train + 1)/2).mean()
print("Historical % up weeks: ", round(per_up_weeks,2))

Model in sample train score:  0.67

Historical % up weeks:  0.51


In [189]:
y_test = test_data['EUR/USD Close']
X_test = test_data[list(features)]

# Test sample score
in_sample_score = model.score(X_test, y_test)
print("Model in sample train score: ", in_sample_score)
print()
# test-sample historical percent up weeks for pair
per_up_weeks = ((y_test + 1)/2).mean()
print("Historical % up weeks: ", per_up_weeks)

Model in sample train score:  0.46

Historical % up weeks:  0.58


In [194]:
print("CONFUSION MATRIX")
y_pred = model.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
print(conf_mat)
#In the binary case, we can extract true positives, etc as follows:
tn, fp, fn, tp = conf_mat.ravel()
print()
print("true pos: ", tp)
print("true neg: ", tn)
print("false pos: ", fp)
print("false neg: ", fn)

CONFUSION MATRIX
[[ 7 14]
 [11 18]]

true pos:  18
true neg:  7
false pos:  14
false neg:  11


## Cointegration Test

In [120]:
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import coint


In [107]:
def ADF(v, crit='5%', max_d=6, reg='nc', autolag='AIC'):
    """ 
    Augmented Dickey Fuller test

    Parameters
    ----------
    v: ndarray matrix
        residuals matrix

    Returns
    -------
    bool: boolean
        true if v pass the test 
    """

    boolean = False

    adf = adfuller(v, max_d, reg, autolag)
    print("adf: ", adf)
    if(adf[0] < adf[4][crit]):
        pass
    else:
        boolean = True
    return boolean

Here I attempt to use statsmodels Johansen Test to test for cointegration. According to Johansen, there is a linear stationary combination of the factors with order of integration 0 with over 95% confidence. Thus there exists an opportunity for pairs trading. The Google Trends metholody we built is not directly investable however.

I ran out of time to perfect the johnansen test for the multivariate case (the current implementation is only the univariate case -- where's R when you need it!) But running the univariate case on several factors shows that the Google Trend series and EUR/USD is cointegrated. From here the final small step would be to go long EUR/USD when the model predict a weekly increase (and short when predicts decrease) then check profitability of the algorithm.

Simiarly its worth while to test ADF since we did not do that clearly before.

In [148]:
print("\nADF Test Result: ", ADF(y_train))

adf:  (-13.421114550940748, 2.1427208940816457e-24, 0, 187, {'1%': -2.5777998701707228, '5%': -1.9425278169832012, '10%': -1.6154734119830811}, 515.65022407389051)

ADF Test Result:  False


In [146]:
for series in features:
    #_, pvalue, crit = coint(y_train, train_data[series])
    print("EUR/USD vs {}".format(series))
    print("t-statistic of unit-root test on residuals: \n", _)
    print("pvalue: \n", pvalue)
    print("Critical values: \n", crit)

EUR/USD vs FMrkt_News_x
t-statistic of unit-root test on residuals: 
 {'SpecialTrvl_y', 'TourDest_y', 'Fis_News_x', 'Biz_News_y', 'EUR/USD Close', 'TravelAgncs_y', 'Econ_News_y', 'Fis_News_y', 'TourDest_x', 'BN_News_y', 'FMrkt_News_y', 'Pol_News_y', 'FMrkt_News_x', 'TravelAgncs_x', 'BN_News_x', 'SpecialTrvl_x', 'AirTravel_x', 'AirTravel_y', 'Biz_News_x', 'Econ_News_x', 'Pol_News_x'}
pvalue: 
 0.296901157286
Critical values: 
 [-3.95596507 -3.36899945 -3.067208  ]
EUR/USD vs AirTravel_x
t-statistic of unit-root test on residuals: 
 {'SpecialTrvl_y', 'TourDest_y', 'Fis_News_x', 'Biz_News_y', 'EUR/USD Close', 'TravelAgncs_y', 'Econ_News_y', 'Fis_News_y', 'TourDest_x', 'BN_News_y', 'FMrkt_News_y', 'Pol_News_y', 'FMrkt_News_x', 'TravelAgncs_x', 'BN_News_x', 'SpecialTrvl_x', 'AirTravel_x', 'AirTravel_y', 'Biz_News_x', 'Econ_News_x', 'Pol_News_x'}
pvalue: 
 0.296901157286
Critical values: 
 [-3.95596507 -3.36899945 -3.067208  ]
EUR/USD vs AirTravel_y
t-statistic of unit-root test on residuals