In [1]:
# !pip install keras_bert
# !pip install -q keras-bert keras-rectified-adam

#Please run these lines on Colab before running the rest of the code

# **FINE TUNNING A LARGE LANGUAGE MODEL (LLM) FOR SENTIMENT ANALYSIS.**

## **Using BERT for sentiment analysis.**


<font size="3">During 2018 the Google Research Team released their Large Language Model called BERT. You can find it on the following link: https://github.com/google-research/bert<br><br>

<font size="3">BERT stands for Bidirectional Encoder Representation of Transformers. It is a deep learning based unsupervised language representation model. It is the first deeply-bidirectional unsupervised language model. The language models, until BERT, learnt from text sequences in either left-to-right or combined left-to-right and right-to-left contexts. Thus they were either not bidirectional or not bidirectional in all layers.<br><br>
    
<font size="3">You can use a LLM for many applications. People who work in finance can use it to figure out easily if the words of a CEO or CFO during an earnings call have a positive or negative sentiment. Sentiment analysis involves classifying text into sentiment categories (e.g. positive vs negative). Sentiment analysis allows us to convert complex unstructured data into concise numerical ratings. This is a valuable tool for investors trying to avoid being drowned by the modern firehose of information. The explanations you can find in this notebook are inspired in a Sparkline Capital document called: <a href="https://sparklinecapital.files.wordpress.com/2020/11/sparkline-deep-learning.pdf" target="_blank">Deep Learning in Investing:  Opportunity in Unstructured Data.</a> In that document they showed the following example of sentiment analysis in finance. <br><br>

> *“Yes. So we've never really disclosed beds per door,                  
anything like that. What I will say is, we actually just                      
completed a pretty big deep dive on this with cohort                    
views. And no matter how we cut it, we are continuing                      
to see same-store sales increase, which is terrific, and              
Q4 was no exception to that.                   
So our strength in the marketplace continues to grow.”    
          
    Joe Megibow, CEO, Purple Innovation Inc. (Mar 13, 2020). *
         
<div class="alert alert-block alert-success">
<font size="3">
<b>Sentiment: Positive</b> 
</font>
</div>

<font size="3">If we can build a tool to analyze this kind of information and summarize it, it can be of great help. You can find quickly the overall sentiment of the company's C-level executives and track that information through history.<br><br>

<font size="3">In this notebook we are going to get BERT and add a layer to teach it how to perform sentiment analysis. This task is known as **Fine Tunning**. We will show how the model performs on a film reviews dataset. The following notebook should be run on Google Colab using a GPU. Below the credits where we took from most of the implementation ideas.<br><br>
    
<font size="3">Credits: https://pysnacks.com/machine-learning/bert-text-classification-with-fine-tuning/

In [3]:
import warnings
warnings.filterwarnings('ignore')

#Import libraries
import os
import sys
import numpy as np
import codecs
import tensorflow as tf
from tensorflow import keras
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from tqdm import tqdm

In [4]:
#Verify Colab is running on a GPU

print("TensorFlow version:", tf. __version__)
print(tf.config.list_physical_devices())

TensorFlow version: 2.9.2
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [5]:
#This is a necessary step to read data from you Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [6]:
#Define the path of the folder where you will put BERT files

path_folder = '/content/gdrive/MY_OWN_FOLDER'
sys.path.append(path_folder)

## **Downloading BERT.**

<font size="3">You should go to: https://github.com/google-research/bert . There, under the Pre-trained models section, you can find different models to use. In this notebook we used BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters. When you download it you get a zip file named uncased_L-12_H-768_A-12.zip. You should unzip the files and put them under MY_OWN_FOLDER as specified before. The 3 files that we need here are: 1) bert_config.json, 2) bert_model.ckpt, 3) vocab.txt <br><br>

In [7]:
# Bert Model Constants
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
LR = 2e-5

#Files we need to run the model
config_path = path_folder +'/BERT_model/bert_config.json'
checkpoint_path = path_folder +'/BERT_model/bert_model.ckpt'
vocab_path = path_folder +'/BERT_model/vocab.txt'

## **Reading the vocab.txt file**

<font size="3">The file vocab.txt that we downladed from the BERT Github site contains more than 30.000 words. Each word is in a separate line in the text file. We can open the vocab.txt file using the codec library as follows:

In [8]:
#Here we use codec to open the text file. 
#You can also use import io; io.open(...)

with codecs.open(vocab_path, 'r', 'utf8') as reader:
  print(reader.read(35)) 

[PAD]
[unused0]
[unused1]
[unused2]


## **Creating the tokenizer**

<font size="3">After we open the vocab.txt file we can do the following:
1) Read each row in the file and get the text using the function strip <br><br>
2) Fill the dictionary token_dict. Each row contains the word of vocab.txt and the number of elements in the dictionary up to that row. <br><br>
3) Use the function Tokenizer which we imported from the library keras_bert previously to create our own tokenizer.  <br><br>

In [9]:
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()        
        token_dict[token] = len(token_dict)

tokenizer = Tokenizer(token_dict)

In [10]:
#Inspection of token_dict
token_dict

{'[PAD]': 0,
 '[unused0]': 1,
 '[unused1]': 2,
 '[unused2]': 3,
 '[unused3]': 4,
 '[unused4]': 5,
 '[unused5]': 6,
 '[unused6]': 7,
 '[unused7]': 8,
 '[unused8]': 9,
 '[unused9]': 10,
 '[unused10]': 11,
 '[unused11]': 12,
 '[unused12]': 13,
 '[unused13]': 14,
 '[unused14]': 15,
 '[unused15]': 16,
 '[unused16]': 17,
 '[unused17]': 18,
 '[unused18]': 19,
 '[unused19]': 20,
 '[unused20]': 21,
 '[unused21]': 22,
 '[unused22]': 23,
 '[unused23]': 24,
 '[unused24]': 25,
 '[unused25]': 26,
 '[unused26]': 27,
 '[unused27]': 28,
 '[unused28]': 29,
 '[unused29]': 30,
 '[unused30]': 31,
 '[unused31]': 32,
 '[unused32]': 33,
 '[unused33]': 34,
 '[unused34]': 35,
 '[unused35]': 36,
 '[unused36]': 37,
 '[unused37]': 38,
 '[unused38]': 39,
 '[unused39]': 40,
 '[unused40]': 41,
 '[unused41]': 42,
 '[unused42]': 43,
 '[unused43]': 44,
 '[unused44]': 45,
 '[unused45]': 46,
 '[unused46]': 47,
 '[unused47]': 48,
 '[unused48]': 49,
 '[unused49]': 50,
 '[unused50]': 51,
 '[unused51]': 52,
 '[unused52]': 53,

In [11]:
#This is what the tokenizer returns using an example of a short movie review. 
tokenizer.tokenize("Can't wait for it's next part!")

['[CLS]',
 'can',
 "'",
 't',
 'wait',
 'for',
 'it',
 "'",
 's',
 'next',
 'part',
 '!',
 '[SEP]']

## **Downloading Large Movie Review Dataset.**

<font size="3">From Stanford University we will download the Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. They provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. Here we use the function tf.keras.utils.get_file . In the following link you can see what this function does https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file<br><br>

In [12]:
dataset = tf.keras.utils.get_file(
    fname="aclImdb.tar.gz", 
    origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
    extract=True,
)

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [13]:
dataset

'/root/.keras/datasets/aclImdb.tar.gz'

In [14]:
#Python method listdir() returns a list containing the names of the entries in the directory given by path. 
#The list is in arbitrary order
os.listdir('/root/.keras/datasets')

['aclImdb', 'aclImdb.tar.gz']

In [15]:
#We have train and test folders
os.listdir('/root/.keras/datasets/aclImdb')

['train', 'README', 'imdbEr.txt', 'test', 'imdb.vocab']

In [16]:
#In train and test folders we also have pos and neg folders.
#Pos stands for positive and neg for negative. There we have the film reviews that are positive or negative
os.listdir('/root/.keras/datasets/aclImdb/train')

['urls_neg.txt',
 'pos',
 'unsupBow.feat',
 'unsup',
 'neg',
 'labeledBow.feat',
 'urls_pos.txt',
 'urls_unsup.txt']

In [17]:
#This is an example of the reviews in each folder
os.listdir('/root/.keras/datasets/aclImdb/train/neg')[0:10]

['1528_1.txt',
 '6684_1.txt',
 '6173_1.txt',
 '6371_4.txt',
 '9009_1.txt',
 '11580_2.txt',
 '12161_3.txt',
 '10844_1.txt',
 '221_4.txt',
 '9018_3.txt']

## **Organizing the data from movie reviews**

<font size="3">Here we will create a function to get all the positive and negative reviews and tokenize them. With that information in hand we can feed our deep neural network and perform a sentiment analysis. <br><br>


In [18]:
#Define the train and test paths
train_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'train')
test_path = os.path.join(os.path.dirname(dataset), 'aclImdb', 'test')

#Define the negative and positive labels
tagset = [('neg', 0), ('pos', 1)]
id_to_labels = {0: 'negative', 1: 'positive'}

In [19]:
for folder, sentiment in tagset:
  folder = os.path.join(train_path, folder)
  print(folder)
  print("----------------------------------------")
  print(sentiment)

/root/.keras/datasets/aclImdb/train/neg
----------------------------------------
0
/root/.keras/datasets/aclImdb/train/pos
----------------------------------------
1


In [20]:
#Open a particular review and read its text
with open(os.path.join(folder, os.listdir(folder)[0]), 'r',encoding="utf-8") as reader:
  text = reader.read()
  
text

"Soylent Green is a classic. I have been waiting for someone to re-do it.They seem to be remaking sci-fi classics these days (i.e. War of the Worlds)and I am hoping some director/producer will re-do Soylent Green. With todays computer animation and technology, it would have the potential to be a great picture. Anti-Utopian films may not be that far-fetched. The human race breeds like roaches with no outside influence to curtail it. We, as humans, have the option of putting the kibosh on the procreation of lesser species if they get out of hand, but there's nothing to control human breeding except for ourselves. Despite all the diseases, wars, abortions, birth control, etc. the human race still multiplies like bacteria in a petri dish. Classic Malthusian economics states that any species, including humans, will multiply beyond their means of subsistence. 6 billion and growing....that's obscene."

In [21]:
#There are 12500 positive reviews in the trainning set 
len(os.listdir(folder))

12500

In [22]:
#Example of how the tokenizer transforms the text into a vector
tokenizer.tokenize(text)

['[CLS]',
 'soy',
 '##lent',
 'green',
 'is',
 'a',
 'classic',
 '.',
 'i',
 'have',
 'been',
 'waiting',
 'for',
 'someone',
 'to',
 're',
 '-',
 'do',
 'it',
 '.',
 'they',
 'seem',
 'to',
 'be',
 're',
 '##making',
 'sci',
 '-',
 'fi',
 'classics',
 'these',
 'days',
 '(',
 'i',
 '.',
 'e',
 '.',
 'war',
 'of',
 'the',
 'worlds',
 ')',
 'and',
 'i',
 'am',
 'hoping',
 'some',
 'director',
 '/',
 'producer',
 'will',
 're',
 '-',
 'do',
 'soy',
 '##lent',
 'green',
 '.',
 'with',
 'today',
 '##s',
 'computer',
 'animation',
 'and',
 'technology',
 ',',
 'it',
 'would',
 'have',
 'the',
 'potential',
 'to',
 'be',
 'a',
 'great',
 'picture',
 '.',
 'anti',
 '-',
 'utopia',
 '##n',
 'films',
 'may',
 'not',
 'be',
 'that',
 'far',
 '-',
 'fetch',
 '##ed',
 '.',
 'the',
 'human',
 'race',
 'breeds',
 'like',
 'roach',
 '##es',
 'with',
 'no',
 'outside',
 'influence',
 'to',
 'curt',
 '##ail',
 'it',
 '.',
 'we',
 ',',
 'as',
 'humans',
 ',',
 'have',
 'the',
 'option',
 'of',
 'putting

In [23]:
#Example of how the tokenizer transforms the vector containing strings into numbers
#The numbers represent the position of the word in the vocab.txt file
tokenizer.encode(text, max_len=SEQ_LEN)

([101,
  25176,
  16136,
  2665,
  2003,
  1037,
  4438,
  1012,
  1045,
  2031,
  2042,
  3403,
  2005,
  2619,
  2000,
  2128,
  1011,
  2079,
  2009,
  1012,
  2027,
  4025,
  2000,
  2022,
  2128,
  12614,
  16596,
  1011,
  10882,
  10002,
  2122,
  2420,
  1006,
  1045,
  1012,
  1041,
  1012,
  2162,
  1997,
  1996,
  8484,
  1007,
  1998,
  1045,
  2572,
  5327,
  2070,
  2472,
  1013,
  3135,
  2097,
  2128,
  1011,
  2079,
  25176,
  16136,
  2665,
  1012,
  2007,
  2651,
  2015,
  3274,
  7284,
  1998,
  2974,
  1010,
  2009,
  2052,
  2031,
  1996,
  4022,
  2000,
  2022,
  1037,
  2307,
  3861,
  1012,
  3424,
  1011,
  26425,
  2078,
  3152,
  2089,
  2025,
  2022,
  2008,
  2521,
  1011,
  18584,
  2098,
  1012,
  1996,
  2529,
  2679,
  15910,
  2066,
  20997,
  2229,
  2007,
  2053,
  2648,
  3747,
  2000,
  20099,
  12502,
  2009,
  1012,
  2057,
  1010,
  2004,
  4286,
  1010,
  2031,
  1996,
  5724,
  1997,
  5128,
  1996,
  11382,
  15853,
  2232,
  2006,
  1996,
 

In [24]:
#Example of word and location in the vocab file
print(list(token_dict)[1998:2001])
print("---------------------")
print(list(token_dict)[2013:2016 ])

['and', 'in', 'to']
---------------------
['from', 'her', '##s']


In [25]:
ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)

In [26]:
#This is an example of how the function zip works. 
#This is useful to understand the function that processes the reviews.
#The zip function forms pairs from the elements in two lists.

a = ("John", "Charles", "Mike")
b = ("Jenny", "Christy", "Monica")

x = list(zip(a, b))

x

[('John', 'Jenny'), ('Charles', 'Christy'), ('Mike', 'Monica')]

In [27]:
#The function np.random.shuffle shuffles the order of the pairs

np.random.shuffle(x)

x

[('Mike', 'Monica'), ('John', 'Jenny'), ('Charles', 'Christy')]

In [28]:
#When we use + with zip, we reverse the zipping process

a,b = zip(*x)

In [29]:
 a = np.array(a)
 a

array(['Mike', 'John', 'Charles'], dtype='<U7')

In [30]:
#This is an example of how the % operator works. 
#This is useful to understand the function that processes the reviews.

#The % operator is mostly to find the modulus of two integers. 
#a % b returns the remainder after dividing a by b

9%2

1

In [31]:
#This is an example of how the function tqdm works. 
#This is useful to understand the function that processes the reviews.

for i in tqdm(range(0, int(1e7)), desc ="Processing"):
  pass

Processing: 100%|██████████| 10000000/10000000 [00:02<00:00, 4475509.19it/s]


In [32]:
#Final function to process all the reviews

def load_data(path, tagset):
    global tokenizer
    indices, sentiments = [], []
    for folder, sentiment in tagset:
        folder = os.path.join(path, folder)
        for name in tqdm(os.listdir(folder)):
            with open(os.path.join(folder, name), 'r',encoding="utf-8") as reader:
                  text = reader.read()
            ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
            indices.append(ids)
            sentiments.append(sentiment)
  
    items = list(zip(indices, sentiments))
    np.random.shuffle(items)
    indices, sentiments = zip(*items)
    indices = np.array(indices)
    mod = indices.shape[0] % BATCH_SIZE
    if mod > 0:
        indices, sentiments = indices[:-mod], sentiments[:-mod]
    return [indices, np.zeros_like(indices)], np.array(sentiments)
  

train_x, train_y = load_data(train_path, tagset)
test_x, test_y = load_data(test_path, tagset)

100%|██████████| 12500/12500 [00:44<00:00, 279.93it/s]
100%|██████████| 12500/12500 [00:43<00:00, 288.16it/s]
100%|██████████| 12500/12500 [00:41<00:00, 304.08it/s]
100%|██████████| 12500/12500 [00:41<00:00, 298.26it/s]


In [33]:
#This is the output of train_x and train_y
#train_x is an array of vector containing the numbers corresponding to the vocabulary.
#train_y is a vector with zeros and ones. They represent a positive or negative review.
print(train_x)
print("------------------------------------------------------------------------")
print(train_y)

[array([[  101,  7078, 10392, ...,  3496,  2000,   102],
       [  101,  6583,  9905, ...,  7987,  1013,   102],
       [  101,  1045,  1005, ...,     0,     0,     0],
       ...,
       [  101,  1045,  2428, ...,  1010,  1000,   102],
       [  101,  2065,  2017, ...,  2007,  1996,   102],
       [  101,  1045,  2001, ...,  1999,  1996,   102]]), array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])]
------------------------------------------------------------------------
[1 0 0 ... 0 0 0]


## **Building the model**

<font size="3">Here we are using the BERT model that we downloaded and put on our Google drive. Please remember that we loaded the keras_bert library and the function load_trained_model_from_checkpoint. We are loading BERT with the following script<br><br>

In [34]:
#Load model using the function load_trained_model_from_checkpoint from the keras_bert library
model = load_trained_model_from_checkpoint(
      config_path,
      checkpoint_path,
      training=True,
      trainable=True,
      seq_len=SEQ_LEN,
  )

## **Fine tunning**

<font size="3">For fine-tuning this model for classification tasks, we take the last layer NSP-Dense (Next Sentence Prediction-Dense) and tie its output to a new fully connected dense layer. In this example we take the last layer (NSP-Dense) and connect it to a binary classification layer. The binary classification layer is essentially a fully-connected dense layer with size 2. This is shown below<br><br>

In [35]:
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
outputs = keras.layers.Dense(units=2, activation='softmax')(dense)
model = keras.models.Model(inputs, outputs)

## **Compile new model**

<font size="3">We need to compile the new model. We will use Rectified Adam (RAdam) as the optimizer. The size of the last fully connected dense layer is equal to the number of classification classes or labels.
The activation and loss function for binary and multiclass text classification is softmax. Since it is a case of binary classification, we want the probabilities of the output nodes to sum up to 1. That's why we use the softmax as the activation function. We also use the sparse categorical cross entropy loss function.<br><br>

In [36]:
from tensorflow.python import keras
from keras_radam import RAdam

In [37]:
 model.compile(
        RAdam(lr=LR),
        loss='sparse_categorical_crossentropy',
        metrics=['sparse_categorical_accuracy'],
    )

In [38]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Input-Token (InputLayer)       [(None, 128)]        0           []                               
                                                                                                  
 Input-Segment (InputLayer)     [(None, 128)]        0           []                               
                                                                                                  
 Embedding-Token (TokenEmbeddin  [(None, 128, 768),  23440896    ['Input-Token[0][0]']            
 g)                              (30522, 768)]                                                    
                                                                                                  
 Embedding-Segment (Embedding)  (None, 128, 768)     1536        ['Input-Segment[0][0]']    

## **Fit the model**

<font size="3">The next setp is to fit the model. We will train it using 3 epochs and a batch size of 16. It might take a little less than 1 hour using the Colab GPU.<br><br>

In [39]:
history = model.fit(
    train_x,
    train_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_split=0.20,
    shuffle=True,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


## **Evaluate the results**

<font size="3">We can calculate the accuracy of the model using the test database we uploaded before. In this example we got pretty good results, close to 86% accuracy<br><br>

In [40]:
from sklearn.metrics import accuracy_score, f1_score

predicts = model.predict(test_x, verbose=True).argmax(axis=-1)
accuracy = accuracy_score(test_y, predicts)
macro_f1 = f1_score(test_y, predicts, average='macro')
print ("Accuracy: %s" % accuracy)
print ("macro_f1: %s" % macro_f1)

Accuracy: 0.8641165172855314
macro_f1: 0.8636751784892873


## **Perform sentiment analysis on new prompts**

<font size="3">Here we have a list of 5 short reviews. Using the function predict we can observe if the model assigns a positive or negative review to the statement.<br><br>

In [41]:
texts = [
  "It's a must watch",
  "Can't wait for it's next part!",
  'It fell short of expectations.',
  'Wish there was more to it!',
  'Just wow!',
  'Colossial waste of time',
  'Save youself from this 90 mins trauma!'
]
for text in texts:
  ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
  inpu = np.array(ids).reshape([1, SEQ_LEN])
  predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
  print ("%s: %s"% (id_to_labels[predicted_id], text))

positive: It's a must watch
positive: Can't wait for it's next part!
negative: It fell short of expectations.
negative: Wish there was more to it!
positive: Just wow!
negative: Colossial waste of time
negative: Save youself from this 90 mins trauma!


## **Perform sentiment analysis on new prompts from another domain**

<font size="3">Now we will try the sentiment analysis task not on movie reviews but on statements given by Chief Financial Officers on their earning calls conferences. Here we selected only 2 of them. The second statement talked about losses and it should be labeled as a negative statement. This opens the question on wheter we should considering running a domain adaptation process for the model. That is basically train the model on new words that are frequent on the financial jargon. We will explore that in another notebook.<br><br>

In [42]:
texts = [
  "Yes. So we've never really disclosed beds per door, anything like that.\
  What I will say is, we actually just completed a pretty big deep dive on this with cohort\
  views. And no matter how we cut it, we are continuing to see same-store sales increase,\
  which is terrific , and Q4 was no exception to that. So our strength in the marketplace continues to grow",  
  "Understood.I'd say that we probably lost $0.5 million to $0.75 million in the fourth quarter of the year due to\
  some of those headwinds as an approximation for the combination of outages , weathers and the like."
]
for text in texts:
  ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
  inpu = np.array(ids).reshape([1, SEQ_LEN])
  predicted_id = model.predict([inpu,np.zeros_like(inpu)]).argmax(axis=-1)[0]
  print ("%s: %s"% (id_to_labels[predicted_id], text))

positive: Yes. So we've never really disclosed beds per door, anything like that.  What I will say is, we actually just completed a pretty big deep dive on this with cohort  views. And no matter how we cut it, we are continuing to see same-store sales increase,  which is terrific , and Q4 was no exception to that. So our strength in the marketplace continues to grow
positive: Understood.I'd say that we probably lost $0.5 million to $0.75 million in the fourth quarter of the year due to  some of those headwinds as an approximation for the combination of outages , weathers and the like.


> *What we have to learn to do, we learn by doing*. *Aristotle*

<font size="3">
Follow me on <a href="https://co.linkedin.com/in/andres-gomez-hernandez" target="_blank">Linkedin</a> for topics about quantitative finance, data science and emerging markets.
</font>