<center><h1><font color = "orange" >Financial Data Structures</font></h1></center>

<h4>Introduction:</h4>
<p>In this notebook we learn to work with unstructured financial data and from this to derive a structured dataset for machine learning algorithms. Generally, it is not advisable to consume someone else's preproccessed dataset as the likely outcome will be that you are figuring out what they have already figured out. We want to take an unstructured dataset and process it such that we can find novel informative features.</p>

<h4>Structured V.S. Unstructured Data:</h4>
<p>Structured data is data that you would usually find in a relational database. For instance: phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length like names are contained in records, making it a simple matter to search. Unstructured Data is data that is in the wild that does not have a concrete structure. For instance, social media text feeds, audio, and images.</p>

<h4>The 4 Types of Financial Data:</h4>
<p>There are four types of financial data:</p>
<ul>
    <li><b>Fundamental Data:</b> Data you can find in regulatory findings such as quarterly reports. For example: Assets, liabilities, sales, costs, and earnings. This data is extremly regularized and low frequency. Since this data is so accessible to the marketplace, it is unlikely that there is much value left to be exploited. However, it can be useful in combination with other data types.</li>
    <li><b>Market Data:</b> Data from all trading activity that takes place in an exchange such as price, volatility, dividends, and volume.</li>
    <li><b>Analytics:</b> Data that has already been proccessed in a particular way such as analyst recommendations, credit ratings, earnings expectations, and news sentiment. This is data that is usually purchased from an alternate vendor.</li>
    <li><b>Alternative Data:</b> Primary information that has not made it yet to other sources such as sattelite images, google searches, and twitter posts. This data is usually more unique and harder to process.</li>
</ul>
<p><b>Note:</b> A dataset might be useful if it annoys the data infrastructure team. Perhaps your competitors did not try to use it for particular reasons or gave up midway.</p>

<h4>Bars:</h4>
<p>The data structures used to contain trading information are often referred to as bars. This is basically a table of data and the rows contain information. These rows are the "bars". These bars can vary greatly in how they were constructed but in general there are two categories of bars:</p>
<ul>
    <li>Standard Bar Methods</li>
    <li>Information-Driven Methods</li>
</ul>

<h4>Standard Bars:</h4>
<p>Standard Bars aim to transform a series of observations that arrive at an irregular frequency into a homogenous series derived from regular sampling. There are 4 main type of standard bars:</p>
<ul>
    <li>Time Bars</li>
    <li>Tick Bars</li>
    <li>Volume Bars</li>
    <li>Dollar Bars</li>
</ul>

<h5>Time Bars:</h5>
<p>Time Bars are obtained by sampling information at a fixed time interval e.g. once every minute. This information usually contains:</p>
<ul>
    <li>Timestamp</li>
    <li>Volume</li>
    <li>VWAP - calculated by adding up the dollars traded for every transaction (price multiplied by number of shares traded) and then dividing by the total shares traded for the day.</li>
    <li>Open</li>
    <li>Close</li>
    <li>High</li>
    <li>Low</li>
</ul>
<p>This is the typical csv data that you will find from yahoo finance for a particular equity. This type of data should be avoided for two reasons:</p>
<ol>
    <li>Markets do not process information at a constant time interval e.g. the hour of the open is more active than the hour around noon. Here time bars oversample information during low-activity periods and undersample information during high-activity periods.</li>
    <li>Time sampled series often exhibit poor statistical properties.</li>
</ol>

<h5>Tick Bars:</h5>
<p>Sample variables such as Timestamp, VWAP, open price, etc. are extracted each time a pre-defined number of transactions takes place.</p>

<p>For instance, every 1000 transactions we take a sample bar. Mandlebrot and Taylor realized that sampling as a function of the number of transactions gives more desirable statistical properties; sampling as a function of trading activity allows us to achieve returns closer to Independant and Identitically Distributed (IID) Normal. Many statistical methods make an assumption that observations are drawn from an IID Gaussian process so this allows us to take advantage of these statistical observations.</p>

<h5>Volume Bars:</h5>
<p>Volume bars sample every time a pre-defined amount of the securitie's units (shares, futures contracts, etc.) have been exchanged. For example, we could sample prices every time a futures contract exchanges 1,000 units, regardless of the number of ticks involved. Volume bars circumvent the following problem that tick bars incur:</p>

<p>Suppose there is one order sitting on the offer for a size of 10. If we buy 10 lots, the order will be recorded as 1 tick. If there are 10 orders of size 1, our 1 buy will be recorded as 10 separate transactions.</p>

<p>Volume bars are preferred over tick bars as sampling by volume gets us closer to an IID Gaussian distribution than sampling by tick bars.</p>

<h5>Dollar Bars:</h5>
<p>Dollar bars are formed by sampling an observation every time a pre-defined market value is exchanged.</p>
<p>The number of shares traded is a function of the actual value exchanged. Thus, it makes sense to sample bars in terms of dollar value exchanged rather than ticks or volume particularly when the analysis involves significant price fluctuations.</p>

<p>Dollar bars are also more interesting than time, tick, or volume bars since the number of outstanding shares often changes multiple times over the course of a securitie's life as a result of corporate actions. Even after adjustment for splits and reverse splits, there are other actions that will impact the amount of ticks and volumes, like issuing new shares or buying back existing shares. Dollar bars tend to be robust in the face of those actions.</p>

<h4>Creating Standard Bar Methods:</h4>

In [22]:
class standard_bar:
    def __init__(self,data_path): 
        self.data_path = data_path
        self.df = pd.read_csv(data_path)
    def format_data(): 
        #Combine the date and time columns 
        self.df['datetime'] = self.df['Date'].map(str)+' '+self.df['Time'].map(str)
        self.df = self.df.drop(['Date','Time'],axis=1)
        
        #Calculate Transaction Value
        self.df['transaction_value'] = self.df['Price']*self.df['Volume']
        
        #Set index to the datetime
        self.df.set_index('datetime')
        
        #Drop duplicates 
        
        #Remove Outliers

In [13]:
data_path = 'data/ES_Trades.csv'
df = pd.read_csv(data_path)

In [14]:
df.head()

Unnamed: 0,Symbol,Date,Time,Price,Volume,Market Flag,Sales Condition,Exclude Record Flag,Unfiltered Price
0,ESU13,09/01/2013,17:00:00.083,1640.25,8,E,0,,1640.25
1,ESU13,09/01/2013,17:00:00.083,1640.25,1,E,0,,1640.25
2,ESU13,09/01/2013,17:00:00.083,1640.25,2,E,0,,1640.25
3,ESU13,09/01/2013,17:00:00.083,1640.25,1,E,0,,1640.25
4,ESU13,09/01/2013,17:00:00.083,1640.25,1,E,0,,1640.25


In [19]:
df['transaction_value'] = df['Price']*df['Volume']

In [21]:
df.set_index('datetime')

Unnamed: 0_level_0,Symbol,Date,Time,Price,Volume,Market Flag,Sales Condition,Exclude Record Flag,Unfiltered Price,transaction_value
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,8,E,0,,1640.25,13122.00
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,1,E,0,,1640.25,1640.25
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,2,E,0,,1640.25,3280.50
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,1,E,0,,1640.25,1640.25
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,1,E,0,,1640.25,1640.25
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,12,E,0,,1640.25,19683.00
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,4,E,0,,1640.25,6561.00
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,4,E,0,,1640.25,6561.00
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,1,E,0,,1640.25,1640.25
09/01/2013 17:00:00.083,ESU13,09/01/2013,17:00:00.083,1640.25,1,E,0,,1640.25,1640.25


In [None]:
df.head