## NER Display

In [1]:
import pandas as pd
import numpy as np
from data import clients  # our data
import spacy
from pprint import pprint
nlp = spacy.load('en')
from spacy import displacy
import nltk
#from terms import RE_MAP,REA_MAP
from pprint import pprint
from pathlib import Path

from IPython.core.display import display, HTML

### Sample Investor Memo Data
- converted to json format to replicate API call
- dummy data was added to complete name entity
- Client A:
    - **company**: Argos Group
    - **contact**: Adam
- Client B:
    - **company**: Blue Industries
    - **contact**: None
- Client C:
    - **company**: Circle Inc
    - **contact**: Carol

In [2]:
pprint(clients)

{'client_A': {'memo_1': 'I recently caught up with Adam of Argos Group and he '
                        'indicated they are actively trying to grow their\n'
                        '        portfolio in the US. They have been focused '
                        'in NYC, but with his recent addition to the team, '
                        'are\n'
                        '        looking in Chicago and on the West Coast. '
                        'Their main focus continues to be high-rise office in '
                        'CBDs, but\n'
                        '        are also considering urban multi-housing, '
                        'preferably with a value-add component.',
              'memo_2': 'I will be based in New York, mainly tasked with '
                        'sourcing equity and debt investments in high-profile\n'
                        '        real estate assets in gateway markets with '
                        'equity ticket $30M and up. My team and I will also '
     

In [3]:
df = pd.DataFrame.from_dict(clients)  # load our dirty data in pandas

## Lets look at some of our objectives
> 
>- Transforming unstructured data to structured data via information extraction system. For
example, convert investor memo (text format) to a well-defined structure (table format with
pre-defined values), such as investor name, preference (geography, property type, investment
size)


What we need is to extract entities
- name : Person
- prefrence
    - geography : location
    - property type : create

- invesment size : money?

## Name entity


| TYPE                               | DESCRIPTION                                          | 
|------------------------------------|------------------------------------------------------| 
| `PERSON`                             | People, including fictional.                         | 
| `NORP`                               | Nationalities or religious or political groups.      | 
| `FAC`                                | Buildings, airports, highways, bridges, etc.         | 
| `ORG`                                | Companies, agencies, institutions, etc.              | 
| `GPE`                                | Countries, cities, states.                           | 
| `LOC`                                | Non-GPE locations, mountain ranges, bodies of water. | 
| `PRODUCT`                            | Objects, vehicles, foods, etc. (Not services.)       | 
| `EVENT`                              | Named hurricanes, battles, wars, sports events, etc. | 
| `WORK_OF_ART`                        | Titles of books, songs, etc.                         | 
| `LAW`                                | Named documents made into laws.                      | 
| `LANGUAGE`                           | Any named language.                                  | 
| `DATE`                               | Absolute or relative dates or periods.               | 
| `TIME`                               | Times smaller than a day.                            | 
| `PERCENT`                            |	Percentage, including "%".                        |         
| `MONEY`                              | Monetary values, including unit.                     | 
| `QUANTITY`                           | Measurements, as of weight or distance.              | 
| `ORDINAL`                            | "first", "second", etc.                              | 
| `CARDINAL`                           | Numerals that do not fall under another type.        | 

# Visualize entities

## Client A

In [4]:
def remove_whitespace_entities(doc):
    '''simple hack:
    empty space shall not be entities'''
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

# create pipeline
nlp.add_pipe(remove_whitespace_entities, after='ner')

In [5]:
# split each memo for better notebook display
memo_1A  = df.client_A[0]
memo_2A =  df.client_A[1]
memo_3A =  df.client_A[2]
memo_4A =  df.client_A[3]
memo_1B  = df.client_B[0]
memo_2B =  df.client_B[1]
memo_3B =  df.client_B[2]
memo_4B =  df.client_B[3]
memo_1C  = df.client_C[0]
memo_2C =  df.client_C[1]
memo_3C =  df.client_C[2]
memo_4C =  df.client_C[3]

# display entities
- text
- location
- label

In [6]:
print(memo_1A)

I recently caught up with Adam of Argos Group and he indicated they are actively trying to grow their
        portfolio in the US. They have been focused in NYC, but with his recent addition to the team, are
        looking in Chicago and on the West Coast. Their main focus continues to be high-rise office in CBDs, but
        are also considering urban multi-housing, preferably with a value-add component.


In [7]:
doc = nlp(memo_1A)
ent_text = [token.orth_ for token in doc.ents]
start_char = [token.start_char for token in doc.ents]
end_char = [token.end_char for token in doc.ents]
ent_label = [token.label_ for token in doc.ents]

pd.DataFrame(list(zip(ent_text,start_char,end_char,ent_label)),
         columns=['ent_text','start_char','end_char','ent_label'])    

Unnamed: 0,ent_text,start_char,end_char,ent_label
0,Adam,26,30,PERSON
1,Argos Group,34,45,ORG
2,US,127,129,GPE
3,NYC,157,160,LOC
4,Chicago,227,234,GPE
5,the West Coast,242,256,LOC
6,CBDs,311,315,DATE


Create function to display entities


In [8]:
def report(doc):
    '''insert spacy.token.doc
    return dictionary with entities and values'''
    ent_label = [token.label_ for token in doc.ents]  #  extract ent labels
    ent_text = [token.orth_.strip() for token in doc.ents]  # extract ent text
    #print(ent_text)  # santa check
    ent_dict = {k:[] for k in ent_label}  # dict comprh for results
    en_tuple = list(zip(ent_label,ent_text))
    for k,v in en_tuple:
        ent_dict[k].append(v)
    return ent_dict

## Client A : memo 1

In [9]:
# raw text of memo
print(memo_1A)

I recently caught up with Adam of Argos Group and he indicated they are actively trying to grow their
        portfolio in the US. They have been focused in NYC, but with his recent addition to the team, are
        looking in Chicago and on the West Coast. Their main focus continues to be high-rise office in CBDs, but
        are also considering urban multi-housing, preferably with a value-add component.


In [10]:
# display entities
doc = nlp(memo_1A)  # convert to spacy token text
displacy.render(doc, jupyter=True, style='ent') # display entities

In [12]:

html = displacy.render(doc, style='ent')
display(HTML(html))

In [16]:
html = displacy.render(doc, style='ent',page=True)
display(HTML(html))

In [17]:
report(doc)  # function to extract entities from text

{'PERSON': ['Adam'],
 'ORG': ['Argos Group'],
 'GPE': ['US', 'Chicago'],
 'LOC': ['NYC', 'the West Coast'],
 'DATE': ['CBDs']}

### Client A: memo 2

In [18]:
print(memo_2A)

I will be based in New York, mainly tasked with sourcing equity and debt investments in high-profile
        real estate assets in gateway markets with equity ticket $30M and up. My team and I will also look for JV
        and M&A opportunities of established real estate companies and platforms.


In [19]:
# display entities
doc = nlp(memo_2A)  # convert to spacy token text
displacy.render(doc, jupyter=True, style='ent') # display entities

In [20]:
print(report(doc))

{'GPE': ['New York'], 'MONEY': ['$30'], 'ORG': ['JV']}


## Client A memo 3

In [21]:
print(memo_3A)

They intend to add a large subterranean retail complex as well as reposition the building after the
        major tenant moves out in 2 years. For the most part they are value-add to opportunistic driven. They
        are focused on the following markets: NYC, Boston, DC, Chicago, SF, and LA. They will look at office, MH,
        and retail. They are solving to mid-teen returns and they have no max or min on their investment size.


In [22]:
# display entities
doc = nlp(memo_3A)  # convert to spacy token text
displacy.render(doc, jupyter=True, style='ent') # display entities

In [23]:
report(doc)

{'DATE': ['2 years'],
 'GPE': ['NYC', 'Boston', 'DC', 'Chicago', 'SF', 'LA'],
 'ORG': ['MH']}

## Client memo 4

In [24]:
print(memo_4A)

They are seeking large office/residential/retail deals in the Tri-State region. Looking for low teen return
        profile and 100% ownership (no operating partners). Typically looking for long term value and can
        handle temporary non cash flowing periods to help generate value. Currently not using much, if any,
        leverage on their deals. Global portfolio is 70mm SF, US portfolio is two NYC assets (room to grow).


In [25]:
# display entities
doc = nlp(memo_4A)  # convert to spacy token text
displacy.render(doc, jupyter=True, style='ent') # display entities

In [26]:
report(doc)

{'PERCENT': ['100%'],
 'CARDINAL': ['70', 'two'],
 'GPE': ['SF', 'US'],
 'LOC': ['NYC']}

---
# <center> Results </center>
## Client A
### `memo 1`
- `CBD`: (Central Buisness District) mislabeled as date
- `multi-housing` *type* needs entity
- domain specific words need explination
    - `value-add`
    - `high-rise`
    
### `memo 2`
- `JV`: Join Venture needs entity
- `M&A`: Mergers and acquisitions completely ignored

### `Memo 3'
Needs clarification
- `value-add`
- mid-teens
- MH: Multi-Housing?

### `Memo 4`
- `70mm` mislabeled as numer not as `MONEY`
- `Tri-State`: could be `LOC`
- `office/residential/retail`: needs expansion, then entity label

---
# Client B
- memo 1
- memo 2
- memo 3
- memo 4

In [27]:
for row in df.client_B.index:
    mem = df.client_B[row]
    memo = mem.replace('\n',' ')  # remove newline
    doc = nlp(memo)
    print('TEXT:{}'.format(row))
    print('---------------------------------------------------')
    print(mem)
    print('---------------------------------------------------')
    # MEMO 1
    print('NER:{}\n'.format(row))
    displacy.render(doc, jupyter=True, style='ent')
    print('---------------------------------------------------')
    print('RESULTS:{}\n'.format(row))
    pprint(report(doc))
    print('\n')

TEXT:memo_1
---------------------------------------------------
Retail Property Overview 
        * Target investments are dominant centres that need work / capital improvements
        * Really like outlet malls
        * Own and like Paris high street
---------------------------------------------------
NER:memo_1



---------------------------------------------------
RESULTS:memo_1

{'GPE': ['Paris'], 'ORG': ['Retail Property Overview', 'Target']}


TEXT:memo_2
---------------------------------------------------
Industrial Property Overview
        Blue Industries’s non-traded REIT is currently deploying $500 million/month into multifamily and industrial assets,
        with a focus on assets producing an immediate, steady cash flow with low cost basis.
        
        Typical parameters for the non-traded REIT are as follows:
        * B, B+ market (Atlanta, Baltimore, Phoenix)
        * Prefer to enter a secondary market and use higher leverage
        * 10-11% IRR with 60% leverage
        * 7% cash-on-cash returns
        * 5% net return
        
        Given Blue Industries’s current appetite for industrial product, they are open to a wide range of equity checks
        across primary and secondary markets, and programmatic ventures with proven operators and
        developers.
        
   

---------------------------------------------------
RESULTS:memo_2

{'DATE': ['/month'],
 'GPE': ['Atlanta', 'Baltimore', 'Phoenix'],
 'MONEY': ['$500 million'],
 'ORG': ['Industrial Property Overview         Blue Industries’s',
         'REIT',
         'REIT',
         'Blue Industries’s',
         'Blue Industries’s'],
 'PERCENT': ['10-11%', '60%', '7%', '5%'],
 'PERSON': ['B+', 'Return']}


TEXT:memo_3
---------------------------------------------------
Multi-Family Property Overview
        * Locations: looking everywhere in the US, but less active in the middle of the country. Avoid
        tertiary markets and low class property types
        * Very deep appetite for MF - $5.5B in MF last year. More stable and more downside protection
        * 3 Funds
            * Fund #1
                * Class B, suburban garden, 80-90’s vintage
                * High growth markets
                * Deal size: $250M+
            * Fund #2  Core-plus vehicle
                * Lower on the ri

---------------------------------------------------
RESULTS:memo_3

{'CARDINAL': ['3', '80', 'two'],
 'DATE': ['last year'],
 'GPE': ['US', 'Seattle'],
 'MONEY': ['250M+', '2', '$250', '#3', '$100M+ deals'],
 'ORDINAL': ['tertiary'],
 'ORG': ['Multi-Family Property Overview',
         'MF - $5.5B',
         'MF',
         'Core',
         'Completed',
         'MF'],
 'PERCENT': ['50%', '7-8%'],
 'PERSON': ['Avoid']}


TEXT:memo_4
---------------------------------------------------
They are open to geographies and can do deals as small as $50mm.
        * Industrial: hungry for more industrial and need size/quality. Have difficulty pricing cold storage.
        * Healthcare/MOB/AL/IL: Hugely bullish as they continue to grow in the space.
        * Hospitality: cautious on the national hotel front, but interested in NYC
        * West Coast: consensus seems that as they continue buying and selling in early 2017, they will be
        a net buyer on the west coast in 2017.
---------------

---------------------------------------------------
RESULTS:memo_4

{'DATE': ['early 2017', '2017'],
 'LOC': ['West Coast'],
 'MONEY': ['50'],
 'ORG': ['Healthcare/MOB/AL/IL: Hugely'],
 'PERSON': ['mm', 'NYC']}




---
# Client C
- memo 1
- memo 2
- memo 3
- memo 4

In [28]:
for row in df.client_C.index:
    mem = df.client_C[row]
    memo = mem.replace('\n',' ')  # remove newline
    doc = nlp(memo)
    print('TEXT:{}'.format(row))
    print('---------------------------------------------------')
    print(mem)
    print('---------------------------------------------------')
    # MEMO 1
    print('NER:{}\n'.format(row))
    displacy.render(doc, jupyter=True, style='ent')
    print('---------------------------------------------------')
    print('RESULTS:{}\n'.format(row))
    pprint(report(doc))
    print('\n')

TEXT:memo_1
---------------------------------------------------
I had a call with Carol at Circle Inc to discuss their investment criteria for office acquisitions. As many of you
        know, they are an office owner that has been focused on coastal markets and Chicago. He indicated
        that they are now having strategic discussions about entry into other markets such as Austin, Denver,
        Charlotte, Nashville, etc. At this point it is only exploratory and they are not ready to pursue new deals
        yet. They are going to focus on Austin first and have asked for our help educating them on the market.  
        We are coordinating a meeting next month in Austin to give them an overview and market tour. I let
        him know that we can help in other markets and will be ready to assist at the appropriate time.
---------------------------------------------------
NER:memo_1



---------------------------------------------------
RESULTS:memo_1

{'DATE': ['next month'],
 'GPE': ['Chicago', 'Austin', 'Denver', 'Nashville', 'Austin'],
 'ORDINAL': ['first'],
 'ORG': ['Carol at Circle Inc'],
 'PERSON': ['Charlotte', 'Austin']}


TEXT:memo_2
---------------------------------------------------
They would like to see all core and value-add multifamily and office opportunities in their target
        markets as well as all development deals.
        * New York
        * Chicago
        * Boston
        * San Francisco
        * Los Angeles
        * Portland/Seattle
        * Washington DC
---------------------------------------------------
NER:memo_2



---------------------------------------------------
RESULTS:memo_2

{'GPE': ['New York',
         'Chicago',
         'Boston',
         'San Francisco',
         'Los Angeles',
         'Portland',
         'Washington DC']}


TEXT:memo_3
---------------------------------------------------
Circle Inc is aggressively looking to build their MH pipeline in LA via ground up development similar to what
        they are doing in the Bay Area and on the East Coast. Will look at apartments or condos, and ok with
        un-entitled sites as long as there is some income to limit downside. 150 units+ and product type
        agnostic (high/mid rise). Targeting a low-mid 5% un-trended ROC.
---------------------------------------------------
NER:memo_3



---------------------------------------------------
RESULTS:memo_3

{'CARDINAL': ['150'],
 'GPE': ['LA', 'ROC'],
 'LOC': ['the Bay Area', 'the East Coast'],
 'ORG': ['Circle Inc', 'MH'],
 'PERCENT': ['5%']}


TEXT:memo_4
---------------------------------------------------
Circle Inc is very focused on the following markets (read: 90% of what they will do) Boston, NYC, DC, Chicago,
        Seattle. They will do value add and new development of office and multi-family. He mentioned they are
        mainly focused on development as they do not like buying at 4.0-4.5% caps (they are building to 5.75%
        to 6.0% ROC).
        
        Multi-family-They really like the fundamentals of multi-family as they continue to see a migration into
        the cities across the U.S. and a movement away from home ownership. They believe transportation is
        key. They will continue to develop multi-family in urban locations.
        Office- They feel good about the fundamentals in all of their 

---------------------------------------------------
RESULTS:memo_4

{'CARDINAL': ['one'],
 'GPE': ['Boston', 'NYC, DC', 'Chicago', 'Seattle', 'ROC', 'U.S.'],
 'ORG': ['Circle Inc'],
 'PERCENT': ['90%', '4.0-4.5%', '5.75%', '6.0%']}




# Recomendations
- Updating the Named Entity Recognizer
- needs more data for correct NER training
- train for domain specific terms
- Real Estate & Finantial Abbreviations
    > `ROC,REIT,MOB,"value-add",...`

- example https://github.com/explosion/spacy/blob/master/examples/training/train_ner.py