# III. SVM classification 4


### Goals

* AS  III. SVM_Sentencetraining3 but focus on M&A only
* Adapted to work with latest version of spacy (03/20)

### Comments



## Lexical features

* token n-gram features: unigrams,bigrams, trigrams
* character n-gram fatures: trigrams,fourgrams
* lemma n-gram features: uni,bi,trigrams
* disambiguated lemmas: Lemma + POS tag
* numerals: yes,no
* symbols: yes,no
* time indicators: yes, not
* future: add semantic knowledge from structured resources:
    * takeover=acquire, acquisition,
    * are word embedding sufficient to capture semantic knowledge?

## Syntactic features
* PoS categories: 
    * for each binary (yes,no)
    * 0,1,more; 
    * total number of occurances
* named entity types: person, organization, location, product, event, 


    NE Type 	Examples
    ORGANIZATION 	Georgia-Pacific Corp., WHO
    PERSON 	Eddy Bonte, President Obama
    LOCATION 	Murray River, Mount Everest
    DATE 	June, 2008-06-29
    TIME 	two fifty a m, 1:30 p.m.
    MONEY 	175 million Canadian Dollars, GBP 10.40
    PERCENT 	twenty pct, 18.75 %
    FACILITY 	Washington Monument, Stonehenge
    GPE 	South East Asia, Midlothian

# Feature extraction on sentence level

In [1]:
import nltk,gensim, spacy

#nltk.data.path=[]
#nltk.data.path.append("C:\\Users\\rittchr\\nltk_data")
#nltk.data.path.append("\\esdfiles\INTERNAL\SpecialProjects\EconomicEventDetection\Analytics\nltk_data")

import re
import numpy as np
from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

from sklearn.multiclass import OneVsRestClassifier
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import GridSearchCV

In [2]:
#nltk.download('averaged_perceptron_tagger')
#nltk.download('tagsets')
#nltk.download('maxent_ne_chunker')
#nltk.download('words')

In [3]:
#from nltk.data import load
#all_pos_tags = list(load('help/tagsets/upenn_tagset.pickle').keys())
#all_pos_tags

In [4]:
class extract_other_lexical_features(BaseEstimator, TransformerMixin):
    '''
    other lexical features such as time, special chars
    '''
    
    def fit(self, x, y=None):
        return self    

    def transform(self, sentences):
    
        def extract_other_lexical_features_int(sentence):
            '''
            Simple indicator variables if digits, symobls or times are mentioned in the sentence
            '''
        
            tokentext = nltk.word_tokenize(sentence)

            ## Check if it is digit, could also use POS tag 'NUM'
            digits = np.any([token.isdigit() for token in tokentext])
            #digits = [any(char.isdigit() for char in token) for token in tokentext] #any char contains digit

            ## contains symbols (true), other characters
            symbols = np.any([not token.isalnum() for token in tokentext])

            ## contains time indicators ('yesterday','today')
            time_indicator_list = ['yesterday','today','tomorrow']
            
            # note that I have already TIME as a NER tag below. However here I follow the paper.
            times = np.any([True if token in time_indicator_list else False for token in tokentext])
            
            return [digits,symbols,times] #{'digits':digits,'symbols':symbols,'times':times}
        
        return np.array([extract_other_lexical_features_int(sentence) for sentence in sentences])

In [5]:
#NER_types = ['ORGANIZATION','PERSON','LOCATION','DATE','TIME','MONEY','PERCENT','FACILITY','FACILITY']
NER_TYPES_spacy_all = ['PERSON','NORP','FAC','ORG','GPE','LOC','PRODUCT','EVENT','WORK_OF_ART','LAW','LANGUAGE','DATE','TIME','PERCENT','MONEY','QUANTITY','ORDINAL','CARDINAL']
#NER_TYPES_spacy_subset = ['ORG','PERSON','LOC','GPE','DATE','TIME','MONEY','PERCENT']

In [6]:
 # load spacy's English language models
#en_nlp = spacy.load('en')
en_nlp = spacy.load('en_core_web_sm')

Tag map: https://github.com/explosion/spaCy/blob/master/spacy/lang/en/tag_map.py with meaning: https://www.clips.uantwerpen.be/pages/mbsp-tags

In [7]:
len(en_nlp.tokenizer.vocab.morphology.tag_map)

50

18 NER entities

https://spacy.io/api/annotation#named-entities

In [8]:
#! cat ~/opt/anaconda3/envs/py36/lib/python3.6/site-packages/spacy/lang/en/tag_map.py

In [9]:
from spacy.lang.en.tag_map import TAG_MAP
#TAG_MAP

In [10]:
class extract_syntactic_features(BaseEstimator, TransformerMixin):
    '''
    each sub-feature vector is of length all_pos_tags, fixed vector lengths!
    '''
    
    def fit(self, x, y=None):
        return self    

    def transform(self, sentences):
        '''
        PoS Tagging and NER of sentence
        '''
        tags_docs = []
        ner_docs = []
        for doc in en_nlp.pipe(sentences): #, disable=["tagger", "parser"]):
            
            # Do something with the doc here
            #print([(ent.text, ent.label_) for ent in doc.ents])
            
            tags = [token.pos_ for token in doc]
            tags_docs.append(tags)
            
            ents = [ent.label_ for ent in doc.ents]
            ner_docs.append(ents)

        unique_tags = list(set(x for l in tags_docs for x in l))
        unique_ner = list(set(x for l in ner_docs for x in l))
        
        print("unique tags: ",unique_tags)
        print('uniquener: ', unique_ner)
        
        docs_features = []
        for tags,ners in zip(tags_docs,ner_docs):
            
            tag_occurance = [apt in tags for apt in unique_tags]
            
            count_dict = Counter(tags)
            
            # number of occurances
            tag_counts = [count_dict[apt] if apt in count_dict.keys() else 0 for apt in unique_tags]
            
            # occurance, 0, 1 or more
            tag_three_classes = [2 if tc>1 else tc for tc in tag_counts]
        
            
            # named entity recognition: person, organization, location, product, event,
                    
            ners_feature = [1 if ner in ners else 0 for ner in unique_ner]
            
            sent_features = tag_occurance+tag_three_classes+tag_three_classes #{'tag_occurance':tag_occurance,'tag_three_classes':tag_three_classes,'ners':ners}
            docs_features.append(sent_features)
            
        return np.array(docs_features)

In [11]:
def tokenize_lemmatize(sentence):
    
    #tokentext = nltk.word_tokenize(sentence)
    return [token.lemma_ for token in en_nlp(sentence)]

def tokenize_lemma_pos(sentence):
    '''
    Combine token name and pos label
    '''
    
    tokens = []
    
    for token in en_nlp(sentence):
            tokens.append(token.lemma_+'_'+token.pos_)

    #tokentext = nltk.word_tokenize(sentence)
    return tokens #[en_nlp(token[0])[0].lemma_+token[1] for token in nltk.pos_tag(tokentext)]

In [12]:
class Debug(BaseEstimator, TransformerMixin):

    def transform(self, X):
        self.shape = X.shape
        # what other output you want
        return X

    def fit(self, X, y=None, **fit_params):
        return self

In [13]:
#test_sentence = 'The New York Times posted about people running marathons'

### Combining feature extraction

In [45]:
pipeline = Pipeline([
    
   # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(
        transformer_list=[
            
            # Pipeline for getting syntactic features
            ('syntactic_features', Pipeline([
                ('extract_syntactic_features', extract_syntactic_features()),
                ("debug", Debug()),

                #('vect', DictVectorizer()),  # list of dicts -> feature matrix
            ])),    
    

            # Pipeline for getting other lexical features
            ('other_lexical_features', Pipeline([
                ('extract_other_lexical_features', extract_other_lexical_features()),
                #('vect', DictVectorizer()),  # list of dicts -> feature matrix
                ("debug", Debug()),

            ])),               
    
            # word token ngrams
            ('word_ngrams', Pipeline([
                ('tfidf', TfidfVectorizer(ngram_range=(1,3),analyzer='word')),
                ("debug", Debug())
            ])),
    
    
            # character token ngrams
            ('char_ngrams', Pipeline([
                ('tfidf', TfidfVectorizer(ngram_range=(3,4),analyzer='char')),
                ("debug", Debug())
            ])),     
    
            
    
            ## lemma n-gram features, MODIFIED tokenizer=tokenize_lemmatize,
            ('lemma_ngrams', Pipeline([
                ('tfidf', TfidfVectorizer(ngram_range=(1,3),tokenizer=tokenize_lemmatize,analyzer='word')),
                ("debug", Debug())
            ])),              
    
            
            
            ## Get lemma + POS tags, MODIFIED, tokenizer=tokenize_lemma_pos,
            ('lemma_pos', Pipeline([
                ('tfidf', TfidfVectorizer(analyzer='word',tokenizer=tokenize_lemma_pos)),
                ("debug", Debug()),
            ]))     
            

        ]
        
        

        # weight components in FeatureUnion
        #transformer_weights={
        #    'subject': 1.0,
        #    'body_bow': 1.0,
        #    'body_stats': 1.0,
        #},
    )),
    
    ("debug_final", Debug()),

    # Use a SVC classifier on the combined features
    ('svc', SVC(kernel='linear')),
    ])

In [46]:
path_corpus = './Data/jacobs_corpus.csv'

In [47]:
df=pd.read_csv(path_corpus)

In [48]:
df.head(2)

Unnamed: 0,sentence,label,datatype,title,publication_date,file_id,-1,Profit,Dividend,MergerAcquisition,SalesVolume,BuyRating,QuarterlyResults,TargetPrice,ShareRepurchase,Turnover,Debt
0,It will not say what it has spent on the proje...,-1,holdin,tesco,25-09-2013,833,1,0,0,0,0,0,0,0,0,0,0
1,"Sir John Bond , chairman , told the bank 's an...",-1,holdin,FT other HSBC,28-05-2005,393,1,0,0,0,0,0,0,0,0,0,0


In [49]:
multi_labels = ['-1',
       'Profit', 'Dividend', 'MergerAcquisition', 'SalesVolume', 'BuyRating',
       'QuarterlyResults', 'TargetPrice', 'ShareRepurchase', 'Turnover',
       'Debt']

In [50]:
one_label = 'MergerAcquisition'

In [51]:
X = df['sentence']
y = df[one_label]
X.shape,y.shape

((9937,), (9937,))

In [52]:
X_train, X_test, y_train, y_test,file_id_train,file_id_test = train_test_split(X, y, df['file_id'],test_size=0.33, random_state=42)

In [53]:
a = [[231,233],[123],[2,2,3]]

In [54]:
list(set(x for l in a for x in l))

[2, 3, 231, 233, 123]

In [None]:
%%time
pipeline.fit(X_train,y_train)

In [None]:
#pipeline.steps
#pipeline.get_params()['svc']

#### Feature dimensions

In [None]:
'syntactic_features: ',pipeline.get_params()['union'].get_params()['syntactic_features'].get_params()['debug'].shape, \
'other_lexical_features: ',pipeline.get_params()['union'].get_params()['other_lexical_features'].get_params()['debug'].shape, \
'word_ngrams: ',pipeline.get_params()['union'].get_params()['word_ngrams'].get_params()['debug'].shape,\
'char_ngrams: ',pipeline.get_params()['union'].get_params()['char_ngrams'].get_params()['debug'].shape,\
'lemma_ngrams: ',pipeline.get_params()['union'].get_params()['lemma_ngrams'].get_params()['debug'].shape, \
'lemma_pos: ',pipeline.get_params()['union'].get_params()['lemma_pos'].get_params()['debug'].shape, \
'total dim: ',pipeline.get_params()['debug_final'].shape

###### 400k features per data point!

In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
print(classification_report(y_test,y_pred,target_names=multi_labels))

## Write out results

In [107]:
#from sklearn.externals import joblib
#joblib.dump(pipeline,'../Models/TrainingJacobs/model.joblib',compress=True)