# Text Analytics

#### Charlie Marshall
#### Prof. Klabjan
#### IEMS 308
#### 2 March 2020

In [1]:
import pandas as pd
import numpy as np
import re
import glob
import os
from nltk.tokenize import word_tokenize,sent_tokenize,RegexpTokenizer
from nltk import pos_tag
import spacy

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler;

from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))

### Load in Data

In [2]:
percent = pd.read_csv("/Users/charlesmarshall/Desktop/IEMS 308/Project 3/all/percentage.csv", engine = "python", names = ['perc'])

In [3]:
percent.head()

Unnamed: 0,perc
0,66%
1,40%
2,90%
3,49%
4,100%


In [4]:
ceo = pd.read_csv("/Users/charlesmarshall/Desktop/IEMS 308/Project 3/all/ceo.csv", engine = "python", names = ['first', 'last'])

In [5]:
def ceo_name(df):
    for i in range(len(ceo)):
        if pd.isnull(ceo.loc[i,'last']):
            ceo.loc[i,'ceo_full'] = ceo.loc[i,'first']
        elif pd.isnull(ceo.loc[i,'first']):
            ceo.loc[i,'ceo_full'] = ceo.loc[i,'last']
        else:
            ceo.loc[i,'ceo_full'] = ceo.loc[i,'first'] + ' ' + ceo.loc[i,'last']
            
    return df;

In [6]:
ceo = ceo_name(ceo)

In [7]:
ceo = ceo.drop(['first','last'], axis=1)

In [8]:
ceo.head()

Unnamed: 0,ceo_full
0,Tom Horton
1,Patti Hart
2,Jamie Dimon
3,Steve Cohen
4,Tim Cook


In [9]:
company = pd.read_csv("/Users/charlesmarshall/Desktop/IEMS 308/Project 3/all/companies.csv", engine = "python", names = ['company'])

In [10]:
company.head()

Unnamed: 0,company
0,Abaxis Inc
1,ACA Financial
2,Alibaba Group Holding Ltd
3,American Bell Telephone Co
4,American Express Co


In [11]:
file_list = glob.glob("/Users/charlesmarshall/Desktop/IEMS 308/Project 3/*/*.txt")

corpus = []

for file_path in file_list:
    with open(file_path,encoding='ISO-8859-1') as f_input:
        corpus.append(f_input.read())

In [12]:
len(corpus)

730

## Clean data

Remove all unicode and *.

In [13]:
print(corpus[0])

ReutersChina's seven day repo rose to a record high of 10.77% in Shanghai, the highest since March 2003, according to Bloomberg*. Meanwhile, the one-day rate hit a record 12.85%. And Zerohedge reported that overnight repo hit 25%. The liquidity squeeze in China first began ahead of the Dragon Boat festival earlier this month. Spikes in interbank rates are common right before holidays.Â  But Diana Choyleva at Lombard Street Research said this is symptomatic of a bigger problem. She said capital flows had "become a more important driver of domestic liquidity conditions in China's managed exchange rate system." In a new note to clients Bank of America's Ting Lu wrote: "There are many factors behind the interbank liquidity squeeze that might be cited, but we believe that the ultimate reason is the central bankâs tough stance as the PBOC can practically provide unlimited liquidity to ease every squeeze if it wishes to."Â  Banks have been clamoring for a reserve requirement ratio cut. So w

In [14]:
for text in range(len(corpus)):
    corpus[text] = re.sub(r'[^\x00-\x7f]|[*]',r'', corpus[text])

In [15]:
print(corpus[0])

ReutersChina's seven day repo rose to a record high of 10.77% in Shanghai, the highest since March 2003, according to Bloomberg. Meanwhile, the one-day rate hit a record 12.85%. And Zerohedge reported that overnight repo hit 25%. The liquidity squeeze in China first began ahead of the Dragon Boat festival earlier this month. Spikes in interbank rates are common right before holidays. But Diana Choyleva at Lombard Street Research said this is symptomatic of a bigger problem. She said capital flows had "become a more important driver of domestic liquidity conditions in China's managed exchange rate system." In a new note to clients Bank of America's Ting Lu wrote: "There are many factors behind the interbank liquidity squeeze that might be cited, but we believe that the ultimate reason is the central banks tough stance as the PBOC can practically provide unlimited liquidity to ease every squeeze if it wishes to." Banks have been clamoring for a reserve requirement ratio cut. So why isn't

## Tokenizing the Sentences

In [16]:
sentences = []

for text in range(len(corpus)):
    s = sent_tokenize(corpus[text])
    sentences.append(s)

In [17]:
len(sentences)

730

In [18]:
sentences = [item for sublist in sentences for item in sublist]

In [19]:
len(sentences)

695841

In [20]:
sentences[6940]

'Crablike, Mr Hollande is trying to do just enough on Europe, without aggravating nationalism at home.'

### Removing stop words in sentences:

None of the categories we are looking for (CEOs, percentages, or Companies) should include stop words,
so  removing them will not eliminate any candidates which simulataneously eliminating candidates which do not deserve to be picked

In [21]:
stop_words=sorted(set(stopwords.words("english")))

In [22]:
def drop_stop_words(ls):
    for i in range(len(ls)):
        tokenized_sent = word_tokenize(ls[i])
        ls[i] = ' '.join([word for word in tokenized_sent if word.lower() not in stop_words])
        
    return ls;

In [23]:
sentences = drop_stop_words(sentences)

In [24]:
sentences[6940]

'Crablike , Mr Hollande trying enough Europe , without aggravating nationalism home .'

## CEO's

1) Find all the names of people included in the corpus (potential CEO's). This is done by searching for any value that has two uppercase words in a row or just one uppercase word. It is not the most exact way to do this (for instance, there are lots of words at the beginning of sentences which are included, but many of these words should be eliminated in feature selection.

2) Blocks of text (paragraphs, windows, sentences, etc) will be inspected to come up with features.

- Potential Features:
1) CEO is in the same sentence (should correctly identify people who are obviously CEOs)
2) Word/ word phrase is longer than 3 characters (many of the stop words which are included in the potential ceo list are just words which start sentences, but can be eliminated because they have only a few characters)
3) I'm not sure - this might be good

3) A df will then be created with the row name being the name of each person and each column being a feature. 

4) Train a logistic regression model on half of the data

5) Test the model on the other half of the data. 

### Creating df for classification

In [26]:
def cap_letters(message):
    caps = sum(1 for c in message if c.isupper())
    return caps;

In [27]:
def cap_in_sent(ls):
    sent_caps = sum(1 for c in ls if c.isupper())
    return sent_caps;

In [28]:
def sentence_words(ls):
    ceos = 0
    sens = 0
    pres = 0
    inv = 0
    aut = 0
    represent = 0
    ambass = 0
    secr = 0
    exp = 0
    spok = 0
    gov = 0
    part = 0
    found = 0
    
    if re.findall(r'CEO|ceo', ls) != []: 
        ceos = 1
    if re.findall(r'Senator|Sen.', ls) != []: 
        sens = 1
    if re.findall(r'President', ls) != []: 
        pres = 1
    if re.findall(r'investor|Investor', ls) != []: 
        inv = 1
    if re.findall(r'author|Author', ls) != []: 
        aut = 1
    if re.findall(r'Representative|Rep.', ls) != []: 
        represent = 1
    if re.findall(r'Ambassador|ambassador', ls) != []: 
        ambass = 1
    if re.findall(r'Secretary|secretary', ls) != []: 
        secr = 1
    if re.findall(r'Expert|expert', ls) != []: 
        exp = 1
    if re.findall(r'spokesman|spokeswoman|Spokesman|Spokeswoman', ls) != []: 
        spok = 1
    if re.findall(r'Governor|Gov.', ls) != []: 
        gov = 1
    if re.findall(r'partner|Partner', ls) != []: 
        part = 1
    if re.findall(r'founder|Founder', ls) != []: 
        found = 1
        
    return ceos, sens, pres, inv, aut, represent, ambass, secr, exp, spok, gov, part, found;

In [29]:
def person_two_before(sent,phrase_in_sent):
    try:
        who_two_before = 0
        ceo_two_before = 0
        sen_two_before = 0
        pres_two_before = 0
        inv_two_before = 0
        aut_two_before = 0
        rep_two_before = 0
        amb_two_before = 0
        sec_two_before = 0
        exp_two_before = 0
        spoke_two_before = 0
        gov_two_before = 0
        part_two_before = 0
        found_two_before = 0
        
        sec_word = ''

        sent_split = re.split(r'[ |,|.]', sent)
        last_word = re.split(r'[ ]', phrase_in_sent)[0]

        if last_word in sent_split:
            word_index = sent_split.index(last_word)
            sec_word = sent_split[word_index-2].lower()
            if word_index-2 >= 0:
                if sec_word == 'who':
                    who_two_before = 1;
                if sec_word == 'ceo':
                    ceo_two_before = 1;
                if sec_word == 'senator' or sec_word == 'sen':
                    sen_two_before = 1;
                if sec_word == 'president':
                    pres_two_before = 1;
                if sec_word == 'investor':
                    inv_two_before = 1;
                if sec_word == 'author':
                    aut_two_before = 1;
                if sec_word == 'representative' or sec_word == 'rep':
                    rep_two_before = 1;
                if sec_word == 'ambassador':
                    amb_two_before = 1;
                if sec_word == 'secretary':
                    sec_two_before = 1;
                if sec_word == 'expert':
                    exp_two_before = 1;
                if sec_word == 'spokesman' or sec_word == 'spokeswoman':
                    spoke_two_before = 1;
                if sec_word == 'governor':
                    gov_two_before = 1;
                if sec_word == 'partner':
                    part_two_before = 1;
                if sec_word == 'founder':
                    found_two_before = 1;
                return who_two_before,ceo_two_before,sen_two_before,pres_two_before,inv_two_before,aut_two_before,rep_two_before,amb_two_before,sec_two_before,exp_two_before,spoke_two_before,gov_two_before,part_two_before,found_two_before;
            else:
                return who_two_before,ceo_two_before,sen_two_before,pres_two_before,inv_two_before,aut_two_before,rep_two_before,amb_two_before,sec_two_before,exp_two_before,spoke_two_before,gov_two_before,part_two_before,found_two_before;
    except IndexError:  
        return who_two_before,ceo_two_before,sen_two_before,pres_two_before,inv_two_before,aut_two_before,rep_two_before,amb_two_before,sec_two_before,exp_two_before,spoke_two_before,gov_two_before,part_two_before,found_two_before;

In [30]:
def person_one_before(sent,phrase_in_sent):
    try:
        who_one_before = 0
        ceo_one_before = 0
        sen_one_before = 0
        pres_one_before = 0
        inv_one_before = 0
        aut_one_before = 0
        rep_one_before = 0
        amb_one_before = 0
        sec_one_before = 0
        exp_one_before = 0
        spoke_one_before = 0
        gov_one_before = 0
        part_one_before = 0
        found_one_before = 0
        
        sec_word = ''

        sent_split = re.split(r'[ |,|.]', sent)
        last_word = re.split(r'[ ]', phrase_in_sent)[0]

        if last_word in sent_split:
            word_index = sent_split.index(last_word)
            sec_word = sent_split[word_index - 1].lower()
            if word_index - 1 >= 0:
                if sec_word == 'who':
                    who_one_before = 1;
                if sec_word == 'ceo':
                    ceo_one_before = 1;
                if sec_word == 'senator' or sec_word == 'sen':
                    sen_one_before = 1;
                if sec_word == 'president':
                    pres_one_before = 1;
                if sec_word == 'investor':
                    inv_one_before = 1;
                if sec_word == 'author':
                    aut_one_before = 1;
                if sec_word == 'representative' or sec_word == 'rep':
                    rep_one_before = 1;
                if sec_word == 'ambassador':
                    amb_one_before = 1;
                if sec_word == 'secretary':
                    sec_one_before = 1;
                if sec_word == 'expert':
                    exp_one_before = 1;
                if sec_word == 'spokesman' or sec_word == 'spokeswoman':
                    spoke_one_before = 1;
                if sec_word == 'governor':
                    gov_one_before = 1;
                if sec_word == 'partner':
                    part_one_before = 1;
                if sec_word == 'founder':
                    found_one_before = 1;
                return who_one_before,ceo_one_before,sen_one_before,pres_one_before,inv_one_before,aut_one_before,rep_one_before,amb_one_before,sec_one_before,exp_one_before,spoke_one_before,gov_one_before,part_one_before,found_one_before;
            else:
                return who_one_before,ceo_one_before,sen_one_before,pres_one_before,inv_one_before,aut_one_before,rep_one_before,amb_one_before,sec_one_before,exp_one_before,spoke_one_before,gov_one_before,part_one_before,found_one_before;
    except IndexError:  
        return who_one_before,ceo_one_before,sen_one_before,pres_one_before,inv_one_before,aut_one_before,rep_one_before,amb_one_before,sec_one_before,exp_one_before,spoke_one_before,gov_one_before,part_one_before,found_one_before;

In [31]:
def person_one_after(sent,phrase_in_sent):
    try:
        who_one_after = 0
        ceo_one_after = 0
        sen_one_after = 0
        pres_one_after = 0
        inv_one_after = 0
        aut_one_after = 0
        rep_one_after = 0
        amb_one_after = 0
        sec_one_after = 0
        exp_one_after = 0
        spoke_one_after = 0
        gov_one_after = 0
        part_one_after = 0
        found_one_after = 0
        
        fst_word = ''

        sent_split = re.split(r'[ |,|.]', sent)
        last_word = re.split(r'[ ]', phrase_in_sent)[1]

        if last_word in sent_split:
            word_index = sent_split.index(last_word)
            fst_word = sent_split[word_index+1].lower()

            if fst_word == 'who':
                who_one_after = 1;
            if fst_word == 'ceo':
                ceo_one_after = 1;
            if fst_word == 'senator' or fst_word == 'sen':
                sen_one_after = 1;
            if fst_word == 'president':
                pres_one_after = 1;
            if fst_word == 'investor':
                inv_one_after = 1;
            if fst_word == 'author':
                aut_one_after = 1;
            if fst_word == 'representative'or fst_word == 'rep':
                rep_one_after = 1;
            if fst_word == 'ambassador':
                amb_one_after = 1;
            if fst_word == 'secretary':
                sec_one_after = 1;
            if fst_word == 'expert':
                exp_one_after = 1;
            if fst_word == 'spokesman' or fst_word == 'spokeswoman':
                spoke_one_after = 1;
            if fst_word == 'governor':
                gov_one_after = 1;
            if fst_word == 'partner':
                part_one_after = 1;
            if fst_word == 'founder':
                found_one_after = 1;
        return who_one_after,ceo_one_after,sen_one_after,pres_one_after,inv_one_after,aut_one_after,rep_one_after,amb_one_after,sec_one_after,exp_one_after,spoke_one_after,gov_one_after,part_one_after,found_one_after;
    except IndexError:  
        return who_one_after,ceo_one_after,sen_one_after,pres_one_after,inv_one_after,aut_one_after,rep_one_after,amb_one_after,sec_one_after,exp_one_after,spoke_one_after,gov_one_after,part_one_after,found_one_after;

In [32]:
def person_two_after(sent,phrase_in_sent):
    try:
        who_two_after = 0
        ceo_two_after = 0
        sen_two_after = 0
        pres_two_after = 0
        inv_two_after = 0
        aut_two_after = 0
        rep_two_after = 0
        amb_two_after = 0
        sec_two_after = 0
        exp_two_after = 0
        spoke_two_after = 0
        gov_two_after = 0
        part_two_after = 0
        found_two_after = 0
        
        sec_word = ''

        sent_split = re.split(r'[ |,|.]', sent)
        last_word = re.split(r'[ ]', phrase_in_sent)[1]

        if last_word in sent_split:
            word_index = sent_split.index(last_word)
            sec_word = sent_split[word_index+2].lower()

            if sec_word == 'who':
                who_two_after = 1;
            if sec_word == 'ceo':
                ceo_two_after = 1;
            if sec_word == 'senator' or sec_word == 'sen':
                sen_two_after = 1;
            if sec_word == 'president':
                pres_two_after = 1;
            if sec_word == 'investor':
                inv_two_after = 1;
            if sec_word == 'author':
                aut_two_after = 1;
            if sec_word == 'representative'or sec_word == 'rep':
                rep_two_after = 1;
            if sec_word == 'ambassador':
                amb_two_after = 1;
            if sec_word == 'secretary':
                sec_two_after = 1;
            if sec_word == 'expert':
                exp_two_after = 1;
            if sec_word == 'spokesman' or sec_word == 'spokeswoman':
                spoke_two_after = 1;
            if sec_word == 'governor':
                gov_two_after = 1;
            if sec_word == 'partner':
                part_two_after = 1;
            if sec_word == 'founder':
                found_two_after = 1;
        return who_two_after,ceo_two_after,sen_two_after,pres_two_after,inv_two_after,aut_two_after,rep_two_after,amb_two_after,sec_two_after,exp_two_after,spoke_two_after,gov_two_after,part_two_after,found_two_after;
    except IndexError:  
        return who_two_after,ceo_two_after,sen_two_after,pres_two_after,inv_two_after,aut_two_after,rep_two_after,amb_two_after,sec_two_after,exp_two_after,spoke_two_after,gov_two_after,part_two_after,found_two_after;    

In [33]:
def ceo_word_in_sent(sent,phrase):
    try:
        two_before = person_two_before(sent,phrase)
        one_before = person_one_before(sent,phrase)
        one_after = person_one_after(sent,phrase)
        two_after = person_two_after(sent,phrase)

        who = two_before[0] + one_before[0] + one_after[0] + two_after[0]
        ceo_in_sent = two_before[1] + one_before[1] + one_after[1] + two_after[1]
        senator = two_before[2] + one_before[2] + one_after[2] + two_after[2]
        president = two_before[3] + one_before[3] + one_after[3] + two_after[3]
        investor = two_before[4] + one_before[4] + one_after[4] + two_after[4]
        author = two_before[5] + one_before[5] + one_after[5] + two_after[5]
        rep = two_before[6] + one_before[6] + one_after[6] + two_after[6]
        ambassador = two_before[7] + one_before[7] + one_after[7] + two_after[7]
        secretary = two_before[8] + one_before[8] + one_after[8] + two_after[8]
        expert = two_before[9] + one_before[9] + one_after[9] + two_after[9]
        spokesman = two_before[10] + one_before[10] + one_after[10] + two_after[10]
        governor = two_before[11] + one_before[11] + one_after[11] + two_after[11]
        partner = two_before[12] + one_before[12] + one_after[12] + two_after[12]
        founder = two_before[13] + one_before[13] + one_after[13] + two_after[13]

        return who,ceo_in_sent,senator, president, investor, author, rep, ambassador, secretary, expert, spokesman, governor, partner,founder;
    except TypeError:
        return np.zeros(14)

In [34]:
def potential_ceo_df(ls):
    ceo_df = []
    sentences = []
    for i in range(len(ls)):
        p = re.findall(r'[A-Z]\w+ [A-Z]\w+', ls[i])
        if p != []:
            
            sent_caps = cap_in_sent(ls[i])
            sent_len = len(ls[i])
            
            for j in p:
                ceo_word = ceo_word_in_sent(ls[i],j)
                ceos = ceo_word[1]
                
                in_sent = sentence_words(ls[i])
                ceo_in_sent = in_sent[0]
                            
                length = len(j)
                caps = cap_letters(j)
                ceo_df.append([j,length,caps,ceos,ls[i],i])
                #ceo_df.append([j,length,sent_len,caps,sent_caps,ceos,ceo_in_sent,ls[i],i])
                
    return ceo_df;

In [35]:
ceo_df = pd.DataFrame(potential_ceo_df(sentences), columns = ['Candidate', 'length', 'caps', 'ceo_near', 'Sentence', 'index'])


In [36]:
#ceo_df = pd.DataFrame(potential_ceo_df(sentences), columns = ['Candidate', 'length', 'sent_len', 'caps', 'sent_caps', 'ceo_near','ceo_in_sent', 'Sentence', 'index'])

In [37]:
sentences[3]

'liquidity squeeze China first began ahead Dragon Boat festival earlier month .'

In [38]:
ceo_df

Unnamed: 0,Candidate,length,caps,ceo_near,Sentence,index
0,Dragon Boat,11,2,0.0,liquidity squeeze China first began ahead Drag...,3
1,Diana Choyleva,14,2,0.0,Diana Choyleva Lombard Street Research said sy...,5
2,Lombard Street,14,2,0.0,Diana Choyleva Lombard Street Research said sy...,5
3,Bank America,12,2,0.0,new note clients Bank America 's Ting Lu wrote...,7
4,Ting Lu,7,2,0.0,new note clients Bank America 's Ting Lu wrote...,7
5,Bank China,10,2,0.0,previously explained People 's Bank China seem...,10
6,China Banking,13,2,0.0,also comes time banks required meet loan-to-de...,12
7,Regulatory Commission,21,2,0.0,also comes time banks required meet loan-to-de...,12
8,Charlene Chu,12,2,0.0,Earlier week Fitch 's Charlene Chu warned Chin...,13
9,Lehman China,12,2,0.0,SHIBOR 25 % basically means functioning interb...,16


## CEO Logistic Regression

In [39]:
sum(ceo_df['ceo_near'])

2314.0

In [40]:
labels = []
values = ceo['ceo_full'].values

for i in range(len(ceo_df)):
    if ceo_df.loc[i,'Candidate'] in values:
        labels.append(1)
    else: 
        labels.append(0) 
#ceo_df['label'] = labels

In [41]:
Xceo, yceo = ceo_df.drop(['Sentence','index','Candidate'], axis=1), range(len(ceo_df))

In [42]:
Xceo = StandardScaler().fit_transform(Xceo)
Xceo

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


array([[-0.7138046 , -0.35068002, -0.07121519],
       [ 0.13183124, -0.35068002, -0.07121519],
       [ 0.13183124, -0.35068002, -0.07121519],
       ...,
       [ 0.97746707, -0.35068002, -0.07121519],
       [ 0.69558846,  2.40041678, -0.07121519],
       [ 0.69558846,  2.40041678, -0.07121519]])

In [43]:
Xceo_train, Xceo_test, yceo_train, yceo_test = train_test_split(Xceo, yceo, test_size=0.5, random_state=42)

In [44]:
Xceo_train

array([[-0.43192599, -0.35068002, -0.07121519],
       [-1.27756183,  0.33709418, -0.07121519],
       [-0.43192599, -0.35068002, -0.07121519],
       ...,
       [ 0.13183124, -0.35068002, -0.07121519],
       [-0.99568321, -0.35068002, -0.07121519],
       [ 1.25934569,  0.33709418, -0.07121519]])

In [45]:
ceo_train_label = np.zeros(len(yceo_train))
j=0
for i in yceo_train:
    ceo_train_label[j] = labels[i]
    j = j+1

In [46]:
sum(ceo_train_label)

7710.0

In [47]:
yceo_train

[239876,
 377132,
 106993,
 194570,
 411388,
 411294,
 408440,
 259668,
 128360,
 268187,
 263927,
 81220,
 375531,
 305800,
 74575,
 48751,
 106844,
 200343,
 75615,
 396634,
 92345,
 401492,
 401371,
 37635,
 44153,
 18866,
 268880,
 213073,
 125719,
 386506,
 357720,
 331561,
 195955,
 348577,
 31303,
 155078,
 170357,
 414738,
 32976,
 400598,
 284037,
 47190,
 112675,
 20253,
 343527,
 237354,
 305844,
 152714,
 413549,
 25495,
 445280,
 377723,
 291502,
 229713,
 353269,
 300303,
 87596,
 56606,
 429931,
 369523,
 354333,
 223507,
 67328,
 206556,
 215272,
 191884,
 140208,
 29710,
 371228,
 132796,
 179837,
 415291,
 108431,
 336627,
 274503,
 223973,
 129059,
 285965,
 26310,
 290906,
 176207,
 23581,
 26624,
 297169,
 160538,
 369488,
 13883,
 198796,
 122335,
 434141,
 312404,
 340292,
 124211,
 48856,
 181585,
 324712,
 294071,
 426826,
 296405,
 325810,
 230227,
 78703,
 170554,
 259598,
 184774,
 222539,
 341364,
 349759,
 29061,
 7667,
 158869,
 211923,
 274094,
 159453,


In [48]:
Xceo_test

array([[ 2.10498152, -0.35068002, -0.07121519],
       [ 0.41370985, -0.35068002, -0.07121519],
       [-1.84131905,  2.40041678, -0.07121519],
       ...,
       [ 0.69558846, -0.35068002, -0.07121519],
       [ 0.41370985, -0.35068002, -0.07121519],
       [ 0.13183124, -0.35068002, -0.07121519]])

In [49]:
yceo_test

[93664,
 38482,
 320969,
 150811,
 371176,
 156921,
 345393,
 166330,
 106168,
 193327,
 147199,
 154347,
 129250,
 290662,
 49626,
 202374,
 354359,
 311091,
 221505,
 314945,
 261791,
 234493,
 25907,
 290120,
 254112,
 177919,
 380584,
 373406,
 802,
 169929,
 159241,
 195675,
 179227,
 139025,
 99916,
 419651,
 427070,
 284432,
 126157,
 394591,
 423601,
 120038,
 102177,
 54121,
 132310,
 128009,
 108796,
 307079,
 362304,
 235636,
 315423,
 15566,
 48138,
 323533,
 325521,
 441676,
 65153,
 179153,
 267086,
 181958,
 216224,
 264470,
 252758,
 231578,
 85716,
 401321,
 167476,
 115572,
 221878,
 197764,
 321040,
 16326,
 124212,
 64659,
 104763,
 118340,
 380221,
 260707,
 150188,
 438203,
 276300,
 424793,
 73462,
 376012,
 438743,
 180088,
 230495,
 37946,
 180123,
 53917,
 192078,
 420063,
 81205,
 273242,
 69879,
 98725,
 68795,
 314746,
 368563,
 51540,
 245742,
 267978,
 430731,
 172694,
 329059,
 352736,
 308031,
 48543,
 39264,
 40737,
 227245,
 344440,
 351432,
 247584,


ceo_features = Xceo_train.iloc[:,Xceo_train.columns != 'label']
ceo_label = Xceo_train['label']

ceo_log = LogisticRegression()
ceo_log.fit(ceo_features, ceo_label)

In [50]:
ceo_features = Xceo_train
ceo_label = ceo_train_label

ceo_log = LogisticRegression()
ceo_log.fit(ceo_features, ceo_label)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [51]:
ceo_pred_feat = Xceo_test
ceo_pred = ceo_log.predict(ceo_pred_feat)

ceo_pred_feat = Xceo_test.iloc[:,Xceo_test.columns != 'label']
ceo_pred = ceo_log.predict(ceo_pred_feat)

In [52]:
sum(ceo_pred)

828.0

In [53]:
#print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(ceo_log.score(ceo_pred_feat, Xceo_test['label'])))

In [54]:
ceo_full = Xceo
ceo_pred_full = ceo_log.predict(ceo_full)
sum(ceo_pred_full)

1632.0

In [140]:
ceo_df['pred'] = ceo_pred_full
ceo_final = ceo_df[ceo_df['pred']==1]
ceo_final = ceo_final.reset_index(drop=True)
CEOs = list(ceo_final['Candidate'])
set(CEOs)

{'Aaron Levie',
 'Aaron Regent',
 'Abigail Johnson',
 'According Nanex',
 'Ackman Valeant',
 'Advisory Group',
 'Aer Lingus',
 'Afghan United',
 'Airbus Group',
 'Alan Breed',
 'Alan Joyce',
 'Alan Mulally',
 'Alan Mulallyis',
 'Alan Mullaly',
 'Aleksey Miller',
 'Alex Algard',
 'Alexei Miller',
 'Allen Questrom',
 'America Founding',
 'American Apparel',
 'American Eagle',
 'American Express',
 'Analyst Earnings',
 'Andersen Tax',
 'Anderson Real',
 'Andrei Bugrov',
 'Andrei Cherny',
 'Andrei Kostin',
 'Andy Grove',
 'Angela Ahrendts',
 'Angelo Mozilo',
 'Anglo Irish',
 'Antonio Horta',
 'Antony Jenkins',
 'Apple Inc',
 'Apple Pay',
 'Ari Reichental',
 'Armstrong Fired',
 'Art Levinson',
 'Asia Pacific',
 'Asset Management',
 'Auto Nation',
 'Avishai Abrahami',
 'Bank America',
 'Bank American',
 'Barry Norris',
 'Barry Ritholtz',
 'Barry Silbert',
 'Bay Capital',
 'Beach Business',
 'Bear Stearns',
 'Beats Music',
 'Beauty Wealth',
 'Belus Capital',
 'Ben Milne',
 'Ben Verwaayen',
 '

In [135]:
finalCEO = set(CEOs)
finalCEO = pd.DataFrame(finalCEO)
finalCEO.to_csv("ExtractedCEOs.csv",header=False,index=False)

In [56]:
ceo_final

Unnamed: 0,Candidate,length,caps,ceo_near,Sentence,index,pred
0,Joe Ratterman,13,2,1.0,BATS CEO Joe Ratterman told Wall Street Journa...,1157,1.0
1,Square Capital,14,2,1.0,"'s full release Pershing : William A. Ackman ,...",1354,1.0
2,Former Citigroup,16,2,1.0,Former Citigroup CEO Sanford Sandy Weill said ...,1980,1.0
3,Sandy Weill,11,2,1.0,Former Citigroup CEO Sanford Sandy Weill said ...,1980,1.0
4,Bank America,12,2,1.0,Bank America CEO Brian T. Moynihan said hes co...,1988,1.0
5,Warren Buffett,14,2,1.0,Sokol Berkshire Hathaway subsidiary executive ...,2144,1.0
6,Capital Advisors,16,2,1.0,"Brian Sozzi , CEO Belus Capital Advisors point...",3379,1.0
7,Mike Ullman,11,2,1.0,meddling JCP CEO Mike Ullman focused beginning...,3401,1.0
8,Capital Advisors,16,2,1.0,`` 's still core Baby Boomer customer change b...,3408,1.0
9,Ron Johnson,11,2,1.0,began Pershing Square Capital 's Bill Ackman m...,3600,1.0


## Companies

length, corp/corporation/group/holding/inc in word/sentence, company, stock in the sentence, stop words, beginning of sentence, number of words, profit, plural, 

In [57]:
def company_in_sentence(sentence):
    ret = 0
    if re.search(r'company', sentence.lower()) != None:
        ret = 1
    return ret

In [58]:
def stock_in_sentence(sentence):
    ret = 0
    if re.search(r'stock', sentence.lower()) != None:
        ret = 1
    return ret

In [59]:
def shares_in_sentence(sentence):
    ret = 0
    if re.search(r'share', sentence.lower()) != None:
        ret = 1
    return ret

In [60]:
def trade_in_sentence(sentence):
    ret = 0
    if re.search(r'trad', sentence.lower()) != None:
        ret = 1
    return ret

### Company Specific 

In [61]:
def length_of_company(item):
    return len(item)

In [62]:
def plural_word(item):
    plural = 0
    if item[len(item) - 1] == 's':
        plural = 1
    return plural

In [63]:
def number_of_words(words):
    return len(words)

In [64]:
def location_at_start(sentence, item):
    start = 0
    if re.search(re.compile(item), sentence).start() == 0:
        start = 1;
    else:
        start = 0;
    return start;

In [65]:
def company_words(word_phrase):
    corp = 0
    corporation = 0
    group = 0
    holding = 0
    inc = 0
    company = 0
    association = 0
    foundation = 0

    for word in word_phrase:
        if word == "Corp" or word == 'Corp.' or word == 'Corporation':
            corp = 1;
        if word == "Group":
            group = 1;
        if word == "Holding":
            holding = 1;
        if word == "Inc" or word == "Inc.":
            inc = 1;
        if word == "Company":
            company = 1;
        if word == "Association":
            association = 1;
        if word == "Foundation":
            foundation = 1;

    return corp, group, holding, inc, company, association, foundation

In [66]:
def feature_creator_companies(sentences):
    candidates = []
    for i in range(len(sentences)):
        x = re.findall(r'(([A-Z][A-Za-z0-9]+[ -]?)+)', sentences[i])
        extract = [i[0] for i in x]
        if extract != []:
            comp_in_sent = company_in_sentence(sentences[i])
            #stock = stock_in_sentence(sentences[i])
            shares = shares_in_sentence(sentences[i])
            #trade = trade_in_sentence(sentences[i]) 
            for j in extract:
                
                new_j = j
                if new_j[-1] == ' ':
                    new_len = len(new_j)-1
                    new_j = new_j[0:new_len]
                
                words = re.split(r'[ ]', new_j)
                #length = length_of_company(item)
                plural = plural_word(new_j)
                number_words = number_of_words(words)
                location = location_at_start(sentences[i], new_j)
                comp = company_words(words)
                corp = comp[0]
                group = comp[1]
                holding = comp[2]
                inc = comp[3]
                company = comp[4]
                association = comp[5]
                foundation = comp[6]
                candidates.append([new_j, comp_in_sent, shares, plural, number_words,location, corp, group, holding, inc, company,association, foundation,sentences[i],i])
                #candidates.append([item,company,shares,length,plural,number_words,location,corp,group,holding,inc,company,association,foundation])
    return candidates

In [67]:
comp_df = pd.DataFrame(feature_creator_companies(sentences), columns = ['Candidate','comp_in_sent','shares', 'plural', 'number_words','location' , 'corp', 'group', 'holding', 'inc', 'company', 'association','foundation','sentence','index'])


In [68]:
#comp_df = pd.DataFrame(feature_creator_companies(sentences), columns = ['Candidate', 'company', 'stock', 'shares', 'trade', 'length', 'plural', 'number_words','location', 'corp', 'corporation', 'group', 'holding', 'inc', 'company', 'association','foundation'])

In [69]:
#ex1 = []
#for i in range(len(comp_df)):
#    if comp_df.iloc[i,0][-1] == ' ':
#        st = len(comp_df.iloc[0,0])-1
#        ex1.append(comp_df.iloc[0,0][0:st])
        
#    else:
#        ex1.append(comp_df.iloc[i,0])

In [70]:
comp_df

Unnamed: 0,Candidate,comp_in_sent,shares,plural,number_words,location,corp,group,holding,inc,company,association,foundation,sentence,index
0,ReutersChina,0,0,0,1,1,0,0,0,0,0,0,0,ReutersChina 's seven day repo rose record hig...,0
1,Shanghai,0,0,0,1,0,0,0,0,0,0,0,0,ReutersChina 's seven day repo rose record hig...,0
2,March,0,0,0,1,0,0,0,0,0,0,0,0,ReutersChina 's seven day repo rose record hig...,0
3,Bloomberg,0,0,0,1,0,0,0,0,0,0,0,0,ReutersChina 's seven day repo rose record hig...,0
4,Meanwhile,0,0,0,1,1,0,0,0,0,0,0,0,"Meanwhile , one-day rate hit record 12.85 % .",1
5,Zerohedge,0,0,0,1,1,0,0,0,0,0,0,0,Zerohedge reported overnight repo hit 25 % .,2
6,China,0,0,0,1,0,0,0,0,0,0,0,0,liquidity squeeze China first began ahead Drag...,3
7,Dragon Boat,0,0,0,2,0,0,0,0,0,0,0,0,liquidity squeeze China first began ahead Drag...,3
8,Spikes,0,0,1,1,1,0,0,0,0,0,0,0,Spikes interbank rates common right holidays .,4
9,Diana Choyleva Lombard Street Research,0,0,0,5,1,0,0,0,0,0,0,0,Diana Choyleva Lombard Street Research said sy...,5


### Logistic Regression for Companies

In [71]:
comp_labels = []
values = set(company['company'].values)
candidates = comp_df['Candidate'].tolist()

for i in range(len(comp_df)):
    if candidates[i] in values:
        comp_labels.append(1)
    else: 
        comp_labels.append(0)
#comp_df['label'] = comp_labels

In [72]:
Xcomp, ycomp = comp_df.drop(['Candidate','sentence','index'], axis=1), range(len(comp_df))

In [73]:
Xcomp.sum(axis=0)

comp_in_sent      50327
shares            38788
plural           151365
number_words    1771426
location         311224
corp               1680
group              2650
holding             223
inc                2792
company            1306
association         781
foundation          471
dtype: int64

In [74]:
Xcomp = StandardScaler().fit_transform(Xcomp)
Xcomp

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


array([[-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       ...,
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071]])

In [75]:
Xcomp_train, Xcomp_test, ycomp_train, ycomp_test = train_test_split(Xcomp, ycomp, test_size=0.5, random_state=42)

In [76]:
Xcomp_train.shape

(574498, 12)

In [77]:
ycomp_train

[710412,
 532888,
 391930,
 49110,
 528853,
 463705,
 563265,
 1000199,
 233011,
 913752,
 143217,
 49576,
 193373,
 85759,
 392167,
 1091397,
 938638,
 831208,
 1146637,
 889347,
 876288,
 470689,
 827657,
 198707,
 16912,
 1108395,
 472964,
 1087258,
 792982,
 638458,
 78386,
 737830,
 55183,
 638606,
 966485,
 486176,
 539546,
 764629,
 169196,
 448023,
 913971,
 691058,
 530609,
 111408,
 325369,
 44625,
 681459,
 1094427,
 218190,
 874237,
 801300,
 90871,
 864732,
 1085869,
 771297,
 15392,
 743203,
 557999,
 961682,
 451829,
 1114421,
 85882,
 128204,
 80575,
 55010,
 320165,
 80918,
 910992,
 389115,
 108353,
 1073987,
 24425,
 518027,
 449690,
 743095,
 457700,
 1098803,
 526903,
 972714,
 913373,
 627476,
 792210,
 154663,
 278082,
 89925,
 643040,
 188963,
 208637,
 816051,
 533765,
 866785,
 1133070,
 1021273,
 846841,
 993970,
 167852,
 226374,
 73530,
 571642,
 866488,
 343906,
 781917,
 977440,
 268487,
 43215,
 728581,
 561188,
 579189,
 303283,
 392469,
 878122,
 67302

In [78]:
comp_train_label = np.zeros(len(ycomp_train))
j=0
for i in ycomp_train:
    comp_train_label[j] = comp_labels[i]
    j = j+1

In [79]:
sum(comp_train_label)

51498.0

In [80]:
Xcomp_test

array([[-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       ...,
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071],
       [-0.21402627, -0.18691602, -0.38951821, ..., -0.03373334,
        -0.02608038, -0.02025071]])

In [81]:
ycomp_test

[821270,
 81095,
 1132602,
 89122,
 1052210,
 344903,
 703192,
 53185,
 849452,
 1097214,
 362447,
 863181,
 1094759,
 854001,
 185669,
 236283,
 794913,
 68882,
 695763,
 784485,
 577514,
 806829,
 1034519,
 291816,
 114465,
 77787,
 425988,
 1131100,
 839497,
 350542,
 401833,
 665157,
 989649,
 307405,
 59439,
 432670,
 893520,
 657366,
 270179,
 164827,
 56417,
 514314,
 56896,
 268310,
 449919,
 170238,
 1131900,
 422190,
 392199,
 379940,
 1017417,
 1028905,
 689408,
 468268,
 898958,
 231305,
 424378,
 876298,
 891146,
 94010,
 1145837,
 127353,
 284736,
 694662,
 276536,
 935629,
 40980,
 928229,
 45247,
 758726,
 162891,
 127721,
 1068710,
 774588,
 680722,
 285297,
 918220,
 1074253,
 515804,
 1018784,
 68513,
 870726,
 1070532,
 1096991,
 683104,
 700028,
 293527,
 418013,
 708996,
 825253,
 575830,
 116929,
 1074780,
 312584,
 586354,
 234045,
 22907,
 1067812,
 76855,
 727890,
 879189,
 512323,
 11696,
 907628,
 142988,
 720907,
 1145648,
 834355,
 415110,
 310974,
 285066

In [82]:
comp_features = Xcomp_train
comp_label = comp_train_label

comp_log = LogisticRegression()
comp_log.fit(comp_features, comp_label)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [83]:
#comp_features = Xcomp_train.iloc[:,Xcomp_train.columns != 'label']
#comp_label = Xcomp_train['label']

#comp_log = LogisticRegression()
#comp_log.fit(comp_features, comp_label)

In [84]:
comp_pred_feat = Xcomp_test
comp_pred = comp_log.predict(comp_pred_feat)

In [85]:
#comp_pred_feat = Xcomp_test.iloc[:,Xcomp_test.columns != 'label']
#comp_pred = comp_log.predict(comp_pred_feat)

In [86]:
sum(comp_pred)

1622.0

In [87]:
comp_pred_feat = Xcomp
comp_pred = comp_log.predict(comp_pred_feat)

In [88]:
comp_df.columns

Index(['Candidate', 'comp_in_sent', 'shares', 'plural', 'number_words',
       'location', 'corp', 'group', 'holding', 'inc', 'company', 'association',
       'foundation', 'sentence', 'index'],
      dtype='object')

In [89]:
comp_log.coef_

array([[ 1.33999109e-01,  1.38053127e-01,  6.09233786e-02,
        -5.89570648e-01, -1.87495505e-01,  1.28011338e-01,
         1.33584563e-01, -2.24028093e-04,  1.28348328e-01,
        -5.38849459e-02, -1.48072818e-01, -2.78348543e-02]])

In [90]:
#comp_pred_feat = Xcomp.iloc[:,Xcomp.columns != 'label']
#comp_pred = comp_log.predict(comp_pred_feat)

In [91]:
sum(comp_pred)

3174.0

In [92]:
#print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(comp_log.score(comp_pred_feat, Xcomp_test['label'])))

In [93]:
comp_df['pred'] = comp_pred
comp_final = comp_df[comp_df['pred']==1]
comp_final = comp_final.reset_index(drop=True)
comps = list(comp_final['Candidate'])
set(comps)

{'Gulfstream Aerospace Corp',
 'Bombardier Inc',
 'McCain Group',
 'Witkoff Group',
 'Espirito Santo Financial Group',
 'HCA Inc',
 'BNY Mellon Corp',
 'Spirit Group',
 'Chrysler Group',
 'Centennial Group',
 'NASDAQ OMX Group',
 'TNK-BP Alfa Group',
 'Clearing Corporation',
 'Hormel Foods Corp Kraft Foods Group Inc',
 'Delhaize Group',
 'Development Corp',
 'Exelon Corp',
 'Kraft Kraft Foods Group Inc Mondelez International Inc',
 'Proto Labs Inc',
 'Schork Group Inc',
 'Enron Corporation',
 'Autodata Corporation',
 'Webbmedia Group',
 'Altimeter Group',
 'Altria Group Inc',
 'Group CEO',
 'TABB Group',
 'Burroughs Corp',
 'Kennedy Group',
 'United Technologies Corp',
 'Mena Corp',
 'Toshiba Corp',
 'Sina Corp',
 'FileWeibo Corp',
 'Starbucks Corporation',
 'Howard Hughes Corporation',
 'AMC Networks Inc',
 'Nokia Corporation',
 'Zhuhai Zhenrong Corp',
 'Geo Group',
 'Illumina Corp Equinix',
 'Intesa Sanpaolo Group',
 'Rosemount Inc',
 'Bilderberg Group',
 'Lindsay Corp',
 'Hess Corpo

In [141]:
finalCompany = set(comps)
finalCompany = pd.DataFrame(finalCompany)
finalCompany.to_csv("ExtractedCompanies.csv",header=False,index=False)

In [94]:
#comp_final = pd.DataFrame(np.concatenate((Xcomp,np.array(comp_pred)[:,None]),axis=1))
#comp_final = comp_final[comp_final.iloc[:,13]==1]
#comp_final = comp_final.reset_index()

#Companies = []
#for i in comp_final['index']:
#    Companies.append(comp_df.loc[i,'Candidate'])
#Companies = list(comp_final['Candidate'])
#Companies

## predict_proba outputs the probabilities of each class for logistic regression

# take a set of the features so that you do not have repeats

## Percentages

In [95]:
def percent_after(sent,num):
    try:
        perc = 0
        nxt = ''        
        split = re.split(r'[ ]', sent)
        if num in split:
            num_index = split.index(num)
            nxt = split[num_index+1].lower()
            if nxt == 'percentage' or nxt == "percent":
                perc = 1;
                return perc;
        char_index = re.search(num, sent.lower()).start() + len(num)
        if sent[char_index] == '%' or sent[char_index+1] == '%':
            perc = 1;
            return perc;
        else: perc = 0;
            
    except IndexError:
        perc = 0;
    return perc;

In [96]:
def greater_than_1800(num):
    try:
        year = 0
        num = int(num)
        
        if num > 1800: year = 1;
        else: year = 0;
    except ValueError: pass
    return year;

In [97]:
def feature_creator_percent(ls):
    numbers = []
    for i in range(len(ls)):
        re1 = re.findall(r'\d*\.?\d+', ls[i])
        re2 = re.findall(r'one[\s|-]?hundred|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen', ls[i].lower())
        re3 = re.findall(r'((twenty|thirty|fourty|fifty|sixty|seventy|eighty|ninety)(\s|-)?(one|two|three|four|five|six|seven|eight|nine)?)', ls[i].lower())
        re3 = [i[0] for i in re3]
        extract = re1 + re2 + re3
        if extract != []:
            for item in extract:
                year = greater_than_1800(item)
                perc = percent_after(ls[i],item)
                numbers.append([item, year, perc,i])
    return numbers

In [98]:
numbers = pd.DataFrame(feature_creator_percent(sentences), columns = ['numbers','year','perc','sentence'])

In [99]:
numbers.head()

Unnamed: 0,numbers,year,perc,sentence
0,10.77,0,1,0
1,2003,1,0,0
2,seven,0,0,0
3,12.85,0,1,1
4,one,0,0,1


## Logistic Regression for Percentages

In [100]:
#values = percent['perc'].values

In [101]:
#def remove_perc(ls):
#    nums = []
#    for i in range(len(ls)):
#        if ls[i][-1] == '%':
#            ls[i] = ls[i].replace("%", "")
#            nums.append(ls[i])
        
#        elif ls[i][(len(ls[i])-len('percent')):len(ls[i])] == 'percent':
#            ls[i] = ls[i].replace(" percent", "")
#            nums.append(ls[i])
        
#        elif values[i][(len(ls[i])-len('percentage')):len(ls[i])] == 'percentage':
#            ls[i] = ls[i].replace(" percentage", "")
#            nums.append(ls[i])
        
#        else:
#            nums.append(ls[i])
            
#    return nums;

In [102]:
#percents = set(remove_perc(values))

### Manually creating a test set of 200 data points

I had to do this becuase there was no logical way to compare to the percentage csv file because my numbers are 

In [103]:
#labels=[]
#candidates = numbers['numbers'].tolist()

#for i in range(len(candidates)):
#    if candidates[i] in percents:
#        labels.append(1)
#    else: 
#        labels.append(0) 
#numbers['label'] = labels

In [104]:
train_perc = pd.read_csv("/Users/charlesmarshall/Desktop/IEMS 308/Project 3/all/train_label.csv", engine = "python", names = ['train'])

In [105]:
train_perc.head()

Unnamed: 0,train
0,1
1,0
2,0
3,1
4,0


In [106]:
Xperc_train = numbers[0:100]
Xperc_train['label'] = train_perc
Xperc_train = Xperc_train.drop(['numbers','sentence'], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [107]:
yperc_train = list(range(len(Xperc_train)))

In [108]:
#Xperc, yperc = numbers.drop(['numbers','sentence'], axis=1), range(len(numbers))

In [109]:
#Xperc

In [110]:
#Xperc_train, Xperc_test, yperc_train, yperc_test = train_test_split(Xperc, yperc, test_size=0.5, random_state=42)

In [111]:
Xperc_test = numbers.drop(numbers.index[0:101],axis=0)
Xperc_test = Xperc_test.drop(['numbers','sentence'],axis=1)

In [112]:
yperc_train = Xperc_train.index.tolist()

In [113]:
Xperc_train

Unnamed: 0,year,perc,label
0,0,1,1
1,1,0,0
2,0,0,0
3,0,1,1
4,0,0,0
5,0,1,1
6,0,0,0
7,0,1,1
8,0,0,0
9,0,0,0


In [114]:
yperc_train

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99]

In [115]:
Xperc_test

Unnamed: 0,year,perc
101,1,0
102,0,1
103,0,0
104,0,0
105,0,1
106,0,0
107,0,0
108,0,0
109,0,0
110,0,0


In [116]:
yperc_test = Xperc_test.index.tolist()

In [117]:
perc_features = Xperc_train.iloc[:,Xperc_train.columns != 'label']
perc_label = Xperc_train['label']

perc_log = LogisticRegression()
perc_log.fit(perc_features, perc_label)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [118]:
perc_pred_feat = Xperc_test
perc_pred = perc_log.predict(perc_pred_feat)

In [119]:
sum(perc_pred)

75511

In [120]:
Xperc = numbers.drop(['numbers','sentence'],axis=1)
perc_pred_full = perc_log.predict(Xperc)
sum(perc_pred_full)

75544

In [121]:
numbers['pred'] = perc_pred_full
perc_df = numbers[numbers['pred']==1]
perc_df = perc_df.reset_index(drop=True)
sentences[perc_df.iloc[0,3]]
perc_df.iloc[0,0]

'10.77'

In [122]:
def extract_percentages(sent,num):
    percentage = num
    nxt = ''
    split = re.split(r'[ ]', sent)
    char_index = re.search(num, sent.lower()).start() + len(num)
    if sent[char_index] == '%':
        percentage = num + '%'
        return percentage;
    if sent[char_index+1] == '%':
        percentage = num + ' %'
        return percentage; 
    
    if num in split:
        num_index = split.index(num)
        nxt = split[num_index+1].lower()
        if nxt == 'percentage':
            percentage = num + ' ' + nxt
            return percentage;
        if nxt == 'percent':
            percentage = num + ' ' + nxt
            return percentage;
        else:
            return percentage;
    else:
        return percentage;

In [123]:
percentages = []
for i in range(len(perc_df)):
    sent = sentences[perc_df.iloc[i,3]]
    num = perc_df.iloc[i,0]
    percentages.append(extract_percentages(sent,num))

In [142]:
set(percentages)

{'759 %',
 '0.6603 %',
 '14.47 percent',
 '62.7 %',
 '5.44 %',
 '3.90 %',
 '21.3 percent',
 '5.20 percent',
 '3050 %',
 '126 %',
 '19.0 %',
 '7.04 percent',
 '0.500 percent',
 '10.17 %',
 '18.8 percent',
 '1.1 percent',
 '46.5 percent',
 '0.28 percentage',
 '3.41 %',
 '14.6 percent',
 '6.584 %',
 '1.33 percentage',
 '.1 percent',
 '67.3 %',
 '4.10 percent',
 '3.16 percent',
 '72.0 %',
 '2.7 percentage',
 '117 percent',
 '8.4 percent',
 '1.38 %',
 '23.6 percent',
 '147 percent',
 '31.7 percent',
 '36.5 %',
 '.020 %',
 '2.59 percent',
 '251 %',
 '3.58 percent',
 '15.36 %',
 'eight percentage',
 '653 percent',
 '6.27 percent',
 '5.67 %',
 '10.7 %',
 '1.097 percent',
 '58.2 %',
 '42 %',
 '52.9 percent',
 '2.37 percent',
 '0.65 %',
 '0.515 percent',
 '69.1 %',
 '1.87 %',
 '7.9 percent',
 '13.2 %',
 '0.35 percent',
 '11.3 %',
 '2.5 percentage',
 '94 percent',
 '15.7 percent',
 '16.84 %',
 '16.87 percent',
 '6.93 percent',
 '75.8 percent',
 '23 percentage',
 '8.0 percent',
 '18.2 percent',
 '

In [143]:
finalPercentage = set(percentages)
finalPercentage = pd.DataFrame(finalPercentage)
finalPercentage.to_csv("ExtractedPercantages.csv",header=False,index=False)