## Objective of the Notebook
- Easily eliminate any articles that are not directly related to a firm receiving a bank loan
- Record every step of my thought process and iteration of the model

### Imports

In [1]:
import pandas as pd
import nltk

Reading files containing negative and positive words. Saving them into lists. 

In [2]:
file_object = open('project_initial_data/negative_words.txt', 'r')
negative_words = [i.strip() for i in file_object]
file_object.close()

file_object = open('project_initial_data/positive_words.txt', 'r')
positive_words = [i.strip() for i in file_object]
file_object.close()

### Initial Data Manipulation

Reading all consolidated data from years 1980-1984 into a dataframe. Dropped the 'ADVANCED-DATE' column for consistency's sake in order to concatenate the sheets. 

In [3]:
consolidated_data = pd.read_excel('project_initial_data/consolidated.xlsx', sheetname=['Year 1980', 'Year 1981', 'Year 1982', 'Year 1983', 'Year 1984'])

In [4]:
for i in consolidated_data:
    if 'ADVANCED-DATE' in consolidated_data[i].columns:
        consolidated_data[i] = consolidated_data[i].drop('ADVANCED-DATE', axis=1)

Now it's possible to divide the data into 'Training' and 'Test'.

In [14]:
training_data = pd.concat([consolidated_data['Year 1980'], consolidated_data['Year 1981'], consolidated_data['Year 1982'], consolidated_data['Year 1983']], ignore_index=True)
test_data = consolidated_data['Year 1984']

### METHOD 1: Flagging Articles Containing "Negative" Words
This method simply scans each article's FULLTEXT and determines whether or not it contains any words from the Negative Words list. The 'SUSPICIOUS?' column contains a boolean, and the 'REASON' column contains a list of the words, if any, that caused the article to be flagged.

Two functions were used: 
- A lambda function, negativeOverlapExist, that returns a boolean if there's any overlap
- findNegativeOverlap which returns a list of the actual words that caused the flag

A third function, findPositiveOverlap, returns a list of positive words found in the full text. Planning on using it later.

In [24]:
negativeOverlapExist = lambda a: any(i in a for i in negative_words)

def findNegativeOverlap(full_text):
    temp = []
    for i in negative_words:
        if i in full_text:
            temp.append(i)
    return temp

def findPositiveOverlap(full_text):
    temp = []
    for i in positive_words:
        if i in full_text:
            temp.append(i)
    return temp

In [30]:
training_data['SUSPICIOUS?'] = training_data['FULLTEXT'].apply(negativeOverlapExist)
training_data['REASON'] = training_data['FULLTEXT'].apply(findNegativeOverlap)
training_data['POSITIVE'] = training_data['FULLTEXT'].apply(findPositiveOverlap)

test_data['SUSPICIOUS?'] = test_data['FULLTEXT'].apply(negativeOverlapExist)
test_data['REASON'] = test_data['FULLTEXT'].apply(findNegativeOverlap)
test_data['POSITIVE'] = test_data['FULLTEXT'].apply(findPositiveOverlap)

Only keep the rows that have 'Y' or 'N' in the 'ANNOUNCEMENT' column. 

In [31]:
keep_indices = []
for index, rows in training_data.iterrows():
    announcement = training_data.iloc[index]['ANNOUNCEMENT']
    if announcement == 'N' or announcement == 'Y':
        keep_indices.append(index)
final_train = training_data.ix[keep_indices]
final_train

Unnamed: 0,ANNOUNCEMENT,BYLINE,CITY,COMPANY,COPYRIGHT,COUNTRY,DATELINE,DISTRIBUTION,FULLTEXT,GRAPHIC,...,REPORT_DATE,SECTION,SOURCE,STATE,SUBJECT,TICKER,TITLE,SUSPICIOUS?,REASON,POSITIVE
0,N,,,,Copyright 1980 Associated Press All Rights Res...,(88%);,"SACO, MAINE",,"Robert J. Tarte of Ashland, Mass., has come up...",,...,"January 12, 1980, Saturday, AM cycle",DOMESTIC NEWS,The Associated Press,"MAINE, USA (88%);",CITIES (90%); CREDIT CRISIS (90%); CITY GOVERN...,,Bright and Brief,True,[default],[]
1,Y,,,FIRST NATIONAL BANK OF BOSTON (93%); AM GENERA...,"COPYRIGHT 1980 PR Newswire Association, Inc.",NORTH AMERICA (94%); (94%);,"SOUTHFIELD, MICH., JAN. 29",TO FINANCIAL; COPY TO AUTO EDITOR,American Motors Corp. announced today that agr...,,...,"January 29, 1980, Tuesday",,PR Newswire,,PRESS RELEASES (91%); COMMERCIAL BANKING (90%)...,,,False,[],"[entered, reached, provide, secured, credit ag..."
3,N,"BY ROBERT L. SHAFFER, ASSOCIATED PRESS WRITER","CLEVELAND, OH, USA (94%); COLUMBUS, OH, USA (7...",DELOITTE LLP (56%);,Copyright 1980 Associated Press All Rights Res...,(94%);,CLEVELAND,,Fourteen months after Cleveland defaulted on $...,,...,"February 14, 1980, Thursday, PM cycle",DOMESTIC NEWS,The Associated Press,"OHIO, USA (94%); CALIFORNIA, USA (79%);",ACCOUNTING & AUDITING FIRMS (90%); BANKING & F...,,Private Firms Hired To Help Restore Financial ...,True,"[default, recovery]",[provide]
4,Y,,"RALEIGH, NC, USA (59%);",DELAWARE CORP (55%); CAMERON-BROWN INVESTMEN...,"COPYRIGHT 1980 PR Newswire Association, Inc.",(59%);,"RALEIGH, N.C., FEB. 20",TO FINANCIAL,"Cameron-Brown Investment Group (NYSE ""CB"") tod...",,...,"February 20, 1980, Wednesday",,PR Newswire,"NORTH CAROLINA, USA (59%);",PRESS RELEASES (90%); INTEREST RATES (89%); EC...,,,True,[recovery],"[negotiated, signed, reached, credit agreement]"
5,Y,,"NASHVILLE, TN, USA (59%);",GENESCO INC (96%);,"COPYRIGHT 1980 PR Newswire Association, Inc.",(59%);,"NASHVILLE, TENN., FEB. 21",TO FINANCIAL,"Genesco, Inc., announced that it has signed ne...",,...,"February 21, 1980, Thursday",,PR Newswire,"TENNESSEE, USA (59%);",PRESS RELEASES (91%); BANKING & FINANCE (87%);,GCO (NYSE) (96%);,,False,[],"[signed, credit agreement]"
6,Y,,"HOUSTON, TX, USA (90%); GALVESTON, TX, USA (54...",MITCHELL ENERGY & DEVELOPMENT CORP (95%);,"COPYRIGHT 1980 PR Newswire Association, Inc.",(90%);,"HOUSTON, FEB. 27",TO FINANCIAL; COPY TO REAL ESTATE EDITOR,Mitchell Energy & Development Corp.'s real est...,,...,"February 27, 1980, Wednesday",,PR Newswire,"TEXAS, USA (90%);",PRESS RELEASES (91%); BANKING & FINANCE (90%);...,CHCO (NASDAQ) (57%);,,False,[],"[received, provide, secured, credit line, line..."
7,Y,,"HOUSTON, TX, USA (91%);",WEATHERFORD INTERNATIONAL LTD (95%);,"COPYRIGHT 1980 PR Newswire Association, Inc.",(91%);,"HOUSTON, FEB. 29",TO FINANCIAL,Weatherford International Incorporated (AMEX) ...,,...,"February 29, 1980, Friday",,PR Newswire,"TEXAS, USA (91%);",PRESS RELEASES (91%); INTEREST RATES (90%); EC...,WH4 (FRA) (95%); WFT (SWX) (95%); WFT (PAR) (9...,,False,[],"[entered, credit agreement, line of credit]"
9,N,"BY CHARLES CAMPBELL, ASSOCIATED PRESS WRITER","ATLANTA, GA, USA (78%);",FIRST NATIONAL BANK OF CHEROKEE (90%); FIRST ...,Copyright 1980 Associated Press All Rights Res...,(91%);,ATLANTA,,"Bert Lance's mother-in-law and sister-in-law, ...",,...,"March 15, 1980, Saturday, AM cycle",DOMESTIC NEWS,The Associated Press,"GEORGIA, USA (91%);",WITNESSES (91%); TESTIMONY (90%); BUDGETS (90%...,,Prosecutors To Question Bert Lance's Relatives...,True,"[fraud, government]",[]
10,Y,,"ROCHESTER, NY, USA (79%);",VAN WYCK INTERNATIONAL CORP (86%); DISTRICT ...,"COPYRIGHT 1980 PR Newswire Association, Inc.",(92%);,"NEW YORK, MARCH 18",TO FINANCIAL,"Van Wyck International Corporation, (NASDAQ), ...",,...,"March 18, 1980, Tuesday",,PR Newswire,"NEW YORK, USA (91%);",PRESS RELEASES (90%); US CHAPTER 11 BANKRUPTCY...,,,True,[Chapter 11],[line of credit]
11,N,,"AURORA, OH, USA (50%);",HAMILTON INVESTMENT (65%);,"COPYRIGHT 1980 PR Newswire Association, Inc.",(59%);,"ELIZABETH, N.J., MARCH 18",TO FINANCIAL,"Hamilton Investment Trust (NASDAQ), a real est...",,...,"March 18, 1980, Tuesday",,PR Newswire,"NEW JERSEY, USA (59%); COLORADO, USA (55%); WA...",INVESTMENT TRUSTS (91%); EARNINGS PER SHARE (9...,,,True,[recovery],"[signed, provide, credit agreement]"


Do the same for the test data. 

In [32]:
keep_indices2 = []
for index, rows in test_data.iterrows():
    announcement = test_data.iloc[index]['ANNOUNCEMENT']
    if announcement == 'N' or announcement == 'Y':
        keep_indices2.append(index)
final_test = test_data.ix[keep_indices2]
final_test

Unnamed: 0,ANNOUNCEMENT,BYLINE,CITY,COMPANY,COPYRIGHT,COUNTRY,DATELINE,DISTRIBUTION,FULLTEXT,HTML,...,REPORT_DATE,SECTION,SOURCE,STATE,SUBJECT,TICKER,TITLE,SUSPICIOUS?,REASON,POSITIVE
0,N,"BY ANDREW KATELL, ASSOCIATED PRESS WRITER","CHARLESTON, WV, USA (79%); RICHMOND, VA, USA (...",NATIONAL STEEL CORP (83%); NATIONAL INTERGROUP...,Copyright 1984 Associated Press All Rights Res...,(93%);,"CHARLESTON, W.VA.",,Officials involved in the sale of Weirton Stee...,"\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 2, 1984, Monday, AM cycle",DOMESTIC NEWS,The Associated Press,"WEST VIRGINIA, USA (93%); PENNSYLVANIA, USA (9...",FACTORY WORKERS (90%); IRON & STEEL MILLS (90%...,NII (NYSE) (84%);,Officials Preparing For Employee Buyout Of Ste...,True,[lawsuit],[completed]
1,N,,,PALMER SQUARE DEVELOPMENT (57%);,"COPYRIGHT 1984 PR Newswire Association, Inc.",(90%);,"LAWRENCEVILLE, N.J., JAN. 3",TO FINANCIAL; COPY TO REAL ESTATE NEWS EDITOR,"LAWRENCEVILLE, N.J., Jan. 3 /PRN/ -- Commonwea...","\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 3, 1984, Tuesday",,PR Newswire,"NEW JERSEY, USA (90%);",REAL ESTATE (92%); PRESS RELEASES (91%); REAL ...,,,False,[],[]
2,Y,,,WESTERN FINANCIAL GROUP INC (58%); WELLS FARGO...,"Copyright 1984 Business Wire, Inc",(64%);,"LAGUNA NIGUEL, CALIF.",BUSINESS EDITORS,Digital Datacom Inc. (NASDAQ/DDII) Wednesday a...,"\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 4, 1984, Wednesday",,Business Wire,"CALIFORNIA, USA (64%);",PRESS RELEASES (91%); COMPUTING & INFORMATION ...,WFC (NYSE) (58%);,DIGITAL-DATACOM; Establishes initial line of c...,False,[],"[received, signed, credit line, line of credit]"
3,Y,,"CLEVELAND, OH, USA (90%);",LUXOTTICA GROUP SPA (93%); OPTICAL DEPARTMEN...,"COPYRIGHT 1984 PR Newswire Association, Inc.",NORTH AMERICA (79%); (79%);,"CLEVELAND, JAN. 5",TO FINANCIAL,"CLEVELAND, Jan. 5 /PRN/ -- Cole National Corpo...","\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 5, 1984, Thursday",,PR Newswire,,PRESS RELEASES (91%); OPTICAL GOODS STORES (90...,LUX (NYSE) (93%); LUX (BIT) (93%);,,True,[fiscal year],"[entered, credit agreement, credit line]"
4,N,,"DENVER, CO, USA (88%);",AMERICAN GYPSUM CO LLC (93%); REPUBLIC PAPERBO...,"COPYRIGHT 1984 PR Newswire Association, Inc.",(90%);,"DALLAS, JAN. 5",TO FINANCIAL,"DALLAS, Jan. 5 /PRN/ -- Republic Gypsum Compan...","\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 5, 1984, Thursday",,PR Newswire,"KANSAS, USA (90%); COLORADO, USA (88%); MIDWES...",PAPERBOARD (92%); PRESS RELEASES (91%); GYPSUM...,RGC (NYSE) (96%);,,True,"[Republic, common stock]","[obtained, secured]"
5,Y,,"SANTA ANA, CA, USA (51%);",SORIN SPA (96%);,"Copyright 1984 Business Wire, Inc",(90%);,"SANTA ANA, CALIF.",BUSINESS EDITORS,Gish Biomedical Inc. (NASDAQ/GISH) announced F...,"\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 6, 1984, Friday",,Business Wire,"CALIFORNIA, USA (90%);",PRESS RELEASES (91%); MEDICAL DEVICES (78%); E...,SRN (BIT) (96%);,GISH-BIOMEDICAL; Secures new credit line,False,[],"[negotiated, credit agreement, credit line, li..."
6,Y,,"CHICAGO, IL, USA (91%);",CHICAGO MILWAUKEE CORP (93%); NATIONAL BANK & ...,"COPYRIGHT 1984 PR Newswire Association, Inc.",(91%);,"CHICAGO, JAN. 9",TO FINANCIAL,"CHICAGO, Jan. 9 /PRN/ -- Chicago Milwaukee Cor...","\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 9, 1984, Monday",,PR Newswire,"ILLINOIS, USA (91%);",PRESS RELEASES (91%); HOLDING COMPANIES (90%);...,CHG (NYSE) (58%);,,True,"[common stock, common share]","[provide, credit line, line of credit]"
7,N,,,FOTOMAT CORP (91%);,"COPYRIGHT 1984 PR Newswire Association, Inc.",(90%);,"ST. PETERSBURG, FLA., JAN. 9",TO FINANCIAL,"ST. PETERSBURG, Fla., Jan. 9 /PRN/ -- Fotomat ...","\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 9, 1984, Monday",,PR Newswire,"FLORIDA, USA (90%);",PRESS RELEASES (91%); CORPORATE DEBT (88%); CO...,,,False,[],[established]
9,Y,,,ADVENTURE INC (93%);,"Copyright 1984 Business Wire, Inc",(94%);,"WOODINVILLE, WASH.",BUSINESS EDITORS,The Great Outdoor American Adventure Inc. (NAS...,"\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 10, 1984, Tuesday",,Business Wire,NORTHWEST USA (79%); MIDWEST USA (79%); CALIFO...,PRESS RELEASES (92%); RV PARKS & CAMPGROUNDS (...,,AMERICAN-ADVENTURE; Completes $14 million cred...,False,[],"[obtained, provide, credit line, line of credit]"
10,N,"BY KURT J. REPANSHEK, ASSOCIATED PRESS WRITER",,WEIRTON STEEL CORP (94%); NATIONAL INTERGROUP ...,Copyright 1984 Associated Press All Rights Res...,(79%);,"WEIRTON, W. VA.",,Weirton Steel Corp. becomes the nation's large...,"\n<br/><div class=""c0""><p class=""c1""><span cla...",...,"January 10, 1984, Tuesday, AM cycle",BUSINESS NEWS,The Associated Press,"WEST VIRGINIA, USA (79%);",FACTORY WORKERS (90%); IRON & STEEL MILLS (90%...,WRTL (NASDAQ) (94%);,Weirton Employees To Close Takeover Deal,True,[recovery],[]


<i>Need to include some analysis about if the suspicious flags are accurate or not. (Some sort of statistical analysis should be helpful)<i>

### METHOD 2: Naive Bayes Classifier

<i>Just focusing on features in FULLTEXT for now</i>

#### Feature Extractor
Every word in the positive and negative word lists is its own feature. If the word exists in the article, the feature is True. If not, then it's False. 
- e.g. if a word contains 'Attorney General', then the feature 'contains (Attorney General)' will be True. If not, it will be False. 

The function articleClass returns whether or not the article at the index given is an announcement or not.

In [34]:
# This method is a bit inefficient. Should find a faster way to do it. 
def articleFeatures(index, df):
    features = {}
    for word in negative_words:
        features['contains ({}) (-)'.format(word)] = word in df.ix[index]['FULLTEXT']
    for word in positive_words:
        features['contains ({}) (+)'.format(word)] = word in df.ix[index]['FULLTEXT']
    return features

def articleClass(index, df):
    return df.ix[index]['ANNOUNCEMENT']

nb_train and nb_test are lists of tuples in the form of (features, class)

In [39]:
nb_train = [(articleFeatures(index, final_train), articleClass(index, final_train)) for index,rows in final_train.iterrows()]
nb_test = [(articleFeatures(index, final_test), articleClass(index, final_test)) for index, rows in final_test.iterrows()]

Training and testing the classifier, then displaying the most informative features. Could there possibly be the problem of overfitting?

In [43]:
classifier = nltk.NaiveBayesClassifier.train(nb_train)

print("Classifier Accuracy on Test Data: " + str(nltk.classify.accuracy(classifier, nb_test)) + "\n")
classifier.show_most_informative_features(20)

Classifier Accuracy on Test Data: 0.776255707762557

Most Informative Features
contains (government) (-) = True                N : Y      =      8.6 : 1.0
 contains (Congress) (-) = True                N : Y      =      7.7 : 1.0
 contains (recovery) (-) = True                N : Y      =      5.3 : 1.0
    contains (fraud) (-) = True                N : Y      =      4.5 : 1.0
  contains (entered) (+) = True                Y : N      =      4.4 : 1.0
  contains (secured) (+) = True                Y : N      =      4.1 : 1.0
contains (credit agreement) (+) = True                Y : N      =      3.3 : 1.0
  contains (default) (-) = True                N : Y      =      3.2 : 1.0
 contains (impaired) (-) = True                Y : N      =      2.5 : 1.0
contains (renegotiate) (+) = True                N : Y      =      2.3 : 1.0
contains (common share) (-) = True                N : Y      =      2.2 : 1.0
 contains (Republic) (-) = True                N : Y      =      2.2 : 1.0
contains

Let's see what this classifier is actually guessing for each article and how accurate it is. 

In [58]:
guesses = [] 
for index, row in final_test.iterrows():
    guess = "Guess: " + str(classifier.classify(articleFeatures(index, final_test)))
    correct_tag = "Actual: " + str(articleClass(index, final_test))
    negative_words = "-: " + str(final_test.ix[index]['REASON'])
    positive_words = "+: " + str(final_test.ix[index]['POSITIVE'])
    guesses.append( (index, correct_tag, guess, negative_words, positive_words))

for i in guesses:
    print(i)

(0, 'Actual: N', 'Guess: N', "-: ['lawsuit']", "+: ['completed']")
(1, 'Actual: N', 'Guess: N', '-: []', '+: []')
(2, 'Actual: Y', 'Guess: N', '-: []', "+: ['received', 'signed', 'credit line', 'line of credit']")
(3, 'Actual: Y', 'Guess: N', "-: ['fiscal year']", "+: ['entered', 'credit agreement', 'credit line']")
(4, 'Actual: N', 'Guess: N', "-: ['Republic', 'common stock']", "+: ['obtained', 'secured']")
(5, 'Actual: Y', 'Guess: N', '-: []', "+: ['negotiated', 'credit agreement', 'credit line', 'line of credit']")
(6, 'Actual: Y', 'Guess: N', "-: ['common stock', 'common share']", "+: ['provide', 'credit line', 'line of credit']")
(7, 'Actual: N', 'Guess: N', '-: []', "+: ['established']")
(9, 'Actual: Y', 'Guess: N', '-: []', "+: ['obtained', 'provide', 'credit line', 'line of credit']")
(10, 'Actual: N', 'Guess: N', "-: ['recovery']", '+: []')
(12, 'Actual: N', 'Guess: N', "-: ['default', 'subordinated debentures']", "+: ['credit agreement']")
(13, 'Actual: N', 'Guess: N', '-: []

It actually turns out that this classifier guesses 'N' for every single article. So the 78% accuracy just comes from the fact that Perhaps it has something to do with the fact that there's so many negative examples and so few positive examples... 

Now I'll try out another version of the Naive Bayes classifier that, instead of treating each negative/positive word as its own feature, treats sets of negative/positive words that appear together as individual features. 
- e.g. Say an article contains both 'completed' and 'line of credit'. Rather than having two separate features, there's just one feature containing a set in the form of a tuple ('completed', 'line of credit')
    - It needs to be a tuple rather than a list because features fed into the nltk Naive Bayes Classifier need to be hashable objects. 

First, I need to create another feature extractor that encodes <i>sets</i> of positive and negative words rather than individual words.. 

In [47]:
def articleFeatures_v2(index, df):
    features = {}
    features['positive_words'] = tuple(df.ix[index]['POSITIVE'])
    features['negative_words'] = tuple(df.ix[index]['REASON'])
    return features

In [49]:
nb_train_v2 = [(articleFeatures_v2(index, final_train), articleClass(index, final_train)) for index, rows in final_train.iterrows()]
nb_test_v2 = [(articleFeatures_v2(index, final_test), articleClass(index, final_test)) for index, rows in final_test.iterrows()]

In [64]:
classifier_v2 = nltk.NaiveBayesClassifier.train(nb_train_v2)
print("ACCURACY: " + str(nltk.classify.accuracy(classifier_v2, nb_test_v2)))

ACCURACY: 0.730593607305936


Now let's see what kinds of guesses this version of the classifier makes on the test data. 

In [76]:
guesses = []
correct_guesses = []
incorrect_guesses = []

for index, row in final_test.iterrows():
    guess = "Guess: " + str(classifier_v2.classify(articleFeatures_v2(index, final_test)))
    correct_tag = "Actual: " + str(articleClass(index, final_test))
    negative_words = "-: " + str(final_test.ix[index]['REASON'])
    positive_words = "+: " + str(final_test.ix[index]['POSITIVE'])
    
    if classifier_v2.classify(articleFeatures_v2(index, final_test)) == articleClass(index, final_test):
        judgment = 'CORRECT'
        correct_guesses.append( (index, judgment, correct_tag, guess, negative_words, positive_words))
    else:
        judgment = 'INCORRECT'
        incorrect_guesses.append( (index, judgment, correct_tag, guess, negative_words, positive_words))
    guesses.append( (index, judgment, correct_tag, guess, negative_words, positive_words))

for i in guesses:
    print(i)

(0, 'CORRECT', 'Actual: N', 'Guess: N', "-: ['lawsuit']", "+: ['completed']")
(1, 'CORRECT', 'Actual: N', 'Guess: N', '-: []', '+: []')
(2, 'CORRECT', 'Actual: Y', 'Guess: Y', '-: []', "+: ['received', 'signed', 'credit line', 'line of credit']")
(3, 'CORRECT', 'Actual: Y', 'Guess: Y', "-: ['fiscal year']", "+: ['entered', 'credit agreement', 'credit line']")
(4, 'CORRECT', 'Actual: N', 'Guess: N', "-: ['Republic', 'common stock']", "+: ['obtained', 'secured']")
(5, 'CORRECT', 'Actual: Y', 'Guess: Y', '-: []', "+: ['negotiated', 'credit agreement', 'credit line', 'line of credit']")
(6, 'INCORRECT', 'Actual: Y', 'Guess: N', "-: ['common stock', 'common share']", "+: ['provide', 'credit line', 'line of credit']")
(7, 'CORRECT', 'Actual: N', 'Guess: N', '-: []', "+: ['established']")
(9, 'CORRECT', 'Actual: Y', 'Guess: Y', '-: []', "+: ['obtained', 'provide', 'credit line', 'line of credit']")
(10, 'CORRECT', 'Actual: N', 'Guess: N', "-: ['recovery']", '+: []')
(12, 'CORRECT', 'Actual: N

Now we can see the actual guesses the classifier is making. This version actually guesses 'Y' occasionally. One possible reason is that having groups of positive words together dramatically increases the probability of the article being classified as a 'Y', so much so that even with few positive examples to train on, the Naive Bayes Classifier guesses 'Y'. 

I guess the next question to ask is: how accurately does it predict a 'Y'? An 'N'? 

In [80]:
# Find number of correctly guessed Ys and divide by total Ys
correct_y = 0
for i in correct_guesses:
    if 'Y' in i[2]:
        correct_y += 1

all_y = 0
for i in guesses:
    if 'Y' in i[2]:
        all_y += 1

y_accuracy = correct_y/all_y  
print("CORRECT Y ACCURACY: " + str(y_accuracy))

CORRECT Y ACCURACY: 0.6206896551724138


In [81]:
# Now do the same with Ns 
correct_n = 0
for i in correct_guesses:
    if 'N' in i[2]:
        correct_n+=1
        
all_n = 0
for i in guesses:
    if 'N' in i[2]:
        all_n += 1
        
n_accuracy = correct_n/all_n
print("CORRECT N ACCURACY: " + str(n_accuracy))

CORRECT N ACCURACY: 0.7701863354037267


So this model predicts 'Y' correctly in 62% of cases and 'N' correctly in 77% of cases.

Let's see what the most helpful features are now.

In [61]:
classifier_v2.show_most_informative_features(20)

Most Informative Features
          positive_words = ('completed', 'provide', 'line of credit')      Y : N      =      5.7 : 1.0
          positive_words = ()                  N : Y      =      5.4 : 1.0
          positive_words = ('received',)       N : Y      =      4.4 : 1.0
          positive_words = ('obtained', 'line of credit')      Y : N      =      4.4 : 1.0
          positive_words = ('provide',)        N : Y      =      3.7 : 1.0
          negative_words = ('Republic',)       Y : N      =      3.2 : 1.0
          negative_words = ('fiscal year',)      Y : N      =      2.8 : 1.0
          positive_words = ('signed',)         N : Y      =      2.6 : 1.0
          positive_words = ('signed', 'provide', 'credit agreement')      Y : N      =      2.6 : 1.0
          negative_words = ('common share',)      N : Y      =      2.6 : 1.0
          positive_words = ('provide', 'line of credit')      N : Y      =      2.3 : 1.0
          negative_words = ('government',)      N : Y     