### Detecting descriptions of failure in text

The goal of this project is to build a system that can detect parts of text that describe instances of failure - such as failure of a project, a piece of equipment, or a company. This problem resembles sentiment analysis, and I will be using some approaches from sentiment analysis.

#### Table of contents

This notebook is organized into sections as follows:
    1. Assemble training data : assemble a collection of example sentences used to train the classifier. 
                                Positive sentences (those describing failure) are loaded directly from a file. 
                                Negative sentences (no failure) are extracted from Wikipedia articles 
                                after some text cleaning.
    2. Train the classifier   : load GloVe word embedding data. Train a LSTM network on the example sentences,
                                with words replaced by vectors according to the word embedding. Use crossvalidation  
                                to obtain an estimate of how well the classifier can generalize on data drawn from 
                                the training distribution.
    3. Assemble test data     : use web scraping to gather a dataset of interviews with startup founders from 
                                www.failory.com. The startups either succeeded or failed, we want to use the pre-
                                trained classifier to try to determine what happened. 
                                3a : scrape main pages and download all the relevant articles, save HTML to sqllite 
                                     database. Parse HTML to extract text of interest, save processed text to .json.
                                3b : use googlemaps API to plot a map showing locations of all startups in dataset
    4. Classify test data     : run the classifier trained in 2. on the failory.com dataset 
    
All intermediate results are saved to file at the end of each section and re-loaded at the start of the next section,
so it is possible to start running the notebook at any section.

In [None]:
#package installation for jupyter
#if any packages are missing, they can be installed and made available for jupyter 
#by running the code below directly in the notebook (only need to do this once)
#import sys
#!{sys.executable} -m pip install keras
#!{sys.executable} -m pip install beautifulsoup4
#!{sys.executable} -m pip install requests
#!{sys.executable} -m pip install  ...

### 1. Assemble training data

The first step is to assemble a set of labeled text data for training the algorithm. I plan to use an algorithm that takes single sentences or parts of sentences as input, and returns an estimated probability that the sentence describes an instance of failure. Therefore, I need a training dataset consisting of sentences with binary labels. I refer to these sentences as positive cases if they describe failure, and negative cases if they do not.



To acquire the positive cases (sentences describing failure), I manually extracted sentences from a variety of texts. These included descriptions of failed construction projects, failed software engineering projects, failed charitable initiatives, failed startups, and other instances of failure. The websites included Medium, Quora, calleam.com and several others. Using multiple sources is crucial, because it helps to prevent the algorithm from learning any spurious associations between the language style of a sentence and its failure-related status. A full list of sources is given in the file of positive cases.

In [149]:
import os
import re
import sqlite3
import requests
import urllib.request
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import json

In [51]:
#load positive cases (sentences describing an instance of failure)
#these sentences are accumulated as keys in a python dict that is written to a file as raw code
#(so it is easy to modify directly)

from training_positive_cases import training_positive_cases
training_positive_cases = list(training_positive_cases.keys())

In [52]:
print("Number of positive training cases = %d\n\n"%len(training_positive_cases))
#print a few cases at random
print("Examples:\n\n")
_=[print(s + "\n") for s in np.array(training_positive_cases)[np.random.permutation(len(training_positive_cases))[0:5]]]

Obamacare Website Programmers Complained About Unrealistic Deadlines

Ignoring users is a tried and true way to fail

the President had to admit that the performance of the system was below what would be expected

Despite significant technical problems with the prototype

they never spend any money promoting it and it goes unused and is left to die



The negative cases are sentences/parts of sentences that do not describe failure, so these belong to a much larger and more diverse set. They need to resemble the positive cases in terms of style, general language use, and non-failure-related vocabulary, because if there is any systematic difference the algorithm could learn a spurious associations. To obtain the negative cases, I downloaded the text of multiple Wikipedia articles on specific software and other projects which are not known for failure, and used all sentences from the main body of text of these articles as negative cases.

In [29]:
from training_negative_urls import training_negative_urls

#download the full text from the specified URLs
#and save it to sqlite database
#reasons: 
#         1) avoid downloading multiple times 
#         2) now have a working snapshot, will not be affected by future wikipedia edits
conn = sqlite3.connect('negative_raw_html.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS Negative (url TEXT, text TEXT) ')

def insert_negative_text(cur, url, text):
    cur.execute('SELECT text FROM Negative WHERE url = ?', (url,))
    #? -> avoid SQL injection
    row = cur.fetchone()
    if(row is None):
        cur.execute('''INSERT INTO Negative (url, text)
            VALUES (?, ?)''', (url, text))
        return True
    else:
        return False
    
for url in training_negative_urls.keys():
    print('\ndownloading '+url+'\n')
    article = requests.get(url)
    time.sleep(1)  
    insert_negative_text(cur, url, article.text)
    conn.commit()


downloading https://en.wikipedia.org/wiki/Linux


downloading https://en.wikipedia.org/wiki/Triborough_Bridge


downloading https://en.wikipedia.org/wiki/Database


downloading https://en.wikipedia.org/wiki/Lean_startup


downloading https://en.wikipedia.org/wiki/Business_model


downloading https://en.wikipedia.org/wiki/Pinterest


downloading https://en.wikipedia.org/wiki/Twitter


downloading https://en.wikipedia.org/wiki/Application_software


downloading https://en.wikipedia.org/wiki/Web_search_engine


downloading https://en.wikipedia.org/wiki/Software



In [53]:


def clean_wikipedia_text(article_text):

    soup = BeautifulSoup(article_text, "html.parser")
    text = ""
    tags = soup.findAll("p")
    for t in tags:
        text = text + t.text
    #only take contents of <p> tags
    #this ensures we only take the main text while discarding extraneous material 
    #(references etc.)
        
    sentences = re.split('\. |\.\\n|\.\[\d+\]', text)
    #split on: period followed by space | period followed by line break | period followed by citation 

    sentences = [s.split() for s in sentences]
    #split on whitespace
    sentences = [s for s in sentences if len(s) > 2 and len(s) < 50]
    #remove unsually short or long sentences

    remove_citations = lambda s : [t for t in s if "[" not in t and "]" not in t]
    sentences = [remove_citations(s) for s in sentences]
    
    remove_listens = lambda s : [t for t in s if "/" not in t and not t=="(listen)"]
    sentences = [remove_listens(s) for s in sentences]
    
    #remove other extraneous punctuation?
    
    sentences = [" ".join(s) for s in sentences]
    
    return sentences


In [39]:
#load texts from database into list of dicts (database_dict_list)
database_dict_list = []
sqlstr = 'SELECT url, text FROM Negative'
for row in cur.execute(sqlstr):
    entry = {}
    entry['url'] = row[0]
    entry['text'] = row[1]
    database_dict_list.append(entry)
texts = [d['text'] for d in database_dict_list]
training_negative_cases = []
for t in texts: training_negative_cases = training_negative_cases + clean_wikipedia_text(t)

In [59]:
print("Number of negative training cases = %d\n\n"%len(training_positive_cases))
#print a few cases at random
print("Examples:\n\n")
#print a few cases at random
_=[print(s + "\n") for s in np.array(training_negative_cases)[np.random.permutation(len(training_positive_cases))[0:5]]]

For E-ZPass users, sensors detect their transponders wirelessly

The Korg OASYS, the Korg KRONOS, the Yamaha Motif XF music Yamaha Yamaha synthesizers, Yamaha Motif-Rack XS tone generator module, and Roland RD-700GX digital piano also run Linux

Separately, the Board of Estimate voted to create an authority to impose toll charges on both crossings

Larry Ellison's Oracle Database (or more simply, Oracle) started from a different chain, based on IBM's papers on System R

Many other open-source software projects contribute to Linux systems



In [150]:
#save both positive and negative cases to .json
#so we can immediately load them and start at 2. if desired
training_dict = {"negatives":training_negative_cases, "positives":training_positive_cases}
with open('training_data.json', 'w') as fp:
    json.dump(training_dict, fp)

### 2. Train the classifier

I want to develop a classifier which takes sentences or parts of sentences as input and returns an estimated probability that this input describes an instance of failure.
My approach is as follows:
First, I use GloVe word embeddings to convert each word of the sentence to a d-dimensional vector (where d will be some value between 50-300). The idea behind this is that the embeddings should capture some aspects of the meaning of each word, which will enable the classifier to generalize to sentences with similar semantics, even if it has never seen the precise words before. For example, I would hope that once the classifier learns that the sentence "it was a disaster" is an instance of failure, it will subsequently classify "it was a catastrophe" as failure as well, even if the word "catastrophe" never appeared in the training dataset.
Using word embeddings is crucial because I can't realistically assemble a training dataset that includes all combinations of relevant english words, so I need to build a system that can perform semantic generalization.

In [115]:
import LSTM_functions as lstm
#separate file contains functions for defining and training lstm

import numpy as np
import pandas as pd
import json

import importlib


In [151]:
#load the data that we assembled in 1. from .json
with open('training_data.json', 'r') as fp:
    training_data = json.load(fp)


In [152]:
glove_pretrained_embeddings_path = '/users/cstoneki/Documents/analysis/general_resources/glove.6B/glove.6B.300d.txt'
#glove_pretrained_embeddings_path = '/mnt/glove.6B.50d.txt'

In [117]:
#load GloVe data
#this can take a bit of time, especially for the higher-dimensional datasets (such as 300d)
#so report progress

with open(glove_pretrained_embeddings_path) as f:
    n_entries = 0
    d = 0
    
    for k, line in enumerate(f.readlines()):
        n_entries = k + 1
        #the first entry is "the", it is well formatted
        if(k==0): d = len(line.split()) - 1
    glove_data = np.zeros([d, n_entries])
    words = []
    #store each entry (word) as column
    print('Found %d words in glove dataset'%n_entries)
    f.seek(0)
    for k, line in enumerate(f.readlines()):
        lst = line.split()
        words.append(lst[0])
        vals = np.array([float(s) for s in lst[1:]])
        glove_data[:,k] = vals
        if(k % 50000==0):
            print('Words loaded : %06d '%k)
    print('Finished loading data')
        
glove_df = pd.DataFrame(glove_data, columns=words)

Found 400000 words in glove dataset
Words loaded : 000000 
Words loaded : 050000 
Words loaded : 100000 
Words loaded : 150000 
Words loaded : 200000 
Words loaded : 250000 
Words loaded : 300000 
Words loaded : 350000 
Finished loading data


The model I use is an LSTM, implemented using Keras. The code is in a separate file (LSTM_functions.py)

In [None]:
importlib.reload(lstm)
hp = lstm.get_default_hyperparameters()
#train_LSTM: 
# inputs are list of positive sentences, list of negative sentences, embedding mapping dataframe, and hyperparameter dictionary (optional)
# outputs are trained model, out-of-fold predictions from crossvalidation, true labels, and list of sentences actually used for training
# (depending on hyperparameters, may not use all the training data provided)

(model, out_of_fold_preds, labels, training_cases) = lstm.train_LSTM(training_data['positives'], training_data['negatives'], glove_df,hp=hp)
#pass default hyperparameters
#so in particular, negative_positive_ratio = 0.5 -> 
#            the model will use all positive cases, and an equal number of negative cases chosen at random


Training LSTM on fold 1 / 5 :

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

Training LSTM on fold 2 / 5 :

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50

In [148]:
model_accuracy = np.mean((1.0*(np.squeeze(out_of_fold_preds)>0.5))==labels)
baseline_guessing_accuracy = max(np.mean(labels==0), np.mean(labels==1))
print("Accuracy of model          = %3f"%model_accuracy)
print("Baseline guessing accuracy = %3f"%baseline_guessing_accuracy)

Accuracy of model          = 0.881115
Baseline guessing accuracy = 0.500000


The out-of-fold accuracy is substantially greater than 0.5, so the model can learn to generalize well for data drawn from the training distribution. The next step is to try to use the model to solve an actual prediction problem, using data that are drawn from a different distribution.

### 3. Analyze startup founder interviews from failory.com 

As an interesting real-world problem, I want to take a set of interviews with startup founders and determine whether the startup failed or succeeded. The interviews are collected at www.failory.com. This is a potentially challenging problem because it requires the model to deal with the semantics of the text. There are no obvious shortcuts: the interview questions are similar or identical for both failure and success, and the overall language use and vocabulary are similar in both cases.

First, we have to gather the data from failory, using web scraping.

In [None]:
import json

#### 3a. Gather Failory data using web scraping

In [40]:
main_urls = {'failory failure':'https://www.failory.com/interview-failure',
             'failory success':'https://www.failory.com/interview-success'}

In [None]:
def insert_report(cur,  url, text, tags):
    cur.execute('SELECT text FROM Startups WHERE url = ?', (url,))
    #? -> avoid SQL injection
    row = cur.fetchone()
    if(row is None):
        cur.execute('''INSERT INTO Startups (url, text, tags)
            VALUES (?, ?, ?)''', (url, text, tags))
        return True
    else:
        return False

In [None]:
conn = sqlite3.connect('startups_03.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS Startups (url TEXT, text TEXT, tags TEXT) ')

In [None]:
for tags, url in main_urls.items():
    print('\ncollecting articles from '+url+'\n')
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    a_tags = soup.findAll('a')
    sub_urls = []
    for i in range(len(a_tags)):
        try:
            if(a_tags[i]["class"][0] =='card-for-interviews-title'):
                sub_urls.append(a_tags[i]["href"])
        except:
            continue
    for sub_url in sub_urls:
        full_url = 'https://www.failory.com' + sub_url
        article  = requests.get(full_url)
        print('downloaded '+full_url)
        time.sleep(1)
        
        insert_report(cur, full_url, article.text, tags)
        conn.commit()

In [None]:
#check contents of database
#by retrieving small text fields, not full text
sqlstr = 'SELECT url, tags FROM Startups'
database_dict = {}
for row in cur.execute(sqlstr):
    database_dict[str(row[0])] = [row[1]]

In [None]:
#print text of first article
sqlstr = 'SELECT url, text, tags FROM Startups'
for k, row in enumerate(cur.execute(sqlstr)):
    if(k > 0): break
    soup = BeautifulSoup(row[1], "html.parser")
    print(soup)

Now we need to figure out how to extract the text of the article from the mess of HTML. We need to strip out all of the ads and repeated quotes. One key part will be extracting the interviewer's questions, and the response that follows.

In [None]:

#failory has tags at the start of each article
#these are: location, area, failure cause #1, failure cause #2
#these are obviously extremely useful, so we want to extract them
#try to find location
def get_failory_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    article_tags = []
    div_tags = soup.findAll('div')
    for i in range(len(div_tags)):
        try:
            if(div_tags[i]["class"][0] =="secondary-tag-interview"):
                if(div_tags[i].text):
                    article_tags.append(div_tags[i].text)
        except:
            continue
    return article_tags

In [None]:
database_dict_list = []
sqlstr = 'SELECT url, text, tags FROM Startups'
for row in cur.execute(sqlstr):
    entry = {}
    entry['url'] = row[0]
    entry['text'] = row[1]
    entry['tags'] = row[2]
    database_dict_list.append(entry)



In [None]:
for entry in database_dict_list:
    if('failory' in entry['tags']):
        #the following will only work for failory articles
        #so check because we may have non-failory articles in database later
        entry['failory_tags'] = get_failory_tags(entry['text'])

In [None]:
#now try to extract text of interest
def get_questions_responses(text):

    soup = BeautifulSoup(text, "html.parser")
    tags = soup.findAll(['h4', 'p'])
    tags_clean = []
    for i in range(len(tags)):
        try:
            if(tags[i]["class"][0]):
                continue
        except:
            tags_clean.append(tags[i])
        
    questions = []
    responses = []
    current_text = []
    current_question = ""
    for i in range(len(tags_clean)):
        if(tags_clean[i].name=='h4'):
            if(current_question):
                questions.append(current_question)
                responses.append(" ".join(current_text))
            current_question = tags_clean[i].text
            current_text = []
        else:
            current_text.append(tags_clean[i].text)
            
    return (questions, responses)
        



In [None]:
for entry in database_dict_list:
    q,r = get_questions_responses(entry['text'])
    entry['questions'] = q
    entry['responses'] = r

We've spent a bit of computational time developing database_dict_list, so save it as json. Use json rather than sql a) because it's much easier to handle fields that are lists of variable length and b) because we are not growing the data entry-by-entry, but dumping a single finished database.

In [None]:

with open('failory_data.json', 'w') as fp:
    json.dump(database_dict_list, fp)

#### 3b Plot a world map showing the locations of all startups

This is a bit of data visualization that is not essential for the classification analysis.
www.failory.com has interviews with startup founders from a variety of countries. To visualize this better, let's show a world map.

In [None]:
googlemaps_api_key = "not a valid key"

In [None]:
country_list = [entry['failory_tags'][0] for entry in database_dict_list if 'failory_tags' in entry]

In [None]:
print(set(country_list))

In [None]:
import pandas as pd
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="startup-analysis")

#get coordinates for each country
locations = []
scales = []
for c in list(set(country_list)):
    
    location = geolocator.geocode(c)
    scales.append(int(np.sum(np.array(country_list)==c)))
    entry = {}
    entry['latitude'] = location.latitude
    entry['longitude'] = location.longitude
    locations.append(entry)
locations = pd.DataFrame(locations)

In [None]:
scales_to_plot = [int(np.floor(1.5*np.sqrt(s) + 0.5)) for s in scales]

In [None]:
import gmaps
gmaps.configure(api_key=googlemaps_api_key)
coordinates = (30, 0)
fig = gmaps.figure(center=coordinates, zoom_level=2, layout={'width': '1000px', 'height': '600px'})


startup_layer = gmaps.symbol_layer(
    locations, fill_color='blue', stroke_color='blue', scale = scales_to_plot
)
fig.add_layer(startup_layer)
fig

In [None]:
#display a previously saved image
#this can be useful if googlemaps has API key issues
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
plt.figure(figsize=(20,10))
img=mpimg.imread('map.png')
imgplot = plt.imshow(img)
plt.axis('off')
plt.show()

#### 3c Load failory data from json, and perform classification analysis

In part 3a we scraped www.failory.com to gather a dataset of interviews with startup founders. We had to do some processing to convert raw html to usable text (also in part 3a). Now the text data are stored in a .json file and we can more-or-less directly input these to the LSTM classifier which we trained in part 2.

In [None]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

In [None]:
with open('data.json', 'r') as fp:
    data = json.load(fp)

In [None]:
import numpy as np
#determine which tags failory has used
failure_reason_tags = []
area_tags = []

for d in data:
    
    if('failure' in d['tags'].split()):
        failure_reason_tags = failure_reason_tags + d['failory_tags'][2:3]
        
failure_reason_tags = np.array(failure_reason_tags)

In [None]:
#failure_reason_tags

unique_failure_reasons = np.unique(failure_reason_tags)
counts = np.array([np.sum(failure_reason_tags==r) for r in unique_failure_reasons])
order = np.argsort(counts)
counts = counts[order]
labels = unique_failure_reasons[order]

In [None]:
vals = np.arange(len(counts))
plt.barh(vals,counts)
plt.yticks(vals, [lab + " " for lab in labels])
plt.title('Failure reasons, according to Failory\n')
plt.xlabel('Number of cases')
plt.tight_layout()
plt.show()