### Detecting descriptions of failure in text

The goal of this project is to build a system that can detect parts of text that describe instances of failure - such as failure of a project, a piece of equipment, or a company. This problem resembles sentiment analysis, and I will be using some approaches from sentiment analysis.

In [None]:
googlemaps_api_key = "not a valid key"

In [5]:
import os
import sqlite3
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 1. Assemble training data

The first step is to assemble a set of labeled text data for training the algorithm. I plan to use an algorithm that takes single sentences or parts of sentences as input, and returns an estimated probability that the sentence describes an instance of failure. Therefore, I need a training dataset consisting of sentences with binary labels. I refer to these sentences as positive cases if they describe failure, and negative cases if they do not.



To acquire the positive cases (sentences describing failure), I manually extracted sentences from a variety of texts. These included descriptions of failed construction projects, failed software engineering projects, failed charitable initiatives, failed startups, and other instances of failure. The websites included Medium, Quora, calleam.com and several others. Using multiple sources is crucial, because it helps to prevent the algorithm from learning any spurious associations between the language style of a sentence and its failure-related status. A full list of sources is given in the file of positive cases.

In [51]:
#load positive cases (sentences describing an instance of failure)
#these sentences are accumulated as keys in a python dict that is written to a file as raw code
#(so it is easy to modify directly)

from training_positive_cases import training_positive_cases
training_positive_cases = list(training_positive_cases.keys())

In [52]:
#print a few cases at random
_=[print(s + "\n") for s in np.array(training_positive_cases)[np.random.permutation(len(training_positive_cases))[0:5]]]

Obamacare Website Programmers Complained About Unrealistic Deadlines

Ignoring users is a tried and true way to fail

the President had to admit that the performance of the system was below what would be expected

Despite significant technical problems with the prototype

they never spend any money promoting it and it goes unused and is left to die



The negative cases are sentences/parts of sentences that do not describe failure, so these belong to a much larger and more diverse set. They need to resemble the positive cases in terms of style, general language use, and non-failure-related vocabulary, because if there is any systematic difference the algorithm could learn a spurious associations. To obtain the negative cases, I downloaded the text of multiple Wikipedia articles on specific software and other projects which are not known for failure, and used all sentences from the main body of text of these articles as negative cases.

In [29]:
from training_negative_urls import training_negative_urls

#download the full text from the specified URLs
#and save it to sqlite database
#reasons: 
#         1) avoid downloading multiple times 
#         2) now have a working snapshot, will not be affected by future wikipedia edits
conn = sqlite3.connect('negative_raw_html.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS Negative (url TEXT, text TEXT) ')

def insert_negative_text(cur, url, text):
    cur.execute('SELECT text FROM Negative WHERE url = ?', (url,))
    #? -> avoid SQL injection
    row = cur.fetchone()
    if(row is None):
        cur.execute('''INSERT INTO Negative (url, text)
            VALUES (?, ?)''', (url, text))
        return True
    else:
        return False
    
for url in training_negative_urls.keys():
    print('\ndownloading '+url+'\n')
    article = requests.get(url)
    time.sleep(1)  
    insert_negative_text(cur, url, article.text)
    conn.commit()


downloading https://en.wikipedia.org/wiki/Linux


downloading https://en.wikipedia.org/wiki/Triborough_Bridge


downloading https://en.wikipedia.org/wiki/Database


downloading https://en.wikipedia.org/wiki/Lean_startup


downloading https://en.wikipedia.org/wiki/Business_model


downloading https://en.wikipedia.org/wiki/Pinterest


downloading https://en.wikipedia.org/wiki/Twitter


downloading https://en.wikipedia.org/wiki/Application_software


downloading https://en.wikipedia.org/wiki/Web_search_engine


downloading https://en.wikipedia.org/wiki/Software



In [53]:
import re

def clean_wikipedia_text(article_text):

    soup = BeautifulSoup(article_text, "html.parser")
    text = ""
    tags = soup.findAll("p")
    for t in tags:
        text = text + t.text
    #only take contents of <p> tags
    #this ensures we only take the main text while discarding extraneous material 
    #(references etc.)
        
    sentences = re.split('\. |\.\\n|\.\[\d+\]', text)
    #split on: period followed by space | period followed by line break | period followed by citation 

    sentences = [s.split() for s in sentences]
    #split on whitespace
    sentences = [s for s in sentences if len(s) > 2 and len(s) < 50]
    #remove unsually short or long sentences

    remove_citations = lambda s : [t for t in s if "[" not in t and "]" not in t]
    sentences = [remove_citations(s) for s in sentences]
    
    remove_listens = lambda s : [t for t in s if "/" not in t and not t=="(listen)"]
    sentences = [remove_listens(s) for s in sentences]
    
    #remove other extraneous punctuation?
    
    sentences = [" ".join(s) for s in sentences]
    
    return sentences


In [39]:
#load texts from database into list of dicts (database_dict_list)
database_dict_list = []
sqlstr = 'SELECT url, text FROM Negative'
for row in cur.execute(sqlstr):
    entry = {}
    entry['url'] = row[0]
    entry['text'] = row[1]
    database_dict_list.append(entry)

In [54]:
texts = [d['text'] for d in database_dict_list]
training_negative_cases = []
for t in texts: training_negative_cases = training_negative_cases + clean_wikipedia_text(t)

In [55]:
len(sentences)

2216

In [59]:
#print a few cases at random
_=[print(s + "\n") for s in np.array(training_negative_cases)[np.random.permutation(len(training_positive_cases))[0:5]]]

For E-ZPass users, sensors detect their transponders wirelessly

The Korg OASYS, the Korg KRONOS, the Yamaha Motif XF music Yamaha Yamaha synthesizers, Yamaha Motif-Rack XS tone generator module, and Roland RD-700GX digital piano also run Linux

Separately, the Board of Estimate voted to create an authority to impose toll charges on both crossings

Larry Ellison's Oracle Database (or more simply, Oracle) started from a different chain, based on IBM's papers on System R

Many other open-source software projects contribute to Linux systems



### 2. Train the classifier

I want to develop a classifier which takes sentences or parts of sentences as input and returns an estimated probability that the input describes an instance of failure.

In [None]:
import numpy as np
import pandas as pd

In [None]:
glove_pretrained_embeddings_path = '/users/cstoneki/Documents/analysis/general_resources/glove.6B/glove.6B.300d.txt'
#glove_pretrained_embeddings_path = '/mnt/glove.6B.50d.txt'

In [None]:
#load GloVe data
#this can take a bit of time, especially for the higher-dimensional datasets (such as 300d)
#so report progress

import numpy as np
with open(glove_pretrained_embeddings_path) as f:
    n_entries = 0
    d = 0
    
    for k, line in enumerate(f.readlines()):
        n_entries = k + 1
        #the first entry is "the", it is well formatted
        if(k==0): d = len(line.split()) - 1
    glove_data = np.zeros([d, n_entries])
    words = []
    #store each entry (word) as column
    print('Found %d words in glove dataset'%n_entries)
    f.seek(0)
    for k, line in enumerate(f.readlines()):
        lst = line.split()
        words.append(lst[0])
        vals = np.array([float(s) for s in lst[1:]])
        glove_data[:,k] = vals
        if(k % 10000==0):
            print('Words loaded : %06d '%k)
    print('Finished loading data')
        

In [None]:
glove_df = pd.DataFrame(glove_data, columns=words)

Group data into sentences, not single words

In [None]:
#compute word embeddings from a piece of text
#keep the grouping of words into sentences

#return: sentence_embeddings = list of embedded sentences
#           if output == "index":
#           each embedded sentence is a N x 1 vector of indices into embedding matrix
#           where N = number of tokens in the sentence
#           if output == "full_embedding"
#           each embedded sentence is a M x N matrix
#           where M is embedding space dimension, N is number of tokens in the sentence
#        valid_sentences = list of raw text (string) of embedded sentences
def embed_grouped_by_sentence(glove_df, text, output="index"):
    sentences = [s for s in text.split(".") if len(s.split()) > 1]
    sentence_embeddings = []
    valid_sentences = []
    for s in sentences:
        single_word_embeddings = []
        words = s.split()
        for w in words:
            if(w in glove_df.columns):
                if(output=="full_embedding"):
                    single_word_embeddings.append(glove_df[w].values[:,np.newaxis])
                elif(output=="index"):
                    single_word_embeddings.append(glove_df.columns.get_loc(w))
        if(len(single_word_embeddings) > 1):
            if(output=="full_embedding"):
                sentence_embeddings.append(np.concatenate(single_word_embeddings, axis=1))
            elif(output=="index"):
                sentence_embeddings.append(np.array(single_word_embeddings))
            valid_sentences.append(s)
    return (sentence_embeddings, valid_sentences)

In [None]:
glove_df.head(10)

In [None]:
glove_df.columns.get_loc('the')

In [None]:
glove_df.values[2,0]

#### 1. Scrape data, store in SQL database

In [None]:
import sys
!{sys.executable}  -m pip install keras
#!{sys.executable} -m pip install beautifulsoup4
#!{sys.executable} -m pip install requests

In [40]:
main_urls = {'failory failure':'https://www.failory.com/interview-failure',
             'failory success':'https://www.failory.com/interview-success'}

In [None]:
def insert_report(cur,  url, text, tags):
    cur.execute('SELECT text FROM Startups WHERE url = ?', (url,))
    #? -> avoid SQL injection
    row = cur.fetchone()
    if(row is None):
        cur.execute('''INSERT INTO Startups (url, text, tags)
            VALUES (?, ?, ?)''', (url, text, tags))
        return True
    else:
        return False

In [None]:
conn = sqlite3.connect('startups_03.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS Startups (url TEXT, text TEXT, tags TEXT) ')

In [None]:
for tags, url in main_urls.items():
    print('\ncollecting articles from '+url+'\n')
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    a_tags = soup.findAll('a')
    sub_urls = []
    for i in range(len(a_tags)):
        try:
            if(a_tags[i]["class"][0] =='card-for-interviews-title'):
                sub_urls.append(a_tags[i]["href"])
        except:
            continue
    for sub_url in sub_urls:
        full_url = 'https://www.failory.com' + sub_url
        article  = requests.get(full_url)
        print('downloaded '+full_url)
        time.sleep(1)
        
        insert_report(cur, full_url, article.text, tags)
        conn.commit()

In [None]:
#check contents of database
#by retrieving small text fields, not full text
sqlstr = 'SELECT url, tags FROM Startups'
database_dict = {}
for row in cur.execute(sqlstr):
    database_dict[str(row[0])] = [row[1]]

In [None]:
#print text of first article
sqlstr = 'SELECT url, text, tags FROM Startups'
for k, row in enumerate(cur.execute(sqlstr)):
    if(k > 0): break
    soup = BeautifulSoup(row[1], "html.parser")
    print(soup)

#### 2. Process HTML data

Now we need to figure out how to extract the text of the article from the mess of HTML. We need to strip out all of the ads and repeated quotes. One key part will be extracting the interviewer's questions, and the response that follows.

In [None]:

#failory has tags at the start of each article
#these are: location, area, failure cause #1, failure cause #2
#these are obviously extremely useful, so we want to extract them
#try to find location
def get_failory_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    article_tags = []
    div_tags = soup.findAll('div')
    for i in range(len(div_tags)):
        try:
            if(div_tags[i]["class"][0] =="secondary-tag-interview"):
                if(div_tags[i].text):
                    article_tags.append(div_tags[i].text)
        except:
            continue
    return article_tags

In [None]:
database_dict_list = []
sqlstr = 'SELECT url, text, tags FROM Startups'
for row in cur.execute(sqlstr):
    entry = {}
    entry['url'] = row[0]
    entry['text'] = row[1]
    entry['tags'] = row[2]
    database_dict_list.append(entry)



In [None]:
for entry in database_dict_list:
    if('failory' in entry['tags']):
        #the following will only work for failory articles
        #so check because we may have non-failory articles in database later
        entry['failory_tags'] = get_failory_tags(entry['text'])

In [None]:
database_dict_list[3]

In [None]:
country_list = [entry['failory_tags'][0] for entry in database_dict_list if 'failory_tags' in entry]

In [None]:
print(set(country_list))

In [None]:
import pandas as pd
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="startup-analysis")

#get coordinates for each country
locations = []
scales = []
for c in list(set(country_list)):
    
    location = geolocator.geocode(c)
    scales.append(int(np.sum(np.array(country_list)==c)))
    entry = {}
    entry['latitude'] = location.latitude
    entry['longitude'] = location.longitude
    locations.append(entry)
locations = pd.DataFrame(locations)

In [None]:
scales_to_plot = [int(np.floor(1.5*np.sqrt(s) + 0.5)) for s in scales]

In [None]:
import gmaps
gmaps.configure(api_key=googlemaps_api_key)
coordinates = (30, 0)
fig = gmaps.figure(center=coordinates, zoom_level=2, layout={'width': '1000px', 'height': '600px'})


startup_layer = gmaps.symbol_layer(
    locations, fill_color='blue', stroke_color='blue', scale = scales_to_plot
)
fig.add_layer(startup_layer)
fig

In [None]:
#display a previously saved image
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
plt.figure(figsize=(20,10))
img=mpimg.imread('map.png')
imgplot = plt.imshow(img)
plt.axis('off')
plt.show()

In [None]:
#now try to extract text of interest
def get_questions_responses(text):

    soup = BeautifulSoup(text, "html.parser")
    tags = soup.findAll(['h4', 'p'])
    tags_clean = []
    for i in range(len(tags)):
        try:
            if(tags[i]["class"][0]):
                continue
        except:
            tags_clean.append(tags[i])
        
    questions = []
    responses = []
    current_text = []
    current_question = ""
    for i in range(len(tags_clean)):
        if(tags_clean[i].name=='h4'):
            if(current_question):
                questions.append(current_question)
                responses.append(" ".join(current_text))
            current_question = tags_clean[i].text
            current_text = []
        else:
            current_text.append(tags_clean[i].text)
            
    return (questions, responses)
        



In [None]:
for entry in database_dict_list:
    q,r = get_questions_responses(entry['text'])
    entry['questions'] = q
    entry['responses'] = r

In [None]:
database_dict_list[0]['questions']

In [None]:
database_dict_list[-1]['questions']

We've spent a bit of computational time developing database_dict_list, so save it as json. Use json rather than sql a) because it's much easier to handle fields that are lists of variable length and b) because we are not growing the data entry-by-entry, but dumping a single finished database.

In [None]:
import json

with open('data.json', 'w') as fp:
    json.dump(database_dict_list, fp)

## Start here if loading from json

In [None]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
import json
with open('data.json', 'r') as fp:
    data = json.load(fp)

In [None]:
import numpy as np
#determine which tags failory has used
failure_reason_tags = []
area_tags = []

for d in data:
    
    if('failure' in d['tags'].split()):
        failure_reason_tags = failure_reason_tags + d['failory_tags'][2:3]
        
failure_reason_tags = np.array(failure_reason_tags)

In [None]:
#failure_reason_tags

unique_failure_reasons = np.unique(failure_reason_tags)
counts = np.array([np.sum(failure_reason_tags==r) for r in unique_failure_reasons])
order = np.argsort(counts)
counts = counts[order]
labels = unique_failure_reasons[order]

In [None]:
vals = np.arange(len(counts))
plt.barh(vals,counts)
plt.yticks(vals, [lab + " " for lab in labels])
plt.title('Failure reasons, according to Failory\n')
plt.xlabel('Number of cases')
plt.tight_layout()
plt.show()

### Word Embeddings

Now start to work with word embeddings

Use pretrained GloVe embeddings from https://nlp.stanford.edu/projects/glove/

Specifically, the 6B dataset

In [None]:
contrast_text = """
In February 2015, the company wrote that around 10,000 new daily active users were signing up each week,
and had more than 135,000 paying customers spread across 60,000 teams.
Slack offers many IRC-like features, including persistent chat rooms (channels) organized by topic, private groups, and direct messaging.
Zulip was originally developed as proprietary software by a startup called Zulip, Inc., based in Cambridge, Massachusetts.
In 2014, while in private beta, the company was acquired by Dropbox.
In September 2015, Dropbox open-sourced it under the Apache License.
Today, it is a leading open source alternative to Slack or HipChat, with over 29,000 commits contributed by 450 people.
Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS.
It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications.
It has been a very widely applied spreadsheet for these platforms, especially since version 5 in 1993, and it has replaced Lotus 1-2-3 as the industry standard for spreadsheets.
Excel forms part of the Microsoft Office suite of software.
Microsoft Excel has the basic features of all spreadsheets, using a grid of cells arranged in numbered rows and letter-named columns to organize data manipulations like arithmetic operations.
It has a battery of supplied functions to answer statistical, engineering and financial needs.
In addition, it can display data as line graphs, histograms and charts, and with a very limited three-dimensional graphical display.
It allows sectioning of data to view its dependencies on various factors for different perspectives (using pivot tables and the scenario manager).
It has a programming aspect, Visual Basic for Applications, allowing the user to employ a wide variety of numerical methods, for example, for solving differential equations of mathematical physics, and then reporting the results back to the spreadsheet.
It also has a variety of interactive features allowing user interfaces that can completely hide the spreadsheet from the user, so the spreadsheet presents itself as a so-called application,
or decision support system (DSS), via a custom-designed user interface, for example, a stock analyzer, or in general, as a design tool that asks the user questions and provides answers and reports.
In a more elaborate realization, an Excel application can automatically poll external databases and measuring instruments using an update schedule,analyze the results, make a Word report or PowerPoint slide show, and e-mail these presentations on a regular basis to a list of participants.
Excel was not designed to be used as a database.
Microsoft allows for a number of optional command-line switches to control the manner in which Excel starts.
The Windows version of Excel supports programming through Microsoft's Visual Basic for Applications (VBA), which is a dialect of Visual Basic. Programming with VBA allows spreadsheet manipulation that is awkward or impossible with standard spreadsheet techniques.
Programmers may write code directly using the Visual Basic Editor (VBE), which includes a window for writing code, debugging code, and code module organization environment. The user can implement numerical methods as well as automating tasks such as formatting or data organization in VBA and guide the calculation using any desired intermediate results reported back to the spreadsheet.

VBA was removed from Mac Excel 2008, as the developers did not believe that a timely release would allow porting the VBA engine natively to Mac OS X.
VBA was restored in the next version, Mac Excel 2011, although the build lacks support for ActiveX objects, impacting some high level developer tools.
A common and easy way to generate VBA code is by using the Macro Recorder.
The Macro Recorder records actions of the user and generates VBA code in the form of a macro.
These actions can then be repeated automatically by running the macro.
The macros can also be linked to different trigger types like keyboard shortcuts, a command button or a graphic.
The actions in the macro can be executed from these trigger types or from the generic toolbar options.
The VBA code of the macro can also be edited in the VBE.
Certain features such as loop functions and screen prompt by their own properties, and some graphical display items, cannot be recorded but must be entered into the VBA module directly by the programmer.
Advanced users can employ user prompts to create an interactive program, or react to events such as sheets being loaded or changed.

Macro Recorded code may not be compatible with Excel versions.
Some code that is used in Excel 2010 cannot be used in Excel 2003.
Making a Macro that changes the cell colours and making changes to other aspects of cells may not be backward compatible.

VBA code interacts with the spreadsheet through the Excel Object Model, a vocabulary identifying spreadsheet objects, and a set of supplied functions or methods that enable reading and writing to the spreadsheet and interaction with its users (for example, through custom toolbars or command bars and message boxes).
User-created VBA subroutines execute these actions and operate like macros generated using the macro recorder, but are more flexible and efficient.
From its first version Excel supported end user programming of macros (automation of repetitive tasks) and user defined functions (extension of Excel's built-in function library).
In early versions of Excel these programs were written in a macro language whose statements had formula syntax and resided in the cells of special purpose macro sheets (stored with file extension .XLM in Windows.)
XLM was the default macro language for Excel through Excel 4.0. Beginning with version 5.0 Excel recorded macros in VBA by default but with version 5.0 XLM recording was still allowed as an option.
After version 5.0 that option was discontinued.
All versions of Excel, including Excel 2010 are capable of running an XLM macro, though Microsoft discourages their use.
Excel supports charts, graphs, or histograms generated from specified groups of cells. The generated graphic component can either be embedded within the current sheet, or added as a separate object.
These displays are dynamically updated if the content of cells change. For example, suppose that the important design requirements are displayed visually; then, in response to a user's change in trial values for parameters, the curves describing the design change shape, and their points of intersection shift, assisting the selection of the best design.
Microsoft originally marketed a spreadsheet program called Multiplan in 1982. Multiplan became very popular on CP/M systems, but on MS-DOS systems it lost popularity to Lotus 1-2-3. Microsoft released the first version of Excel for the Macintosh on September 30, 1985, and the first Windows version was 2.05 (to synchronize with the Macintosh version 2.2) in November 1987. Lotus was slow to bring 1-2-3 to Windows and by the early 1990s Excel had started to outsell 1-2-3 and helped Microsoft achieve its position as a leading PC software developer. This accomplishment solidified Microsoft as a valid competitor and showed its future of developing GUI software. Microsoft maintained its advantage with regular new releases, every two years or so.
Instagram (also known as IG or Insta) is a photo and video-sharing social networking service owned by Facebook, Inc.
It was created by Kevin Systrom and Mike Krieger, and launched in October 2010 exclusively on iOS.
A version for Android devices was released a year and half later, in April 2012, followed by a feature-limited website interface in November 2012, and apps for Windows 10 Mobile and Windows 10 in April 2016 and October 2016 respectively.
The app allows users to upload photos and videos to the service, which can be edited with various filters, and organized with tags and location information.
An account's posts can be shared publicly or with pre-approved followers. Users can browse other users' content by tags and locations, and view trending content. Users can like photos, and follow other users to add their content to a feed.

The service was originally distinguished by only allowing content to be framed in a square (1:1) aspect ratio, but these restrictions were eased in 2015. The service also added messaging features, the ability to include multiple images or videos in a single post, as well as Stories—similar to its main competitor Snapchat—which allows users to post photos and videos to a sequential feed, with each post accessible by others for 24 hours each. As of January 2019, the Stories feature is being used by 500 million users daily.

After its launch in 2010, Instagram rapidly gained popularity, with one million registered users in two months, 10 million in a year, and 1 billion as of May 2019. In April 2012, Facebook acquired the service for approximately US$1 billion in cash and stock. As of October 2015, over 40 billion photos had been uploaded to the service. Although praised for its influence, Instagram has been the subject of criticism, most notably for policy and interface changes, allegations of censorship, and illegal or improper content uploaded by users.

As of January 14, 2019, the most liked photo on Instagram is a picture of an egg, posted by the account @world_record_egg, created with the sole purpose of surpassing the previous record of 18 million likes on a Kylie Jenner post. The picture currently has over 53 million likes.

Instagram began development in San Francisco, when Kevin Systrom and Mike Krieger chose to focus their multi-featured HTML5 check-in project, Burbn, on mobile photography. As Krieger reasoned, Burbn became too similar to Foursquare, and both realized that it had gone too far. Burbn was then pivoted to become more focused on photo-sharing. The word Instagram is a portmanteau of instant camera and telegram.

In December 2013, Instagram announced Instagram Direct, a feature that lets users interact through private messaging. Users who follow each other can send private messages with photos and videos, in contrast to the public-only requirement that was previously in place. When users receive a private message from someone they don't follow, the message is marked as pending and the user must accept to see it. Users can send a photo to a maximum of 15 people. The feature received a major update in September 2015, adding conversation threading and making it possible for users to share locations, hashtag pages, and profiles through private messages directly from the news feed. Additionally, users can now reply to private messages with text, emoji or by clicking on a heart icon. A camera inside Direct lets users take a photo and send it to the recipient without leaving the conversation.A new update in November 2016 let users make their private messages disappear after being viewed by the recipient, with the sender receiving a notification if the recipient takes a screenshot. In April 2017, Instagram redesigned Direct to combine all private messages, both permanent and ephemeral, into the same message threads.In May, Instagram made it possible to send website links in messages, and also added support for sending photos in their original portrait or landscape orientation without cropping.

Hudson Yards is a real estate development in the Chelsea and Hudson Yards neighborhoods of Manhattan, New York City. It is the largest private real estate development in the United States by area. Upon completion, 13 of the 16 planned structures on the West Side of Midtown South would sit on a platform built over the West Side Yard, a storage yard for Long Island Rail Road trains. The first of its two phases, opened in 2019, comprises a public green space and eight structures that contain residences, a hotel, office buildings, a mall, and a cultural facility. The second phase, on which construction has not started yet, will include residential space, an office building, and a school.

A suspension bridge is a type of bridge in which the deck (the load-bearing portion) is hung below suspension cables on vertical suspenders. The first modern examples of this type of bridge were built in the early 1800s. Simple suspension bridges, which lack vertical suspenders, have a long history in many mountainous parts of the world.

This type of bridge has cables suspended between towers, plus vertical suspender cables that carry the weight of the deck below, upon which traffic crosses. This arrangement allows the deck to be level or to arc upward for additional clearance. Like other suspension bridge types, this type often is constructed without falsework.

The suspension cables must be anchored at each end of the bridge, since any load applied to the bridge is transformed into a tension in these main cables. The main cables continue beyond the pillars to deck-level supports, and further continue to connections with anchors in the ground. The roadway is supported by vertical suspender cables or rods, called hangers. In some circumstances, the towers may sit on a bluff or canyon edge where the road may proceed directly to the main span, otherwise the bridge will usually have two smaller spans, running between either pair of pillars and the highway, which may be supported by suspender cables or may use a truss bridge to make this connection. In the latter case there will be very little arc in the outboard main cables.
The principles of suspension used on the large scale may also appear in contexts less dramatic than road or rail bridges. Light cable suspension may prove less expensive and seem more elegant for a cycle or footbridge than strong girder supports. An example of this is the Nescio Bridge in the Netherlands.

Where such a bridge spans a gap between two buildings, there is no need to construct special towers, as the buildings can anchor the cables. Cable suspension may also be augmented by the inherent stiffness of a structure that has much in common with a tubular bridge.

US 66 served as a primary route for those who migrated west, especially during the Dust Bowl of the 1930s, and the road supported the economies of the communities through which it passed. People doing business along the route became prosperous due to the growing popularity of the highway, and those same people later fought to keep the highway alive in the face of the growing threat of being bypassed by the new Interstate Highway System.

US 66 underwent many improvements and realignments over its lifetime, but was officially removed from the United States Highway System in 1985 after it had been replaced in its entirety by segments of the Interstate Highway System. Portions of the road that passed through Illinois, Missouri, New Mexico, and Arizona have been communally designated a National Scenic Byway of the name Historic Route 66, returning the name to some maps. Several states have adopted significant bypassed sections of the former US 66 into their state road networks as State Route 66. The corridor is also being redeveloped into U.S. Bicycle Route 66, a part of the United States Bicycle Route System that was developed in the 2010s.

Computer software, or simply software, is a collection of data or computer instructions that tell the computer how to work. This is in contrast to physical hardware, from which the system is built and actually performs the work. In computer science and software engineering, computer software is all information processed by computer systems, programs and data. Computer software includes computer programs, libraries and related non-executable data, such as online documentation or digital media. Computer hardware and software require each other and neither can be realistically used on its own.

At the lowest programming level, executable code consists of machine language instructions supported by an individual processor—typically a central processing unit (CPU) or a graphics processing unit (GPU). A machine language consists of groups of binary values signifying processor instructions that change the state of the computer from its preceding state. For example, an instruction may change the value stored in a particular storage location in the computer—an effect that is not directly observable to the user. An instruction may also invoke one of many input or output operations, for example displaying some text on a computer screen; causing state changes which should be visible to the user. The processor executes the instructions in the order they are provided, unless it is instructed to "jump" to a different instruction, or is interrupted by the operating system. As of 2015, most personal computers, smartphone devices and servers have processors with multiple execution units or multiple processors performing computation together, and computing has become a much more concurrent activity than in the past.

The majority of software is written in high-level programming languages. They are easier and more efficient for programmers because they are closer to natural languages than machine languages. High-level languages are translated into machine language using a compiler or an interpreter or a combination of the two. Software may also be written in a low-level assembly language, which has strong correspondence to the computer's machine language instructions and is translated into machine language using an assembler.

An outline (algorithm) for what would have been the first piece of software was written by Ada Lovelace in the 19th century, for the planned Analytical Engine. She created proofs to show how the engine would calculate Bernoulli Numbers. Because of the proofs and the algorithm, she is considered the first computer programmer.

The first theory about software—prior to creation of computers as we know them today—was proposed by Alan Turing in his 1935 essay On Computable Numbers, with an Application to the Entscheidungsproblem (decision problem).

This eventually led to the creation of the academic fields of computer science and software engineering; Both fields study software and its creation. Computer science is the theoretical study of computer and software (Turing's essay is an example of computer science), whereas software engineering is the application of engineering and development of software.

However, prior to 1946, software was not yet the programs stored in the memory of stored-program digital computers, as we now understand it. The first electronic computing devices were instead rewired in order to "reprogram" them.

In 2000, Fred Shapiro, a librarian at the Yale Law School, published a letter revealing that John Wilder Tukey's 1958 paper The Teaching of Concrete Mathematics contained the earliest known usage of the term "software" found in a search of JSTOR's electronic archives, predating the OED's citation by two years.This led many to credit Tukey with coining the term, particularly in obituaries published that same year,although Tukey never claimed credit for any such coinage. In 1995, Paul Niquette claimed he had originally coined the term in October 1953, although he could not find any documents supporting his claim.The earliest known publication of the term "software" in an engineering context was in August 1953 by Richard R. Carhart, in a Rand Corporation Research Memorandum.

Programming tools are also software in the form of programs or applications that software developers (also known as programmers, coders, hackers or software engineers) use to create, debug, maintain (i.e. improve or fix), or otherwise support software.

Software is written in one or more programming languages; there are many programming languages in existence, and each has at least one implementation, each of which consists of its own set of programming tools. These tools may be relatively self-contained programs such as compilers, debuggers, interpreters, linkers, and text editors, that can be combined together to accomplish a task; or they may form an integrated development environment (IDE), which combines much or all of the functionality of such self-contained tools. IDEs may do this by either invoking the relevant individual tools or by re-implementing their functionality in a new way. An IDE can make it easier to do specific tasks, such as searching in files in a particular project. Many programming language implementations provide the option of using both individual tools or an IDE. 

"""

In [None]:
from training_data_failure_sentences import failure_sentences


In [None]:
failure_sentences

In [None]:
embeddings = []
sentences = []
labels = []
for s,_ in failure_sentences.items():
    e, s = embed_grouped_by_sentence(glove_df, s)
    embeddings = embeddings + e
    sentences = sentences + s
    labels = labels + [1]*len(s)
    
e, s = embed_grouped_by_sentence(glove_df, contrast_text)
embeddings = embeddings + e
sentences = sentences + s
labels = labels + [0]*len(s)
labels = np.array(labels)

In [None]:
print(np.sum(np.array([x==0 for x in labels])))
print(np.sum(np.array([x==1 for x in labels])))

In [None]:
max_len = np.max(np.array([len(e) for e in embeddings]))
embedding_matrix = np.zeros([len(embeddings), max_len])
for k,e in enumerate(embeddings):
    embedding_matrix[k,0:len(e)] = e
labels = np.array(labels)

In [None]:
import numpy as np
np.random.seed(0)
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
np.random.seed(1)

The approach here is to pass Keras the vectors of word indices (each sentence is one vector).
A pretrained embedding layer is used to convert word indices into full word embedding vectors.

In [None]:
def pretrained_embedding_layer(embedding_df):

    embedding_layer = Embedding(embedding_df.shape[1] + 1, embedding_df.shape[0], trainable=False)
    embedding_layer.build((None,))
    embed_matrix = np.transpose(embedding_df.values)
    embed_matrix = np.concatenate([embed_matrix, np.zeros([1, embed_matrix.shape[1]])], axis=0)
    embedding_layer.set_weights([embed_matrix])
    return embedding_layer

In [None]:
def LSTM_graph(max_len, embedding_df):
    index_vectors = Input(shape=(max_len,) , dtype='int32')
    embedding_layer = pretrained_embedding_layer(embedding_df)
    embeddings = embedding_layer(index_vectors)
    X = LSTM(units=128, return_sequences=True)(embeddings)
    X = Dropout(rate=0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Here the returned output is a single hidden state, not a batch of sequences.
    X = LSTM(units=128, return_sequences=False)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(rate=0.5)(X)
    # Single sigmoid output unit
    X = Dense(units=1, activation='sigmoid')(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=index_vectors, outputs=X)
    
    return model
    

In [None]:
model = LSTM_graph(max_len, glove_df)
model.summary()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [31]:
from sklearn.model_selection import StratifiedKFold

n_folds = 5
n_epochs = 10**3

skf = StratifiedKFold(n_splits = n_folds, random_state = 33, shuffle=True)

out_of_fold_preds = np.nan*np.ones([embedding_matrix.shape[0], 1])
#predictions made when the given data point was in holdout set


#loop over k-fold splits
for k, (train_indices, test_indices) in enumerate(skf.split(embedding_matrix, labels)):
    print('\nTraining LSTM on fold %d / %d :\n'%(k+1, n_folds))
    model = LSTM_graph(max_len, glove_df)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(embedding_matrix[train_indices,:], labels[train_indices], epochs=50, batch_size=32, shuffle=True)
    out_of_fold_preds[test_indices,:] = model.predict(embedding_matrix[test_indices,:])




Training LSTM on fold 1 / 5 :

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

Training LSTM on fold 2 / 5 :

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 

Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50

Training LSTM on fold 5 / 5 :

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [33]:
out_of_fold_preds

array([[9.96589959e-01],
       [9.96329784e-01],
       [9.99440074e-01],
       [9.39141750e-01],
       [9.98056769e-01],
       [9.97541726e-01],
       [9.97922659e-01],
       [9.99444366e-01],
       [9.38325882e-01],
       [9.99361634e-01],
       [9.99371171e-01],
       [9.39458609e-01],
       [9.37644124e-01],
       [9.99439836e-01],
       [9.98959661e-01],
       [9.98264968e-01],
       [9.99373078e-01],
       [9.37231779e-01],
       [9.98103321e-01],
       [9.96634066e-01],
       [9.98261452e-01],
       [9.97969866e-01],
       [9.99384165e-01],
       [9.98260617e-01],
       [9.99387622e-01],
       [9.99435902e-01],
       [9.96303916e-01],
       [9.98099089e-01],
       [9.99459028e-01],
       [3.27229500e-04],
       [9.95953977e-01],
       [9.97938991e-01],
       [8.52109790e-01],
       [9.99406695e-01],
       [9.39127803e-01],
       [9.98246372e-01],
       [9.99444604e-01],
       [9.98222828e-01],
       [9.96633351e-01],
       [9.97218966e-01],


In [39]:
#print metrics on hold-out data
out_of_fold_preds = np.squeeze(out_of_fold_preds)
out_of_fold_probs = out_of_fold_preds > 0.5
out_of_fold_acc = np.mean(out_of_fold_probs == labels)
print("Out of fold accuracy        = %3f"%out_of_fold_acc)
print("Base rate guessing accuracy = %3f"%max(np.mean(labels==0), np.mean(labels==1)))

Out of fold accuracy        = 0.904239
Base rate guessing accuracy = 0.759812


In [None]:
#output of the model is a binary vector
model.fit(train_matrix, train_labels, epochs = 50, batch_size = 32, shuffle=True)

In [None]:
train_pred = model.predict(train_matrix)

In [None]:
print(np.sum(train_pred>0.5))
print(np.sum(train_pred<=0.5))

Now train with cross-validation

In [None]:
from sklearn.model_selection import StratifiedKFold

n_folds = 5
n_epochs = 10**3

skf = StratifiedKFold(n_splits = n_folds, random_state = 33, shuffle=True)

out_of_fold_preds = []
true_labels = []
words = np.array([])

model_W_list = []
model_b_list = []


#loop over k-fold splits
for k, (train_index, test_index) in enumerate(skf.split(embedding_matrix, labels)):
    print('\nTraining L2-regularized logistic regression on fold %d / %d :\n'%(k+1, n_folds))

Train a model on sentences describing failures of various kinds 
(mainly engineering project / software engineering)

### Single-word embeddings for logistic regression

In [None]:
#embed all words from a given piece of text
#return a matrix: rows are words, columns are embedding dimensions
#skips any words not in embedding
def get_embedding(glove_df, text):
    words = text.split()
    single_word_embeddings = []
    words_used = []
    for w in words:
        if(w in glove_df.columns):
            single_word_embeddings.append(glove_df[w].values)
            words_used.append(w)
    return (np.array(single_word_embeddings), words_used)

In [None]:
#embed the response to the second question
#for all interviews in the dataset
fail_list = []
success_list = []
fail_words = []
success_words = []
for doc in data:
    r2_embed, words = get_embedding(glove_df, doc['responses'][1])
    if('failure' in doc['tags']):
        fail_list.append(r2_embed)
        fail_words = fail_words + words
    elif('success' in doc['tags']):
        success_list.append(r2_embed)
        success_words = success_words + words
        
r2_embed_fail    = np.concatenate(fail_list, axis=0)
r2_embed_success = np.concatenate(success_list, axis=0)
r2_embed = np.concatenate([r2_embed_fail, r2_embed_success], axis=0)
r2_words = np.array(fail_words + success_words)
labels = np.array([0]*r2_embed_fail.shape[0] + [1]*r2_embed_success.shape[0])

In [None]:
r2_embed.shape

### L2-regularized Logistic Regression

In [None]:
import tensorflow as tf
def train_L2_logistic_regression(X_train, Y_train, X_test=None,\
                                L2_lambda=0.1, learning_rate=10**(-5), n_epochs=10**3, minibatch_size=128,\
                                print_progress=True):
    
    n_train_data = X_train.shape[0]
    n_features = X_train.shape[1]
    tf.reset_default_graph()
    tf.set_random_seed(50)
    
    #placeholders for data and parameters:
    X_ = tf.placeholder(tf.float32, shape = [n_features, None], name="X")
    Y_ = tf.placeholder(tf.float32, shape = [None], name="Y")
    L2_lambda_ = tf.placeholder(tf.float32, shape=(), name="L2_lambda")
    
    W_ = tf.Variable(tf.ones([1, n_features])*0.01, name="W")
    b_ = tf.Variable(tf.ones([1,1]), name="b")
    
    #compute linear predictor 
    #W_ and X_ are vectors so matmul is actually dot product
    Z_ = tf.matmul(W_, X_) + b_
    Z_ = tf.squeeze(Z_)
    
    cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=tf.transpose(Z_),\
                                                                 labels=tf.transpose(Y_)))\
            + L2_lambda_*tf.norm(W_, axis=None, ord=2)
        
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
    
    init = tf.global_variables_initializer()
    
    with tf.Session() as sess:
        sess.run(init)
        epoch_cost_mean = np.ones(n_epochs)*np.nan
        epoch_cost_sem  = np.ones(n_epochs)*np.nan
        
        for epoch in range(n_epochs):
            n_minibatches = int(n_train_data/minibatch_size)
            shuffled_indices = np.random.permutation(n_train_data)
            minibatch_indices =\
            [shuffled_indices[i*minibatch_size : (i+1)*minibatch_size] for i in range(int(np.ceil(float(len(shuffled_indices))/minibatch_size)))]
            minibatch_costs = np.ones(len(minibatch_indices))*np.nan
            for k, ind in enumerate(minibatch_indices):
                X_minibatch = X_train[ind,:]
                Y_minibatch = Y_train[ind]
                _, minibatch_cost = sess.run([optimizer, cost],\
                                         feed_dict = {X_:X_minibatch.T, Y_:Y_minibatch, L2_lambda_:np.array(L2_lambda)})
                minibatch_costs[k] = minibatch_cost

            epoch_cost_mean[epoch] = np.mean(minibatch_costs)
            epoch_cost_sem[epoch]  = np.std(minibatch_costs)/np.sqrt(len(minibatch_costs))

            if(epoch % 100 ==0):

                print("Cost after epoch %d = %f"%(epoch, epoch_cost_mean[epoch]))
        #end loop over epochs

        #now get predictions
        probs_train = 1.0/(1.0 + np.exp(-Z_.eval({X_:X_train.T})))

        if X_test is not None:
            probs_test  = 1.0/(1.0 + np.exp(-Z_.eval({X_:X_test.T})))
        else:
            probs_test = None

        #extract parameters of trained model
        model = {"W":W_.eval(), "b":b_.eval()}

    if(X_test is not None):
        return (probs_train, probs_test, model)
    else:
        return (probs_train, model)
            

In [None]:
#fit L2-regularized logistic regression via tensorflow
#this achieves two things: gives a baseline we can compare more complicated models to
# and yields a separating hyperplane in semantic space
import tensorflow as tf

from sklearn.model_selection import StratifiedKFold

n_folds = 5
n_epochs = 10**3

skf = StratifiedKFold(n_splits = n_folds, random_state = 33, shuffle=True)

out_of_fold_preds = []
true_labels = []
words = np.array([])

model_W_list = []
model_b_list = []


#loop over k-fold splits
for k, (train_index, test_index) in enumerate(skf.split(r2_embed, labels)):
    print('\nTraining L2-regularized logistic regression on fold %d / %d :\n'%(k+1, n_folds))
    X_train = r2_embed[train_index,:]
    Y_train = labels[train_index]
    X_test = r2_embed[test_index,:]
    Y_test = labels[test_index]
    probs_train, probs_test, model = train_L2_logistic_regression(X_train, Y_train, X_test, n_epochs=n_epochs)
    out_of_fold_preds.append(probs_test)
    true_labels.append(Y_test)
    words = np.concatenate([words, r2_words[test_index]])
    model_W_list.append(model['W'])
    model_b_list.append(model['b'])
#concatenate results into single vectors: out-of-fold prediction for each word
out_of_fold_preds = np.concatenate(out_of_fold_preds, axis=0)
true_labels       = np.concatenate(true_labels, axis=0)
loss = out_of_fold_preds*true_labels + (1 - out_of_fold_preds)*(1 - true_labels)

In [None]:
model_W = np.concatenate(model_W_list, axis=0)
model_b = np.concatenate(model_b_list, axis=0)

In [None]:
model_W.shape

In [None]:
mean_W = np.mean(model_W, axis=0)
model_scores = glove_df.apply(lambda x: np.dot(x, mean_W)/np.linalg.norm(x,2))

In [None]:
model_scores = model_scores.sort_values()

In [None]:
model_scores[0:50]

In [None]:
np.mean(loss)

In [None]:
#plot results
import matplotlib.pyplot as plt
lin_pred = -np.log(1.0/out_of_fold_preds-1)
lin_pred_pos = lin_pred[[true_labels[k]==1 for k in range(len(true_labels))]]]
lin_pred_neg = lin_pred[[true_labels[k]==0 for k in range(len(true_labels))]]]

bin_width = 0.25
bins = bp.arange(-8,8,bin_width)
hist_pos = np.histogram(lin_pred_pos, bins)
hist_neg = np.histogram(lin_pred_neg, bins)

centers = lambda b : (b[1:] + b[:-1])/2.0

plt.figure(figsize=[12,7])
plt.bar(centers(hist_neg[1]), hist_neg[0], width=bin_width, align='center', color=[1,0,0.7],alpha=0.5,label='words from failure articles')
plt.bar(centers(hist_pos[1]), hist_pos[0], width=bin_width, align='center', color=[0,0.8,0.8],alpha=1,label='words from success articles')

plt.gca().set_yticklabels([u.astype(int) for u in plt.gca().get_yticks()], fontsize=16)
plt.gca().set_xticklabels([u.astype(int) for u in plt.gca().get_xticks()], fontsize=16)

plt.legend(fontsize=20)
plt.xlabel('Linear Score', fontsize=20, labelpad=20)
plt.plot([0,0],[0,100],'--k',linewidth=1.5)
plt.ylim([0,80])
plt.tight_layout()
plt.show()

In [None]:
#print best-classified words
n_to_print = 50
order = np.argsort(accuracy)

print('Best-classified failure words:')
failure_words = words[order][true_labels[order]==0][:n_to_print]
print(failure_words)
success_words = words[order][true_labels[order]==1][:n_to_print]
print(success_words)

In [None]:
#plot startup locations on a map

import gmplot



gmap = gmplot.GoogleMapPlotter(37.428, -50, 2)

if(googlemaps_api_key):
    gmap.apikey=googlemaps_api_key

#gmap.plot(latitudes, longitudes, 'cornflowerblue', edge_width=10)
#gmap.scatter(more_lats, more_lngs, '#3B0B39', size=40, marker=False)
#gmap.scatter(marker_lats, marker_lngs, 'k', marker=True)
#gmap.heatmap(heat_lats, heat_lngs)

gmap.draw("mymap.html")
from IPython.display import IFrame

IFrame(src='./mymap.html', width=700, height=600)

In [None]:
with output_f as open('output.txt', 'w'):
    output_f.write('stuff')