Some ideas that I'm interested in:
  - crawling through a home webpage to extract the mission/values of the company
  - see if Glassdoor has an API that can be leveraged
  - Main items: extract values and mission statement, some key items from the press on new innovations that answer the question: "Why Company X?"

# Exploration of Cover Letter Idea
Looking into parsing out most relevant sentences from Asana's company page as a starting example: https://asana.com/company

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import newspaper
from collections import OrderedDict

### Parsed Keywords

#### About Keywords
These are the ones that discuss what the company is along with its values/mission.

#### Press Keywords
This parses through some of the latest press to gather some key information on their latest developments.

## Scraping home page for relevant subpages

In [3]:
about_keywords = ['about', 'jobs', 'career', 'value', 'culture', 'mission', 'company']

In [4]:
press_keywords = ['press', 'news', 'latest']

In [5]:
glassdoor = 'glassdoor'
blog = 'blog'

In [6]:
asana = "https://asana.com/"

In [7]:
r = requests.get(asana)

In [8]:
c = r.content
soup = BeautifulSoup(c, "lxml")

In [9]:
links = soup.findAll("a")
about_urls = []
for link in links:
    try:
        url = link['href']
#         print(url)
        if url:
            for about in about_keywords:
                if about in url:
                    about_urls.append(url)
    except:
        pass

In [10]:
about_urls

['/jobs', '/company', '/jobs']

In [11]:
filtered_about = list(OrderedDict.fromkeys(about_urls))

In [12]:
filtered_about

['/jobs', '/company']

In [13]:
full_about = []
for url in filtered_about:
    if url[0] == '/':
        full_about.append(asana + url[1:])

In [127]:
# The scraped child pages that I am interested in if I want to learn about the company
full_about

['https://asana.com/jobs', 'https://asana.com/company']

In [128]:
# from newspaper import Article

# article = Article(full_about[0])

# article.download()

# article.parse()

# article.html

## Exploration of Asana's company page
http://asana.com/company

Output of the page has been suppressed for readability. These next cells clean through the data.

In [129]:
request = requests.get(full_about[1])

In [130]:
s = BeautifulSoup(request.text, "lxml")

In [131]:
bod = s.find('body')

In [132]:
bod.text;

In [133]:
bod_sub = re.sub(r'\n(\t|\n|\s)*', '\n',bod.text)

In [134]:
bod_sub;

In [135]:
# nlp = spacy.load('en')

In [136]:
# import pytextrank
# import sys

In [137]:
regex = re.compile(r'.\s+')

In [138]:
bod_stripped = bod_sub.replace('.\n', '. ')

In [139]:
stripped_newlines = bod_stripped.replace('\n', '. ')

In [140]:
stripped_newlines = stripped_newlines.strip()
stripped_newlines;

In [141]:
stripped = stripped_newlines.split('. ')

In [142]:
stripped = [(strip + '.') for strip in stripped]

In [143]:
stripped_arr = []

for strip in stripped:
    if "function" not in strip:
        stripped_arr.append(strip)
        

In [144]:
stripped_arr;

### Cleaning out sentences with fewer than 3 words.

In [145]:
cleaned = []
for sentence in stripped_arr:
    stripped = sentence.split(' ')
    if len(stripped) > 3:
        cleaned.append(sentence)

### Key question: How do I score the sentences for relevancy?

In [146]:
cleaned

['We use cookies to give you the best possible experience on our website.',
 'By continuing to browse this site, you give consent for cookies to be used.',
 'For more details, please read our Cookie Policy.',
 'We’re empowering teams to do great things together.',
 'Asana’s mission is to help humanity thrive by enabling all teams to work together effortlessly.',
 'We’re changing how teams work together.',
 'Think back to the last time you were deep in the zone—time flew by and the work flowed through you almost effortlessly.',
 'That’s how working together should be.',
 'Instead, information is scattered and responsibilities are unclear.',
 'We try to cut through the chaos with endless meetings and micromanagement, but we end up with less time and not much more clarity.',
 'Work shouldn’t be chaos.',
 'At Asana, we’re building a place where everything from the most immediate details to the big picture are organized.',
 'With Asana, each person knows what they should be doing and why.',

### Following instructions from: http://nbviewer.jupyter.org/github/charlieg/A-Smattering-of-NLP-in-Python/blob/master/A%20Smattering%20of%20NLP%20in%20Python.ipynb

In [147]:
import nltk

In [148]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [154]:
tokens = [word for sent in nltk.sent_tokenize(bod.text) for word in nltk.word_tokenize(sent)]

for token in sorted(set(tokens))[:30]:
    print (token + ' [' + str(tokens.count(token)) + ']')

! [14]
# [2]
$ [86]
% [2]
& [17]
'' [102]
( [304]
) [304]
* [1]
+5 [1]
+parseInt [1]
, [184]
,0 [1]
,200 [1]
,500 [1]
-is-active [2]
-is-hidden [5]
-mobile-nav-active [1]
-no-scroll [1]
. [96]
.-is-hidden [1]
.accordion [3]
.accordion-body [5]
.accordion-header [3]
.accordion-row [8]
.accordion-tab [2]
.accordion-wrapper [3]
.accordion-wrapper.is-active [1]
.addClass [4]
.animate [1]


In [156]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")
stemmed_tokens = [stemmer.stem(t) for t in tokens]

for token in sorted(set(stemmed_tokens))[50:75]:
    print(token + ' [' + str(stemmed_tokens.count(token)) + ']')

.outerheight [5]
.parent [1]
.prev [1]
.readi [1]
.reduc [1]
.removeclass [3]
.resiz [1]
.setupaccordion [2]
.signup-email-modal-buy [2]
.signup-email-modal-buy-not-valid [1]
.signup-email-modal-get-start [2]
.signup-email-modal-get-started-not-valid [1]
.signup-email-modal-signup [2]
.signup-email-modal-signup-not-valid [1]
.signup-email-modal-tri [4]
.signup-email-modal-trial-not-valid [1]
.signup-email-modal-try-not-valid [1]
.signup-email-page-build [1]
.signup-submit-modal-buy [1]
.signup-submit-modal-get-start [1]
.signup-submit-modal-signup [1]
.signup-submit-modal-tri [2]
.signupform [5]
.sitehead [1]
.supportcard-cont [1]


In [157]:
sorted(nltk.corpus.stopwords.words('english'))[:25]

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both']

In [163]:
from nltk.corpus import reuters

print('** BEGIN ARTICLE: ** \"' + reuters.raw(reuters.fileids()[0])[:500] + ' [...]\"')

** BEGIN ARTICLE: ** "ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict wo [...]"


In [164]:
import datetime, re, sys
from sklearn.feature_extraction.text import TfidfVectorizer

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

token_dict = {}
for article in reuters.fileids():
    token_dict[article] = reuters.raw(article)
        
tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')
print('building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']')
sys.stdout.flush()

tdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds on my machine)
print('done! [process finished: ' + str(datetime.datetime.now()) + ']')

building term-document matrix... [process started: 2018-08-29 16:59:21.088874]
done! [process finished: 2018-08-29 16:59:54.348285]


In [167]:
from random import randint

feature_names = tfidf.get_feature_names()
print('TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents')

print('first term: ' + feature_names[0])
print('last term: ' + feature_names[len(feature_names) - 1])

for i in range(0, 4):
    print('random term: ' + feature_names[randint(1,len(feature_names) - 2)])

TDM contains 25833 terms and 10788 documents
first term: 'd
last term: zzzz
random term: crss
random term: iridium
random term: mca
random term: redempt


In [172]:
import math
from __future__ import division

article_id = randint(0, tdm.shape[0] - 1)
article_text = reuters.raw(reuters.fileids()[article_id])

sent_scores = []
for sentence in nltk.sent_tokenize(article_text):
    score = 0
    sent_tokens = tokenize_and_stem(sentence)
    for token in (t for t in sent_tokens if t in feature_names):
        score += tdm[article_id, feature_names.index(token)]
    sent_scores.append((score / len(sent_tokens), sentence))

summary_length = int(math.ceil(len(sent_scores) / 5))
sent_scores.sort(key=lambda sent: sent[0], reverse=True)

print('*** SUMMARY ***')
for summary_sentence in sent_scores[:summary_length]:
    print(summary_sentence[1])

print('\n*** ORIGINAL ***')
print(article_text)

*** SUMMARY ***
Mechanically separated meat is a high-protein, low-cost
  product that has been approved for use since 1978, USDA said.
U.S. MEAT PROCESSORS ASK FOR LABELLING CHANGE
  Four U.S. meat processors have asked
  the federal government to relax a labelling requirement which
  they said discourages the use of mechanically separated meat,
  the U.S. Agriculture Department said.

*** ORIGINAL ***
U.S. MEAT PROCESSORS ASK FOR LABELLING CHANGE
  Four U.S. meat processors have asked
  the federal government to relax a labelling requirement which
  they said discourages the use of mechanically separated meat,
  the U.S. Agriculture Department said.
      The petition, filed by Bob Evans Farms, Odom Sausage Co,
  Sara Lee Corp and Owens Country Sausage, asks USDA to allow
  mechanically separated meat to be listed on product labels as
  the species from which it was derived.
      For example, "pork" would be listed on the ingredients
  statement instead of "mechanically separated po