<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-Natural-Language" data-toc-modified-id="Introduction-to-Natural-Language-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to Natural Language</a></span><ul class="toc-item"><li><span><a href="#Lesson-4:-Intro-to-NLP" data-toc-modified-id="Lesson-4:-Intro-to-NLP-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Lesson 4: Intro to NLP</a></span><ul class="toc-item"><li><span><a href="##6-Counting-Words" data-toc-modified-id="#6-Counting-Words-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>#6 Counting Words</a></span></li></ul></li><li><span><a href="#Lesson-5:-Text-Processing" data-toc-modified-id="Lesson-5:-Text-Processing-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Lesson 5: Text Processing</a></span><ul class="toc-item"><li><span><a href="##6-Capturing-Data" data-toc-modified-id="#6-Capturing-Data-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>#6 Capturing Data</a></span></li><li><span><a href="##7-Cleaning" data-toc-modified-id="#7-Cleaning-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>#7 Cleaning</a></span></li></ul></li></ul></li></ul></div>

# Introduction to Natural Language

## Lesson 4: Intro to NLP 

### #6 Counting Words

Let's implement a simple function that is often used in Natural Language Processing: Counting word frequencies.

Consider this passage of text:

> As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.

— Excerpt from Treasure Island, by Robert Louis Stevenson.

In the following coding exercise, we have provided code to load the text from a file, call the function count_words() to obtain word counts (which you need to implement), and print the 10 most common and least common unique words.

In [1]:
import string
import os

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [2]:
"""Count words."""
def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    
    # TODO: Convert to lowercase
    text = text.lower()
    
    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)
    
    # strip punctuation
    new_text = ''
    for letter in text:
        if letter not in string.punctuation:
            new_text = new_text + letter
    text = new_text

    # split into a list
    text = text.split()
    
    # TODO: Aggregate word counts using a dictionary
    for word in text:
        if word in counts.keys():
            counts.update({word:counts[word]+1})
        else:
            counts.update({word:1})
    
    return counts



In [3]:
def test_run():
    with open(os.path.join("data","input.txt"), "r") as f:
        text = f.read()

        counts = count_words(text)
        sorted_counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)
        
        print("10 most common words:\nWord\tCount")
        for word, count in sorted_counts[:10]:
            print("{}\t{}".format(word, count))
        
        print("\n10 least common words:\nWord\tCount")
        for word, count in sorted_counts[-10:]:
            print("{}\t{}".format(word, count))

In [4]:
test_run()

10 most common words:
Word	Count
a	9
he	6
the	6
and	5
as	4
was	4
with	3
i	2
of	2
his	2

10 least common words:
Word	Count
tables	1
merry	1
word	1
or	1
slap	1
on	1
for	1
more	1
favoured	1
guests	1


## Lesson 5: Text Processing

### #6 Capturing Data

text data can come from a variety of sources, including:
- Plain text, such as a .txt file stored locally
- Tabular data
- Online resources

**Plain Text**

In [5]:
# Plain Text

with open(os.path.join("data", "hieroglyph.txt"), "r") as f:
    text = f.read()
    print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



**Tabular Data**

In [6]:
# Tabular Data

import pandas as pd

# load the csv
df = pd.read_csv(os.path.join("data", "news.csv"))
df.head()

Unnamed: 0,id,title,url,publisher,category,story,hostname,timestamp
0,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
1,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
2,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
3,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027
4,6,Plosser: Fed May Have to Accelerate Tapering Pace,http://www.nasdaq.com/article/plosser-fed-may-...,NASDAQ,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.nasdaq.com,1394470372212


In [7]:
# Extract text column from a data frame
df.head()[['publisher', 'title']]

Unnamed: 0,publisher,title
0,Livemint,Fed's Charles Plosser sees high bar for change...
1,IFA Magazine,US open: Stocks fall after Fed official hints ...
2,IFA Magazine,"Fed risks falling 'behind the curve', Charles ..."
3,Moneynews,Fed's Plosser: Nasty Weather Has Curbed Job Gr...
4,NASDAQ,Plosser: Fed May Have to Accelerate Tapering Pace


In [8]:
# Convert text column to lower case
df['title'] = df['title'].str.lower()
df.head()[['publisher', 'title']]

Unnamed: 0,publisher,title
0,Livemint,fed's charles plosser sees high bar for change...
1,IFA Magazine,us open: stocks fall after fed official hints ...
2,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
3,Moneynews,fed's plosser: nasty weather has curbed job gr...
4,NASDAQ,plosser: fed may have to accelerate tapering pace


**Online Resource**

In [11]:
import requests
import json

# Fetch some data from a REST API
r = requests.get('https://quotes.rest/qod.json')
res = r.json()
print(json.dumps(res, indent=4))
print('\n\n')

# Extract the relevant object & field
q = res['contents']['quotes'][0]
print(q['quote'], '\n--', q['author'])

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "Meaning is something you build into your life. You build it out of your own past, out of your affections and loyalties, out of the experience of humankind as it is passed on to you. ... You are the only one who can put them together into that unique pattern that will be your life.",
                "length": "281",
                "author": "John Gardner",
                "tags": {
                    "0": "existence",
                    "1": "inspire",
                    "2": "meaning",
                    "4": "tso-life"
                },
                "category": "inspire",
                "language": "en",
                "date": "2020-05-07",
                "permalink": "https://theysaidso.com/quote/john-gardner-meaning-is-something-you-build-into-your-life-you-build-it-out-of-y",
                "id": "0dnJB1FQHoDklJmvdpIseQeF",
                "backgro

### #7 Cleaning
Example of how to scrape data by getting the titles of news articles from [Hacker News](https://news.ycombinator.com).  The `BeautifulSoup` library makes this task easier

Just fetching a web page makes it difficult to extract the text:

In [12]:
import requests

# fetch a web page
r = requests.get('https://news.ycombinator.com')
print(r.text)

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?voatr1XnEz4pA5vgzzFL">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="front">past</a> | <a href=

We can try to remove the HTML tags, but the end result is still not great:

In [13]:
import re

# Remove HTML tags with regex

# search for <...>
pattern = re.compile(r'<.*?>')

# replace with empty space
print(pattern.sub('', r.text))


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      Zoom Acquires Keybase (zoom.us)
        135 points by vikram7 30 minutes ago  | hide | 50&nbsp;comments              
      
                
      2.      GeckoView for Android (mozilla.github.io)
        141 points by selvan 2 hours ago  | hide | 57&nbsp;comments              
      
                
      3.      Norsk Data (wikipedia.org)
        90 points by scottlocklin 3 hours ago  | hide | 12&nbsp;comments              
      
                
      4.      Evolution of Emacs Lisp [pdf] (umontreal.ca)
        56 points by signa11 3 hours ago  | hide | 3&nbsp;comments              
      
                
      5.      A self-killing web site requested by a customer (2011) (rachelbythebay.com)
        98 points by cipr

**Beautiful Soup** <br>
Beautiful Soup makes scraping information from webpages easier:

In [17]:
from bs4 import BeautifulSoup

# Remove the HTML tags with beautiful soup
soup = BeautifulSoup(r.text, 'html5lib')
print(soup.get_text())


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      Zoom Acquires Keybase (zoom.us)
        135 points by vikram7 30 minutes ago  | hide | 50 comments              
      
                
      2.      GeckoView for Android (mozilla.github.io)
        141 points by selvan 2 hours ago  | hide | 57 comments              
      
                
      3.      Norsk Data (wikipedia.org)
        90 points by scottlocklin 3 hours ago  | hide | 12 comments              
      
                
      4.      Evolution of Emacs Lisp [pdf] (umontreal.ca)
        56 points by signa11 3 hours ago  | hide | 3 comments              
      
                
      5.      A self-killing web site requested by a customer (2011) (rachelbythebay.com)
        98 points by ciprian_craciun 2 hours 

In [18]:
# find all articles
summaries = soup.find_all('tr', class_='athing')
summaries[0]

<tr class="athing" id="23102430">
      <td align="right" class="title" valign="top"><span class="rank">1.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=23102430&amp;how=up&amp;goto=news" id="up_23102430"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="storylink" href="https://blog.zoom.us/wordpress/2020/05/07/zoom-acquires-keybase-and-announces-goal-of-developing-the-most-broadly-used-enterprise-end-to-end-encryption-offering/">Zoom Acquires Keybase</a><span class="sitebit comhead"> (<a href="from?site=zoom.us"><span class="sitestr">zoom.us</span></a>)</span></td></tr>

In [25]:
# extract one example title
summaries[0].find('a', class_='storylink').get_text().strip()

'Zoom Acquires Keybase'

In [28]:
# Compile all the articles
articles = []
summaries = soup.find_all('tr', class_='athing')

for summary in summaries:
    title = summary.find('a', class_='storylink').get_text().strip()
    articles.append((title))
    
print('Found ', len(articles), ' article summaries.')
print('Sample: ')
print(articles[0])

Found  30  article summaries.
Sample: 
Zoom Acquires Keybase
