# Extracting data - One potential solution

There are always many ways to solve a problem.
This is why I don't like providing a framework for answers, because you might think about it entirely differently.
However makes most sense to you, that's the right way to do it.

Below are my solutions to the questions from `2_Python`.

***Questions***
  1. [x] Load the Great Gatsby text from above. [link](#(1)-Load-data)
  2. [x] Look at the first 5000 characters of the file. Notice it has a header? Let's get rid of that. [link](#(2)-Look-at-data)
     1. [x] Remove the header from the text, and start your new text variable at "Title:      The Great Gatsby"
  1. [x] If each tweet is 280 characters, how many tweets would it take to put the whole Great Gatsby on twitter? [link](#(3)-Calculate-Tweets)
  1. [x] How many words are in the text? [link](#(4)-Count-words)
  1. [x] How many times does Gatsby show up? [link](#(5)-Count-Gatsby)
  1. [x] How many capitalized words are there? [link](#(6)-Count-capital-words)
  1. [x] What's the most common word in the file? [link](#(7)-Word-frequency)
  1. [x] Are there any numbers in the text? Which ones? [link](#(8)-Find-numbers)
  1. [x] When are some significant dates in the book? [link](#(9)-Find-dates)

In [1]:
import re
import math

#  (1) Load data

In [2]:
with open("data/greatgatsby.txt") as fh:
    text = fh.read()

# (2) Look at data

In [3]:
print(text[:1000])


Title:      The Great Gatsby
Author:     F. Scott Fitzgerald
* A Project Gutenberg of Australia eBook *
eBook No.:  0200041.txt
Language:   English
Date first posted: January 2002
Date most recently updated: July 2017

This eBook was produced by: Colin Choat

Project Gutenberg of Australia eBooks are created from printed editions
which are in the public domain in Australia, unless a copyright notice
is included. We do NOT keep any eBooks in compliance with a particular
paper edition.

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing this
file.

This eBook is made available at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg of Australia License which may be viewed online at
http://gutenberg.net.au/licence.html

To contact Project Gutenberg of Australia go to http://gutenberg.net.au


Title:      The Great Gatsby

In [4]:
# hard coding the start:
txt = text[972:].strip()

In [5]:
# finding the start:
start = text.find("Title: ", 100) # there's two titles, skip the first one.
print(f"Start: {start}")
txt = text[start:].strip()

Start: 972


In [6]:
print(txt[:200])

Title:      The Great Gatsby
Author:     F. Scott Fitzgerald



Then wear the gold hat, if that will move her;
  If you can bounce high, bounce for her too,
Till she cry "Lover, gold-hatted, high-boun


# (3) Calculate Tweets

In [7]:
# without assuming spaces should be collapsed:
print(f"Number of tweets to tweet the Great Gatsby: {math.ceil(len(txt)/280)}")

Number of tweets to tweet the Great Gatsby: 960


In [8]:
# without assuming spaces should be collapsed:
re_space_remover = re.compile('\s{2,}') # more than 2 spaces consecutively, if it's just one, leave it there
len_wo_spaces = len(re_space_remover.sub(' ', txt))
print(f"Number of tweets to tweet the Great Gatsby (no multiple-spaces/newlines): {math.ceil(len_wo_spaces/280)}")

Number of tweets to tweet the Great Gatsby (no multiple-spaces/newlines): 953


# (4) Count words

In [9]:
# naieve word-count:
num_words = len(txt.split())
print(f"Number of words: {num_words:,d}")

Number of words: 48,454


In [10]:
# Counting only words which are alphabetic, no numbers or special characters
re_word_finder = re.compile(
    r'\b'       # word-boundry, so match beginning or ends of words
    r'[a-z\-]+' # a-z, and dashes. + means 1 or more
    r'\b'       # ending word-boundry
    , re.IGNORECASE
)
num_words = len(re_word_finder.findall(txt))
print(f"Number of words: {num_words:,d}")

Number of words: 49,477


In [11]:
# That's odd, our regular expression approach yielded MORE words, not fewer.
# Let's see why (it must mean some words count as 2 with the regex):
for word in txt.split():
    if len(re_word_finder.findall(word)) > 1:
        print(word)
        # we can iteratively look at results, stop after the first one
        break

D'INVILLIERS


In [12]:
# ah, we didn't think about apostrophes in words, you're would count as two words.
re_word_finder.findall("you're")

['you', 're']

In [13]:
# So let's add apostrophes into our regex
re_word_finder = re.compile(
    r'\b'        # word-boundry, so match beginning or ends of words
    r"[a-z\-']+" # a-z, and dashes and apostrophe now too. + means 1 or more
    r'\b'        # ending word-boundry
    , re.IGNORECASE
)
num_words = len(re_word_finder.findall(txt))
print(f"Number of words: {num_words:,d}")

Number of words: 48,187


In [14]:
# Let's see what doesn't count as a word.
# Checking your results is always a good idea
i = 0
for word in txt.split():
    if not re_word_finder.search(word):
        print(word)
        i += 1
        if i > 4:
            break

1
1915,
Hôtel
2
158th


In [15]:
# I'm being picky, but I don't like that Hôtel was removed.
# It was not matched because it had ô in it, which doesn't match a-z. 
# So the last thing is to add all alphabetical characters
re_word_finder = re.compile(
    r'\b'                 # word-boundry, so match beginning or ends of words
    r"(?:[^\W\d_]|[-'])+" # (?: is a non-capture group, just means we can use | for or.
                          # ^\W\d_ means not (^) a non-word, or a digit or underscore
    r'\b'                 # ending word-boundry
    , re.IGNORECASE
)
num_words = len(re_word_finder.findall(txt))
print(f"Number of words: {num_words:,d}")

Number of words: 48,201


In [16]:
# Let's see if that worked
i = 0
for word in txt.split():
    if not re_word_finder.search(word):
        print(word)
        i += 1
        if i > 4:
            break

1
1915,
2
158th
"


In [17]:
# Note the r in front of texts in the regular expressions above.
# It's so we can use \b and it won't look like a special character.
# It also breaks our usual escape characters:
print(r'a\nb')

a\nb


# (5) Count Gatsby

In [18]:
re_gatsby = re.compile('Gatsby', re.I)

In [19]:
# Let's use finditer. It returns an iterator.
num_gatsbys = re_gatsby.finditer(txt)
print(f"Number of Gastbys: {len(num_gatsbys)}")

TypeError: object of type 'callable_iterator' has no len()

In [20]:
# Aww shucks. How can we get the lenght?
num_gatsbys = list(re_gatsby.finditer(txt))
print(f"Number of Gastbys: {len(num_gatsbys)}")

Number of Gastbys: 264


**I wonder if we can do it faster. Do we need to keep the whole list of "Gatsby"s?**

In [21]:
%timeit num_gatsbys = list(re_gatsby.finditer(txt))

2.45 ms ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [22]:
%timeit num_gatsbys = [None for _ in re_gatsby.finditer(txt)]

2.48 ms ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**I guess it doesn't matter, unless it was the shortness of the list?**

In [23]:
%timeit num_gatsbys = list(re_word_finder.finditer(txt))

31.1 ms ± 40.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [24]:
%timeit num_gatsbys = tuple(re_word_finder.finditer(txt))

30.5 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [25]:
%timeit num_gatsbys = [None for _ in re_word_finder.finditer(txt)]

29.8 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


**Probably doesn't matter. But this is a good way to optimize your code if things run slowly.**

And finditer isn't necessarily the best either:

In [26]:
%timeit num_gatsbys = re_word_finder.findall(txt)

27.6 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# (6) Count capital words

In [27]:
# We can't use re.IGNORECASE anymore
re_capital = re.compile(
    r"\b"
    r"[A-Z]"              # One capital letter
    r"(?:[^\W\d_]|[-'])*" # The rest can be whatever alphabetical characters
    r"\b"                 # * means 0 or more, so I will match
)

In [28]:
num_capitals = re_capital.findall(txt)
print(f"Number of Capitals: {len(num_capitals):,d}")

Number of Capitals: 6,426


# (7) Word frequency

In [29]:
# Let's start with our word-finder above:
print(re_word_finder.pattern)

\b(?:[^\W\d_]|[-'])+\b


In [30]:
# Let's use a dict to count the words. I use this approach constantly.
counts = {}

In [31]:
# Let' use finditer, so it starts running immediately, instead of finding them all, then starting the loop.
for word in re_word_finder.finditer(txt): 
    counts[word] = counts[word] + 1

KeyError: <re.Match object; span=(0, 5), match='Title'>

In [32]:
# Hmm, two problems here. First, our key is a re.Match object. That's not right.
for word in re_word_finder.finditer(txt):
    word_str = word.group(0)
    counts[word_str] = counts[word_str] + 1

KeyError: 'Title'

In [33]:
# Second, our word isn't in the dictionary yet. Let's use .get for that:
for word in re_word_finder.finditer(txt):
    word_str = word.group(0)
    counts[word_str] = counts.get(word_str, 0) + 1

In [34]:
# Wahoo! Now how do we read it? (note the parens in the for loop)
for i, (word, count) in enumerate(counts.items()):
    if i > 10:
        break
    print(f"{word: >10s} --> {count}")

     Title --> 1
       The --> 174
     Great --> 2
    Gatsby --> 189
    Author --> 1
         F --> 2
     Scott --> 1
Fitzgerald --> 1
      Then --> 50
      wear --> 3
       the --> 2205


In [35]:
# Let's get the most-used word:
most_used_word = None
most_used_word_count = -1

for word, count in counts.items():
    if count > most_used_word_count:
        most_used_word_count = count
        most_used_word = word

print(f"Most common word: '{most_used_word}' at {most_used_word_count:,d}")

Most common word: 'the' at 2,205


In [36]:
# Well that's boring, what's the second most common used word?

# Full disclosure, I'm not going to find it. You could. But I'm lazy.

# And hopefully, by now, you're starting to think "isn't there a better way?"

# This is Python. Of course there is.

In [37]:
from collections import Counter

In [38]:
counts_easy = Counter()

for word in re_word_finder.finditer(txt):
    counts_easy[word.group(0)] += 1 
    # x += 1 is the same as x = x + 1

counts_easy.most_common(10)

[('the', 2205),
 ('and', 1472),
 ('a', 1331),
 ('I', 1179),
 ('to', 1119),
 ('of', 1097),
 ('in', 767),
 ('was', 763),
 ('he', 563),
 ('that', 551)]

If writing code in Python doesn't bring a smile to your face at some point...
you're probably a grad student with too much work weighing you down.

# (8) Find numbers

In [39]:
re_num = re.compile(r"\d+")
print(f"Number of numbers: {len(re_num.findall(txt))}")

Number of numbers: 49


In [40]:
# Maybe we should allow for decimals and commas:
re_num = re.compile(r"\d[\d,.]*")
print(f"Number of numbers: {len(re_num.findall(txt))}")

Number of numbers: 34


# (9) Find dates

I'm going to write a date finder in steps, to show how I would think about generating the ultimate regex pattern.

I'm going to start with a nice way to see what we've found.

In [41]:
from IPython.display import display_html

In [42]:
def show_regex_results(regex, text, surrounding_len=40):
    """
    This method displays a list of regular expression 
    matches while highlighting the actual match.
    """
    
    html_template = "<p>{}<span style='font-weight:  bold;color: white;background-color: black;'>{}</span>{}</p>"
    
    html_list = []
    for match in regex.finditer(txt):
        # match objects are neato. They have a start and an end. Let's look around the matches
        str_before = text[match.start() - surrounding_len: match.start()]
        str_match = match.group(0) # group(0) is the whole match
        str_after = text[match.end(): match.end() + surrounding_len]
        
        html_list.append(html_template.format(str_before, str_match, str_after))
        
    display_html(''.join(html_list), raw=True)

In [43]:
# Start with just the obvious ones, months and 19XX/18XX
re_date_1 = re.compile(
    r"(?:January|February|March|April|May|June|July|August|September|October|November|December)\b|1[89]\d\d"
)

print(len(re_date_1.findall(txt)))

33


In [44]:
# That's not too many to look at, let's print it out
show_regex_results(re_date_1, txt)

Looking through the above, I saw a few patterns to add:
  * Month ..... \*\*\*teen-\*\*\*teen. Let's grab that.
  * Month 9th, 1905

In [45]:
re_date_2 = re.compile(
    r"(?:(?:January|February|March|April|May|June|July|August|September|October|November|December)\b|1[89]\d\d)"
    r"(?:"  # open one big or
        # First, let's do the teen-teen
        r".{0,10}" # let some random characters be in there. . matches everything
        r"[a-z]*teen-[a-z]*teen"
    r"|"    # or
        # Then let's do the 9th, 1906
        r"\s+\d{1,2}[thrds]{2},\s+1[89]\d\d"
    r")?"    # cloes that big or, with a ? meaning find 0 or 1
)

print(len(re_date_1.findall(txt)))

33


In [46]:
# That's not too many to look at, let's print it out
show_regex_results(re_date_2, txt)

In [47]:
# I think that'll do.