## Reading Documents

In [1]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# These lines load the tests.
from client.api.assignment import load_assignment 
tests = load_assignment('reading_documents.ok')

Often, we have to work with data that aren't in CSV format, but instead come in some less nice form.  In this exercise, we'll look at the text of some Reuters news reports from 1987.  Our dataset doesn't include *all* the news reports from that year, but it includes 1,000 of them.  Reuters doesn't say how the articles were selected.

We've put the text of all the articles in a file called `reuters.txt`.  The cell below loads that file into a single big string and prints a few thousand characters, which is just enough to see one full article and the start of the next.  (Don't try to print the whole thing, because it's very long.)

In [2]:
# Just run this cell to load the dataset as one big string of text.
with open('reuters.txt', 'r') as file:
    big_reuters_string = file.read()

print("{:.5000}\n[...]".format(big_reuters_string))

There's a bunch of weird text for each article.  Each article is separated from its neighbors by the string `"***ARTICLE***"`.

**Question 1.** Use the String method `split` to make an array of the text of all the articles.  That is, each entry of this array should be the text of one article.  Put that array in a new table called `reuters` as a column with the name "Raw text".

*Hint:* When you split the articles correctly, you should see that each article starts with `"<reuters..."` and ends with `"...</reuters>"`.  There should be 1,000 articles.

*Hint 2:* As an example, `"steamcleaner".split("ea")` is the same as `make_array('st', 'mcl', 'ner')`.  So you want to split the big string that contains all the data, splitting with the text `"***ARTICLE***"`.

In [3]:
reuters = Table().with_column(
        ...
    )
reuters

In [4]:
_ = tests.grade('q1')

Each article has a line containing its title that looks something like this:

    <title>LYNG SETS TOUGH U.S. STANCE WITH JAPAN ON BEEF</title>

You could find that yourself for a few of the articles, but it would be very tedious to do it for all 1,000 articles.  So we'll write code to do it instead.

**Question 2.** Below, we've written a function called `get_text_in_markers` that will help you find the title text for an article.  Use it to write a function called `get_title`, which is also documented below.

In [5]:
# This function is provided for you to use.  Read at least
# its documentation (the stuff at the beginning in red).
# You can also type in get_text_in_markers? somewhere and
# run it to see the documentation in a slightly nicer form.
# We haven't used any tools you haven't seen yet, so it
# wouldn't hurt to read the code itself, too.
def get_text_in_markers(text, marker):
    """Finds the part of a piece of text that's between specified markers.
    
    Parameters
    ----------
    text : str
        The text in which you want to find something.
    marker : str
        The name of the marker that delimits the part of the
        text you want to grab.  In the text itself, this string
        will be surrounded by "<>" or "</>", but don't include
        those angle brackets in this argument.
    
    Returns
    -------
    str
        The parts of the text that are inside the markers.
    
    Examples
    --------
    >>> get_text_in_markers("stuff <interesting>yay exciting</interesting> more stuff", "interesting")
    'yay exciting'
    """
    start_marker = "<{}>".format(marker)
    end_marker = "</{}>".format(marker)
    split_before = np.array(text.split(start_marker))
    marker_text_and_after = split_before.item(1)
    split_on_end_marker = np.array(marker_text_and_after.split(end_marker))
    return split_on_end_marker.item(0)

# Fill in this function.
def get_title(article_text):
    """Takes the text of an article and returns its title."""
    ...

# When you're done, this should produce 'LYNG SETS TOUGH U.S. STANCE WITH JAPAN ON BEEF'.
get_title(reuters.column("Raw text").item(0))

In [6]:
_ = tests.grade('q2')

**Question 3.** Now use your function to find the title of every article in `reuters`.  Create a new table called `with_titles` that's a copy of `reuters` with an extra column named "Title" that contains these titles.

*Note:* This might take a few seconds to run.

In [7]:
with_titles = ...
with_titles

In [8]:
_ = tests.grade('q3')

Now we'll go through a similar process to get the date of each article.  In each article, the date is on its own line, separated from the rest of the article by `<date>` and `</date>` markers.  You can check one of the articles for an example.

**Question 4.** Write a function called `get_date`.  It should take as its argument the whole text of an article and return the date.  The date should be just the day of the year (so January 1 is day 1, and February 1 is day 32, since January has 31 days).  Note that all the articles are from the year 1987, so the year is irrelevant.

We've written a function called `date_string_to_day` that will help you do this.

In [31]:
# This function is provided for you to use.  Read at least
# its documentation (the stuff at the beginning in red).
def date_string_to_day(date_string):
    """Converts a string that looks like a date into the day of the year.
    
    Parameters
    ----------
    date_string : str
        Text that contains a date in any reasonable format.
        For example, "September 13, 1994" or "9/13/94" or
        "13-SEP-1994 15:02:20.00" all work.
    
    Returns
    -------
    int
        The day of the year that the date represents.
    
    Examples
    --------
    >>> date_string_to_day("January 3, 2016")
    3
    
    >>> date_string_to_day("February 4, 2000")
    35
    """
    from dateutil import parser
    import re
    # Some of the Reuters dates have extraneous text at the end.
    # This removes that text.
    date_part = re.sub(" [A-Z]*$", "", date_string)
    try:
        date = parser.parse(date_part)
    except:
        print("Failed on", date_string)
    day_in_year = date.timetuple().tm_yday
    return day_in_year

# Fill in this function.
def get_date(article_text):
    date_text = ...
    day_in_year = ...
    ...

# When you're done, this should produce 92.
get_date(reuters.column("Raw text").item(0))

In [33]:
_ = tests.grade('q4')

**Question 5.** Use your function to find the date of every article in `with_titles`. Create a new table called `with_dates` that's a copy of `with_titles` with an extra column named "Date" that contains the dates.

In [30]:
with_dates = ...
with_dates

In [13]:
_ = tests.grade('q5')

**Question 6.** There was a series of earthquakes in Ecuador on March 6, 1987.  Most Reuters news stories about Ecuador from that period were related to the earthquake or its political and economic consequences.  Find out when Reuters reported on the earthquake by making a histogram of all the dates of the articles whose *titles* include the word `"ECUADOR"`.  Use bins of width 3.

*Hint:* The function `are.containing` creates a predicate that matches strings that contain a given string.  You can find its documentation by running `are.containing?`.

In [14]:
# Use these bins:
bins = np.arange(0, 375, 3)

...

**Question 7.** Make another histogram of the same data, but this time using different bins.  The first bin should start at day 0, and each bin should have a width of 10 days. Then, **using only your own inspection of the histogram (and no other Python code)**, estimate the proportion of Ecuador articles that were reported between days 50 and 100 of the year (including day 50 but not day 100). (The proportion should be out of the total number of articles whose titles include "ECUADOR".) Give that number the name `proportion_50_to_100`.

In [15]:
# Make a histogram as described above.  (Be sure to use the
# bins described in the question.)
...

# By inspecting your histogram, estimate the proportion of
# Ecuador articles that were reported between days 50 and 100
# of the year.  (It's hard to get exactly the right answer
# from a histogram like this, so it's okay if your answer is
# off by a little bit.)
proportion_50_to_100 = ...

**Question 8.** Your histogram should show several long gaps in coverage about Ecuador during the year.  By exploring the dataset, try to explain this.  Use the code cell below for your explorations.

In [None]:
# Use this cell to answer the question.

*Write your answer here, replacing this text.*

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [tests.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
# Run this cell to submit your work *after* you have passed all of the test cells.
# It's ok to run this cell multiple times. Only your final submission will be scored.

!TZ=America/Los_Angeles ipython nbconvert --output=".reading_documents_$(date +%m%d_%H%M)_submission.html" reading_documents.ipynb