## Install striprtf to read .RTF documents

The `striprtf` package will reformat the contents of your .RTF file into a Python string: https://pypi.org/project/striprtf/

The cell below will install the <code>striprtf</code> package in your Colaboratory environment. 

To install striprtf on your local computer, run this in your terminal/command line:
```
pip install striprtf
```


In [6]:
!pip install striprtf



## Convert .RTF to text

This code below will read the contents of an .RTF file and convert/reformat it to a Python string, using the `rtf_to_text` command.

In [7]:
from striprtf.striprtf import rtf_to_text
filepath = '01012014 to 01052014.rtf'
with open(filepath) as f:
  rtf = f.read()
text = rtf_to_text(rtf).strip()
print(text)

REMINDER/ The Consumer Electronics Association to Advance Sustainability Efforts through Las Vegas Donations

670 words
5 January 2014
06:15
Business Wire
BWR
English
(c) 2014  Business Wire. All Rights Reserved.   


--(BUSINESS WIRE)--January 05, 2014-- 

Consumer Electronics Association (CEA):


 
 
WHAT:   The Consumer Electronics Association (CEA)(R) will hold a news 
        conference to announce donations to local Las Vegas organizations to 
        advance clean energy and sustainable living. CEA owns and produces the 
        International CES(R) , the world's gathering place for all who thrive 
        on the business of consumer technologies, which is held annually in 
        Las Vegas. The 2014 International CES opens on Tuesday, January 7 and 
        runs through Friday, January 10. This event is open to credentialed 
        media only. 
------ 
 
WHEN:   Monday, January 6, at 3:00 p.m. PT 
------ 
 
WHO:    Dr. Thomas Piechota, UNLV interim vice president for research

## Segmenting the text into individual documents

Take a look at the text above. Each document has an ID, in the format "Document \<alphanumeric id\>". If you can identify where each of these occurs, you can break the raw text into smaller units, each containing an individual article. 

Regular expressions (RegEx) is a fast way to do this. RegEx lets you represent text/string patterns and look for them in text. "Document \<alphanumeric id\>" is a pattern that can be represented as a regex, `Document \w+`. We will use this to split the text into articles.

In [13]:
import re
# document IDs
patt = re.compile(r'^Document \w{25}$', re.M)
doc_ids = re.findall(patt, text)
print(doc_ids)
# split the text at each document ID
articles = re.split('|'.join(doc_ids), text)
# strip blank lines/spaces from the beginning/end of each article
articles = [a.strip() for a in articles if a != ''] 

['Document BWR0000020140105ea1500003', 'Document J000000020140104ea140001p', 'Document PRN0000020140103ea130003n', 'Document BWR0000020140103ea1300079', 'Document J000000020140103ea130001q', 'Document J000000020140103ea130001k', 'Document WSJO000020140104ea1300001', 'Document WCWSJB0020140103ea12000b5', 'Document DJDN000020140103ea13001bh', 'Document DJDN000020140103ea13000ic', 'Document DJDN000020140103ea130006i', 'Document PRN0000020140102ea12000bd', 'Document PRN0000020140102ea120009d', 'Document PRN0000020140102ea1200070', 'Document BWR0000020140102ea120000u', 'Document WSJO000020140103ea1200003', 'Document WSJO000020140102ea120030d', 'Document WCWSJB0020140102ea12005k1', 'Document DJDN000020140102ea12001g9', 'Document DJDN000020140102ea12000y3', 'Document DJDN000020140102ea12000q5']


In [14]:
# Take a look at one of the articles
articles[0]

'REMINDER/ The Consumer Electronics Association to Advance Sustainability Efforts through Las Vegas Donations\n\n670 words\n5 January 2014\n06:15\nBusiness Wire\nBWR\nEnglish\n(c) 2014  Business Wire. All Rights Reserved.   \n\n\n--(BUSINESS WIRE)--January 05, 2014-- \n\nConsumer Electronics Association (CEA):\n\n\n \n \nWHAT:   The Consumer Electronics Association (CEA)(R) will hold a news \n        conference to announce donations to local Las Vegas organizations to \n        advance clean energy and sustainable living. CEA owns and produces the \n        International CES(R) , the world\'s gathering place for all who thrive \n        on the business of consumer technologies, which is held annually in \n        Las Vegas. The 2014 International CES opens on Tuesday, January 7 and \n        runs through Friday, January 10. This event is open to credentialed \n        media only. \n------ \n \nWHEN:   Monday, January 6, at 3:00 p.m. PT \n------ \n \nWHO:    Dr. Thomas Piechota, UNLV in

## Extract dates

In [15]:
import calendar
month_names = calendar.month_name[1:13]
date_patt = re.compile(r'(\d{1,2}) (%s) (\d{4})' % ('|'.join(month_names)), re.M)
article_date = []
for a in articles:
  datematch = re.search(date_patt, a)
  if datematch:
    article_date.append(datematch.group(0))
  else:
    article_date.append(None)
    print(a)
article_date

['5 January 2014',
 '4 January 2014',
 '3 January 2014',
 '3 January 2014',
 '3 January 2014',
 '3 January 2014',
 '3 January 2014',
 '2 January 2014',
 '3 January 2014',
 '3 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014',
 '2 January 2014']

## Extract time

In [16]:
time_patt = re.compile(r'(\d{1,2}):(\d{2})', re.M)
article_time = []
for a in articles:
  timematch = re.search(time_patt, a)
  if timematch:
    article_time.append(timematch.group(0))
  else:
    article_time.append(None)
article_time

['06:15',
 None,
 '09:27',
 '16:21',
 None,
 None,
 '14:00',
 '20:38',
 '09:05',
 '03:44',
 '19:32',
 '16:30',
 '14:05',
 '11:00',
 '04:47',
 '14:02',
 '00:50',
 '08:57',
 '11:00',
 '08:15',
 '07:00']

## Organize data in a pandas DataFrame

In [18]:
import pandas as pd
df = pd.DataFrame(zip(doc_ids, article_date, article_time, articles), 
                  columns=['document_ids', 'date', 'time', 'text'])
df

Unnamed: 0,document_ids,date,time,text
0,Document BWR0000020140105ea1500003,5 January 2014,06:15,REMINDER/ The Consumer Electronics Association...
1,Document J000000020140104ea140001p,4 January 2014,,Cross Country: How the EPA Sticks Miners With ...
2,Document PRN0000020140103ea130003n,3 January 2014,09:27,Eight Finalists Selected for NTTC's Usher Awar...
3,Document BWR0000020140103ea1300079,3 January 2014,16:21,ADVISORY/ The Consumer Electronics Association...
4,Document J000000020140103ea130001q,3 January 2014,,Corporate Intelligence: Big Waste Hauler Rethi...
5,Document J000000020140103ea130001k,3 January 2014,,Potomac Watch\nThe Year of the Washington Powe...
6,Document WSJO000020140104ea1300001,3 January 2014,14:00,Opinion\nHow the EPA Sticks Miners With a Moth...
7,Document WCWSJB0020140103ea12000b5,2 January 2014,20:38,"WSJ Blogs, 20:38, 2 January 2014, 3233 words, ..."
8,Document DJDN000020140103ea13001bh,3 January 2014,09:05,Press Release: GE to Hold 2014 Shareowners Mee...
9,Document DJDN000020140103ea13000ic,3 January 2014,03:44,Press Release: Lundin Petroleum Spuds Appraisa...
