## Install wget

Run the following bash command to install wget. If you get a permissions error, switch to an admin user and paste into a terminal window (without preceding exclamation point).

In [None]:
!brew install wget

# The following command resets our current bash session so we can access the program we just
# installed.
!source /etc/bashrc

In [None]:
# The following downloads an EPUB file to the desktop and converts it to plain text. 
# Remove DRM first if needed. (See below.)

# Change directory to desktop
!cd ~/Desktop

# Download file
!wget http://principalhand.org/workshop-data/Piper_2012_Book-Was-There.epub

#Converting from epub to txt
!ebook-convert Piper_2012_Book-Was-There.epub Piper_2012_Book-Was-There.txt

# Printing the first 25 lines of the new text file. Open in a text editor to see more.
!head -n 25 Piper_2012_Book-Was-There.txt

## .DOCX to Plain Text

In [None]:
!pip install --user -U docx

In [None]:
!cd ~/Desktop

!wget http://principalhand.org/workshop-data/Patteson_Player-piano.docx

In [None]:
from docx import opendocx
from docx import getdocumenttext

document = opendocx("/Users/yourname/Desktop/Patteson_Player-piano.docx") ### Swap in your username. ##

docx_text=getdocumenttext(document)

print docx_text

## Ebook to Plain Text

### Install ebook-convert command line tool (via Calibre)

First, download Calibre and add it to your Applications folder. 

    https://calibre-ebook.com/download_osx

Open a terminal window using an admin/sudoer account and enter the following command:

    sudo nano /etc/bashrc

This will open the bashrc file — a list of commands executed every time a shell session starts — in the Nano text editor. Use arrows to move cursor to an empty line, then paste in following. It will create a global alias pointing to the ebook-convert command-line tool included in Calibre's application file.

    alias ebook-convert='/Applications/calibre.app/Contents/MacOS/ebook-convert'

Hit control+X to close the file. When prompted, type 'y' and hit return to save. The command 'ebook-convert' will now launch the ebook-convert tool.


## Remove Ebook DRM

Download and install DeDRM Tools:

    https://apprenticealf.wordpress.com


[ The rest of this cell is copied from the
[DeDRM readme file](https://raw.githubusercontent.com/apprenticeharper/DeDRM_tools/master/DeDRM_Macintosh_Application/DeDRM%20ReadMe.rtf). ]

    


### ▷ About DeDRM

DeDRM is an application that packs all of the python dm removal software into one easy to use program that remembers preferences and settings. It works without manual configuration with Kindle for Mac ebooks, Adobe Digital Editions Adept ePub and PDF ebooks, and Barnes & Noble NOOK Study ebooks.

To remove the DRM of Kindle ebooks from eInk Kindles, other Barnes & Noble ePubs, eReader pdb ebooks, or Mobipocket ebooks, you must first run DeDRM application (by double-clicking it) and set some additional Preferences, depending on the origin of your ebook files. [...]

A final preference is the destination folder for the DRM-free copies of your ebooks that the application produces. This can be either the same folder as the original ebook, or a folder of your choice.

Once these preferences have been set, you can drag and drop ebooks (or folders of ebooks) onto the DeDRM droplet to remove the DRM.

This program uses notifications, so really needs Mac OS X 10.8 or above. It will not work on Mac OS X 10.4 or earlier. It might work on Mac OS X 10.5-10.7, but the latest Kindle for Mac does not support those System versions.


### ▷ Installation

Drag the DeDRM application from from the DeDRM_Application_Macintosh folder (the location of this ReadMe) to your Applications folder, or anywhere else you find convenient.



### ▷ Use
1. To set the preferences, double-click the application and follow the instructions in the dialogs.

2. Drag & Drop DRMed ebooks or folders of DRMed ebooks onto the application icon when it is not running.



## PDF to Plain Text

Converting a PDF to text is just as easy, though the results may be messy. Note that this will only work for PDFs that contain text data.

Enter the following command to install poppler, which will install the pdftotext shortcut.

In [None]:
!brew install poppler

# Linux version:  apt-get install poppler-utils

In [None]:
# Downloads a fairly clean OCR'd PDF and converts it to plain text, then displays then
# first 50 lines.

!cd ~/Desktop

!wget --quiet http://principalhand.org/workshop-data/Star-and-Bowker_2002_How-to-Infrastructure.pdf
    
!pdftotext Star-and-Bowker_2002_How-to-Infrastructure.pdf Star-and-Bowker_2002_How-to-Infrastructure.txt

!head -n 50 Star-and-Bowker_2002_How-to-Infrastructure.txt

In [None]:
# Downloads a (supposedly) OCR'd PDF and converts it to plain text, then displays the first 
# 50 lines.

!cd ~/Desktop

!wget --quiet http://principalhand.org/workshop-data/Benjnamin_Unpacking-my-Library.pdf
    
!pdftotext Benjnamin_Unpacking-my-Library.pdf Benjnamin_Unpacking-my-Library.txt

!head -n 50 Benjnamin_Unpacking-my-Library.pdf

## Web Text to Plain Text

### ▷ Install html2text

In [None]:
!pip install --user -U html2text

In [None]:
# Let's relaunch bash and Python so we can access our newly installed modules.

!source /etc/bashrc

quit()

In [None]:
# Converts a news article to plain text.

from html2text import html2text
import urllib2

url="http://www.chicagotribune.com/news/opinion/editorials/ct-texting-walking-pedestrian-edit-20160818-story.html"
temp_string=urllib2.urlopen(url).read()    # grabs page HTML
#temp_string=sanitize(html2text(temp_string))  # uncomment to sanitize
print temp_string[:1000]

### ▷ Install newspaper

Let's download the [newspaper](https://pypi.python.org/pypi/newspaper) module for Python, which helps parse article-formatted pages on the Web.

In [None]:
# Additional dependencies you may need:

!brew install libxml2 libxslt libtiff libjpeg webp little-cms2

# Dependencies on Linux: apt-get install libxml2-dev libxslt-dev python-dev

!pip install --user -U newspaper

In [None]:
# Let's relaunch Python so we can access our newly installed modules.

quit()

In [None]:
from newspaper import Article

url ='http://www.nytimes.com/2016/08/17/business/when-the-captain-is-mom-accommodating-new-motherhood-at-30000-feet.html'
article = Article(url)
article.download()
print article.html[:1000]

In [None]:
article.parse()
print article.authors
print "\n"
print article.publish_date
print "\n"
print article.text[:1000]

In [None]:
# Let's define a function that takes a URL and returns the cont.

from newspaper import Article
from textblob import TextBlob

def article2blob(url):
    article = Article(url)
    article.download()
    article.parse()
    return TextBlob(article.text)

In [None]:
import random

url="http://www.latimes.com/projects/la-fi-manufacturing-boom-mexico/#nt=oft12aH-2la1"

article_blob=article2blob(url)

print random.sample(article_blob.sentences,1)

<a rel="license"
     href="http://creativecommons.org/publicdomain/zero/1.0/">
    <img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0" />
  </a>