# Get Some Text

The first thing we want to do is get some text to work with. We might:
- get text from console input
- read text from a file
- use an NLTK text
- get text from the web

As with any programming language, there are always many ways to accomplish the same thing. The goal here is to show a few ways to get some text. You will discover your own favorite ways to do these things as you go.

## Get text from console input

The input() function gets text from the user, after displaying the given prompt. 

In [3]:
raw_text = input("Enter some text: ")
print("You entered", raw_text)

Enter some text: hello world
You entered hello world


## Read a file

- the following assumes that sample1.txt is in the same directory
- first we open the file for reading 'r'
- this file had two lines: the first line is a paragraph, the second contains the source
- we need to close the file when we are finished

In [9]:
f = open('sample1.txt', 'r')
text1 = f.read()
print("You read:\n", text1)
f.close()

You read:
 Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.
Source: https://en.wikipedia.org/wiki/Natural_language_processing


## Read a line at a time

You can use for to read and process one line at a time.

In [11]:
f = open('sample1.txt', 'r')
for line in f:
    print(line)
f.close()

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.

Source: https://en.wikipedia.org/wiki/Natural_language_processing


## Using "with"

The **with** statement starts a block of code. When we are through with the block of code, Python will close the file automatically.

In [10]:
with open('sample1.txt', 'r') as f:
    text1 = f.read()
print("You read:\n", text1)

You read:
 Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.
Source: https://en.wikipedia.org/wiki/Natural_language_processing


### Encoding

Encoding used to be a pain in Python 2 but is less of a problem in Python 3, which uses utf-8 by default. However, you can specify the encoding if you need to. The strip() function removed the newline.

In [18]:
with open('sample1.txt', 'r', encoding='utf-8') as f:
    for line in f:
        print(line.strip())

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.
Source: https://en.wikipedia.org/wiki/Natural_language_processing


## Read NLTK Corpus

If you want to use the NLTK Corpus, look in Chapters 1 and 2 of the NLTK book for how to get the corpora.

Notice we opened and read in one statement below.

In [24]:
import nltk
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
moby = open(path, 'r').read()
moby[0:100]  # the first 100 characters

'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consumptive Usher to a Grammar'

## Get text from the web

The urllib is a library of functions that handle urls. Below we read text from a web page. 

In [31]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554.txt"
crime = request.urlopen(url).read().decode('utf8')
crime[:1000]

"The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Crime and Punishment\r\n\r\nAuthor: Fyodor Dostoevsky\r\n\r\nRelease Date: March 28, 2006 [EBook #2554]\r\n[Last updated: November 15, 2011]\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: ASCII\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***\r\n\r\n\r\n\r\n\r\nProduced by John Bickers; and Dagny\r\n\r\n\r\n\r\n\r\n\r\nCRIME AND PUNISHMENT\r\n\r\nBy Fyodor Dostoevsky\r\n\r\n\r\n\r\nTranslated By Constance Garnett\r\n\r\n\r\n\r\n\r\nTRANSLATOR'S PREFACE\r\n\r\nA few words about Dostoevsky himself may help the English reader to\r\nunderstand his work.\r\n\r\nDostoevsky was the son of a doctor. His paren

## Split text into lines

In [35]:
lines = crime.splitlines()
lines[41:45]

["TRANSLATOR'S PREFACE",
 '',
 'A few words about Dostoevsky himself may help the English reader to',
 'understand his work.']

There are also Python libraries to read other formats such as pdf, RSS feeds, extract text from html, and more. You can discover these on your own as needed.