## Get some text

A lot of the nitty-gritty work of NLP, particularly preprocessing, involves simple text processing so it's essential to know how to manipulate text in Python.

The first thing to do in text processing is get some text! There are many ways we can do this:

* from console input
* from a file
* use NLTK corpus
* from the web

As with any programming language, there are always many ways to accomplish the same thing. The goal here is to show a few ways, you will learn your own favorite techniques as you go.

### Text from console input

Use the input() function to prompt the user for input and return the input.

In [1]:
raw_text = input("Enter some text: ")
print("You entered: ", raw_text)

Enter some text: the
You entered:  the


### Read a file

* open a file for reading, 'r', that is in the same directory
* read with the read() function
* close the file

In [1]:
f = open('sample1.txt','r') 
text = f.read()
print('You read:\n', text)
f.close()

You read:
 Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.
Source: https://en.wikipedia.org/wiki/Natural_language_processing


### Read a line at a time

The following code shows a *for* loop to process one line at a time.

In [2]:
f = open('sample1.txt', 'r')
for line in f:
    print(line)
f.close()

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.

Source: https://en.wikipedia.org/wiki/Natural_language_processing


### Using "with"

The *with* statement starts a block of code. When we are through with the block of code, Python will close the file automatically.

In [6]:
with open('sample1.txt', 'r') as f:
    text = f.read()
print("You read:\n", text)

You read:
 Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.
Source: https://en.wikipedia.org/wiki/Natural_language_processing


### Encoding

Encoding used to be a pain in Python 2 but is less of a problem in Python 3, which uses utf-8 by default. However, you can specify the encoding if you need to. The strip() function removes newlines.

In [3]:
with open('sample1.txt', 'r', encoding='utf-8') as f:
    for line in f:
        print(line.strip())

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.
Source: https://en.wikipedia.org/wiki/Natural_language_processing


### Get text from the web

The urllib library contains functions to handle urls. Below we read text from a web page.

In [4]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
crime = request.urlopen(url).read().decode('utf8')
crime[:1000]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Crime and Punishment\r\n\r\nAuthor: Fyodor Dostoevsky\r\n\r\nRelease Date: March 28, 2006 [EBook #2554]\r\nLast Updated: October 27, 2016\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***\r\n\r\n\r\n\r\n\r\nProduced by John Bickers; and Dagny\r\n\r\n\r\n\r\n\r\n\r\nCRIME AND PUNISHMENT\r\n\r\nBy Fyodor Dostoevsky\r\n\r\n\r\n\r\nTranslated By Constance Garnett\r\n\r\n\r\n\r\n\r\nTRANSLATOR’S PREFACE\r\n\r\nA few words about Dostoevsky himself may help the English reader to\r\nunderstand his work.\r\n\r\nDostoevsky was the son of a doctor. His pa

### Get NLTK corpus text

The following assumes that nltk has been installed as well as the corpora.

In [5]:
import nltk
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
moby = open(path, 'r').read()
moby[:100]

'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consumptive Usher to a Grammar'

There are also Python libraries to read pdf, RSS, and more. You can discover these on your own as needed.

