In [11]:
! rm moby_dick.txt*
! wget https://raw.githubusercontent.com/brunomurino/ML_projects/master/datasets/moby_dick.txt

--2018-11-01 11:13:58--  https://raw.githubusercontent.com/brunomurino/ML_projects/master/datasets/moby_dick.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.36.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.36.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 207043 (202K) [text/plain]
Saving to: 'moby_dick.txt'


2018-11-01 11:13:58 (7.16 MB/s) - 'moby_dick.txt' saved [207043/207043]



# Types of text files
* Plain text files, such as .txt
* Text in the form of tables, such as .csv

In [12]:
# Reading a text file
filename = 'moby_dick.txt'

file = open(filename, mode = 'r') # 'r' indicates that we want to just READ the file, we don't want to WRITE in or over it. This command starts a connection with the file.
text = file.read()
file.close() # This command ends a connection with the file. It's always good to do this.

If instead we want to write something to the file, we set `mode = 'w'`.

# Context Managers

When working with lots of files, doing complex operations, it can be really easy to forget to close a file. So in order avoid this problem we have [Context Managers](https://docs.python.org/3/reference/datamodel.html#context-managers), which we use with the `with` statement. [This](https://docs.python.org/3/reference/compound_stmts.html) link and [this](https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/) link help us understand what is going on under the hood. An example would be:

In [None]:
with open(filename, mode = 'r') as file:
    print(file.read())

# Importing entire text files

In this exercise, you'll be working with the file moby_dick.txt. It is a text file that contains the opening sentences of Moby Dick, one of the great American novels! Here you'll get experience opening a text file, printing its contents to the shell and, finally, closing it.

Open the file moby_dick.txt as read-only and store it in the variable file. Make sure to pass the filename enclosed in quotation marks ''. Print the contents of the file to the shell using the print() function. As Hugo showed in the video, you'll need to apply the method read() to the object file. Check whether the file is closed by executing print(file.closed). Close the file using the close() method. Check again that the file is closed as you did above.

In [15]:
# Open a file: file
file = open('moby_dick.txt', 'r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

MOBY DICK

CONTENTS 

CHAP. PAGE 

I. LOOMINGS . 1 

II. THE CARPET-BAG ...... 8 

III. THE SPOUTER-INN . . . . . . 13 

IV. THE COUNTERPANE . . . . . 31 
V. BREAKFAST ...... 36 

VI. THE STREET . . . . . 39 

VII. THE CHAPEL . . . . . . 42 

VIII. THE PULPIT ....... 46 

IX. THE SERMON ...... 49 

X. A BOSOM FRIEND ...... 60 

XI. NIGHTGOWN 65 

XII. BIOGRAPHICAL ...... 68 

XIII. WHEELBARROW . . . . . . 71 

XIV. NANTUCKET ....... 77 

XV. CHOWDER ....... 80 

XVI. THE SHIP . 84 

XVII. THE RAMADAN ...... 102 

XVHI. HIS MARK ....... 110 

XIX. THE PROPHET . . . . . .115 

XX. ALL ASTIR ....... 119 

XXI. GOING ABOARD ...... 122 

XXII. MERRY CHRISTMAS . . . . .126 

XXIII. THE LEE SHORE . . . . . .132 

XXIV. THE ADVOCATE . . . . . .134 

XXV. POSTSCRIPT . . . . . 140 

XXVI. KNIGHTS AND SQUIRES . . . .141 

XXVII. KNIGHTS AND SQUIRES .... 145 

XXVIII. AHAB ....... 151 

vii 



viii MOBY-DICK 

CHAP. PAGE 

XXIX. ENTER AHAB ; TO HIM, STUBB . . .156 

XXX. THE PIPE ...... 160 

XXX

# Importing text files line by line
For large files, we may not want to print all of their content to the shell: you may wish to print only the first few lines. Enter the readline() method, which allows you to do this. When a file called file is open, you can print out the first line by executing file.readline(). If you execute the same command again, the second line will print, and so on.

In the introductory video, Hugo also introduced the concept of a context manager. He showed that you can bind a variable file by using a context manager construct:

with open('huck_finn.txt') as file:
While still within this construct, the variable file will be bound to open('huck_finn.txt'); thus, to print the file to the shell, all the code you need to execute is:

with open('huck_finn.txt') as file:
    print(file.read())
You'll now use these tools to print the first few lines of moby_dick.txt!

Open moby_dick.txt using the with context manager and the variable file.
Print the first three lines of the file to the shell by using readline() three times within the context manager.

In [16]:
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

MOBY DICK



CONTENTS 



# Why we like flat files and the Zen of Python
In PythonLand, there are currently hundreds of Python Enhancement Proposals, commonly referred to as PEPs. PEP8, for example, is a standard style guide for Python, written by our sensei Guido van Rossum himself. It is the basis for how we here at DataCamp ask our instructors to style their code. Another one of my favorites is PEP20, commonly called the Zen of Python. Its abstract is as follows:

Long time Pythoneer Tim Peters succinctly channels the BDFL's guiding principles for Python's design into 20 aphorisms, only 19 of which have been written down.

If you don't know what the acronym BDFL stands for, I suggest that you look [here](https://docs.python.org/3.3/glossary.html#term-bdfl). You can print the Zen of Python in your shell by typing `import this` into it! You're going to do this now and the 5th aphorism (line) will say something of particular interest.

The question you need to answer is: what is the 5th aphorism of the Zen of Python?

# Using NumPy to import flat files
In this exercise, you're now going to load the MNIST digit recognition dataset using the numpy function `loadtxt()` and see just how easy it can be:

The first argument will be the filename.
The second will be the delimiter which, in this case, is a comma.
You can find more information about the MNIST dataset [here](http://yann.lecun.com/exdb/mnist/) on the webpage of Yann LeCun, who is currently Director of AI Research at Facebook and Founding Director of the NYU Center for Data Science, among many other things.

* Fill in the arguments of np.loadtxt() by passing file and a comma ',' for the delimiter.
* Fill in the argument of print() to print the type of the object digits. Use the function type().
* Execute the rest of the code to visualize one of the rows of the data.

In [None]:
# Import package
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

print(im.shape)

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

# Customizing your NumPy import
What if there are rows, such as a header, that you don't want to import? What if your file has a delimiter other than a comma? What if you only wish to import particular columns?

There are a number of arguments that np.loadtxt() takes that you'll find useful: delimiter changes the delimiter that loadtxt() is expecting, for example, you can use ',' and '\t' for comma-delimited and tab-delimited respectively; skiprows allows you to specify how many rows (not indices) you wish to skip; usecols takes a list of the indices of the columns you wish to keep.

The file that you'll be importing, digits_header.txt,

* has a header
* is tab-delimited.


* Complete the arguments of np.loadtxt(): the file you're importing is tab-delimited, you want to skip the first row and you only want to import the first and third columns.
* Complete the argument of the print() call in order to print the entire array that you just imported.

In [None]:
# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[1,3])

# Print data
print(data)


# Importing different datatypes
The file seaslug.txt

* has a text header, consisting of strings
* is tab-delimited.
These data consists of percentage of sea slug larvae that had metamorphosed in a given time period. Read more [here](http://www.stat.ucla.edu/projects/datasets/seaslug-explanation.html).

Due to the header, if you tried to import it as-is using np.loadtxt(), Python would throw you a ValueError and tell you that it could not convert string to float. There are two ways to deal with this: firstly, you can set the data type argument dtype equal to str (for string).

Alternatively, you can skip the first row as we have seen before, using the skiprows argument.

* Complete the first call to np.loadtxt() by passing file as the first argument.
* Execute print(data[0]) to print the first element of data.
* Complete the second call to np.loadtxt(). The file you're importing is tab-delimited, the datatype is float, and you want to skip the first row.
* Print the 10th element of data_float by completing the print() command. Be guided by the previous print() call.
* Execute the rest of the code to visualize the data.

In [None]:
# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()
