# Importing Flat Files

## Plain Text Files

- e.g. txt, csv

### 1. Importing the whole file

In [5]:
filename = "./Data/Txt_Tester.txt"
file = open(filename, mode="r") # "r" is to read, "w" is to write

text = file.read()
print(text)

# close the connection
file.close()
print(file.closed)

The quick brown fox jumps over the lazy dog. The lazy dog is unstirred. The fox tries again. And again.

And again.

Yet, the dog continues to snore gruffly, clear in deep slumber. 

As he exhales, the air whistled through his nose. This gave the clever fox an idea. The dog's next breath was met with a shrill sound. The dog jolted awake.

"What are you DOING?!" The dog exclaimed, clearly startled.

"I did nothing", the fox simply replied. "It was you."

The fox pointed at the dog's snout. Sure enough, at the end of his nose, fit snugly in his nostril, was a silver whistle.
True


You can avoid having to close the connection each time by using a context manager

In [6]:
# the file is bound to this context and closes automatically when the context is left
with open("./Data/Txt_Tester.txt") as file:
    print(file.read())

The quick brown fox jumps over the lazy dog. The lazy dog is unstirred. The fox tries again. And again.

And again.

Yet, the dog continues to snore gruffly, clear in deep slumber. 

As he exhales, the air whistled through his nose. This gave the clever fox an idea. The dog's next breath was met with a shrill sound. The dog jolted awake.

"What are you DOING?!" The dog exclaimed, clearly startled.

"I did nothing", the fox simply replied. "It was you."

The fox pointed at the dog's snout. Sure enough, at the end of his nose, fit snugly in his nostril, was a silver whistle.


### 2. Importing Line by Line

In [11]:
with open("./Data/Txt_tester.txt") as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

The quick brown fox jumps over the lazy dog. The lazy dog is unstirred. The fox tries again. And again.



And again.



## Flat Files

- text files containing records, aka table data, without structured relationships.
- fields are often seperated by a delimiter. (e.g. comma seperated values (.csv) are seperated by commas)

### 1. Using Numpy

- best if all the data are of the same type
- especially useful for purely numeric data
- essential for other packages (e.g. scikit-learn)

In [40]:
import numpy as np

filename = "./Data/MNIST.txt"

# specify delimiter, \t is tab
# skip header
# select relevant columns
# set data type
data = np.loadtxt(filename, delimiter="\t", skiprows=1, usecols=np.arange(0,8,2), dtype=str)
data

array([['0', '0', '0', '0'],
       ['0', '200', '200', '0'],
       ['20', '250', '250', '20'],
       ['20', '250', '250', '20'],
       ['20', '250', '250', '20'],
       ['0', '200', '200', '0'],
       ['0', '0', '0', '0']], dtype='<U3')

How about files with columns of different types? We use **np.genfromtxt** or **np.recfromcsv()** instead.

In [27]:
# \t is tab delimiter
# names refers to the presence of headers
# dtype=None means that data is not casted into a different type
# each element in the array is a row
data = np.genfromtxt("./Data/people.csv", delimiter=',', encoding="ASCII", names=True, dtype=None)
print(data)

# the default delimiter is ","
# default name=True
# default dtype=None
data2 = np.recfromcsv("./Data/people.csv", encoding="ASCII")
print(data2)

[('Trisha', 'Wong', 12, 'Female') ('Joe', 'Osman', 30, 'Male')
 ('Rebecca', 'Tan', 27, 'Female') ('Kai Wen', 'Ong', 18, 'Male')]
[('Trisha', 'Wong', 12, 'Female') ('Joe', 'Osman', 30, 'Male')
 ('Rebecca', 'Tan', 27, 'Female') ('Kai Wen', 'Ong', 18, 'Male')]


## 2. Using Pandas

However, numpy does not allow for a 2D labelled data structure. It is also not ideal for values of columns of different types.

Pandas, on the other hand, allows for data to be manipulated, sliced, reshaped, grouped, joined and merged. It also allows us to perform statistics, deal with time series, etc. 

This is because pandas imports flat files as data frames which are better for complex data handling.

In [36]:
import pandas as pd

filename = "./Data/people.csv"
data = pd.read_csv(filename)
data

Unnamed: 0,First Name,Last Name,Age,Gender
0,Trisha,Wong,12,Female
1,Joe,Osman,30,Male
2,Rebecca,Tan,27,Female
3,Kai Wen,Ong,18,Male


In [38]:
# "," delimited, comments after "#", first 4 rows, no headers, reads "Nothing" as NA
filename = "./Data/MNIST2.csv"
data = pd.read_csv(filename, sep=",", comment="#", nrows=4, header=None, na_values="Nothing")
data

Unnamed: 0,0,1,2,3,4,5,6
0,0,0,10.0,20,,0.0,0
1,0,10,,70,20.0,10.0,0
2,10,20,70.0,200,70.0,,10
3,20,70,200.0,256,200.0,70.0,20


We can also convert the dataframe to a numpy array.

In [32]:
data_array = data.values
data_array

array([['Trisha', 'Wong', 12, 'Female'],
       ['Joe', 'Osman', 30, 'Male'],
       ['Rebecca', 'Tan', 27, 'Female'],
       ['Kai Wen', 'Ong', 18, 'Male']], dtype=object)