## Reading Data

The most convenient method that you can use to work with data is to load it directly into memory.

When you load a file (of any type), the entire dataset is available at all times and the loading process is quite direct:

In [2]:
with open("files/SeaIce.txt", 'r') as input_file:
    print('File content:\n' + input_file.read())

File content:
year mo    data_type region extent   area
1979  1      Goddard      N  15.54  12.33
1980  1      Goddard      N  14.96  11.85
1981  1      Goddard      N  15.03  11.82
1982  1      Goddard      N  15.26  12.11
1983  1      Goddard      N  15.10  11.92
1984  1      Goddard      N  14.61  11.60
1985  1      Goddard      N  14.86  11.60
1986  1      Goddard      N  15.02  11.79
1987  1      Goddard      N  15.20  11.81
1988  1        -9999      N  -9999  -9999
1989  1      Goddard      N  15.12  13.11
1990  1      Goddard      N  14.95  12.72
1991  1      Goddard      N  14.46  12.49
1992  1      Goddard      N  14.72  12.54
1993  1      Goddard      N  15.08  12.85
1994  1      Goddard      N  14.82  12.80
1995  1      Goddard      N  14.62  12.72
1996  1      Goddard      N  14.21  12.07
1997  1      Goddard      N  14.47  12.30
1998  1      Goddard      N  14.81  12.73
1999  1      Goddard      N  14.47  12.54
2000  1      Goddard      N  14.41  12.22
2001  1      Goddard

The entire dataset is loaded from the library into free memory. Of course, the loading process will fail if your system lacks sufficient memory to hold the dataset. When this problem occurs, you need to consider other techniques
for working with the dataset, such as **streaming** it or **sampling** it.

Here’s an example of how you can stream data using Python:

In [5]:
with open("files/SeaIce.txt", 'r') as input_file:
    for observation in input_file:
        print('Reading Data: ' + observation, end="")

Reading Data: year mo    data_type region extent   area
Reading Data: 1979  1      Goddard      N  15.54  12.33
Reading Data: 1980  1      Goddard      N  14.96  11.85
Reading Data: 1981  1      Goddard      N  15.03  11.82
Reading Data: 1982  1      Goddard      N  15.26  12.11
Reading Data: 1983  1      Goddard      N  15.10  11.92
Reading Data: 1984  1      Goddard      N  14.61  11.60
Reading Data: 1985  1      Goddard      N  14.86  11.60
Reading Data: 1986  1      Goddard      N  15.02  11.79
Reading Data: 1987  1      Goddard      N  15.20  11.81
Reading Data: 1988  1        -9999      N  -9999  -9999
Reading Data: 1989  1      Goddard      N  15.12  13.11
Reading Data: 1990  1      Goddard      N  14.95  12.72
Reading Data: 1991  1      Goddard      N  14.46  12.49
Reading Data: 1992  1      Goddard      N  14.72  12.54
Reading Data: 1993  1      Goddard      N  15.08  12.85
Reading Data: 1994  1      Goddard      N  14.82  12.80
Reading Data: 1995  1      Goddard      N  14.62

The ``input_file`` file object contains a pointer to the open file. As the code performs data reads in the for loop, the file pointer moves to the next record.

Data streaming obtains all the records from a data source. You may find that
you don’t need all the records. You can save time and resources by simply
sampling the data.

In [6]:
n = 17
with open("files/SeaIce.txt", 'r') as input_file:
    for j, observation in enumerate(input_file):
        if j % n==0:
            print('Reading Line: ' + str(j) + ' Content: ' + observation, end="")

Reading Line: 0 Content: year mo    data_type region extent   area
Reading Line: 17 Content: 1995  1      Goddard      N  14.62  12.72
Reading Line: 34 Content: 2012  1      Goddard      N  13.77  11.87
Reading Line: 51 Content: 1993  2      Goddard      N  15.73  13.54
Reading Line: 68 Content: 2010  2      Goddard      N  14.59  12.60
Reading Line: 85 Content: 1991  3      Goddard      N  15.50  13.35
Reading Line: 102 Content: 2008  3      Goddard      N  15.22  13.20
Reading Line: 119 Content: 1990  4      Goddard      N  14.68  12.16
Reading Line: 136 Content: 2007  4      Goddard      N  13.87  11.75
Reading Line: 153 Content: 1989  5      Goddard      N  12.98  11.30
Reading Line: 170 Content: 2006  5      Goddard      N  12.62  10.39
Reading Line: 187 Content: 1988  6      Goddard      N  12.02   9.62
Reading Line: 204 Content: 2005  6      Goddard      N  11.29   8.74
Reading Line: 221 Content: 1987  7      Goddard      N   9.98   6.84
Reading Line: 238 Content: 2004  7      G

You can perform random sampling as well.

In [7]:
import random
sample_size = 0.01
with open("files/SeaIce.txt", 'r') as input_file:
    for j, observation in enumerate(input_file):
        if random.random()<=sample_size:
            print('Reading Line: ' + str(j) + ' Content: ' + observation, end= "") 

Reading Line: 13 Content: 1991  1      Goddard      N  14.46  12.49
Reading Line: 229 Content: 1995  7      Goddard      N   9.15   6.05
Reading Line: 255 Content: 1986  8      Goddard      N   8.01   4.92


A flat file presents the easiest kind of file to work with. 

A problem with using native Python techniques is that the input isn’t intelligent. For example, when a file contains a header, Python simply reads it as yet more data to process, rather than as a header (not a problem for Pandas!).

The least formatted and therefore easiest‐to‐read flat‐file format is the text file. However, a text file also treats all data as strings, so you often have to convert numeric data into other forms.

A comma‐separated value (CSV) file provides more formatting and more information, but it requires a little more effort to read.

At the high end of flat‐file formatting are custom data formats, such as an Excel file, which contains extensive formatting and could include multiple datasets in a single file.

### Reading from a CSV file

A CSV file provides more formatting than a simple text file. In fact, CSV files can become quite complicated. There is a standard that defines the format of CSV files, and you can see it at https://tools.ietf.org/html/rfc4180.

The ``csv`` module is useful for working with data exported from spreadsheets and databases into text files formatted with fields and records, commonly referred to as comma-separated value (CSV).

In [121]:
import csv

f = open("files/Advertising.csv", 'r')
try:
    reader = csv.reader(f)
    for row in reader:
        print(row)
finally:
    f.close()

['', 'TV', 'Radio', 'Newspaper', 'Sales']
['1', '230.1', '37.8', '69.2', '22.1']
['2', '44.5', '39.3', '45.1', '10.4']
['3', '17.2', '45.9', '69.3', '9.3']
['4', '151.5', '41.3', '58.5', '18.5']
['5', '180.8', '10.8', '58.4', '12.9']
['6', '8.7', '48.9', '75', '7.2']
['7', '57.5', '32.8', '23.5', '11.8']
['8', '120.2', '19.6', '11.6', '13.2']
['9', '8.6', '2.1', '1', '4.8']
['10', '199.8', '2.6', '21.2', '10.6']
['11', '66.1', '5.8', '24.2', '8.6']
['12', '214.7', '24', '4', '17.4']
['13', '23.8', '35.1', '65.9', '9.2']
['14', '97.5', '7.6', '7.2', '9.7']
['15', '204.1', '32.9', '46', '19']
['16', '195.4', '47.7', '52.9', '22.4']
['17', '67.8', '36.6', '114', '12.5']
['18', '281.4', '39.6', '55.8', '24.4']
['19', '69.2', '20.5', '18.3', '11.3']
['20', '147.3', '23.9', '19.1', '14.6']
['21', '218.4', '27.7', '53.4', '18']
['22', '237.4', '5.1', '23.5', '12.5']
['23', '13.2', '15.9', '49.6', '5.6']
['24', '228.3', '16.9', '26.2', '15.5']
['25', '62.3', '12.6', '18.3', '9.7']
['26', '262.

In [122]:
with open("files/Advertising.csv", 'r') as input_file:
    reader = csv.reader(input_file)
    for row in reader:
        print(row)

['', 'TV', 'Radio', 'Newspaper', 'Sales']
['1', '230.1', '37.8', '69.2', '22.1']
['2', '44.5', '39.3', '45.1', '10.4']
['3', '17.2', '45.9', '69.3', '9.3']
['4', '151.5', '41.3', '58.5', '18.5']
['5', '180.8', '10.8', '58.4', '12.9']
['6', '8.7', '48.9', '75', '7.2']
['7', '57.5', '32.8', '23.5', '11.8']
['8', '120.2', '19.6', '11.6', '13.2']
['9', '8.6', '2.1', '1', '4.8']
['10', '199.8', '2.6', '21.2', '10.6']
['11', '66.1', '5.8', '24.2', '8.6']
['12', '214.7', '24', '4', '17.4']
['13', '23.8', '35.1', '65.9', '9.2']
['14', '97.5', '7.6', '7.2', '9.7']
['15', '204.1', '32.9', '46', '19']
['16', '195.4', '47.7', '52.9', '22.4']
['17', '67.8', '36.6', '114', '12.5']
['18', '281.4', '39.6', '55.8', '24.4']
['19', '69.2', '20.5', '18.3', '11.3']
['20', '147.3', '23.9', '19.1', '14.6']
['21', '218.4', '27.7', '53.4', '18']
['22', '237.4', '5.1', '23.5', '12.5']
['23', '13.2', '15.9', '49.6', '5.6']
['24', '228.3', '16.9', '26.2', '15.5']
['25', '62.3', '12.6', '18.3', '9.7']
['26', '262.

When you have data to be imported into some other application, writing ``csv`` files is just as easy as reading them. 

Use ``writer()`` to create an object for writing, then iterate over the rows, using ``writerow()`` to print them.  

In [123]:
import csv

ifile  = open("files/Advertising.csv", 'r')
reader = csv.reader(ifile)

ofile  = open('test.csv', "w")
writer = csv.writer(ofile, delimiter=',', lineterminator='\n')

for row in reader:
    writer.writerow(row)

ifile.close()
ofile.close()

A more ellegant way of reading a CSV file:
+  Wrap the CSV reader in a function that returns a generator
+  Use context managers ``with [callable] as [name]`` to ensure that the handle to the file is closed automatically.
+  Use the ``csv.DictReader`` class when headers are present (otherwise just use ``csv.reader``)

In [138]:
import csv

ADV = 'files/Advertising.csv'

def read_data(path):
    with open(path, 'r') as data:
        reader = csv.DictReader(data)
        for row in reader:
            yield row

for idx, row in enumerate(read_data(ADV)):
    if idx < 15: print(row)
    else: break

OrderedDict([('', '1'), ('TV', '230.1'), ('Radio', '37.8'), ('Newspaper', '69.2'), ('Sales', '22.1')])
OrderedDict([('', '2'), ('TV', '44.5'), ('Radio', '39.3'), ('Newspaper', '45.1'), ('Sales', '10.4')])
OrderedDict([('', '3'), ('TV', '17.2'), ('Radio', '45.9'), ('Newspaper', '69.3'), ('Sales', '9.3')])
OrderedDict([('', '4'), ('TV', '151.5'), ('Radio', '41.3'), ('Newspaper', '58.5'), ('Sales', '18.5')])
OrderedDict([('', '5'), ('TV', '180.8'), ('Radio', '10.8'), ('Newspaper', '58.4'), ('Sales', '12.9')])
OrderedDict([('', '6'), ('TV', '8.7'), ('Radio', '48.9'), ('Newspaper', '75'), ('Sales', '7.2')])
OrderedDict([('', '7'), ('TV', '57.5'), ('Radio', '32.8'), ('Newspaper', '23.5'), ('Sales', '11.8')])
OrderedDict([('', '8'), ('TV', '120.2'), ('Radio', '19.6'), ('Newspaper', '11.6'), ('Sales', '13.2')])
OrderedDict([('', '9'), ('TV', '8.6'), ('Radio', '2.1'), ('Newspaper', '1'), ('Sales', '4.8')])
OrderedDict([('', '10'), ('TV', '199.8'), ('Radio', '2.6'), ('Newspaper', '21.2'), ('Sale

The file is not opened, read, or parsed until you need it. This is powerful because it means that even for much larger data sets you will have efficient, portable code. 

In [139]:
data = read_data(ADV)
print(data)
for item in data:
    print(item)

<generator object read_data at 0x7ff5b9f632b0>
OrderedDict([('', '1'), ('TV', '230.1'), ('Radio', '37.8'), ('Newspaper', '69.2'), ('Sales', '22.1')])
OrderedDict([('', '2'), ('TV', '44.5'), ('Radio', '39.3'), ('Newspaper', '45.1'), ('Sales', '10.4')])
OrderedDict([('', '3'), ('TV', '17.2'), ('Radio', '45.9'), ('Newspaper', '69.3'), ('Sales', '9.3')])
OrderedDict([('', '4'), ('TV', '151.5'), ('Radio', '41.3'), ('Newspaper', '58.5'), ('Sales', '18.5')])
OrderedDict([('', '5'), ('TV', '180.8'), ('Radio', '10.8'), ('Newspaper', '58.4'), ('Sales', '12.9')])
OrderedDict([('', '6'), ('TV', '8.7'), ('Radio', '48.9'), ('Newspaper', '75'), ('Sales', '7.2')])
OrderedDict([('', '7'), ('TV', '57.5'), ('Radio', '32.8'), ('Newspaper', '23.5'), ('Sales', '11.8')])
OrderedDict([('', '8'), ('TV', '120.2'), ('Radio', '19.6'), ('Newspaper', '11.6'), ('Sales', '13.2')])
OrderedDict([('', '9'), ('TV', '8.6'), ('Radio', '2.1'), ('Newspaper', '1'), ('Sales', '4.8')])
OrderedDict([('', '10'), ('TV', '199.8'), 

## Exercices

Store the first line of the file files/SeaIce.txt in one variable.
Then, create a list containing the first 10 lines of the files (excluding the first one).

In [16]:
with open("files/SeaIce.txt", 'r') as input_file:
    # Your code goes here
    print()




Read the file files/Advertising.csv and write in test_10.csv lines 1, 10 and 20.

In [17]:
# Your code goes here