# Storing data. CSV formats
---------------

## 1. File IO Tools
-------------------
[Corey Schafer. File Objects - Reading and Writing to Files](https://www.youtube.com/watch?v=Uh2ebFW8OYM&t=515s)


[File object](https://docs.python.org/3/glossary.html#term-file-object)

#### `io` — core tools for working with streams

* 3 main categories of IO: 
    - text IO 
    - binary IO 
    - raw IO.  
* **file object** -- concrete object belonging to any of these categories

 

#### Text I/O
* expects and produces `str` objects
* the contents of the file are returned as:
    - strings 
    - bytes having been first decoded (using a platform-dependent encoding or the specified encoding if given)

[Built-in ``open()``](https://docs.python.org/3/library/functions.html?highlight=open#open)
 -- the easiest way to create a text stream 
 
```ipython
open(file, mode='r', buffering=-1, encoding=None, errors=None, 
           newline=None, closefd=True, opener=None)
```

* open file and return **a stream**
* raise IOError upon failure

#### Opening modes:
* ```r``` -- open for reading (default)
* ```w``` -- open for writing, truncating the file first
* ```x``` -- create a new file and open it for writing (an `FileExistsError` if the file already exists)
* ```+``` -- open a disk file for updating (reading and writing)
* ```a``` -- open for writing, appending to the end of the file if it exists
* ```b``` -- binary mode
* ```t``` -- text mode (default)

#### Binary I/O 
* **buffered I/O**, expects bytes-like objects and produces bytes objects
* files opened in binary mode return contents as bytes objects
* no encoding, decoding, or newline translation is performed  
* streams can be used for all kinds of non-text data, and also when manual control over the handling of text data is desired

#### Raw I/O 
* **unbuffered I/O**
* is generally used as a low-level building-block for binary and text streams
* it is rarely useful to directly manipulate a raw stream from user code
* raw stream can be createed by opening a file in binary mode with **buffering** disabled:

In [None]:
open("IMAG1225.jpg", "rb", buffering=0)


ABC	          |Inherits	| Stub Methods     |	Mixin Methods and Properties
:---------| :---------| :------------------|:-----------------------------------
`IOBase	`      | 	    |`fileno, seek, truncate`	| `close`, `closed`, `__enter__`, `__exit__`, `flush, isatty,` `__iter__`, `__next__`,` readable, readline,` `readlines, seekable,` `tell, writable, and writelines`
`RawIOBase`	  |`IOBase`	|`readinto,  write`|Inherited `IOBase` methods,` read, and readall`
`BufferedIOBase`|	`IOBase`	|`detach, read, read1, write`	|Inherited `IOBase` methods, `readinto, and readinto1`
`TextIOBase`	|`IOBase`	|`detach, read, readline,  write`	|Inherited `IOBase` methods, `encoding, errors, and newlines`

It is possible to use a string or bytearray as a file for both reading and writing:
* ```StringIO``` for string (in a text mode)
* ```BytesIO``` for bytes (in a binary mode)

In [None]:
f = open("test.txt", "r")

In [None]:
print(f'{type(f)}\n {f.name}\n { f.mode}')
f.close()

## 2. Reading Files
---------------

 * #### ``f.read()`` -- reading Small Files:

In [None]:
with open("test.txt", "r") as f:
    f_contents = f.read()
    print(f'{type(f_contents)}\n{f_contents}')

  * ####    ``f.readlines()`` -- reading Big Files:

In [None]:
with open("test.txt", "r") as f:
    f_contents = f.readlines()
    print(f'{type(f_contents)}\n{f_contents}')

In [None]:
f_contents[3] 

In [None]:
with open("test.txt", "r") as f:
    f_contents = f.readlines(30)
    print(f_contents)

 * #### ``f.readline()`` -- with extra lines

In [None]:
with open("test.txt", "r") as f:
    f_contents = f.readline()
    print(f_contents)
    f_contents = f.readline()
    print(f_contents)

In [None]:
f_contents

* #### Print out without the extra lines

In [None]:
with open("test.txt", "r") as f:
    f_contents = f.readline()
    print(f_contents, end = '')
    f_contents = f.readline()
    print(f_contents, end = '')

* #### Iterating through the file

In [None]:
with open("test.txt", "r") as f:
    for line in f:
        print(line, end = '')
    else: print('\n', type(line))

* #### Iterating through small chunks, with ```size_to_read``` characters:

In [None]:
with open("test.txt", "r") as f:
    size_to_read = 100
    
    f_contents = f.read(size_to_read)
    print(len(f_contents))
    print(f_contents, end = '')

    f_contents = f.read(size_to_read)
    print(len(f_contents))
    print(f_contents)

In [None]:
with open("test.txt", "r") as f:
    size_to_read = 10
    f_contents = f.read(size_to_read)
    print(f_contents, end = '')

    f.seek(0)
    
    f_contents = f.read(size_to_read)
    print(f_contents)
    
    print(f'Current position={f.tell()}')
    while len(f_contents) > 0:
        print(f_contents, end = '*')
        f_contents = f.read(size_to_read)

## 3. Writing Files
------------------------

In [None]:
open("test123456.txt", "w")

In [None]:
with open("test2.txt", "w") as f:
    f.write("Test")
    f.seek(1)
    f.write("Test")

* #### Copying Files

In [None]:
with open("test.txt", "r") as rf:
    with open("test_copy.txt", "w") as wf:
        for line in rf:
            wf.write(line)

* #### Copying the image without chunks

In [None]:
iter_counter=0
with open("IMAG1225.jpg", "rb") as rf:
    with open("spring_2017.jpg", "wb") as wf:
        for line in rf:
            wf.write(line)
            iter_counter+=1
            if iter_counter==1:
                print(type(line))
        
iter_counter

* #### Copying the image with chunks

In [None]:
iter_counter=0
print(type(iter_counter))
with open("IMAG1225.jpg", "rb") as rf:
    with open("spring_2018.jpg", "wb") as wf:
        chunk_size=1024
        rf_chunk=rf.read(chunk_size)
        while len(rf_chunk)>0:
            wf.write(rf_chunk)
            rf_chunk=rf.read(chunk_size)
            iter_counter+=1
iter_counter            

## 4. Working with CSV Files
-----------------

In [None]:
import csv

* #### Description of CSV format 

1.  Each record is located on a separate line, delimited by a **line break** (CRLF):

       aaa,bbb,ccc CRLF
       
       zzz,yyy,xxx CRLF

2.  The last record in the file may or may not have an ending line break:

       aaa,bbb,ccc CRLF
       
       zzz,yyy,xxx

3.  In general, the **default** separator character (a delimiter) is  comma. Other popular delimiters include  tab (\t), colon (:) and semi-colon (;) . 

    Properly parsing a CSV file requires us to know which delimiter is being used.

4.  There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file        and should contain the same number of fields as the records in the rest of the file:

       field_name,field_name,field_name CRLF
       
       aaa,bbb,ccc CRLF
       
       zzz,yyy,xxx CRLF

5.  Each field may or may not be enclosed in double quotes:

       "aaa","bbb","ccc" CRLF
       zzz,yyy,xxx

* that seems to be followed by most implementations
* programmers can also define their own special-purpose CSV formats

* #### ```csv``` module
   * The ```csv``` module’s ```reader``` and ```writer``` objects read and write sequences 
   * Programmers can also read and write data in dictionary form using the ```DictReader``` and ```DictWriter``` classes

[CSV File Reading and Writing](https://docs.python.org/3/library/csv.html)

[ CSV Module - How to Read, Parse, and Write CSV Files](https://www.youtube.com/watch?v=q5uM4VKywbA)

* #### ``csv.reader`` reader object 

```ipython
csv.reader(csvfile, dialect='excel', **fmtparams)
```

   - is responsible for reading and parsing tabular data in CSV format
   - return a reader object which will iterate over **lines** in the given ```csvfile```
   - ```csvfile``` can be any object which supports the iterator protocol and returns a string each time its ```__next__()``` method is called — file objects and list objects are both suitable
   - an optional dialect parameter can be given which is used to define a set of parameters specific to a particular CSV dialect.

In [None]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    print(f'"type of csv_file: "{type(csv_file)}, "type of csv_reader: "{type(csv_reader)}')
    
    for line in csv_reader:
        print(line)

In [None]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    
    next(csv_reader) #to skip over the field name headers
    for line in csv_reader:
        print(line[2])

* #### ```csv.writer``` writer object 

```ipython
csv.writer(csvfile, dialect='excel', **fmtparams)
```

- is responsible for writing tabular data in CSV format
- return a writer object responsible for converting the user’s data into delimited strings on the given file-like object
- ```csvfile``` can be any object with a ```write()``` method 
- if ```csvfile``` is a file object, it should be opened with ```newline=''```

In [None]:
with open('names.csv',  'r', newline='') as csv_file:
    csv_reader = csv.reader(csv_file)
    
    with open('names_copy.csv', 'w', newline='') as new_file:
        csv_writer = csv.writer(new_file, delimiter='-') 

        for line in csv_reader:
            csv_writer.writerow(line)

In [None]:
with open('names.csv',  'r', newline='') as csv_file:
    csv_reader = csv.reader(csv_file)
    
    with open('new_names.csv', 'w', newline='') as new_file:
        csv_writer = csv.writer(new_file, delimiter='\t') 

        for line in csv_reader:
            csv_writer.writerow(line)

In [None]:
with open('new_names.csv', 'r', newline='') as csv_file:
    csv_reader = csv.reader(csv_file)# default delimiter=',' but in this case delimiter='\t'
    for line in csv_reader:
        print(line)

In [None]:
with open('new_names.csv', 'r', newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter='\t')# setup true delimeter
    for line in csv_reader:
        print(line)

* #### DictReader & DictWriter

```ipython 
class csv.DictReader(f, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
```
     
- create an object that operates like a regular reader but maps the information in each row to an ```OrderedDict``` whose keys are given by the optional **fieldnames** parameter
-  ```fieldnames```  is a sequence. If ```fieldnames``` is omitted, the values in the first row of file ```f``` will be used as the fieldnames. Regardless of how the fieldnames are determined, the ordered dictionary preserves their original ordering
- If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by ```restkey```. If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with ```None```
     - Key access to the values in the row  

In [None]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    #first line no longer contains fieldnames

    for line in csv_reader: 
         print(line)

In [None]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)

    for line in csv_reader: 
         print(line['email'])

In [None]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    
    with open('new_names_1.csv', 'w') as new_file:
        fieldnames = ['first_name', 'last_name', 'email']

        #csv_writer = csv.DictWriter(new_file, delimiter='\t')#TypeError: __init__() missing 1 required positional argument: 'fieldnames'
        csv_writer = csv.DictWriter(new_file, fieldnames=fieldnames, delimiter='\t')

        csv_writer.writeheader()

        for line in csv_reader:
            del line['email']
            csv_writer.writerow(line)

In [None]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    
    with open('new_names_1.csv', 'w') as new_file:
        fieldnames = ['first_name', 'last_name']#correct header

        csv_writer = csv.DictWriter(new_file, fieldnames=fieldnames, delimiter='\t')

        csv_writer.writeheader()

        for line in csv_reader:
            del line['email']
            csv_writer.writerow(line)