# Storing data. CSV formats

## File IO
[Corey Schafer. File Objects - Reading and Writing to Files](https://www.youtube.com/watch?v=Uh2ebFW8OYM&t=515s)

### File Objects: The Basics
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

Open file and return **a stream**.  Raise IOError upon failure.

#### Opening modes:
* ```r``` -- open for reading (default)
* ```w``` -- open for writing, truncating the file first
* ```x``` -- create a new file and open it for writing (an `FileExistsError` if the file already exists)
* ```+``` -- open a disk file for updating (reading and writing)
* ```a``` -- open for writing, appending to the end of the file if it exists
* ```b``` -- binary mode
* ```t``` -- text mode (default)

Files opened in binary mode return contents as bytes objects without any decoding. 
In text mode the contents of the file are returned as strings, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

It is possible to use a string or bytearray as a file for both reading and writing:
* ```StringIO``` for string (in a text mode)
* ```BytesIO``` for bytes (in a binary mode)

In [45]:
f = open("test.txt", "r")
# f = open?
print('{}\n {}\n {}'.format(type(f), f.name, f.mode))
f.close()

<class '_io.TextIOWrapper'>
 test.txt
 r


#### Reading Small Files:

In [47]:
with open("test.txt", "r") as f:
    f_contents = f.read()
    print('{}\n{}'.format(type(f_contents), f_contents))

<class 'str'>
1) This is a test file
2) With multiple lines of data...
3) Third line
4) Fourth line
5) Fifth line
6) Sixth line
7) Seventh line
8) Eighth line
9) Ninth line
10) Tenth line


#### Reading Big Files:

In [49]:
with open("test.txt", "r") as f:
    f_contents = f.readlines()
    print('{}\n{}'.format(type(f_contents), f_contents))

<class 'list'>
['1) This is a test file\n', '2) With multiple lines of data...\n', '3) Third line\n', '4) Fourth line\n', '5) Fifth line\n', '6) Sixth line\n', '7) Seventh line\n', '8) Eighth line\n', '9) Ninth line\n', '10) Tenth line']


In [25]:
f_contents[3] 

'4) Fourth line\n'

In [20]:
with open("test.txt", "r") as f:
    f_contents = f.readlines(30)
    print(f_contents)

['1) This is a test file\n', '2) With multiple lines of data...\n']


#### With the extra lines

In [28]:
with open("test.txt", "r") as f:
    f_contents = f.readline()
    print(f_contents)
    f_contents = f.readline()
    print(f_contents)

1) This is a test file

2) With multiple lines of data...



In [29]:
f_contents

'2) With multiple lines of data...\n'

#### Without the extra lines

In [30]:
with open("test.txt", "r") as f:
    f_contents = f.readline()
    print(f_contents, end = '')
    f_contents = f.readline()
    print(f_contents, end = '')

1) This is a test file
2) With multiple lines of data...


#### Iterating through the file

In [42]:
with open("test.txt", "r") as f:
    for line in f:
        print(line, end = '')
    else: print('\n', type(line))

1) This is a test file
2) With multiple lines of data...
3) Third line
4) Fourth line
5) Fifth line
6) Sixth line
7) Seventh line
8) Eighth line
9) Ninth line
10) Tenth line
 <class 'str'>


### Iterating through small chunks, with ```size_to_read``` characters:

In [50]:
with open("test.txt", "r") as f:
    size_to_read = 10
    f_contents = f.read(size_to_read)
    print(f_contents, end = '')
    
    f.seek(0)
    
    f_contents = f.read(size_to_read)
    print(f_contents)
    
    print('Current position={}'.format(f.tell()))
    while len(f_contents) > 0:
        print(f_contents, end = '*')
        f_contents = f.read(size_to_read)

1) This is1) This is
Current position=10
1) This is* a test fi*le
2) With* multiple *lines of d*ata...
3) *Third line*
4) Fourth* line
5) F*ifth line
*6) Sixth l*ine
7) Sev*enth line
*8) Eighth *line
9) Ni*nth line
1*0) Tenth l*ine*

## Writing Files

In [None]:
with open("test2.txt", "w") as f:
    f.write("Test")
    f.seek(1)
    f.write("Test")

#### Copying Files

In [54]:
with open("test.txt", "r") as rf:
    with open("test_copy.txt", "w") as wf:
        for line in rf:
            wf.write(line)

#### Copying the image without chunks

In [56]:
with open("IMAG1225.jpg", "rb") as rf:
    with open("spring_2017.jpg", "wb") as wf:
        for line in rf:
            wf.write(line)

## Working with CSV Files

In [None]:
import csv

### Description of CSV format 

1.  Each record is located on a separate line, delimited by a line break (CRLF):

       aaa,bbb,ccc CRLF
       zzz,yyy,xxx CRLF

2.  The last record in the file may or may not have an ending line break:

       aaa,bbb,ccc CRLF
       zzz,yyy,xxx

3.  In general, the default separator character (a delimiter) is the comma. Other popular delimiters include the tab (\t), colon (:) and semi-colon (;) characters. Properly parsing a CSV file requires us to know which delimiter is being used.

4.  There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file        and should contain the same number of fields as the records in the rest of the file:

       field_name,field_name,field_name CRLF
       aaa,bbb,ccc CRLF
       zzz,yyy,xxx CRLF

5.  Each field may or may not be enclosed in double quotes:

       "aaa","bbb","ccc" CRLF
       zzz,yyy,xxx

* that seems to be followed by most implementations
* programmers can also define their own special-purpose CSV formats

### ```csv``` module
* The ```csv``` module’s ```reader``` and ```writer``` objects read and write sequences 
* Programmers can also read and write data in dictionary form using the ```DictReader``` and ```DictWriter``` classes

[CSV File Reading and Writing](https://docs.python.org/3/library/csv.html)

[ CSV Module - How to Read, Parse, and Write CSV Files](https://www.youtube.com/watch?v=q5uM4VKywbA)

**Reader objects** are responsible for reading and parsing tabular data in CSV format

In [None]:
csv.reader(csvfile, dialect='excel', **fmtparams)

* Return a reader object which will iterate over **lines** in the given csvfile
* ```csvfile``` can be any object which supports the iterator protocol and returns a string each time its ```__next__()``` method is called — file objects and list objects are both suitable
* An optional dialect parameter can be given which is used to define a set of parameters specific to a particular CSV dialect.

In [74]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    print('"type of csv_file: "{}, "type of csv_reader: "{}'.format(type(csv_file), type(csv_reader)))
    
#     next(csv_reader) #to skip over the field name headers
    for line in csv_reader:
        print(line)

"type of csv_file: "<class '_io.TextIOWrapper'>, "type of csv_reader: "<class '_csv.reader'>
['first_name', 'last_name', 'email']
['John', 'Doe', 'john-doe@bogusemail.com']
['Mary', 'Smith-Robinson', 'maryjacobs@bogusemail.com']
['Dave', 'Smith', 'davesmith@bogusemail.com']
['Jane', 'Stuart', 'janestuart@bogusemail.com']
['Tom', 'Wright', 'tomwright@bogusemail.com']
['Steve', 'Robinson', 'steverobinson@bogusemail.com']
['Nicole', 'Jacobs', 'nicolejacobs@bogusemail.com']
['Jane', 'Wright', 'janewright@bogusemail.com']
['Jane', 'Doe', 'janedoe@bogusemail.com']
['Kurt', 'Wright', 'kurtwright@bogusemail.com']
['Kurt', 'Robinson', 'kurtrobinson@bogusemail.com']
['Jane', 'Jenkins', 'janejenkins@bogusemail.com']
['Neil', 'Robinson', 'neilrobinson@bogusemail.com']
['Tom', 'Patterson', 'tompatterson@bogusemail.com']
['Sam', 'Jenkins', 'samjenkins@bogusemail.com']
['Steve', 'Stuart', 'stevestuart@bogusemail.com']
['Maggie', 'Patterson', 'maggiepatterson@bogusemail.com']
['Maggie', 'Stuart', 'mag

In [86]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    
    next(csv_reader) #to skip over the field name headers
    for line in csv_reader:
        print(line[2])

john-doe@bogusemail.com
maryjacobs@bogusemail.com
davesmith@bogusemail.com
janestuart@bogusemail.com
tomwright@bogusemail.com
steverobinson@bogusemail.com
nicolejacobs@bogusemail.com
janewright@bogusemail.com
janedoe@bogusemail.com
kurtwright@bogusemail.com
kurtrobinson@bogusemail.com
janejenkins@bogusemail.com
neilrobinson@bogusemail.com
tompatterson@bogusemail.com
samjenkins@bogusemail.com
stevestuart@bogusemail.com
maggiepatterson@bogusemail.com
maggiestuart@bogusemail.com
janedoe@bogusemail.com
stevepatterson@bogusemail.com
davesmith@bogusemail.com
samwilks@bogusemail.com
kurtjefferson@bogusemail.com
samstuart@bogusemail.com
janestuart@bogusemail.com
davedavis@bogusemail.com
sampatterson@bogusemail.com
tomjefferson@bogusemail.com
janestuart@bogusemail.com
maggiejefferson@bogusemail.com
marywilks@bogusemail.com
neilpatterson@bogusemail.com
coreydavis@bogusemail.com
stevejacobs@bogusemail.com
janejenkins@bogusemail.com
johnjacobs@bogusemail.com
neilsmith@bogusemail.com
coreywilks@bog

### Writer objects 
are responsible for writing tabular data in CSV format

In [None]:
csv.writer(csvfile, dialect='excel', **fmtparams)

* Return a writer object responsible for converting the user’s data into delimited strings on the given file-like object
* ```csvfile``` can be any object with a ```write()``` method 
* If ```csvfile``` is a file object, it should be opened with ```newline=''```

In [95]:
with open('names.csv',  'r', newline='') as csv_file:
    csv_reader = csv.reader(csv_file)
    
    with open('new_names.csv', 'w', newline='') as new_file:
        csv_writer = csv.writer(new_file, delimiter='\t') #delimiter='\t'

        for line in csv_reader:
            csv_writer.writerow(line)

In [97]:
with open('new_names.csv', 'r', newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter='-')
#     csv_reader = csv.reader(csv_file, delimiter='\t')
    for line in csv_reader:
        print(line)

['first_name\tlast_name\temail']
['John\tDoe\tjohn', 'doe@bogusemail.com']
['Mary\tSmith', 'Robinson\tmaryjacobs@bogusemail.com']
['Dave\tSmith\tdavesmith@bogusemail.com']
['Jane\tStuart\tjanestuart@bogusemail.com']
['Tom\tWright\ttomwright@bogusemail.com']
['Steve\tRobinson\tsteverobinson@bogusemail.com']
['Nicole\tJacobs\tnicolejacobs@bogusemail.com']
['Jane\tWright\tjanewright@bogusemail.com']
['Jane\tDoe\tjanedoe@bogusemail.com']
['Kurt\tWright\tkurtwright@bogusemail.com']
['Kurt\tRobinson\tkurtrobinson@bogusemail.com']
['Jane\tJenkins\tjanejenkins@bogusemail.com']
['Neil\tRobinson\tneilrobinson@bogusemail.com']
['Tom\tPatterson\ttompatterson@bogusemail.com']
['Sam\tJenkins\tsamjenkins@bogusemail.com']
['Steve\tStuart\tstevestuart@bogusemail.com']
['Maggie\tPatterson\tmaggiepatterson@bogusemail.com']
['Maggie\tStuart\tmaggiestuart@bogusemail.com']
['Jane\tDoe\tjanedoe@bogusemail.com']
['Steve\tPatterson\tstevepatterson@bogusemail.com']
['Dave\tSmith\tdavesmith@bogusemail.com']
['Sa

### DictReader & DictWriter

In [None]:
class csv.DictReader(f, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)

* Create an object that operates like a regular reader but maps the information in each row to an ```OrderedDict``` whose keys are given by the optional fieldnames parameter
* The fieldnames parameter is a sequence. If fieldnames is omitted, the values in the first row of file ```f``` will be used as the fieldnames. Regardless of how the fieldnames are determined, the ordered dictionary preserves their original ordering
* If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by ```restkey```. If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with ```None```
* Key access to the values in the row  

In [100]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)

    for line in csv_reader: 
         print(line)

OrderedDict([('first_name', 'John'), ('last_name', 'Doe'), ('email', 'john-doe@bogusemail.com')])
OrderedDict([('first_name', 'Mary'), ('last_name', 'Smith-Robinson'), ('email', 'maryjacobs@bogusemail.com')])
OrderedDict([('first_name', 'Dave'), ('last_name', 'Smith'), ('email', 'davesmith@bogusemail.com')])
OrderedDict([('first_name', 'Jane'), ('last_name', 'Stuart'), ('email', 'janestuart@bogusemail.com')])
OrderedDict([('first_name', 'Tom'), ('last_name', 'Wright'), ('email', 'tomwright@bogusemail.com')])
OrderedDict([('first_name', 'Steve'), ('last_name', 'Robinson'), ('email', 'steverobinson@bogusemail.com')])
OrderedDict([('first_name', 'Nicole'), ('last_name', 'Jacobs'), ('email', 'nicolejacobs@bogusemail.com')])
OrderedDict([('first_name', 'Jane'), ('last_name', 'Wright'), ('email', 'janewright@bogusemail.com')])
OrderedDict([('first_name', 'Jane'), ('last_name', 'Doe'), ('email', 'janedoe@bogusemail.com')])
OrderedDict([('first_name', 'Kurt'), ('last_name', 'Wright'), ('email'

In [101]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)

    for line in csv_reader: 
         print(line['email'])

john-doe@bogusemail.com
maryjacobs@bogusemail.com
davesmith@bogusemail.com
janestuart@bogusemail.com
tomwright@bogusemail.com
steverobinson@bogusemail.com
nicolejacobs@bogusemail.com
janewright@bogusemail.com
janedoe@bogusemail.com
kurtwright@bogusemail.com
kurtrobinson@bogusemail.com
janejenkins@bogusemail.com
neilrobinson@bogusemail.com
tompatterson@bogusemail.com
samjenkins@bogusemail.com
stevestuart@bogusemail.com
maggiepatterson@bogusemail.com
maggiestuart@bogusemail.com
janedoe@bogusemail.com
stevepatterson@bogusemail.com
davesmith@bogusemail.com
samwilks@bogusemail.com
kurtjefferson@bogusemail.com
samstuart@bogusemail.com
janestuart@bogusemail.com
davedavis@bogusemail.com
sampatterson@bogusemail.com
tomjefferson@bogusemail.com
janestuart@bogusemail.com
maggiejefferson@bogusemail.com
marywilks@bogusemail.com
neilpatterson@bogusemail.com
coreydavis@bogusemail.com
stevejacobs@bogusemail.com
janejenkins@bogusemail.com
johnjacobs@bogusemail.com
neilsmith@bogusemail.com
coreywilks@bog

In [102]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    
    with open('new_names.csv', 'w') as new_file:
        fieldnames = ['first_name', 'last_name', 'email']

        csv_writer = csv.DictWriter(new_file, fieldnames=fieldnames, delimiter='\t')

        csv_writer.writeheader()

        for line in csv_reader:
            #del line['email']
            csv_writer.writerow(line)

In [103]:
with open('names.csv', 'r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    
    with open('new_names.csv', 'w') as new_file:
        fieldnames = ['first_name', 'last_name']

        csv_writer = csv.DictWriter(new_file, fieldnames=fieldnames, delimiter='\t')

        csv_writer.writeheader()

        for line in csv_reader:
            del line['email']
            csv_writer.writerow(line)