# Reading and Writing CSV Files

.csv files are simple text files where each row is a record and each field in a record is separated with commas. These files can also be opened with spreadsheet programs like Microsoft Excel. Try opening accounts.csv file with Excel.

The following command displays the contents of account.csv on a Linux/mac os system.

In [1]:
!cat accounts.csv

100,Jones,24.98
200,Doe,345.67
300,White,0.0
400,Stone,-42.16
500,Rich,224.62


Python's `csv` module can be used for reading and writing csv files.

## Writing a CSV file

In [5]:
import csv

with open('accounts.csv', mode='w', newline='') as accounts:
    writer = csv.writer(accounts)
    writer.writerow([100, 'Jones', 24.98])
    writer.writerow([200, 'Doe', 345.67])
    writer.writerow([300, 'White', 0.00])
    writer.writerow([400, 'Stone', -42.16])
    writer.writerow([500, 'Rich', 224.61])

In [6]:
!cat accounts.csv

100,Jones,24.98
200,Doe,345.67
300,White,0.0
400,Stone,-42.16
500,Rich,224.61


* **`.csv` file extension** indicates a CSV-format file
* **`writer` function** returns an object that writes CSV data to the specified file object
* `writer`’s **`writerow` method** receives an iterable to store in the file
* By default, `writerow` delimits values with commas, but you can specify custom delimiters
* `writerow` calls above can be replaced with one **`writerows`** call that outputs a comma-separated list of iterables representing the records
* If you write data that contains commas in a given string, `writerow` encloses that string in double quotes to indicate a _single_ value

## Reading a CSV file

In [8]:
with open('accounts.csv', 'r', newline='') as accounts:
    print(f'{"Account":<10}{"Name":<10}{"Balance":>10}')
    reader = csv.reader(accounts)
    for record in reader:  
        account, name, balance = record
        print(f'{account:<10}{name:<10}{balance:>10}')

Account   Name         Balance
100       Jones          24.98
200       Doe           345.67
300       White            0.0
400       Stone         -42.16
500       Rich          224.61


* `csv` module’s **`reader` function** returns an object that reads CSV-format data from the specified file object
* Can iterate through the `reader` object one record of comma-delimited values at a time

## Exercise 1: 

In this exercise, you will create the personnel file of a company using CSV file format. Each personnel record will contain firstname, lastname, position, age and salary fields. While firstname, lastname and position fields are strings, age is an integer and salary is a float. 

Create a file named `personnel.csv`. Write the column names as first row. Write 5 records in it. Display the contents with `cat`(`more` command if Windows) command.

Use python code to read the file and display in tabular format with proper column headers. 

## Reading CSV Files to a Pandas DataFrame object

* Load a CSV dataset into a `DataFrame` with the pandas function **`read_csv`**
* `names` argument specifies the `DataFrame`’s column names
    * Without this argument, `read_csv` assumes that the CSV file’s first row is a comma-delimited list of column names

In [None]:
import pandas as pd
df = pd.read_csv('accounts.csv', 
                 names=['account', 'name', 'balance'])
print(df)

## Selecting a column

A column can be selected as a series object by providing the column name.

In [None]:
names = df['name']
for i in names:
    print(i)

## Creating a Calculated Column

A new column can be calculated from an existing column.

In [None]:
# creating a calculated column
df['new_balance'] = df['balance'] + 100

print(df)

## Traverse the DataFrame using `loc`

`loc` structure can be used to get a single row of a DataFrame. count function returns the number of items in a column. It is a function of Series type of objects.

In [None]:
for i in range(df['account'].count()):
    account, name, balance, new_balance = df.loc[i]
    print(account)
    
# creating a subset of columns with only name and balance columns

dfsub = df.loc[:,['name','balance']]
print(dfsub)

## Filtering Rows

You can select a subset of rows by providing a condition

In [None]:
# dataframe only contains where balance is greater than 200
dfbalance = df[df['balance']>200]

print(dfbalance)

## Writing DataFrame object to CSV file
* To save a `DataFrame` to a file using CSV format, call `DataFrame` method **`to_csv`**
* `index=False` indicates that the row names (`0`–`4` at the left of the `DataFrame`’s output above are not written to the file
* Resulting file contains the column names as the first row

In [None]:
df.to_csv('accounts_from_dataframe.csv', index=False)

## Here Are Some Datasets

### Datasets
* Enormous variety of free datasets available online
* **Rdatasets repository** provides links to over 1100 free datasets in comma-separated values (CSV) format
> https://vincentarelbundock.github.io/Rdatasets/datasets.html
* **`pydataset` module** specifically for accessing Rdatasets
> https://github.com/iamaziz/PyDataset
* Another large source of datasets is
> https://github.com/awesomedata/awesome-public-datasets
* A commonly used machine-learning dataset for beginners is the **Titanic disaster dataset**

## Exercise 2

- Download the TitanicSurvival.csv dataset.
- Open it with a notepad type program and change the first column name to "name".
- Read the csv file into a DataFrame.
- Print only 'name' column
- Titanic sank in 1912. Create a calculated column 'birthyear' from the 'age' column by subtracting it from 1912.
- Create a dataframe with only alive passengers
- Create a dataframe with only first class passengers
- Create a dataframe with only name and sex columns. Write this dataframe to a CSV file. 


