## Reading CSV and TXT file
As we saw on previous courses we can read data simply using Python.

When you want to work with a file, the first thing to do is to open it. This is done by invoking the open() built-in function.

open() has a single required argument that is the path to the file and has a single return, the file object.

The with statement automatically takes care of closing the file once it leaves the with block, even in cases of error.

In [6]:
import csv

In [2]:
import pandas as pd

In [3]:
with open('btc-market-price.csv', 'r') as fp:
    print(fp)

<_io.TextIOWrapper name='btc-market-price.csv' mode='r' encoding='cp1252'>


Once the file is opened, we can read its content as follows:

In [4]:
with open('btc-market-price.csv', 'r') as fp:
    for index, line in enumerate(fp.readlines()):
        # read just the first 10 lines
        if (index < 10):
            print(index, line)

0 2017-04-02 00:00:00,1099.169125

1 2017-04-03 00:00:00,1141.813

2 2017-04-04 00:00:00,1141.6003625

3 2017-04-05 00:00:00,1133.0793142857142

4 2017-04-06 00:00:00,1196.3079375

5 2017-04-07 00:00:00,1190.45425

6 2017-04-08 00:00:00,1181.1498375

7 2017-04-09 00:00:00,1208.8005

8 2017-04-10 00:00:00,1207.744875

9 2017-04-11 00:00:00,1226.6170375



How can we process the data read from the file using pure Python? It involves a lot of manual work, for example, splitting the values by the correct separator:

In [5]:
with open('btc-market-price.csv', 'r') as fp:
    for index, line in enumerate(fp.readlines()):
        # read just the first 10 lines
        if (index < 10):
            timestamp, price = line.split(',')
            print(f"{timestamp}: ${price}")

2017-04-02 00:00:00: $1099.169125

2017-04-03 00:00:00: $1141.813

2017-04-04 00:00:00: $1141.6003625

2017-04-05 00:00:00: $1133.0793142857142

2017-04-06 00:00:00: $1196.3079375

2017-04-07 00:00:00: $1190.45425

2017-04-08 00:00:00: $1181.1498375

2017-04-09 00:00:00: $1208.8005

2017-04-10 00:00:00: $1207.744875

2017-04-11 00:00:00: $1226.6170375



## The csv module
Python includes the builtin module csv that helps a little bit more with the process of reading CSVs:

import csv

In [7]:
with open('btc-market-price.csv', 'r') as fp:
    reader = csv.reader(fp)
    for index, (timestamp, price) in enumerate(reader):
        # read just the first 10 lines
        if (index < 10):
            print(f"{timestamp}: ${price}")

2017-04-02 00:00:00: $1099.169125
2017-04-03 00:00:00: $1141.813
2017-04-04 00:00:00: $1141.6003625
2017-04-05 00:00:00: $1133.0793142857142
2017-04-06 00:00:00: $1196.3079375
2017-04-07 00:00:00: $1190.45425
2017-04-08 00:00:00: $1181.1498375
2017-04-09 00:00:00: $1208.8005
2017-04-10 00:00:00: $1207.744875
2017-04-11 00:00:00: $1226.6170375


The csv modules takes care of splitting the file using a given separator (called delimiter) and creating an iterator for us.

In [8]:
with open('exam_review.csv', 'r') as fp:
    reader = csv.reader(fp, delimiter='>')  # special delimiter
    next(reader)  # skipping header
    for index, values in enumerate(reader):
        if not values:
            continue  # skip empty lines
        fname, lname, age, math, french = values
        print(f"{fname} {lname} (age {age}) got {math} in Math and {french} in French")

Ray Morley (age 18) got 68,000 in Math and 75,000 in French
Melvin Scott (age 24) got 77 in Math and 83 in French
Amirah Haley (age 22) got 92 in Math and 67 in French
Gerard Mills (age 19) got 78,000 in Math and 72 in French
Amy Grimes (age 23) got 91 in Math and 81 in French


## The read_csv method
### The first method we'll learn is read_csv, that let us read comma-separated values (CSV) files and raw text (TXT) files into a DataFrame.

The read_csv function is extremely powerful and you can specify a very broad set of parameters at import time that allow us to accurately configure how the data will be read and parsed by specifying the correct structure, enconding and other details. The most common parameters are as follows:

>filepath: Path of the file to be read.

>sep: Character(s) that are used as a field separator in the file.

>header: Index of the row containing the names of the columns (None if none).

>index_col: Index of the column or sequence of indexes that should be used as index of rows of the data.

>names: Sequence containing the names of the columns (used together with header = None).

>skiprows: Number of rows or sequence of row indexes to ignore in the load.

>na_values: Sequence of values that, if found in the file, should be treated as NaN.

>dtype: Dictionary in which the keys will be column names and the values will be types of NumPy to which their content must be converted.

>parse_dates: Flag that indicates if Python should try to parse data with a format similar to dates as dates. You can enter >a list of column names that must be joined for the parsing as a date.

>date_parser: Function to use to try to parse dates.

>nrows: Number of rows to read from the beginning of the file.

>skip_footer: Number of rows to ignore at the end of the file.

>encoding: Encoding to be expected from the file read.

>squeeze: Flag that indicates that if the data read only contains one column the result is a Series instead of a DataFrame.

>thousands: Character to use to detect the thousands separator.

>decimal: Character to use to detect the decimal separator.

>skip_blank_lines: Flag that indicates whether blank lines should be ignored.

Full read_csv documentation can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html.

In this case we'll try to read our btc-market-price.csv CSV file using different parameters to parse it correctly.

This file contains records of the mean price of Bitcoin per date.



In [9]:
csv_url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"

pd.read_csv(csv_url).head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1968,25760680000.0
1,Arab World,ARB,1969,28434200000.0
2,Arab World,ARB,1970,31385500000.0
3,Arab World,ARB,1971,36426910000.0
4,Arab World,ARB,1972,43316060000.0


 ## Missing values with na_values parameter
###  We can define a na_values parameter with the values we want to be recognized as NA/NaN. In this case empty strings '', ? and - will be recognized as null values.

In [10]:
df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'])
df.head()

Unnamed: 0,0,1
0,2017-04-02 00:00:00,1099.169125
1,2017-04-03 00:00:00,1141.813
2,2017-04-04 00:00:00,1141.600363
3,2017-04-05 00:00:00,1133.079314
4,2017-04-06 00:00:00,1196.307937
