<h2>Reading CSV and TXT files</h2>
Rather than creating <code>Series</code> or <code>DataFrames</code> structures from scratch (or even from Python core sequences or <code>ndarrays</code>, the most typical use of <b>pandas</b> is based on the loading of information from files or sources of information for further exploration, transformation, and analysis. <br><br>In this lesson, we'll learn how to read CSVs and raw text files into pandas <code>DataFrames</code>

In [8]:
import numpy as np
import pandas as pd

<h4>Reading data with Python</h4>
<br>When you want to work with a file, the first thing you have to do is open it by invoking the <code>open()</code> built-in function.<br><br><code>open()</code> has a single required argument that is th epath to the file and has a single return: <b> the file object</b>.<br><br>The <code>with</code> statement automatically takes care of the closing the file once it leaves the <code>with</code> block, even in case of error.

In [4]:
# with open('''filepath&file''', 'r') as fp:
#     pass

<h4>Reading data with Pandas</h4>

In [17]:
formats = pd.DataFrame({
    'Format Type': ['text', 'text', 'text', 'text',
                    'binary', 'binary','binary','binary',
                    'binary','binary','binary','binary',
                    'binary','SQL', 'SQL'],
    'Data Description': ['CVS', 'JSON','HTML','Local Clip', 
                         'MS Excel', 'OpenDocument', 'HDF5', 'Feather Format', 
                         'Parquet Format', 'Msgpack', 'Stata', 'SAS', 
                         'Python Pickle Format', 'SQL', 'Google Big Query'],
    'Reader': ['read_csv', 'read_json', 'read_html', 'read_clipboard', 
               'read_excel', 'read_excel', 'read_hdf', 'read_feather', 
               'read_parquet', 'read_msgpack', 'read_stata', 'read_sas', 
               'read_pickle', 'read_sql', 'read_gbq'],
    'Writer': ['to_csv',np.nan, np.nan, np.nan, 
               np.nan, np.nan, np.nan, np.nan, 
               np.nan, np.nan, np.nan, np.nan, 
               np.nan, np.nan, np.nan]
})

In [19]:
formats.set_index('Format Type', inplace=True)

In [20]:
formats

Unnamed: 0_level_0,Data Description,Reader,Writer
Format Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
text,CVS,read_csv,to_csv
text,JSON,read_json,
text,HTML,read_html,
text,Local Clip,read_clipboard,
binary,MS Excel,read_excel,
binary,OpenDocument,read_excel,
binary,HDF5,read_hdf,
binary,Feather Format,read_feather,
binary,Parquet Format,read_parquet,
binary,Msgpack,read_msgpack,


<h2>The <code>read_csv</code> Method</h2>
The read_csv function is extremely powerful and you can specify a very broad set of parameters at import time that allow us to accurately configure how the data will be read and parsed by specifying the correct structure, enconding and other details. The most common parameters are as follows:

<b><u>PARAMETERS</u>:</b><br>
<code>filepath</code>: Path of file to be read.<br>
<code>sep</code>: Character(s) that are used as a field separator in file.<br>
<code>header</code>: Index of the row containing the names of the columns (None if none).<br>
<code>index_col</code>: Index of the column or sequence of indexes that should be used as index of rows of the data<br>
<code>names</code>: sequence containing the names of the olumns (used together with header = None)<br>
<code>skiprows</code>: Number of rows or sequence of row indexes to ignore the load<br>
<code>na_values</code>: Sequence of values that, if found in the file, should be treated as NaN.<br>
<code>dtype</code>: Dictionary in which the keys will be column names and the values wil be types of NumPy to which their content must be converted.<br>
<code>parse_dates</code>: Flag that indicates if Python should try to parse data with a formate simliar to dates as dates.<br>
<code>date_parser</code>: Function to use to try to parse dates<br>
<code>skip_footer</code><br>
<code>encoding</code><br>
<code>squeeze</code><br>
<code>thousands</code>: Character to use to detect the thousands separator.<br>
<code>decimal</code>: Character to use to detect the decimal sepaerator. <br>
<code>skip_blank_lines</code>: Flag that indicates whether blank lines should be ignored. <br>


<h3>Reading our first CSV file<h3>


In [25]:
pd.read_csv('https://raw.githubusercontent.com/'+
            'ine-rmotr-curriculum/data-cleaning'+
            '-rmotr-freecodecamp/master/data/btc'+
            '-market-price.csv').head()

#In this case, we let pandas infer everything related to our data but, in most cases...
#We'll need to explicity tell pandas how we want our data to be loaded -- by using parameters.

Unnamed: 0,2017-04-02 00:00:00,1099.169125
0,2017-04-03 00:00:00,1141.813
1,2017-04-04 00:00:00,1141.600363
2,2017-04-05 00:00:00,1133.079314
3,2017-04-06 00:00:00,1196.307937
4,2017-04-07 00:00:00,1190.45425


<b><i>First row behavior with <code><b>header</b></code> parameter</i></b><br>
The CSV file we're reading has two columns: <code>Timestamp</code> and <code>Price</code>. It doesn't have a header. Pandas are automatically assigned the first row of data as headers, which is incorrect. We can overwrite this behavior with the <code>header</code> parameter. 

In [34]:
df = pd.read_csv('https://raw.githubusercontent.com/'+
            'ine-rmotr-curriculum/data-cleaning'+
            '-rmotr-freecodecamp/master/data/btc'+
            '-market-price.csv', 
                 header=None, 
                 names=(['Timestamp', 'Price']))
df.head()

Unnamed: 0,Timestamp,Price
0,2017-04-02 00:00:00,1099.169125
1,2017-04-03 00:00:00,1141.813
2,2017-04-04 00:00:00,1141.600363
3,2017-04-05 00:00:00,1133.079314
4,2017-04-06 00:00:00,1196.307937


<b><i>Missing values with <code>na_values</code> parameter</i></b><br>
We can define <code>na_values</code> parameter with the values we want to be recognized as NA/NaN. 

In [37]:
df = pd.read_csv('https://raw.githubusercontent.com/'+
            'ine-rmotr-curriculum/data-cleaning'+
            '-rmotr-freecodecamp/master/data/btc'+
            '-market-price.csv', 
                 header=None, na_values=['','?','-'])
df.head()

Unnamed: 0,0,1
0,2017-04-02 00:00:00,1099.169125
1,2017-04-03 00:00:00,1141.813
2,2017-04-04 00:00:00,1141.600363
3,2017-04-05 00:00:00,1133.079314
4,2017-04-06 00:00:00,1196.307937


<b>Column types using <code>dtype</code> parameter</b>

<b>Date parser using <code>parse_dates</code> parameter</b>

<b>Adding index to our data using <code>index_col</code> parameter</b>

<b>Custom data delimiters using <code>sep</code> parameter</b>

<b>Custom data encoding</b>

<b>Customer numeric <code>decimal</code> and <code>Thousands</code> character</b>

<b>Get rid of blank lines</b>

<b>Save to CSV file</b><br>We can simply generate a CSV string from our <code>DataFrame</code>:

In [43]:
exam = pd.DataFrame({
    'first_name': ['Melvin', 'Gerard', 'Amy'],
    'last_name:': ['Scott', 'Mills', 'Grimes'],
    'Age': [24,19,23],
    'math_score': [77.0,78.0,91.0],
    'french_score': [83, 72,81],
})

exam

Unnamed: 0,first_name,last_name:,Age,math_score,french_score
0,Melvin,Scott,24,77.0,83
1,Gerard,Mills,19,78.0,72
2,Amy,Grimes,23,91.0,81
