# Introduction

In the previous chapter, you got familiar with the pandas library and with all the basic functionalities that it
provides for the data analysis. You have seen that DataFrame and Series are the heart of this library. These
are the material on which to perform all manipulations of data, calculations, and analysis.
In this chapter you will see all of the tools provided by pandas for reading data stored in many types
of media (such as files and databases). In parallel, you will also see how to write data structures directly on
these formats, without worrying too much about the technologies used.
This chapter is focused on a series of I/O API functions that pandas provides to facilitate as much as
possible the reading and writing data process directly as DataFrame objects on all of the most commonly
used formats. You start to see the text files, then move gradually to more complex binary formats.
At the end of the chapter, you’ll also learn how to interface with all common databases, both SQL and
NoSQL, with examples showing how to store the data in a DataFrame directly in them. At the same time, you
 will see how to read the data contained in a database and retrieve them already as a DataFrame.

---

# I/O API Tools

Readers | Writers <br>
read_csv | to_csv <br>
read_excel | to_excel <br>
read_hdf | to_hdf <br>
read_sql | to_sql <br>
read_json | to_json <br>
read_html | to_html <br>
read_stata | to_stata <br>
read_clipboard | to_clipboard <br>
read_pickle | to_pickle <br>
read_msgpack | to_msgpack (experimental) <br>
read_gbq | to_gbq (experimental)

---

# CSV and Textual Files

Everyone has become accustomed over the years to write and read files in text form. In particular, data
are generally reported in tabular form. If the values in a row are separated by a comma, you have the CSV
(comma-separated values) format, which is perhaps the best-known and most popular format. <br>
Other forms with tabular data separated by spaces or tabs are typically contained in text files of various
types (generally with the extension .txt). <br>
So this type of file is the most common source of data and actually even easier to transcribe and
interpret. In this regard pandas provides a set of functions specific for this type of file. <br> <br>
• read_csv <br>
• read_table <br>
• to_csv

---

# Reading Data in CSV or Text Files

In [1]:
import numpy as np
import pandas as pd

In [22]:
csvframe = pd.read_csv('ch05_01.csv')

In [23]:
csvframe

Unnamed: 0,white,red,blue,green,animal
0,1,5,2,3,cat
1,2,7,8,5,dog
2,3,3,6,7,horse
3,2,2,8,3,duck
4,4,4,2,1,mouse


In [24]:
# use read_table() with specified separator
pd.read_table('ch05_01.csv', sep=',')

  


Unnamed: 0,white,red,blue,green,animal
0,1,5,2,3,cat
1,2,7,8,5,dog
2,3,3,6,7,horse
3,2,2,8,3,duck
4,4,4,2,1,mouse


In the example you just saw, you can notice that in the CSV file, headers to identify all the columns are in the first row. But this is not a general case, it often happens that the tabulated data begin directly from the first line

In [25]:
pd.read_csv('ch05_02.csv')

Unnamed: 0,1,5,2,3,cat
0,2,7,8,5,dog
1,3,3,6,7,horse
2,2,2,8,3,duck
3,4,4,2,1,mouse


In [26]:
pd.read_csv('ch05_02.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,5,2,3,cat
1,2,7,8,5,dog
2,3,3,6,7,horse
3,2,2,8,3,duck
4,4,4,2,1,mouse


In [27]:
pd.read_csv('ch05_02.csv', 
            names=['white', 'red', 'blue', 'green', 'animal'])

Unnamed: 0,white,red,blue,green,animal
0,1,5,2,3,cat
1,2,7,8,5,dog
2,3,3,6,7,horse
3,2,2,8,3,duck
4,4,4,2,1,mouse


In more complex cases, in which you want to create a DataFrame with a hierarchical structure by
reading a CSV file, you can extend the functionality of the read_csv() function by adding the index_col
option, assigning all the columns to be converted into indexes to it.

In [28]:
pd.read_csv('ch05_03.csv', index_col=['color', 'status'])

Unnamed: 0_level_0,Unnamed: 1_level_0,item1,item2,item3
color,status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
white,left,1,2,1
red,up,2,2,2
red,down,1,1,4


## Using RegExp for Parsing TXT Files

To better understand the use of a regexp and how you can apply it as a criterion for separation of values, you can start from a simple case. For example, suppose that your file, such as a TXT file, has values separated by spaces or tabs in an unpredictable order. In this case, you have to use the regexp because only with it you will take into account as a separator both cases. You can do that using the wildcard /s*. /s stands for space or tab character (if you wanted to indicate only the tab, you would have used /t), while the pound indicates that these characters may be multiple (see Table 5-1 for other wildcards most commonly used). That is, the values may be separated by more spaces or more tabs.

In [29]:
pd.read_table('ch05_04.txt', sep='\s*') # very bad way

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


Unnamed: 0.1,Unnamed: 0,w,h,i,t,e,Unnamed: 6,r,e.1,d,...,l,u,e.2,Unnamed: 15,g,r.1,e.3,e.4,n,Unnamed: 21
0,,1,,5,,2,,3,,,...,,,,,,,,,,
1,,2,,7,,8,,5,,,...,,,,,,,,,,
2,,3,,3,,6,,7,,,...,,,,,,,,,,


In [30]:
pd.read_table('ch05_05.txt', header=None, sep='\D*')

  """Entry point for launching an IPython kernel.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,,0,0,0,,1,2,3,,1,2,2,
1,,0,0,1,,1,2,4,,3,2,1,
2,,0,0,2,,1,2,5,,3,3,3,


Another fairly common event is to exclude lines from parsing. In fact you do not always want to include
headers or unnecessary comments contained within a file (see Listing 5-6). With the skiprows option you can
exclude all the lines you want, just assigning an array containing the line numbers to not consider in parsing.

In [32]:
pd.read_table('ch05_06.txt', sep=',', skiprows=[0, 1, 3, 6])

  """Entry point for launching an IPython kernel.


Unnamed: 0,white,red,blue,green,animal
0,1,5,2,3,cat
1,2,7,8,5,dog
2,3,3,6,7,horse
3,2,2,8,3,duck
4,4,4,2,1,mouse


## Reading TXT Files into Parts or Partially

When large files are processed, or when you’re only interested in portions of these files, you often need to
read the file into portions (chunks). This is both to apply any iterations and because we are not interested in
doing the parsing of the entire file.
So if for example you want to read only a portion of the file, you can explicitly specify the number of
lines on which to parse. Thanks to the nrows and skiprows options, you can select the starting line
n (n = SkipRows) and the lines to be read after it (nrows = i).

In [35]:
pd.read_csv('ch05_02.csv', header=None, skiprows=[3], nrows=3)

Unnamed: 0,0,1,2,3,4
0,1,5,2,3,cat
1,2,7,8,5,dog
2,3,3,6,7,horse


## Writing Data in CSV

In [39]:
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])

In [40]:
frame2

Unnamed: 0,ball,pen,pencil,paper
red,1,4,3,6
blue,4,5,6,1
yellow,3,3,1,5
white,4,1,6,4


In [43]:
frame2.to_csv('ch05_07.csv')

In [44]:
frame2.to_csv('ch05_08.csv', header=False, index=False)

In [48]:
frame3 = pd.read_csv('ch05_07.csv')

In [53]:
frame3.columns

Index(['Unnamed: 0', 'ball', 'pen', 'pencil', 'paper'], dtype='object')

In [55]:
frame3['Unnamed: 0']

0       red
1      blue
2    yellow
3     white
Name: Unnamed: 0, dtype: object

In [56]:
frame3 = frame3.set_index('Unnamed: 0')

In [58]:
frame3.index.name = None

In [59]:
frame3

Unnamed: 0,ball,pen,pencil,paper
red,1,4,3,6
blue,4,5,6,1
yellow,3,3,1,5
white,4,1,6,4


If we want to save DataFrame with null values, and in the csv file filled with 'NaN' <br> <br>
frame3.to_csv('ch05_09.csv', na_rep='NaN')

---

# Important Points

- read_csv():
    - parameters: header(first_line)
    - return DataFrame
- read_table(): will be deprecated soon and not a good practice to use