##  1. Reading and Writing Files with Pandas

### Installing Pandas and Preparing Data

In [1]:
import pandas as pd

In [2]:
data = {
    'CHN': {'COUNTRY': 'China', 'POP': 1_398.72, 'AREA': 9_596.96,
            'GDP': 12_234.78, 'CONT': 'Asia'},
    'IND': {'COUNTRY': 'India', 'POP': 1_351.16, 'AREA': 3_287.26,
            'GDP': 2_575.67, 'CONT': 'Asia', 'IND_DAY': '1947-08-15'},
    'USA': {'COUNTRY': 'US', 'POP': 329.74, 'AREA': 9_833.52,
            'GDP': 19_485.39, 'CONT': 'N.America',
            'IND_DAY': '1776-07-04'},
    'IDN': {'COUNTRY': 'Indonesia', 'POP': 268.07, 'AREA': 1_910.93,
            'GDP': 1_015.54, 'CONT': 'Asia', 'IND_DAY': '1945-08-17'},
    'BRA': {'COUNTRY': 'Brazil', 'POP': 210.32, 'AREA': 8_515.77,
            'GDP': 2_055.51, 'CONT': 'S.America', 'IND_DAY': '1822-09-07'},
    'PAK': {'COUNTRY': 'Pakistan', 'POP': 205.71, 'AREA': 881.91,
            'GDP': 302.14, 'CONT': 'Asia', 'IND_DAY': '1947-08-14'},
    'NGA': {'COUNTRY': 'Nigeria', 'POP': 200.96, 'AREA': 923.77,
            'GDP': 375.77, 'CONT': 'Africa', 'IND_DAY': '1960-10-01'},
    'BGD': {'COUNTRY': 'Bangladesh', 'POP': 167.09, 'AREA': 147.57,
            'GDP': 245.63, 'CONT': 'Asia', 'IND_DAY': '1971-03-26'},
    'RUS': {'COUNTRY': 'Russia', 'POP': 146.79, 'AREA': 17_098.25,
            'GDP': 1_530.75, 'IND_DAY': '1992-06-12'},
    'MEX': {'COUNTRY': 'Mexico', 'POP': 126.58, 'AREA': 1_964.38,
            'GDP': 1_158.23, 'CONT': 'N.America', 'IND_DAY': '1810-09-16'},
    'JPN': {'COUNTRY': 'Japan', 'POP': 126.22, 'AREA': 377.97,
            'GDP': 4_872.42, 'CONT': 'Asia'},
    'DEU': {'COUNTRY': 'Germany', 'POP': 83.02, 'AREA': 357.11,
            'GDP': 3_693.20, 'CONT': 'Europe'},
    'FRA': {'COUNTRY': 'France', 'POP': 67.02, 'AREA': 640.68,
            'GDP': 2_582.49, 'CONT': 'Europe', 'IND_DAY': '1789-07-14'},
    'GBR': {'COUNTRY': 'UK', 'POP': 66.44, 'AREA': 242.50,
            'GDP': 2_631.23, 'CONT': 'Europe'},
    'ITA': {'COUNTRY': 'Italy', 'POP': 60.36, 'AREA': 301.34,
            'GDP': 1_943.84, 'CONT': 'Europe'},
    'ARG': {'COUNTRY': 'Argentina', 'POP': 44.94, 'AREA': 2_780.40,
            'GDP': 637.49, 'CONT': 'S.America', 'IND_DAY': '1816-07-09'},
    'DZA': {'COUNTRY': 'Algeria', 'POP': 43.38, 'AREA': 2_381.74,
            'GDP': 167.56, 'CONT': 'Africa', 'IND_DAY': '1962-07-05'},
    'CAN': {'COUNTRY': 'Canada', 'POP': 37.59, 'AREA': 9_984.67,
            'GDP': 1_647.12, 'CONT': 'N.America', 'IND_DAY': '1867-07-01'},
    'AUS': {'COUNTRY': 'Australia', 'POP': 25.47, 'AREA': 7_692.02,
            'GDP': 1_408.68, 'CONT': 'Oceania'},
    'KAZ': {'COUNTRY': 'Kazakhstan', 'POP': 18.53, 'AREA': 2_724.90,
            'GDP': 159.41, 'CONT': 'Asia', 'IND_DAY': '1991-12-16'}
}

In [3]:
df = pd.DataFrame(data=data).T

In [4]:
df

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


### Reading and Writing CSV Files

Saving file using `to_csv()`.

In [5]:
df.to_csv('data/data.csv')

Pass the argument `index=False` if you don't want to keep the original index.

In [6]:
df.to_csv('data/nolabels.csv', index=False)

Reading the file with `read_csv`.

In [7]:
csv_df = pd.read_csv('data/data.csv', index_col=0)

`index_col` specifies row labels. It uses zero-based colum index and you use it when rows contain labels as a way to avoid loading labels as data.

In [8]:
csv_df

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


### Reading and Writing Excel Files

In [9]:
import openpyxl

Use `to_excel()` to store the DataFrame as an `.xlsx` file.

In [10]:
df.to_excel('data/data.xlsx')

Likewise, you can read it with `read_excel()`.

In [11]:
df_xlsx = pd.read_excel('data/data.xlsx', index_col=0)

In [12]:
df_xlsx

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


## 2. Working with Different File Types.

### Understanding the Pandas IO API.

**Writing Files** - use the pattern `.to_<file-type()>`: 

- `.to_csv()`
- `.to_excel()`
- `.to_json()`
- `.to_html()`
- `.to_pickle()`

(non exhaustive list)

**Omitting the path or buffer** - `df.to_<file-type>(path_or_buf)`

**Reading Files** - use the pattern `pd.read_<file-type>()`:

- `.read_csv()`
- `.read_excel()`
- `.read_json()`
- `.read_html()`
- `.read_sql()`
- `.read_pickle()`

### Working with CSV Files

If you omitt the path or buf in `to_csv()`, you'll get the corresponding string, and not a .csv file:

In [13]:
csv_str = df.to_csv()
print(csv_str)

,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16
JPN,Japan,126.22,377.97,4872.42,Asia,
DEU,Germany,83.02,357.11,3693.2,Europe,
FRA,France,67.02,640.68,2582.49,Europe,1789-07-14
GBR,UK,66.44,242.5,2631.23,Europe,
ITA,Italy,60.36,301.34,1943.84,Europe,
ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09
DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05
CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01
AUS,Australia,25.47,7692.02,1408.68,Oceania,
KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16



When dealing with `NaN` values, you can use many tools. 

In [14]:
df.loc['RUS', 'CONT']

nan

Normally, empty strings represent the missing data. To change this behaviour, use the `na_rep` method.

In [15]:
df.to_csv('new-data.csv', na_rep='(missing)')

By default, `pandas` consider missing values as `NaN`. To change this behavior, use `na_values=` as `False`.

In [16]:
new_data_nan = pd.read_csv('new-data.csv', index_col=0, na_values='(missing)')

In [17]:
new_data_nan

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


In [18]:
new_data_nan.dtypes

COUNTRY     object
POP        float64
AREA       float64
GDP        float64
CONT        object
IND_DAY     object
dtype: object

In [19]:
dtypes={
    'POP': 'float32',
    'AREA': 'float32',
    'GDP': 'float32',
    }
data_types = pd.read_csv('data/data.csv', index_col=0, dtype=dtypes, parse_dates=['IND_DAY'])

In [20]:
data_types.dtypes

COUNTRY            object
POP               float32
AREA              float32
GDP               float32
CONT               object
IND_DAY    datetime64[ns]
dtype: object

In [21]:
data_types['IND_DAY']

CHN          NaT
IND   1947-08-15
USA   1776-07-04
IDN   1945-08-17
BRA   1822-09-07
PAK   1947-08-14
NGA   1960-10-01
BGD   1971-03-26
RUS   1992-06-12
MEX   1810-09-16
JPN          NaT
DEU          NaT
FRA   1789-07-14
GBR          NaT
ITA          NaT
ARG   1816-07-09
DZA   1962-07-05
CAN   1867-07-01
AUS          NaT
KAZ   1991-12-16
Name: IND_DAY, dtype: datetime64[ns]

In [22]:
data_types.to_csv('formatted-data.csv', date_format='%B %d %Y')

In [23]:
formatted_data = pd.read_csv('formatted-data.csv', index_col=0)
formatted_data

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,August 15 1947
USA,US,329.74,9833.52,19485.39,N.America,July 04 1776
IDN,Indonesia,268.07,1910.93,1015.54,Asia,August 17 1945
BRA,Brazil,210.32,8515.77,2055.51,S.America,September 07 1822
PAK,Pakistan,205.71,881.91,302.14,Asia,August 14 1947
NGA,Nigeria,200.96,923.77,375.77,Africa,October 01 1960
BGD,Bangladesh,167.09,147.57,245.63,Asia,March 26 1971
RUS,Russia,146.79,17098.25,1530.75,,June 12 1992
MEX,Mexico,126.58,1964.38,1158.23,N.America,September 16 1810


**Optional Parameters**:

- `sep` - Value separator
- `decimal` - Decimal separator
- `encoding` - File encoding
- `header` - Column labels (`True`/`False`)

Passing arguments to separator and header.

In [24]:
sh = df.to_csv(sep=';', header=False)
print(sh)

CHN;China;1398.72;9596.96;12234.78;Asia;
IND;India;1351.16;3287.26;2575.67;Asia;1947-08-15
USA;US;329.74;9833.52;19485.39;N.America;1776-07-04
IDN;Indonesia;268.07;1910.93;1015.54;Asia;1945-08-17
BRA;Brazil;210.32;8515.77;2055.51;S.America;1822-09-07
PAK;Pakistan;205.71;881.91;302.14;Asia;1947-08-14
NGA;Nigeria;200.96;923.77;375.77;Africa;1960-10-01
BGD;Bangladesh;167.09;147.57;245.63;Asia;1971-03-26
RUS;Russia;146.79;17098.25;1530.75;;1992-06-12
MEX;Mexico;126.58;1964.38;1158.23;N.America;1810-09-16
JPN;Japan;126.22;377.97;4872.42;Asia;
DEU;Germany;83.02;357.11;3693.2;Europe;
FRA;France;67.02;640.68;2582.49;Europe;1789-07-14
GBR;UK;66.44;242.5;2631.23;Europe;
ITA;Italy;60.36;301.34;1943.84;Europe;
ARG;Argentina;44.94;2780.4;637.49;S.America;1816-07-09
DZA;Algeria;43.38;2381.74;167.56;Africa;1962-07-05
CAN;Canada;37.59;9984.67;1647.12;N.America;1867-07-01
AUS;Australia;25.47;7692.02;1408.68;Oceania;
KAZ;Kazakhstan;18.53;2724.9;159.41;Asia;1991-12-16



### Working with JSON Files

`.json()` = **J**ava**S**cript **O**bject **N**otation

Saving with columns as keys and inner dictionaries as values.

In [25]:
df.to_json('data-columns.json')

Get a different file structure with `orient`, which set the row labels as keys.

In [26]:
df.to_json('data-index.json', orient='index')

With `records`, you get a list with one dictionary for each row, while row labels are not written.

In [27]:
df.to_json('data-records.json', orient='records')

With `orients=split`, you get divided content in one dictionary.

In [28]:
df.to_json('data-split.json', orient='split')

`to_json()` **optional parameters**:
- `index=False` - row labels are not saved
- `double_precision` - number of decimal places
- `date_format` - controls date format (`epoch` or `iso`)
- `date_unit` - controls date resolution (**s**econds to **n**ano**s**econds)

Example with `.to_datetime()`.

In [29]:
df['IND_DAY'] = pd.to_datetime(df['IND_DAY'])
df.dtypes

COUNTRY            object
POP                object
AREA               object
GDP                object
CONT               object
IND_DAY    datetime64[ns]
dtype: object

In [30]:
df.to_json('data-time.json')

Dates were represented as large integers, as the default value of the optional parameter `date_format` is  `epoch` in `ms` whenever `orient` isn't `table`. If you pass `date_format=iso`, you`ll get the dates in the ISO 8601 format. 

In [31]:
df = pd.DataFrame(data=data).T

In [32]:
df['IND_DAY'] = pd.to_datetime(df['IND_DAY'])
df.to_json('new_data-time.json', date_format='iso', date_unit='s')

Load the data with `.read_json()`.

In [33]:
data_index = pd.read_json('data-index.json', orient='index', convert_dates=['IND_DAY'])

`orient` specifies how you want to read the structure of the file.

In [34]:
data_index

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,NaT
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


`read_json()` **optional parameters**:
- `encoding` - set encoding
- `convert_dates` and `keep_default_dates` to manipulate dates
- `dtype` and `precise_float` to control precision
- `numpy=True` to decode directly to NumPy Arrays

**Reminder** - using `json` to store date may not preserve **row** and **column** order!

### Working with HTML Files

- Extensions - `.html` and `htm`
- You need to install an HTML Parser Library


In [35]:
import html5lib

In [36]:
df = pd.DataFrame(data=data).T  # Without buf
df.to_html('data.html')

`to_html()` **optional parameters**:
- `header` - controls saving of column names
- `index` - controls saving of row labels
- `classes` - assigns CSS classes
- `render_links` - controls conversion of URLs to HTML links
- `table_id` - assigns CSS ID of table
- `escape` - controls conversion of `<`, `>`, and `&` to HTML-safe strings

In [37]:
df_html = pd.read_html('data.html', index_col=0, parse_dates=['IND_DAY'])
df_html

[        COUNTRY      POP      AREA       GDP       CONT    IND_DAY
 CHN       China  1398.72   9596.96  12234.78       Asia        NaT
 IND       India  1351.16   3287.26   2575.67       Asia 1947-08-15
 USA          US   329.74   9833.52  19485.39  N.America 1776-07-04
 IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17
 BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07
 PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14
 NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01
 BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26
 RUS      Russia   146.79  17098.25   1530.75        NaN 1992-06-12
 MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16
 JPN       Japan   126.22    377.97   4872.42       Asia        NaT
 DEU     Germany    83.02    357.11   3693.20     Europe        NaT
 FRA      France    67.02    640.68   2582.49     Europe 1789-07-14
 GBR          UK    66.44    242.50   2631.23   

`read_html()` **optional parameters**:
- `parse_dates` - controls interpretation of columns containing dates
- `na_values` - custom NA values
- `encoding` - defines encoding used to decode web page
- `flavor` - parsing engine to use

### Working with Excel Files

You can specify the name of the target worksheet with `sheet_name`.

In [38]:
df.to_excel('data_sheet.xlsx', sheet_name='COUNTRIES')

The optional parameters `startrow` and `startcol` default to zero and indicate the upper left most where the data should start being written into the Excel spreadsheet.

In [39]:
df.to_excel('data-shifted.xlsx', sheet_name='COUNTRIES', startrow=2, startcol=4)

`read_excel()` also has the optional parameter `sheet_name`, with which you can choose which sheets you want to read from a workbook.
- Zero-based index of worksheet - `sheet_name=0`
- Worksheet name - `sheet_name="COUNTRIES"`
- List of indices or names - `sheet_name=[0,1,3]`
- `None` to read all sheets - `sheet_name=None`

In [40]:
x1_index = pd.read_excel('data/data.xlsx', sheet_name=0, index_col=0, parse_dates=['IND_DAY'])
x1_index

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,NaT
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


`read_excel()` **optional parameters**:
- `engine` - `xlrd`, `openpyxl`, `odf`, `pyxlsb`
- `dtype` - specify data types
- `na_values` - specify strings to recognize as `nan`
- `parse_dates` - control parsing of columns as dates

### Working with SQL Files

In [41]:
from sqlalchemy import create_engine

In [42]:
engine = create_engine('sqlite:///data.db', echo=False)

In [43]:
db_dtypes = {
             'POP': 'float64',
             'AREA': 'float64',
             'GDP': 'float64',
             'IND_DAY': 'datetime64'
            }

You can assign multiple data types with `astype()` by storing values in a dictionary.

In [44]:
db_df = pd.DataFrame(data=data).T.astype(dtype=db_dtypes)
db_df.dtypes

COUNTRY            object
POP               float64
AREA              float64
GDP               float64
CONT               object
IND_DAY    datetime64[ns]
dtype: object

Save the DataFrame to a database with `to_sql()`.

In [49]:
db_df.to_sql('data.db', con=engine, index_label='ID')

20

Use the parameter `con=` to specify the database connection or engine you want to use. `index_label=` specifies how to call the database column with the row labels; It is normal to label it as `ID`

You can use `index-False` to omit row labels.

`to_sql()` **optional parameters**:
- `schema` - specify database schema
- `dtype` - specify data types
- `if_exists` - specify behavior in case of an existing database with same name. **Options**:
    - `fail` - raise a `ValueError` (default)
    - `replace` - drops the existing table and inserts new values
    - `append` - inserts new values to the existing table

In [47]:
df_from_db = pd.read_sql('data.db', con=engine, index_col='ID')
df_from_db

ObjectNotExecutableError: Not an executable object: 'data.db'

We've got an extra row after the header, starting with `'ID'`. We can fix that with another line of code.

In [None]:
df_from_db.index.name=None
df_from_db

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,NaT
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


To fill the `NaT` values with `NaN`, pass `.fillna()`.

In [None]:
df_from_db.fillna(value=float('nan'), inplace=True)
df_from_db

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,NaT
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


### Working with Pickle Files

- **Pickling** - convertion of Python Object to Byte Stream
- **Unpickling** - does the opposite
- Binary files that keep the data and hierarchy of python objects
- `pickle` or `pkl` extension

Save your dataframe with `to_pickle()`.

In [None]:
df.to_pickle('data.pickle')

Get the data from a pickle file with `read_pickle()`.

In [None]:
pkl_df = pd.read_pickle('data.pickle')

In [None]:
pkl_df.dtypes

COUNTRY    object
POP        object
AREA       object
GDP        object
CONT       object
IND_DAY    object
dtype: object

## 3. Working with Big Data

There are many ways to deal with large files:
- Compress 
- Choose only columns you want
- Omit rows you don't need
- Force the use of less precise dtype
- Split data into chunks

### Compress and Decompress Files
- `.gz`
- `.bz2`
- `.zip`
- `.xz`

In [None]:
df.to_csv('data.csv.zip')

In [None]:
df_zip = pd.read_csv('data.csv.zip', index_col=0, parse_dates=['IND_DAY'])

In [None]:
df_zip

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,NaT
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


There are optional `compression` **parameter values**:
- `infer` - pandas should deduce the compression type from the file extension
- `gzip`
- `bz2`
- `zip`
- `xz`
- `none`

Writing and Reading `pickle` with compression.

In [None]:
df.to_pickle('data.pickle.compress', compression='gzip')

In [None]:
df_pkl_comp = pd.read_pickle('data.pickle.compress', compression='gzip')

In [None]:
df_pkl_comp

Unnamed: 0,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16
