# Ch 1, Ep. 6: Pandas
## I/O: Reading and Writing Data

The most standard use of pandas is to read and write data files. Pandas provides a series of [built-in I/O functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) to read and write data from various files formats; making it the defacto standard tool to convert files formats.

Pandas is often used to read data from basic internet and SQL files formats such as CSVs and Json files and transform them into **Big Data** formats such as **Parquet, ORC, BigQuery**, and other formats.

### Reading CSV

Amongst pandas [built-in readers](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)you can use `read_csv` to import data from a delimited file:

Data files for this lesson are included under `data/` folder. The `airports.csv` files contains all domestic (USA) airports.


In [1]:
import pandas as pd

data_dir = '../../../data/input/ch1/'
airports_file = data_dir + 'airports.csv'
airports = pd.read_csv(airports_file, header=0)
print(airports.head())

  iata               airport              city state country        lat  \
0  00M              Thigpen        Bay Springs    MS     USA  31.953765   
1  00R  Livingston Municipal        Livingston    TX     USA  30.685861   
2  00V           Meadow Lake  Colorado Springs    CO     USA  38.945749   
3  01G          Perry-Warsaw             Perry    NY     USA  42.741347   
4  01J      Hilliard Airpark          Hilliard    FL     USA  30.688012   

          lon  
0  -89.234505  
1  -95.017928  
2 -104.569893  
3  -78.052081  
4  -81.905944  


`read_csv` methods provides a series of options to parse csv files correctly. The `header` option is used to extract column names from a csv header row. `header=0` marks the first row of csv (row 0) as the header row.

Feel free to set other options:

In [2]:
import pandas as pd

# setting separator and line terminator characters
airports = pd.read_csv(airports_file, header=0, sep=',', lineterminator='\n')

print(airports.head())
# reading only 10 rows and selected columns
airports = pd.read_csv(airports_file, header=0, nrows=10, 
                      usecols=['airport', 'city', 'state'])

print(airports.head())

  iata               airport              city state country        lat  \
0  00M              Thigpen        Bay Springs    MS     USA  31.953765   
1  00R  Livingston Municipal        Livingston    TX     USA  30.685861   
2  00V           Meadow Lake  Colorado Springs    CO     USA  38.945749   
3  01G          Perry-Warsaw             Perry    NY     USA  42.741347   
4  01J      Hilliard Airpark          Hilliard    FL     USA  30.688012   

          lon  
0  -89.234505  
1  -95.017928  
2 -104.569893  
3  -78.052081  
4  -81.905944  
                airport              city state
0              Thigpen        Bay Springs    MS
1  Livingston Municipal        Livingston    TX
2           Meadow Lake  Colorado Springs    CO
3          Perry-Warsaw             Perry    NY
4      Hilliard Airpark          Hilliard    FL


For the full list of available `read_csv` options refer to the online 
[documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)


### Assigning data types

You can set column data types using the `dtype` option:

In [3]:
import pandas as pd
import numpy as np

# using `dtype` to assign particular column data types
airports = pd.read_csv(airports_file, header=0,
                      dtype={
                          'lat': np.float64,
                          'long': np.float64
                      })

# print
print(airports.head(10))

  iata               airport              city state country        lat  \
0  00M              Thigpen        Bay Springs    MS     USA  31.953765   
1  00R  Livingston Municipal        Livingston    TX     USA  30.685861   
2  00V           Meadow Lake  Colorado Springs    CO     USA  38.945749   
3  01G          Perry-Warsaw             Perry    NY     USA  42.741347   
4  01J      Hilliard Airpark          Hilliard    FL     USA  30.688012   
5  01M     Tishomingo County           Belmont    MS     USA  34.491667   
6  02A           Gragg-Wade            Clanton    AL     USA  32.850487   
7  02C               Capitol        Brookfield    WI     USA  43.087510   
8  02G     Columbiana County    East Liverpool    OH     USA  40.673313   
9  03D      Memphis Memorial           Memphis    MO     USA  40.447259   

          lon  
0  -89.234505  
1  -95.017928  
2 -104.569893  
3  -78.052081  
4  -81.905944  
5  -88.201111  
6  -86.611453  
7  -88.177869  
8  -80.641406  
9  -92.226961 

Data types are typically set as numpy types. The dtype parameter is specifically handy since it allows you to set specific columns and leave the rest for pandas to figure out.

### Using Converters

The most convenient way to parse special columns and apply business rules to transform fields at ingest is using the `converters` option of `read_csv()`.

You can use specific function to parse special fields. In this case we use a function called `pad_iata` to make the `iata` code uniform length. We also show that you can use `lambda` functions as converters:

<br/>


In [4]:
import pandas as pd
import numpy as np

def pad_iata(value:str):
    while len(value) < 4:
        value = "0" + str(value)
    return value

# using `converters` to pass functions to parse fields
airports = pd.read_csv(airports_file, header=0,
                      converters={
                          'iata': pad_iata,
                          'lat': (lambda v: int(np.round(float(v)))),
                          'lon': (lambda v: int(np.round(float(v)))),
                      })

# print
print(airports.head(10))

   iata               airport              city state country  lat  lon
0  000M              Thigpen        Bay Springs    MS     USA   32  -89
1  000R  Livingston Municipal        Livingston    TX     USA   31  -95
2  000V           Meadow Lake  Colorado Springs    CO     USA   39 -105
3  001G          Perry-Warsaw             Perry    NY     USA   43  -78
4  001J      Hilliard Airpark          Hilliard    FL     USA   31  -82
5  001M     Tishomingo County           Belmont    MS     USA   34  -88
6  002A           Gragg-Wade            Clanton    AL     USA   33  -87
7  002C               Capitol        Brookfield    WI     USA   43  -88
8  002G     Columbiana County    East Liverpool    OH     USA   41  -81
9  003D      Memphis Memorial           Memphis    MO     USA   40  -92


We highly recommend using the converter functions for parsing and applying business rules and cleansing rules at parse time with `read_csv()`.

### Writing Data

Pandas provides a series of [I/O write functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). You can read the documentation to use appropriate function for your use-case. 

Here we're going to write our airports into both **JSON Row** and **Parquet** formats:


### Renaming and dropping 
We often need to rename or drop columns when dealing with data. We also might want to remove specific rows as well. The below examples will show how to tackle these challenges:

In [5]:
# rename two columns and create a new DataFrame with the results
updated_airports = airports.rename(columns={'airport': 'airport_full_name',
                        'iata': 'iata_code'})

# drop the lat and lon columns from the new DataFrame in place
updated_airports.drop(columns=['lat', 'lon'], inplace=True)

# remove rows - removing rows 0-3 by their label (index)
updated_airports.drop(labels=[0,1,2], inplace=True)

# Check our handywork
updated_airports.head()

Unnamed: 0,iata_code,airport_full_name,city,state,country
3,001G,Perry-Warsaw,Perry,NY,USA
4,001J,Hilliard Airpark,Hilliard,FL,USA
5,001M,Tishomingo County,Belmont,MS,USA
6,002A,Gragg-Wade,Clanton,AL,USA
7,002C,Capitol,Brookfield,WI,USA


### reset_index()
Another operation you might want to do is reset the index of a DataFrame. For this data, we're already using a simple `RangeIndex` but we can still use `reset_index()` to create a new `RangeIndex`. `reset_index` can be done in place and the old index can be included as a column or dropped:

In [11]:
updated_airports = airports.copy()
updated_airports.reset_index(inplace=True)
print(updated_airports.head())
updated_airports.drop(columns=['index'], inplace=True)
updated_airports.reset_index(inplace=True, drop=True)
print(updated_airports.head())

   index  iata               airport              city state country  lat  lon
0      0  000M              Thigpen        Bay Springs    MS     USA   32  -89
1      1  000R  Livingston Municipal        Livingston    TX     USA   31  -95
2      2  000V           Meadow Lake  Colorado Springs    CO     USA   39 -105
3      3  001G          Perry-Warsaw             Perry    NY     USA   43  -78
4      4  001J      Hilliard Airpark          Hilliard    FL     USA   31  -82
   iata               airport              city state country  lat  lon
0  000M              Thigpen        Bay Springs    MS     USA   32  -89
1  000R  Livingston Municipal        Livingston    TX     USA   31  -95
2  000V           Meadow Lake  Colorado Springs    CO     USA   39 -105
3  001G          Perry-Warsaw             Perry    NY     USA   43  -78
4  001J      Hilliard Airpark          Hilliard    FL     USA   31  -82


In [5]:
import pandas as pd

# read csv
airports = pd.read_csv(airports_file, header=0)

# write json row format
airports.to_json(data_dir + 'airports.json', orient='records', lines=True)
# write compressed parquet format
airports.to_parquet(data_dir + 'airports.parquet', engine='pyarrow', 
                   compression='gzip', index=False)

print('done.')

done.
