## Transformations

When working with data, you will often need to apply transformations, in order to make the data more consistent, more usable, easier to understand, or to meet any number of other criteria. Pandas provides a number of functions to make this process easier.

### `Map()`: Single-Column Transformations

Pandas has a number of handy functions that can be used for applying transformation rules to columns. The simplest method is to apply a `map()` function to transform values within a column.

We notice that in our airline data, each airline is represented by short codes such as 'AA'. There's nothing wrong with that, but they're not very readable. What if we want to map those short codes to full airline names, so that the data is easier to understand? Pandas provides a `map()` function that makes this sort of thing easy.

In [1]:
import pandas as pd

# read data fom csv
data_dir = '../../../data/input/ch1/'
routes_file = data_dir + 'routes.csv'
routes = pd.read_csv(routes_file, header=0)

def decode_airline(value:str):
    mapper = {
        'AA': 'American Airlines', 
        'AS': 'Alaska Airlines', 
        'DL': 'Delta Air Lines',
        'UA': 'United Airlines', 
        'WN': 'Southwest Airlines',
    }
    if value in mapper:
        return mapper[value]
    else:
        return 'Other'

# decode airline names and assign to a new column
routes['airline_name'] = routes.airline.map(decode_airline)

# print decoded flights
routes.loc[routes.airline_name != 'Other'][['airline', 'airline_name', 'src', 'dest','stops']]


Unnamed: 0,airline,airline_name,src,dest,stops
4654,AA,American Airlines,ABE,CLT,0
4655,AA,American Airlines,ABE,PHL,0
4656,AA,American Airlines,ABI,DFW,0
4657,AA,American Airlines,ABQ,DFW,0
4658,AA,American Airlines,ABQ,LAX,0
...,...,...,...,...,...
64556,WN,Southwest Airlines,TUS,DEN,0
64557,WN,Southwest Airlines,TUS,LAS,0
64558,WN,Southwest Airlines,TUS,LAX,0
64559,WN,Southwest Airlines,TUS,MDW,0


Notice that our mapper function simply uses a dictionary, with the airline short codes as keys and the full names as values. That makes it easy to get the full airline name from a shortcode, by simply indexing the dictionary with the short code (`mapper[value]'). The last line aboe uses `.loc` to filter out records that cannot be mapped to an airline name. 

Let's practice more to get familiar with using `map()` effectively. The following function simply converts the `stops` column to a floating-point value:

In [2]:
# use lambda functions as map
routes.stops = routes.stops.map(lambda v: float(v))

print(routes.head(5))

  airline  src dest codeshare  stops equipment airline_name
0      2B  ASF  KZN       NaN    0.0       CR2        Other
1      2B  ASF  MRV       NaN    0.0       CR2        Other
2      2B  CEK  KZN       NaN    0.0       CR2        Other
3      2B  CEK  OVB       NaN    0.0       CR2        Other
4      2B  DME  KZN       NaN    0.0       CR2        Other


### `applymap`
The `map()` function we have been using is a built-in function of `Series` objects. There is a similar built-in functon of `DataFrame`s: `applymap()`. The behavior is similar but the input and output are `DataFrame`s instead of series:

In [20]:
# read data fom csv
data_dir = '../../../data/input/ch1/'
routes_file = data_dir + 'routes.csv'
routes = pd.read_csv(routes_file, header=0)

# use applymap with our decode_airline function
routes[['airline_name']] = routes[['airline']].applymap(decode_airline)
routes[['stops']] = routes[['stops']].applymap(lambda v: float(v))
# print decoded flights
routes.loc[routes.airline_name != 'Other'][['airline', 'airline_name', 'src', 'dest', 'stops']]


Unnamed: 0,airline,airline_name,src,dest,stops
4654,AA,American Airlines,ABE,CLT,0.0
4655,AA,American Airlines,ABE,PHL,0.0
4656,AA,American Airlines,ABI,DFW,0.0
4657,AA,American Airlines,ABQ,DFW,0.0
4658,AA,American Airlines,ABQ,LAX,0.0
...,...,...,...,...,...
64556,WN,Southwest Airlines,TUS,DEN,0.0
64557,WN,Southwest Airlines,TUS,LAS,0.0
64558,WN,Southwest Airlines,TUS,LAX,0.0
64559,WN,Southwest Airlines,TUS,MDW,0.0


### `Apply()`: Multi-Column Transformations

While the `.map()` method allows transformation over a single column, pandas Dataframe `.apply()` method allows transformtion over multiple column values. You can use `.apply()` when you need to transform more than one column within a row.

In the following example, we create an `encode_route_key()` method which concatenates airline, src, and dest fields  to create a unique route key for each row:

In [3]:
import pandas as pd

# read data fom csv
data_dir = '../../../data/input/ch1/'
routes_file = data_dir + 'routes.csv'
routes = pd.read_csv(routes_file, header=0)

def encode_route_key(row):
    # a dataframe row is passed. access columns with row.column_name
    route_key = f"{row.airline}-{row.src}-{row.dest}"

    return route_key

# apply a function over entire row values
# set axis=1 to apply function over rows. axis=0 would apply over columns
routes['route_key'] = routes.apply(encode_route_key, axis=1)
routes['route_key']

0        2B-ASF-KZN
1        2B-ASF-MRV
2        2B-CEK-KZN
3        2B-CEK-OVB
4        2B-DME-KZN
            ...    
67657    ZL-WYA-ADL
67658    ZM-DME-FRU
67659    ZM-FRU-DME
67660    ZM-FRU-OSS
67661    ZM-OSS-FRU
Name: route_key, Length: 67662, dtype: object

Pay attention to the `axis=1` argument for `apply()`, which directs Pandas to apply the function horizontally over row values. `axis=0` directs pandas to apply a function vertically to all column values. Please refer to [DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) documentation for more information.

Pandas passes the row values as the first parameter to the apply function. You can use the `args` parameter if your function requires more parameters. For example:

In [8]:
import pandas as pd

# read data fom csv
data_dir = '../../../data/input/ch1/'
routes_file = data_dir + 'routes.csv'
routes = pd.read_csv(routes_file, header=0)

# passing more parameters to apply function by position

def encode_route_key(row, key_type):
    # a dataframe row is passed. access columns with row.column_name
    if key_type == "short":
        route_key = f"{row.airline}-{row.src}-{row.dest}"
    else:
        route_key = f"{row.airline}-{row.src}-{row.dest}-{row.stops}-{row.equipment}"
    return route_key

# apply a function over entire row values
# set axis=1 to apply function over rows. axis=0 would apply over columns
routes['route_key'] = routes.apply(encode_route_key, axis=1, args=("short",))
routes['route_key_long'] = routes.apply(encode_route_key, axis=1, args=("long",))
routes[['route_key', 'route_key_long']]

Unnamed: 0,route_key,route_key_long
0,2B-ASF-KZN,2B-ASF-KZN-0-CR2
1,2B-ASF-MRV,2B-ASF-MRV-0-CR2
2,2B-CEK-KZN,2B-CEK-KZN-0-CR2
3,2B-CEK-OVB,2B-CEK-OVB-0-CR2
4,2B-DME-KZN,2B-DME-KZN-0-CR2
...,...,...
67657,ZL-WYA-ADL,ZL-WYA-ADL-0-SF3
67658,ZM-DME-FRU,ZM-DME-FRU-0-734
67659,ZM-FRU-DME,ZM-FRU-DME-0-734
67660,ZM-FRU-OSS,ZM-FRU-OSS-0-734


### Complex

The section below shows an example where we apply a function over multiple columns which produces multiple columns in a Dataframe. 

In this example, we will produce two new columns called "nonstop" and "loop" depending on if there are no stops and if the `src` is also the `dest`.


In [11]:
routes[routes.src == routes.dest]

Unnamed: 0,airline,src,dest,codeshare,stops,equipment,route_key,route_key_long
33275,IL,PKN,PKN,,0,AT7,IL-PKN-PKN,IL-PKN-PKN-0-AT7


In [14]:
import pandas as pd

# read data fom csv
data_dir = '../../../data/input/ch1/'
routes_file = data_dir + 'routes.csv'
routes = pd.read_csv(routes_file, header=0)

def encode_route_type(row):
    # nonstop: stops = 0
    # loop: src == dest
    nonstop = row.stops == 0
    loops = row.src == row.dest
    # return as a tuple
    return (nonstop, loops)


# apply a function over row values and
# unpack multiple return column values by using zip()
routes['nonstop'], routes['loop'] = zip(*routes.apply(
    encode_route_type, axis=1
))

# print
print(routes.loc[routes.nonstop == True])
print(routes.loc[routes.loop == True])


      airline  src dest codeshare  stops equipment  nonstop   loop
0          2B  ASF  KZN       NaN      0       CR2     True  False
1          2B  ASF  MRV       NaN      0       CR2     True  False
2          2B  CEK  KZN       NaN      0       CR2     True  False
3          2B  CEK  OVB       NaN      0       CR2     True  False
4          2B  DME  KZN       NaN      0       CR2     True  False
...       ...  ...  ...       ...    ...       ...      ...    ...
67657      ZL  WYA  ADL       NaN      0       SF3     True  False
67658      ZM  DME  FRU       NaN      0       734     True  False
67659      ZM  FRU  DME       NaN      0       734     True  False
67660      ZM  FRU  OSS       NaN      0       734     True  False
67661      ZM  OSS  FRU       NaN      0       734     True  False

[67651 rows x 8 columns]
      airline  src dest codeshare  stops equipment  nonstop  loop
33275      IL  PKN  PKN       NaN      0       AT7     True  True


### Further Reading
- [Pandas map() docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html)
- [Pandas applymap() docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html)
- [Pandas apply() docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)
- [Difference between map(), applymap(), and apply() methods](https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas)