# Chapter 4 Notes

### Database-stype operations in Pandas
SQL like operations can be performed in Pandas

#### Querying
- The `query()` method can be used to write filters instead of using a Boolean mask
  - The syntax is similar to the `WHERE` clause
  - This method is especially useful with long dataframe names, but it's similar to Boolean masks

#### Merging
- There are four types of joins: full (outer), left, right, and inner
- The previously discussed methods, `pd.concat()` and `append()` methods mimic the SQL `UNION ALL` and `UNION` statements
- Inner joins:
  - Returns the columns from both dataframes where they match on the specific key column
  - This can be performed with the `merge()` method
    - The dataframe that the `merge()` method is called on is the left dataframe and the right is the dataframe passed to the method
    - The columns to match on can be specified if the two dataframe columns have different names
- Left and right joins:
  - Can also be performed with the `merge()` method using the `how` input
  - Right joins are the inverse of the left join call
- Outer joins:
  - Can also be performed with the `merge()` method and `how` input
- If joining in the index, the `join()` method is easier
  - The `join()` method joins on the left dataframe index and a right column of your choice
- The `intersection()` method can be used to review the number of rows that result from an inner join without performing the join, in the event of joining large dataframes that consume a lot of memory
  - The `difference()` method can be used to report the number of values in the first index that aren't in the second index
    - This also tells you how many records are in a left or right join
  - `symmetric_difference()` method reports what's lost on both sides
- The `union()` method can be used to report the number of values kept in a full outer join

### Enriching data

#### Arithmetic and statistics
- Pandas methods for calculating stats and performing math operations applied to a dataframe performs the operations on columns by default, but can be used on rows
  - `sub()`, `div()`, `std()`, and `mul()` are examples of math operator methods in pandas, for subtraction, division, standard deviation, and multiplication respectively
- Two other useful methods are `rank()` and `pct_change()`
  - `rank()` ranks the values of a volumn
    - `rank()` can be used to calculate a numerical rank or a percentile, with 1.000 as the first value
  - `pct_change()` calculates the percent change between periods
- The `any()` and `all()` methods can be used on boolean masks to get binary values for each column that have any or all values that pass or fail the mask condition, respectively

#### Binning
- Binning helps break down continuous data into discrete groups
  - This is sometimes easier to study, but some information is lost due to the reduced granularity
- The `cut()` method can be used to bin based on value
  - The default label option is the interval of values. Optionally, labels can be applied to the different bins.
  - `cut()` will attempt to set the bin widths as equal as possible
- The `qcut()` method breaks the data down into quartiles, setting each bin to equal number of observations

#### Functions
- The `apply()` method can be used to apply functions to columns in a dataframe
  - This method runs vectorized operations on entire columns or rows
  - `applymap()` can be used to vectorize non-vectorized functions, alternatively `np.vectorize()` can be use
  - Pandas has some functions that iterate over dataframes, but computation time increases as row count increases and is not recommended

#### Window calculations
- These are calculations applied over a window or range of rows/columns
- Rolling windows, or sliding windows, can be specified if the index is the Datetime data type or if the datetime columns is specified
  - The `rolling()` method provides a `window` or `rolling` subclass which aggregate functions can be applied to
  - The `agg()` method seen before can be used to specify functions for individual columns
- Expanding windows, or growing windows, calculate cumulative values of the aggregate function
  - The `expanding()` method is used to generate an `expanding` subclass
  - Like `rolling()`, column specific aggregation can be applied with the `agg()` method
- Finally, exponentially weighted moving windows can be generated with the `ewm()` method
  - This window can be used to smooth data, placing higher importance on more recent observations

#### Pipes
- Pipes facilitate chaining together operations that expect Pandas data structures as their first argument
- This are useful to build complex workflows with easier to read code
- Pipes are created using the `pipe()` method

### Aggregating data
- Aggregation can be used to to summarize dataframes, often through row reduction
- Numpy has many functions that work well with aggregation
- The `agg()` method called directly on a dataframe returns a series back with the results
- Multiple functions can be called on column, returning a dataframe object
  - Nulls are returned for any combination of aggregation and column not explicitly asked for

#### Grouping
- Grouping can be used with aggregation to summarize per group
- The `groupby()` method is used to perform the grouping
  - Functions can be applied directly to the `groupby()` method or through the `agg()` method
  - Further refinement of how each column is aggregated can be done by passing a dictionary with the columns and aggregating functions into `agg()`
    - Passing multiple functions for a clolumn results in a multi index object
      - List comprehensions can be used to remove the hierarchy
    - Level can be passed to `groupby()` to group on a specific level of a hierarchical index
- Grouping can be performed on multiple categories at once
  - A `Grouper` object can be passed if, for instance, grouping is performed on the date index by quarter, as described in the book
- The `transform()` method is introduced in this section and it applies a function to the data, returning an object with dimensions equal to what went in

#### Pivoting and crosstabs
- A pivot table can be created using the `pivot_table()` method, specifying what to group on, the subset of columns to aggregate (optional), and the aggregating function(s) (optional)
  - Passing columns as the `column` argument or `index` argument dictate the rotation of the output
  - Multiple values can be passed to the `index` argument
- The `crosstab()` function can be used to create a frequency table
  - Syntax is similar to the pivot table, where `index` and `column` parameters must be specified
  - Normalizing the output to percentage can be performed by passing the `normalize` parameter
  - The aggregating function can be specified using the `aggfunc` parameter, the default function is count

### Time Series
- Working with a time series opens up additional operations on top of those previously discussed

#### Time-based selection and filter
- Data can be isolated by specific time periods using the `loc()` method, for example isolating by year
  - The `loc()` method is optional when slicing by ranges, simple indexing can be used instead
    - Ranges are inclusive of end dates
  - Other time periods, like months and quarters, can be used
- The first period of dates using the `first()` method, the last period of dates can be selected with `last()`
  - Similar to `first()` and `last()`, `first_valid_index()` and `last_valid_index()` can be used to find the indices of the first and last non-null entries
    - The `asof()` method gives the closest non-null entry before the date specified
- Time based selection can be performed using the `at_time()` and `between_time()` methods

#### Shifting for lagged data
- Data can be shifted using the `shift()` method
  - The default is a shift by one period, but this can be changed

#### Differenced data
- Calculating how values change from one period to the next can be performed usin ghte `diff()` method
  - The `diff()` method calculates the difference between the current time period and one time period back

#### Resampling
- Sometimes data isn't in the correct granularity and must be resampled
- The `resample()` method can be used to perform this resampling
  - This is used before an optional `agg()` call
  - The method itself returns a `Resampler` object
- Downsampling reductes the granularity of the data, upsampling does the opposite
  - Both can be performed with the `resample()` method
  - Pass the `asfreq()` method after the resampling eliminates the aggregation
  - Upsampling can lead to `NaN` values if data isn't available to fill the new time periods
    - There are a variety of ways to fill the `NaN` values, from padding to filling

#### Merging Time Series
- Merging is difficult if entries of the two groups to be merged don't have the same datetime
- The `pd.merge_asof()` function will merge observations that are close in time
  - This is similar to a left join
  - The tolerance of the proximity in time can be specified
  - This gives null values whenever a matching time in the time range can't be found
- The `pd.merge_ordered()` function will match up equal keys and interleave keys without matches
  - This is similar to an outer join
  - This gives null values whenever times don't match up exactly