# Pandas Cheat  Sheet
This notebook will provide a summary of all the commands in all of the notebooks in this repository. The most important and useful commands will be presented with as little code as possible. The point of this notebook is for you to quickly see all the commands that will make you a proficient pandas user.

## Pandas Intro
* Started by Wes McKinney at hedge fund AQR in 2008
* Built directly on numpy
* Two main data structures, Series - one dimension, DataFrame - two dimensions
* Name from panel + data

## Getting Help
* The documentation is excellent - http://pandas.pydata.org/pandas-docs/stable/
* [Python for Data Analysis](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662) - Book from the Wes McKinney mainly on learning pandas.
* [Pandas Cookbook](https://www.amazon.com/Learning-Pandas-Python-Discovery-Analysis/dp/1783985127) - Book from Ted Petrou. Detailed recipes that cover the entire pandas API.
* [Tagged pandas questions on stackoverflow](https://stackoverflow.com/questions/tagged/pandas)
* Shift + Tab + Tab - Opens up the docstring when cursor on attribute/method
* ? - enter at the end of your attribute/method to bring up docstring
* ?? - enter at the end of attribute/method to bring up the source code

# Python
* Everything in Python is an object except the syntatical structures like for, if and operators, etc...
* Every object has a type
* **`type`** function rerturns you objects type.
* Always know the type of object you have
* All objects have attributes (descriptions) and methods (actions) which unleash all of their abilities
* Methods must be executed with `()`
* Use **`dir`** to uncover all attributes and methods

# Pandas
* Excellent at tabular data analysis.
* Relies heavily on NumPy to store data and do calculations.
* DataFrame is primary data structure - two dimensional
* Series is one column of data

## Series
**Anatomy**
![](images/series_anatomy.png)
* One dimensional object
* Two components to a Series - the **index** and the **data (values)**
* **`value_counts`**. Counts and orders all values. Returns a Series with old values in the index and their counts as the new values. Use with `normalize=True` to return percentage of counts.
* All arithmetic operators work as expected on each value of Series. `s + 5` adds 5 to each element. This operation is vectorized - very fast with no for loops
* Construct a Series by hand with `pd.Series` with a list, dict, or numpy array. If passed a dictionary, it puts the keys in the index.
* Operators like `+, -, *, /, **`are **vectorized** and operate on each element of the Series at once. `s + 4` adds 4 to each value in the Series.
* Summary statistics: `s.mean(), s.median(), s.sum(), s.max(), s.min()`
* Accumulation methods `s.cumsum`, etc...
* Get the values: `s.values` - returns a numpy array
* Get the index: `s.index` - returns Index type. `s.index.values` returns a numpy array of the index
* Get in the habit of using **`head, tail`** to shorten long output
* Sort by the values or the index: `s.sort_vales()` or `s.sort_index()`. Use boolean parameter **`ascending=False`** to from high to low
* Get the number of elements in Series: `s.size` or `len(s)`
* Get the data type of the series with **`s.dtype`**
* Get several summary statistics with **`s.describe`**

### Selecting subsets of Series 

* There are two main indexers - **`.loc`** for labels and **`.iloc`** for integer location
* Both indexers can take either a **scalar**, a **list**, or **slice notation**
* **`s.loc['label1']`** - scalar that selects a single item in Series
* **`s.loc[['label1', 'label5']]`** - use a list to select disjoint items
* **`s.loc['start':'stop':step]`** - use a slice to select from start to stop inclusive
* **`s.iloc[integer1]`** - scalar that selects a single item in Series
* **`s.iloc[[integer1, integer2]]`** - use a list to select disjoint items
* **`s.iloc[start:stop:step]`** - use a slice to select from start to stop inclusive
*  use `s.at[label]` and `s.iat[interger_location]` for some speed-up of single (scalar) value selection
* **`.ix`** is deprecated. Do NOT use it.
* **`s[label]`** works for both integer and label location. It is ambiguous. Avoid if possible.
* **`automatic alignment of index`** - be careful when operating with two Series at the same time. They will join on the index first, creating a cartesian product and then complete the calculation.

#### Boolean Indexing

* Boolean indexing works by passing in a Series, or other sequence of booleans to the indexing operator. Only values that are True remain in the Series after.
* Must create a Series of booleans first. Sometimes save this to a variable **`criteria`**
*  **`s[criteria]`**  does the boolean selection
* Use `& | ~` to Create multiple and complex boolean conditions. Wrap each condition in parentheses
* Criteria example: **`criteria = ((s > 5) | (s < -2)) & (s % 2 == 0)`**
* Use **`s.isin([1,2,3])`** instead of **`(s == 1) | (s == 2) | (s == 3)`**
* **`s.isnull()`** turns every value into a boolean whether its missing or not
* Series are mutable. Be careful when doing **`s1 = s; s.iloc[2] = 10`**. This will mutate both Series. Use **`s1 = s.copy()`**
* Its ok to use the indexing operator for boolean selection.

### More Series methods

* **`s.drop(index_label)`** deletes a Series element
* **`s.idxmax()`** gets the index label of the maximum value
* **`s.where(condition, other_value)`** make all values where condition is false some other value
* **`s.mask(condition, other_value)`** makes all values where condition is true some other value: 
* **`s.pct_chage`** Find the percentage change from the previous value to current value
* **`s.diff(lag)`** - Subtract the series from itself by some lag
* The **`inplace`** argument appears in many methods. It is defaulted to False. Make it true to mutate the underlying object. It will return `None` if True. NOT best practice
* **`s.unique()`** returns a numpy array of unique values

## DataFrame
**Anatomy**
![](images/dataframe_anatomy.png)
* 2 dimensional. rows and columns. 
* Three main components - **index**, **columns** and **data (values)**
* The index labels the rows and the columns label the columns
* Uses an Index object for both the rows and the columns
* The row Index is simply called the index. The column index is called the columns.
* Access Index with `df.index` and the columns with `df.columns`
* Construct a DataFrame with a 2d numpy array, a dictionary of lists/Series, or reading in a file (csv,json)
* Create new column from other columns: **`df[new_col] = df[col1] - df[col2].mean()`**
* Insert a column somewhere in the middle: **`df.insert()`**

#### Data Types
<table>
<thead>
<td>Common Data Type Name</td>
<td>NumPy/pandas Object Pandas</td>
<td>String Name</td>
<td>Notes</td>
</thead>
<tr>
<td>Boolean</td>
<td>np.bool</td> 
<td>bool</td> 
<td>Stored as a single byte</td>
</tr>
<tr>
<td>Integer</td> 
<td>np.int</td> 
<td>int</td> 
<td>Defaulted to 64 bits. Also available are unsigned ints - np.uint</td>
</tr>
<tr>
<td>Float</td> 
<td>np.float</td>
<td>float</td>
<td>Defaulted to 64 bit</td>
</tr>
<tr>
<td>Object</td> 
<td>np.object</td> 
<td>O, object</td> 
<td>Typically strings but is a catchall for columns with multiple different types or other Python objects (tuples, lists, dicts, etc...)</td> 
</tr>
<tr>
<td>Datetime</td> 
<td>np.datetime64, pd.Timestamp</td> 
<td>datetime64 </td> 
<td>A specific moment in time with nanosecond precision</td> 
</tr>
<tr>
<td>Timedelta </td> 
<td>np.timedelta64, pd.Timedelta</td> 
<td>timedelta64 </td> 
<td>Represents an amount of time from days to nanoseconds</td> 

</tr>
<tr>
<td>Categorical </td> 
<td>pd.Categorical </td> 
<td>category </td> 
<td>Specific only to pandas. Useful
for object columns with
relatively few values</td> 
</tr>
</table>
* **`df.info()`** to see the column data types, non-null counts and memory. 
* **`df.describe()`** to get summary statistics
* Use `pd.options.display.<tab>` To change the display options of pandas in the notebook

### Selecting subsets of DataFrames

* Indexing operator is different for DataFrames than Series. It selects a column or columns.
* **`df[col1]`** selects a single column as a Series
* **`df[[col1, col2, col3]]`** selects multiple columns. Notice the inner list
*  **`df[[col]]`** selects a one column DataFrame
* **`.iloc`** uses integer location and **`.loc`** uses index labels.
* These indexers can make selections by both rows and columns
* **`.iloc`** and **`.loc`** accept scalars, lists and slice notation. 
* They first select by rows - **`df.loc[label]`** selects a single row of data as a Series
* **`df.loc[[label, label2]]`** selects multiple rows as a DataFrame
* **`df.loc[start:stop:step]`** - slice notation selects multiple rows as a DataFrame
* **`.iloc`** works the same way except with integers
* Select rows and columns at the same time by using a comma after the row selection
* **`df.loc[label1 : label2 : 3, ["col1", "col2"]]`** uses slice notation for the rows and a list for the columns

#### Boolean Indexing for DataFrames

* Boolean indexing works the same way as it does for Series except the criteria is made up of columns
* Example Criteria: `criteria = ((df[col1] > 5) | (df[col2] < -2)) & (df[col3] % 2 == 0)`
* **`df[criteria]`** selects all the columns for the rows that are true
* Use **`df.loc[criteria, [col1, col2]]`** to do boolean selection on the rows and also select columns with a list.

### More DataFrame methods
* Most DataFrame methods have an `axis` parameter that controls whether the action will take place across the rows up down the columns.
* `axis` equal to 0 or `index` will perform the action down the columns. `axis` equal to 1 or `columns` will perform the action across the rows. Default is axis=0
* **`df.rename()`** renames an index label or a column name 
* **`df.drop()`** deletes rows or columns 
* Make use of the index! Place a column into the index: **`df.set_index('column name')`**
* Make the index a column: **`df.reset_index()`**
* Remove duplicate rows with `df.drop_duplicates()`. Use the **`subset`** parameter to find duplicates from a subset of columns
* Fill missing values with `df.fillna(value)`. Pass another DataFrame to .fillna to map specific values to the missing ones.
* **`df.select_dtypes`** to choose columns based on their data type
* Style your DataFrame with **`df.style.`** Press tab to see style types.


## Split - Apply - Combine

* Split - Splits your data into specific groups
* Apply - Apply a function to each of your groups
* Combine - Put the results of the apply function back together

The most common split-apply-combine pattern has three components - **grouping columns**, **aggregating columns**, and **aggregating functions**. There are several types of syntax that can be used with the most common below:

```
>>> df.groupby(['grouping columns')['aggregating columns'].agg(['aggregating functions'])
```

If you have just a single aggregating function you may do the following:

```
>>> df.groupby(['grouping columns'])['aggregating columns'].aggregating_function()
```

* **grouping columns** - Each unique combination forms a group
* **aggregating columns** - Values of these columns are going to be aggregated into a single number
* **aggregating functions** - Determines how the aggregation will happen - **`sum, mean, median`**, etc...

* `groupby` method does split-apply-combine in pandas. Can be done with a Series grouping the index but mainly done on DataFrames. Examples below will concentrate on DataFrames
* Create groupby object: `grouped = df.groupby(['grouping columns'])`
* Dictionary of groups: `grouped.groups`
* Number of groups: `grouped.ngroups`
* Get a specific group: `grouped.get_group('group name')`
* Iterate through groups: `for name, group in grouped: print(name); print(group);`
* When the world 'aggregation' is used in the groupby context, it means turning all the values of one group into one specific value.
* Aggregating groups: `df.groupby([grouping columns])[aggregating columns].aggregating_function()`
* Aggregating function from line above can be `min, max, sum, mean, count, size, std` and more.
* Use many functions to aggregate: `df.groupby([grouping columns])[aggregating columns].agg([list of aggregating functions])`
* You can put anonymous functions and custom functions in your list of aggregating functions
* To prevent the grouping columns going into the index use parameter `as_index=False` in the groupby method.
* For non-aggregating function use `.apply` instead of `.agg` and pass in your custom function.

## Visualization

* For any numerical Series: `s.plot()` will produce a line plot by default with index as x-axis and values as y-axis.
* Use parameter `kind=` to change plot type to `hist, bar, barh, kde, pie, box, density, area`
* `linestyle` (ls) - Pass a string of one of the following ['--', '-.', '-', ':']
* `color` (c) - Can take a string of a named color, a string of the hexadecimal characters or a rgb tuple with each number between 0 and 1. [Check out this really good stackoverflow post to see the colors](http://stackoverflow.com/questions/22408237/named-colors-in-matplotlib)
* `linewidth` (lw) - controls thickness of line. Default is 1
* `alpha` - controls opacity with a number between 0 and 1
* `figsize` - a tuple used to control the size of the plot. (width, height) 
* `legend` - boolean to control legend
* Change plotting template with `plt.style.use('ggplot')`. See all templates with `plt.style.available`

## Time Series

* Original purpose of Pandas
* Date is month, day, year
* Time is Hour, minute, second, micro/nano
* Datetime is both date and time
* Python's datetime standard library allows creation of date, time and datetime objects with functions with the same name
* timedelta function allows addition of time to datetime objcets
* pandas uses the Timestamp object which is numpy's datetime64 object. It has nanosecond precision as opposed to datetime's microsecond
* create a date range with `pd.date_range`
* Become familiar with the date offsets
* Putting Timestamp's in the index is very useful
* Slice a timestamp with `s['2014-01-01:2014-02-28']`
* Shift an index with `s.shift(5, freq='D')` Again use date offsets as frequency
* Use `s.asfreq('W')` to sample your time series at a particular interval
* Use anchored offsets to further choose how to choose your interval. `s.asfreq('W-MON')`
* Use resample to aggregate a time series over an interval. `s.resmaple('5D').sum()`
* To use many different aggregation functions use .agg: `s.resample('3W').agg(['min', 'max', 'sum'])`
* Can do rolling aggregations: `s.rolling(10).mean()` This does a rolling average of last 10 time periods
* These functions work with both Series and DataFrames as long as the Timestamp is in the index