# pandas Cheat Sheet

---

## I. Series:

### 1. Creation (by what):
    - array object (numpy array)
    - dictionary (key-value pairs): key -> index, value -> value
    - tuple
    - set, be careful if the data structure is not an ordered values
### 2. Attributes:
    - values
    - index
        - Attributes: name
    - dtype
### 3. Parameters:
    - data
    - index
    - dtype
    - name
### 4. Methods:
    - isnull
    - notnull
    - reindex:
        - Parameters:
            - index
            - method: 'ffill' -> forward fill
             - fill_value: substitute value to use when introducing missing data by reindexing.
            - limit: when forward- or backfilling, maximum size gap (in number of elements) to fill.
            - tolerance: when forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches.
    - drop: 
        - Parameters:
            - labels: list-like
            - inplace: edit the Series, deleting the old one
    - add, sub, div, mul:
        - Parameters:
            - fill_value: fill nan by specified value
    - apply:
        - Parameters:
            - function
            - axis
    - applymap: apply and map (probably also casting)
    - sort_index:
        - Parameters:
            - ascending: default -> True
    - sort_values: nan will be place at the end
    - rank:
        - Parameters:
            - ascending
            - method
    - Aggregation functions: mean, sum, median, etc
        - skipna: If False, the null values are not excluded
    - idxmax: return the index label of max value
    - idxmin: return the index label of min value
    - cumsum: cumulative sum
    - describe: returns summary statistics. On non-numeric data returns alternative summary statistics
    - argmax, argmin: return index position
    - corr: correlations
    - cov: covariances
    - unique: returns array
    - value_counts: returns Series
    - isin: returns boolean array
    - match: compute integer indices for each value in an array into another array of distinct values; helpful for data alignment and join-type operations
    - fillna: fill null values by specified value
    - sample: taking sample as many as the specified argument
### 5. Accessing: numpy style

## II. DataFrame:

### 1. Creation (by what):
    - array object (numpy array)
    - dictionary containing key-list_of_values
    - dictionary of dictionary: outer -> column label, inner -> index label
    - dictionary of arrays, lists or tuples
    - dictionary of Series
    - list of dicts or Series
    - list of lists
### 2. Attributes:
    - index
        - Attributes: name
    - columns
        - Attributes: name
    - dtypes
    - values: return ndarray
### 3. Parameters:
    - data
    - index
    - columns
    - dtype
### 4. Methods:
    - head: return the first five rows
    - tail: return the last five rows
    - transpose (dataframe.T)
    - isnull
    - notnull
    - reindex:
        - Parameters:
            - index
            - columns
            - method: 'ffill' -> forward fill
            - fill_value: substitute value to use when introducing missing data by reindexing.
            - limit: when forward- or backfilling, maximum size gap (in number of elements) to fill.
            - tolerance: when forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches.
    - drop:
        - Parameters:
            - labels
            - axis: 0 -> along row (row labels), 1 -> along column (column labels)
            - inplace: edit the data frame, delete the old one
    - add, sub, div, mul:
        - Parameters:
            - fill_value: fill nan by specified value
    - apply:
        - Parameters:
            - function
            - axis: 0 -> along column, 1 -> along row
    - applymap: apply and map (probably also casting)
    - sort_index:
        - Parameters:
            - axis: 0 -> index label, 1 -> columns label
            - ascending: default -> True
    - sort_values:
        - Parameters:
            - axis: 0 -> index label, 1 -> columns label
            - ascending: default -> True
            - by: specifying based on which label
    - rank:
        - Parameters:
            - ascending
            - method
            - axis
    - Aggregation functions: mean, sum, median, etc
        - axis: 0 -> along column, 1 -> along row
        - skipna: If False, null values are not excluded
    - idxmax: returns the index label of max value
    - idxmin: returns the index label of min value
    - cumsum: cumulative sum
    - describe: returns summary statistics. On non-numeric data returns alternative summary statistics
    - argmax, argmin: return index position
    - corr: correlations
    - cov: covariances
    - corrwith: compute pair-wise correlations
    - unique: returns array
    - value_counts: returns Series
    - isin: returns boolean array
    - match: compute integer indices for each value in an array into another array of distinct values; helpful for data alignment and join-type operations
    - fillna: fill null values by specified value
### 5. Accessing: 
    - numpy style
    - loc, label based
    - iloc,  position based
    - at
    - iat  

## III. Index Objects:

### 1. Creation:
    - list, tuple
### 2. Methods:
    - append: concatenate additional Index object, return a new object
    - difference: compute set difference
    - intersection: compute set intersection
    - union: compute set union
    - isin: returns boolean array with parameter arraylike
    - is_monotonic: returns True if each element is greater than or equal to the previous element
    - is_unique
    - unique
    - get_indexer: returns the index of argument based on index it provides
### 3. Attributes: 
    - name

## IV. Common functions:

- Mostly each instance methods also have its general functions
- null:
    - isnull
    - notnull
- pd.crosstab: returns cross-tabulation matrix

## V. Data loading, Storage, and File Formats:

### 1. Parsing functions:
    - read_csv: Load delimited data from a file, URL, or file-like object; use comma as default delimiter
        - Parameters:
            - filepath: file path 
            - header: True, the first row will be header
            - sep: separator
            - names: column labels
            - index_col: column that we want to be index label, if provided list, index will be hierarchical form
            - skiprows: skip specified rows
            - na_values: the specified arguments will be treated as null, provided as dict (key as column label, value as list of element)
            - nrows: only takes the specified amount of rows
            - chunksize: take number of pieces, return TextFileReader
    - read_table: Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter
    - read_fwf: Read data in fixed-width column format (i.e., no delimiters)
    - read_clipboard: Version of read_table that reads data from the clipboard; useful for converting tables from web pages
    - read_excel: Read tabular data from an Excel XLS or XLSX file
    - read_hdf: Read HDF5 files written by pandas
    - read_html: Read all tables found in the given HTML document
    - read_json: Read data from a JSON (JavaScript Object Notation) string representation
    - read_msgpack: Read pandas data encoded using the MessagePack binary format
    - read_pickle: Read an arbitrary object stored in Python pickle format
    - read_sas: Read a SAS dataset stored in one of the SAS system’s custom storage formats
    - read_sql: Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
    - read_stata: Read a dataset from Stata file format
    - read_feather: Read the Feather binary file format
### 2. Write (instance method of dataframe):
    - to_csv:
        - Parameters:
            - filename
            - sep: separator
            - na_rep: the argument will be the text for missing values
            - index: If false, no index included
            - header: If false, no header
            - columns: list of column label
### 3. JSON file:
    - json module
    - json.load: load the str to dict
    - json.dumps: back to str
    - pd.read_json: incase for good form of json file
    - df.to_json: make json file from data frame
### 4. HTML:
    - pd.read_html
    - df.to_html
### 5. Pickle:
    - pd.read_pickle
    - df.to_pickle
### 6. HDF5:
    - pd.HDFStore: returns HDFStore object (think it as a storage like dict)
    - Methods for HDFStore object:
        - store.put('storage_name', format=<\table> or <\fixed>)
        - store.select('storage_name', where=[conditions involvind index]
    - pandas:
        - pd.read_hdf, ex: pd.read_hdf('hdf_file', 'storage_name', where)
        - frame.to_hdf('file_name', 'storage_name', format)
### 7. Excel:
    - pd.ExcelFile: for read excel file
    - pd.ExcelWriter: for write excel file
    - pd.read_excel: read 
    - df.to_csv: write
### 8. Web APIs:
    - import request: a module
    - resp = request.get(url): get the url
    - resp.json(): returns list of dict, then turn it into dataframe
### 9. Databases:
    - Module: sqlite3
    - Methods: sqlite3.connect(), create a data base
    - database.execute(query)
    - con.commit()
    - Simple:
        - import sqlalchemy as sqla
        - db = sqla.create_engine(sqlite:///mydata.sqlite)
        - pd.read_sql('SELECT * FROM test, db)

## VI. Data Cleaning and Preparation Functionalities

### 1. Handling Missing Data
    - Methods:
        - isnull: returns boolean pandas object, True if null, False otherwise
        - notnull: negation of isnull method
        - dropna: returns pandas object withoout null values
            - how: specifying all or any
            - axis: 0 along row, 1 along column
            - thresh: specifies # of non-null values
        - fillna: fill null value with specified value
            - value: single value, dict(key->column label, value-value to replaced)
            - inplace: edit, delete old one
            - method: 'ffill' -> forward fill
            - limit: specifies the number of missing values filled by method kwarg
            - axis: 0 along column, 1 along row
### 2. Data Transformation
    - Methods:
        - duplicated: returns boolean pandas object, True if the object is a duplicate
        - drop_duplicates: removing duplicate values
            - subset: subset of column labels
            - keep: 'first' keeps the first observed data, 'last' keeps the last observed data 
        - map: takes dict or function returns pandas object
        - replace: replace the old value with new one. Can be a tuple, or dict
        - rename: index, columns. Rename the index
        - take: return the elements in the given *positional* indices along an axis.
        - sample:taking sample
            - n: number of samples
            - replace: True with replacement, False otherwise
        - join: concatenate 
    - Functions:
        - cut: returns Categorical object
            - x: array
            - bins
            - labels: rename the Categorical labels, if false the index are used
            - precision: if bins=int, this specifies floating-point accuracy
            - Categorical object attributes:
                - codes
                - categories
        - qcut: same as cut, but cut it into its quartiles. Will be roughly equal size
        - get_dummies: making dummy variables
            - data
            - prefix: give a dummy variable prefix, with suffix new column label
        - add_prefix: adding prefix to column labels

## VII. Data Wrangling:

### 1. Hierarchical Indexing
    - Creation:
        - Series: specifying the index as list of list (levels)
        - MultiIndex.from_arrays
    - Methods: 
        - unstack: Hierarchical Series to DataFrame
        - stack: reverse of unstack
        - swaplevel: swap the level in hierarchical pandas object
        - sort_index: argument level
        - Summary Statistics:
            - level
            - axis: 0 along row, 1 along column
        - set_index: the specified columns will be index, if argument is a list it will be hierarchical index
            - drop: by default it removed the specified columns, False if dont remove
### 2. Combining and Merging
    - merge (function): joining dataset using key
        - left: data
        - right: data
        - how: inner is intersection, outer is union
        - on: specify both key, left and right. must have the same key name
        - left_on: key on left
        - right_on: key on right
        - left_index: using index as key from left data
        - right_index: using index as key from right data
        - suffixes: if there are overlapping columns
    - join (method): mostly the same as merge
    - concat (function): like np.concatenate
        - objs: objects
        - join: inner is intersection, outer is union
        - join_axes: specifies the label want to be included in return
        - axis: 0 is vertical, 1 is horizontal
        - keys: if axis 0, keys be index, if axis 1, keys will be column label
        - names: specifies the name of level in hierarchical indexed data
        - ignore_index: ignoring index when concatenating
        - verify_intergrity
    - combine_first (method): take first, if nan take not nan
### 3. Reshaping and Pivoting
    - unstack (method): hierarchical -> common df
        - level
    - stack (method): df -> hierarhical (Series)
        - dropna: if True (default), when stack missing values gone, False otherwise
    - pivot (method): (long to wide), single column to multiple columns
        - index: what column to be the index lable
        - columns: what column to be the column label
        - values: what column to be the values
    - melt (function): (wide to long), multiple columns to single column
        - frame: data
        - id_vars: group indicator, the column label will be variable
        - value_vars: the value of long form, the column label will be value

## VIII. Data Aggregation and Group Operations

### 1. GroupBy Mechanics
    - series.groupby(series or list of series): returns GroupBy object of value with respect to key
        - df['value'].groupby(df['key']): one value-one key, Series with Series
        - df['value'].groupby([df['key1'], df['key2']]): two key one value, will be hierarchical indexed Series
    - dataframe.groupby(series or list of series)
        - df.groupby('colkey1'): df -> whole dataframe, colkey -> column of df as key
        - df.groupby(['colkey1', 'colkey2']): df -> whole data frame, list of column keys
    - dataframe.groupby(series of categorical type): by default it gets excluded, by we can specify it
    - for iterations purposes: groupby returns tuple of group name with the chunk of data. name is the label of the key and chunk is dataframe or series
        - we can turn it into dictionary
        - Seperate the data by key:
            - dict(list(df.groupby('key'))): returns a dict that we can access with key label
        - Seperate the data by types:
            - df.groupby(df.dtypes, axis=1): along row
    - grouping by dicts and series: the key is a dict or series
        - provide dict that map column label to new column label (for grouping purposes). use axis=1 because we use the columns not rows
        - also works if provided by series
    - grouping by functions: provide the function to use, index as the decision maker
    - grouping by index levels: provide the level, note 0 for row labels and 1 for columns labels. Using index as key
### 2. Data Aggregation:
    - agg: takes function, slower than the optimized functions
        - agg('mean'): same as calling mean
        - agg('std'): same as calling std
        - if we pass a list of functions, we get dataframe with functions as the column label
        - if we pass a list of 2-tuple name-function, the name will become the column label
        - if we pass a dict, key represents column name, value represents the functions
        - as_index parameter (pass in groupby method), if False will reset the index, the grouping key wont be the index
### 3. Apply: General split-apply-combine
    - apply: 
        - general idea: we split the data using keys, then apply the method and pass it to apply as argument, then the data is glued together
        - if the function has arguments, we can pass the arguments after function in apply
        - group_keys (pass in groupby method), if false the group key is not used as index
        - if we pass function which returns dictionary, it works by first split the data by grouping keys, apply the data which returns dict, and the dict passed on to dataframe contructor and making a dataframe or series
### 4. Pivot Tables and Cross-Tabulation
    - pivot_table (function or method):
        - values: which column we want to inspect
        - index: grouping key as index
        - columns: grouping key as column
        - aggfunc: aggregation function
        - fill_value: replace missing values 
        - dropna: if True, do not include columns whose entries are all NA
        - margins: add subtotals
    - crosstab: special case of pivot_table
        - index: the specified grouping key as index
        - columns: the specified grouping key as column
        - margins: add subtotals

## IX. Time Series

### 1. Python Built-ins
    1. module: datetime
    - class: 
        - datetime: stores both date and time
            - construction: datetime(year, month, day, ...)
        - timedelta: represents the difference between two datetime values
            - construction: timedelta(days, seconds, ...)
        - date: store calendar date (y,m,d) using gregorian calendar
        - time: store time (h,m,s,ms)
        - tzinfo: base type for storing time zone information
    - methods:
        - now: the current time
            - Attributes:
                - year
                - month
                - day
                - ...
        - datetime.strftime: convert datetime to the specified string format using datetime format specification
        - datetime.strptime: convert string to datetime format
    - datetime format specification:
        - %Y Four-digit year
        - %y Two-digit year
        - %m Two-digit month [01, 12]
        - %d Two-digit day [01, 31]
        - %H Hour (24-hour clock) [00, 23]
        - %I Hour (12-hour clock) [01, 12]
        - %M Two-digit minute [00, 59]
        - %S Second [00, 61] (seconds 60, 61 account for leap seconds)
        - %w Weekday as integer [0 (Sunday), 6]
        - %U Week number of the year [00, 53]; Sunday is considered the first day of the week, and days before the first Sunday of the year are “week 0”
        - %W Week number of the year [00, 53]; Monday is considered the first day of the week, and days before the first Monday of the year are “week 0”
        - %z UTC time zone offset as +HHMM or -HHMM; empty if time zone naive
        - %F Shortcut for %Y-%m-%d (e.g., 2012-4-18)
        - %D Shortcut for %m/%d/%y (e.g., 04/18/12)
    2. module: dateutil.parser
        - functions:
            - parse: parse the data from string format to datetime format, using common string representation
                - dayfirst: to make sure the day is preciding month
    
### 2. pandas Time Series
    - functions:
        - to_datetime: parse common string representation to DatetimeIndex object
        - date_range: like np.arange, but for datetime object
            - start: the start, lower bound
            - end: the end, upper bound
            - periods: how many data between start and end
            - freq: frequency
                - using date offsets by importing from pandas.tseries.offsets import Hour, Minute, Day, ... or using aliases
        - period_range: creating range of period objects, returns PeriodIndex object
    - methods:
        - truncate: slicing data
        - resample: convert the sample time series to be fixed specified frequency provides the frequency, ex: D is daily frequency
        - (Timestamp method) normalize: using midnight as the start of a day
        - to_period: convert to PeriodIndex or Period. For example: if there are three timestamp in same month, then we convert it to Monthly period, each of timestamp will become the corresponding monthly period
        - to_timestamp: convert to Timestamp or DatetimeIndex
    - class:
        - Timestamp: a subtitute of datetime type
            - tz: time zone, by default None
            - if we write Timestamp('7/2/2019'), it will become american way of writing date. so better year firt
        - DatetimeIndex: Index data of datetime type
        - Period: return Period object
        - PeriodIndex: return PeriodIndex object
    - accessing:
        - ts[datetime(year, month, day)]
        - ts[string_format], ex: ts['21/09/1997']
        - ts['year-month-day'], with year, month, and day can be independent
        - ts.truncate: slicing the ts
            - before: lower bound
            - after: upper bound
        - ts.loc[DatetimeIndex]
### 3. Time Zone Handling
    - modules: pytz
    - class: 
        - timezone: create timezone object
        - common_timezones: list of common time zones
    - methods:
        - tz_localize: setting the timezone, ex: series.tz_localize('UTC')
        - tz_convert: convert the timezone
### 4. Periods and Period Arithmetic
    - asfreq: converting period
    - ts.to_period: convert the DatetimeIndex to PeriodIndex
### 5. Resampling and Frequency Conversion
    1. Resampling with Timestamps
    - resample (method): think about groupby method!
        - rule: takes freq value
        - kind: specifies the index type, if period the index will be PeriodIndex, if timestamp the index will be DatetimeIndex
        - closed: defines which side is inclusive, ex: if left 00:00 value is included in 00:00 to 00:05 (5min freq)
        - label: using which side as the label, right means using boundary from the right as the label
        - loffset: shift the index with the specified argument
        - convention: 'end'
    - agggregation:
        - ohlc: returns dataframe containing open(first value in a group), close(last value in a group), high (maximum value within a group), low (minimum value within a group)
    - using resample for upsampling:
        - ex (from weekly frequency): frame.resample('D').asfreq(), asfreq method tells that we dont do aggreagation
        - frame.resample('D').ffill(): ffill method fill the missing values with previous nonnull value
            - parameters: limit=how many missing value we want to fill
    
    2. Resampling with Periods: same thing
### 6. Moving Window Functions
    - methods:
        - rolling: create a moving window, the window is capturing the specified size of samples, doing aggregation and move forward. Works like groupby

## X. Advanced pandas
    1. Categorical data
    - methods:
        - take (for series or dataframe in general): take data (as mapping) with the specified key-value, returns the new data replaced
        - as_ordered (for categorical type of data), convert nominal type to ordinal type
        - set_categories (data.cat): takes list of object representation, replace the encoded version to object representation
        - remove_unused_categories: removing categories that are not present in data
        - add_categories: append new (unused) categories at end of existing categories
        - as_unordered: negation of as_ordered
        - rename_categories: replace categories with indicated set of new category names
    - functions:
        - pd.get_dummmies: make dummy variable
    - class:
        - category: pandas.core.categorical.Categorical
            - convert to category: data.astype('category')
            - attributes:
                - categories
                - codes
            - creation: 
                - pd.Categorical
                - pd.Categorical.from_codes:
                    - codes
                    - categories
                    - ordered: True if categorical type ordinal
    - Others:
        - accessing codes attribute in a Series or DataFrame, use data.cat.codes
        - accessing categories attribute in a Series or DataFrame, use data.cat.categories
    2. Groupby advanced
        - transform: takes function, ex: g.transform(lambda x: x.mean()), split the data into group, function in transform performs aggregation then the combine step is transform method distribute the mean of each group to its group member
        - in timeseries case, we can use resampling method with argument as a key (usually we use offsets). make grouper(level=0, freq='5T'), pass it to groupby method