# Pandas Reference Notebook

## Table of Contents

* [Reading Colab Files](#scrollTo=hw6bhQdm_oCL&line=1&uniqifier=1)
* [Pandas DataFrame DataTypes](#scrollTo=Af0gdIh7J0bW&line=1&uniqifier=1)
* [DataFrame Attributes](#scrollTo=xSrwXTM5Mzn6&line=1&uniqifier=1)
* [Series Attributes](#scrollTo=52QdYIaWNLbe&line=1&uniqifier=1)
* [Indexing and Selecting Data](#scrollTo=nAKrgnkVNmgU&line=1&uniqifier=1)
* [Filtering Pandas DataFrames](#scrollTo=NfZe9ifCTT2i&line=1&uniqifier=1)
* [Merge, Join, Concatenate, and Compare](#scrollTo=pEeYgY-hcb_Y&line=1&uniqifier=1)
* [Reshaping and Pivot Tables](#scrollTo=wYZNwfGPqndX&line=1&uniqifier=1)
* [Handling Missing Data and Duplicates](#scrollTo=ct6VH7saVzg-&line=10&uniqifier=1)
* [Applying Functions](#scrollTo=LKz1N_qn7Jik&line=1&uniqifier=1)
* [Window Operations](#scrollTo=ZLlakPX-8k5b&line=7&uniqifier=1)
* Ranking and Ordering
* [String Operations](#scrollTo=Xr63tB1CS09T&line=38&uniqifier=1)
* Working with Time Series Data: See Obsidian Note: Pandas - Time Series Data
* DataFrame Options and Settings
* Sparse Data Structures



References:
[Linking to Sections in Colab](https://stackoverflow.com/questions/64027534/how-to-use-link-to-cell-in-colab)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

## Google Drive: Mounting Google Drive locally

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Reading Files from Google Drive

In [None]:
path = '/content/drive/MyDrive/Colab/Datasets/shopping_behavior_updated.csv'
# path = 'data/Ecommerce_Consumer_Behavior_Analysis_Data.csv' # USE THIS PATH ON LOCAL
df = pd.read_csv(path)
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,4,Mixed,5,5,2.0,,Somewhat Sensitive,1,7,,Tablet,Credit Card,3/1/2024,True,False,Need-based,No Preference,2
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6
2,84-649-5117,24,Female,Middle,Single,Master's,High,Huzhen,Office Supplies,$426.22,2,Mixed,5,5,0.3,Low,Not Sensitive,1,7,Low,Smartphone,Debit Card,3/15/2024,True,True,Impulsive,No Preference,3
3,48-980-6078,29,Female,Middle,Single,Master's,Middle,Wiwilí,Home Appliances,$101.31,6,Mixed,3,1,1.0,High,Somewhat Sensitive,0,1,,Smartphone,Other,10/4/2024,True,True,Need-based,Express,10
4,91-170-9072,33,Female,Middle,Widowed,High School,Middle,Nara,Furniture,$211.70,6,Mixed,3,4,0.0,Medium,Not Sensitive,2,10,,Smartphone,Debit Card,1/30/2024,False,False,Wants-based,No Preference,4


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

---

In [None]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

# Pandas Dataframe Data Types

| Method/Attribute     | Description                                    | Python Example                              | Parameters                                                                                                                                                            | Documentation                                                                                            |
| -------------------- | ---------------------------------------------- | ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `df.dtypes`          | Returns the data type of each column           | `df.dtypes`                                 | _(property)_ – no parameters                                                                                                                                          |                                                                                                          |
| `df.select_dtypes()` | Select columns based on data type              | `df.select_dtypes(include='number')`        | `include`: scalar or list-like  <br>`exclude`: scalar or list-like                                                                                                    | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html)        |
| `df.astype()`        | Cast a column (or columns) to a specific dtype | `df['col'] = df['col'].astype('int')`       | `dtype`: data type, or dict of column to dtype  <br>`copy`: bool, default `True`  <br>`errors`: {‘raise’, ‘ignore’}, default ‘raise’                                  | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)               |
| `pd.to_numeric()`    | Convert argument to a numeric type             | `pd.to_numeric(df['col'], errors='coerce')` | `arg`: scalar, list, Series  <br>`errors`: {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’  <br>`downcast`: {‘integer’, ‘signed’, ‘unsigned’, ‘float’}, default None   | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html#pandas.to_numeric)   |
| `pd.to_datetime()`   | Convert to datetime                            | `pd.to_datetime(df['date'])`                | `arg`: string, datetime, list, Series  <br>`format`: str, optional  <br>`errors`: {‘ignore’, ‘raise’, ‘coerce’}  <br>`utc`: bool, default False  <br>`dayfirst`: bool | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime) |
| `pd.to_timedelta()`  | Convert to timedelta type                      | `pd.to_timedelta(df['duration'])`           | `arg`: str, timedelta, list, Series  <br>`unit`: str, optional  <br>`errors`: {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’                                          | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.to_timedelta.html)                   |

In [None]:
df.dtypes

Unnamed: 0,0
Customer_ID,object
Age,int64
Gender,object
Income_Level,object
Marital_Status,object
Education_Level,object
Occupation,object
Location,object
Purchase_Category,object
Purchase_Amount,object


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 28 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Customer_ID                            1000 non-null   object 
 1   Age                                    1000 non-null   int64  
 2   Gender                                 1000 non-null   object 
 3   Income_Level                           1000 non-null   object 
 4   Marital_Status                         1000 non-null   object 
 5   Education_Level                        1000 non-null   object 
 6   Occupation                             1000 non-null   object 
 7   Location                               1000 non-null   object 
 8   Purchase_Category                      1000 non-null   object 
 9   Purchase_Amount                        1000 non-null   object 
 10  Frequency_of_Purchase                  1000 non-null   int64  
 11  Purch

In [None]:
df.select_dtypes(include='object').columns

Index(['Customer_ID', 'Gender', 'Income_Level', 'Marital_Status',
       'Education_Level', 'Occupation', 'Location', 'Purchase_Category',
       'Purchase_Amount', 'Purchase_Channel', 'Social_Media_Influence',
       'Discount_Sensitivity', 'Engagement_with_Ads',
       'Device_Used_for_Shopping', 'Payment_Method', 'Time_of_Purchase',
       'Purchase_Intent', 'Shipping_Preference'],
      dtype='object')

[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

---

# Dataframe Attributes

| Attribute    | Description                                    |
| ------------ | ---------------------------------------------- |
| `df.shape`   | Tuple representing (rows, columns)             |
| `df.columns` | Column labels as Index                         |
| `df.index`   | Index object of the DataFrame                  |
| `df.ndim`    | Number of dimensions (always 2 for DataFrames) |
| `df.size`    | Total number of elements                       |
| `df.values`  | Numpy array representation of the DataFrame    |
| `df.dtypes`  | Data types of columns                          |

[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

---

# Series Attributes

| Attribute  | Description           |
| ---------- | --------------------- |
| `s.index`  | Index of the Series   |
| `s.dtype`  | Data type of Series   |
| `s.shape`  | Tuple indicating size |
| `s.size`   | Number of elements    |
| `s.values` | Numpy array of values |
| `s.name`   | Name of the Series    |

[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

---

# Indexing and Selecting Data

| Method/Attribute | Description                               | Python Example                | Parameters                                                                               | Documentation                                                                                                  |
| ---------------- | ----------------------------------------- | ----------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| `df.loc[]`       | Label-based indexing for rows and columns | `df.loc[2, 'col']`            | `row_labels`: single label or list  <br>`column_labels`: single label or list (optional) | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)                        |
| `df.iloc[]`      | Integer-location based indexing           | `df.iloc[2, 1]`               | `row_indices`: int or list  <br>`column_indices`: int or list (optional)                 | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc) |
| `df.at[]`        | Access a single value by label            | `df.at[2, 'col']`             | `row_label`: scalar  <br>`column_label`: scalar                                          | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html#pandas.DataFrame.at)     |
| `df.iat[]`       | Access a single value by integer position | `df.iat[2, 1]`                | `row_index`: int  <br>`column_index`: int                                                | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iat.html#pandas.DataFrame.iat)   |
| `df[]`           | Access single column or slice             | `df['col']`                   | `key`: str (column name) or slice                                                        |                                                                                                                |
| `df.get()`       | Get column with fallback                  | `df.get('col', default=None)` | `key`: str  <br>`default`: value to return if not found                                  | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.get.html)                        |

## Using .loc

In [None]:
df.loc[0, 'Customer_ID']

'37-611-6911'

In [None]:
df.loc[0:5, ['Customer_ID', 'Age', 'Gender']]

Unnamed: 0,Customer_ID,Age,Gender
0,37-611-6911,22,Female
1,29-392-9296,49,Male
2,84-649-5117,24,Female
3,48-980-6078,29,Female
4,91-170-9072,33,Female
5,82-561-4233,45,Male


In [None]:
df.loc[0:5, 'Customer_ID':'Purchase_Amount']

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22
2,84-649-5117,24,Female,Middle,Single,Master's,High,Huzhen,Office Supplies,$426.22
3,48-980-6078,29,Female,Middle,Single,Master's,Middle,Wiwilí,Home Appliances,$101.31
4,91-170-9072,33,Female,Middle,Widowed,High School,Middle,Nara,Furniture,$211.70
5,82-561-4233,45,Male,Middle,Married,Master's,High,Boro Utara,Office Supplies,$487.95


In [None]:
df.loc[0:5]

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,4,Mixed,5,5,2.0,,Somewhat Sensitive,1,7,,Tablet,Credit Card,3/1/2024,True,False,Need-based,No Preference,2
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6
2,84-649-5117,24,Female,Middle,Single,Master's,High,Huzhen,Office Supplies,$426.22,2,Mixed,5,5,0.3,Low,Not Sensitive,1,7,Low,Smartphone,Debit Card,3/15/2024,True,True,Impulsive,No Preference,3
3,48-980-6078,29,Female,Middle,Single,Master's,Middle,Wiwilí,Home Appliances,$101.31,6,Mixed,3,1,1.0,High,Somewhat Sensitive,0,1,,Smartphone,Other,10/4/2024,True,True,Need-based,Express,10
4,91-170-9072,33,Female,Middle,Widowed,High School,Middle,Nara,Furniture,$211.70,6,Mixed,3,4,0.0,Medium,Not Sensitive,2,10,,Smartphone,Debit Card,1/30/2024,False,False,Wants-based,No Preference,4
5,82-561-4233,45,Male,Middle,Married,Master's,High,Boro Utara,Office Supplies,$487.95,8,Mixed,3,3,0.0,High,Not Sensitive,2,3,,Tablet,Debit Card,3/19/2024,False,False,Planned,No Preference,7


### .loc with Conditions

In [None]:
df.loc[df['Age'] > 30].head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6


In [None]:
df.loc[df['Age'] > 30, ['Customer_ID']]

Unnamed: 0,Customer_ID
1,29-392-9296
4,91-170-9072
5,82-561-4233
7,88-661-4689
10,44-674-4037
...,...
989,54-238-5459
991,48-271-1908
994,08-185-6608
995,20-562-2569


In [None]:
df.loc[(df['Age'] > 30) & (df['Customer_Satisfaction'] > 8)].head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
4,91-170-9072,33,Female,Middle,Widowed,High School,Middle,Nara,Furniture,$211.70,6,Mixed,3,4,0.0,Medium,Not Sensitive,2,10,,Smartphone,Debit Card,1/30/2024,False,False,Wants-based,No Preference,4


In [None]:
df.loc[lambda df: df['Age'] == 32].head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
13,80-684-5072,32,Male,High,Married,High School,Middle,Rokytne,Animal Feed,$79.81,4,Mixed,5,5,0.0,High,Not Sensitive,0,9,High,Smartphone,Debit Card,7/16/2024,False,True,Wants-based,Standard,14


## Using .iloc

In [None]:
df.iloc[0]

Unnamed: 0,0
Customer_ID,37-611-6911
Age,22
Gender,Female
Income_Level,Middle
Marital_Status,Married
Education_Level,Bachelor's
Occupation,Middle
Location,Évry
Purchase_Category,Gardening & Outdoors
Purchase_Amount,$333.80


In [None]:
df.iloc[0, 0]

'37-611-6911'

In [None]:
df.iloc[[0, 1]]

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,4,Mixed,5,5,2.0,,Somewhat Sensitive,1,7,,Tablet,Credit Card,3/1/2024,True,False,Need-based,No Preference,2
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6


In [None]:
df.iloc[:3]

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,4,Mixed,5,5,2.0,,Somewhat Sensitive,1,7,,Tablet,Credit Card,3/1/2024,True,False,Need-based,No Preference,2
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6
2,84-649-5117,24,Female,Middle,Single,Master's,High,Huzhen,Office Supplies,$426.22,2,Mixed,5,5,0.3,Low,Not Sensitive,1,7,Low,Smartphone,Debit Card,3/15/2024,True,True,Impulsive,No Preference,3


In [None]:
df.iloc[lambda x: x.index % 2 == 0].head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,4,Mixed,5,5,2.0,,Somewhat Sensitive,1,7,,Tablet,Credit Card,3/1/2024,True,False,Need-based,No Preference,2


In [None]:
df.iloc[[0, 2], [1, 3]]

Unnamed: 0,Age,Income_Level
0,22,Middle
2,24,Middle


In [None]:
df.iloc[0:3, 0:3]

Unnamed: 0,Customer_ID,Age,Gender
0,37-611-6911,22,Female
1,29-392-9296,49,Male
2,84-649-5117,24,Female


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

---

# Filtering Pandas Dataframes

| Method/Attribute | Description                               | Python Example                | Parameters                                                                                               | Documentation                                                                                                                                                        |
| ---------------- | ----------------------------------------- | ----------------------------- | -------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Boolean Masking  | Filter rows based on condition            | `df[df['col'] > 10]`          | _(standard boolean expression)_                                                                          |                                                                                                                                                                      |
| `df.where()`     | Return matching values, set others to NaN | `df.where(df['col'] > 10)`    | `cond`: boolean condition  <br>`other`: replacement  <br>`inplace`: bool  <br>`axis`: int or str         | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html)<br>[Example](https://www.geeksforgeeks.org/python-pandas-dataframe-where/) |
| `df.query()`     | Query with a string expression            | `df.query('col > 10')`        | `expr`: str expression  <br>`inplace`: bool  <br>`engine`: {'python', 'numexpr'}  <br>`local_dict`: dict | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html)                                                                            |
| `df.filter()`    | Subset columns/rows by labels             | `df.filter(items=['A', 'B'])` | `items`: list-like  <br>`like`: str  <br>`regex`: str  <br>`axis`: {0, 1}                                | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html)                                                                           |

In [None]:
mask = df['Age'] > 35
df[mask].head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6


In [None]:
df[(df['Age'] > 35) & (df['Customer_Satisfaction'] > 8)].head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
7,88-661-4689,39,Male,Middle,Single,High School,Middle,Taocheng,Books,$218.06,6,Online,5,4,1.0,Low,Somewhat Sensitive,2,9,,Desktop,Credit Card,3/17/2024,False,True,Impulsive,No Preference,13


In [None]:
mask = (df['Age'] > 35) & (df['Customer_Satisfaction'] > 8)
df[mask].head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
7,88-661-4689,39,Male,Middle,Single,High School,Middle,Taocheng,Books,$218.06,6,Online,5,4,1.0,Low,Somewhat Sensitive,2,9,,Desktop,Credit Card,3/17/2024,False,True,Impulsive,No Preference,13


In [None]:
# # TODO: Need to convert datetime object
# mask_2024 = df['Time_of_Purchase'].dt.year == 2024
# df[mask_2024].head(1)

`df.where()`
<br>
Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

In [None]:
df.query('Age > 40').head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6


In [None]:
df.query('Age == 49').head(1)

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6


In [None]:
df.filter(items=['Customer_ID', 'Age', 'Gender']).head(1)

Unnamed: 0,Customer_ID,Age,Gender
0,37-611-6911,22,Female


In [None]:
df.filter(like='Pur').head(1)

Unnamed: 0,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Time_of_Purchase,Purchase_Intent
0,Gardening & Outdoors,$333.80,4,Mixed,3/1/2024,Need-based


In [None]:
df.filter(regex='e$', axis=1).head(1)

Unnamed: 0,Age,Frequency_of_Purchase,Social_Media_Influence,Return_Rate,Time_of_Purchase,Shipping_Preference
0,22,4,,1,3/1/2024,No Preference


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

---

# Merge, Join, Concatenate

| Method/Attribute | Description                                | Python Example                 | Parameters                                                                                                                    | Documentation                                                                                                  |
| ---------------- | ------------------------------------------ | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| `pd.merge()`     | Merge two DataFrames on key columns        | `pd.merge(df1, df2, on='key')` | `right`, `how`, `on`, `left_on`, `right_on`, `left_index`, `right_index`, `sort`, `suffixes`, `copy`, `indicator`, `validate` | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)                      |
| `df.join()`      | Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.        | `df1.join(df2)`                | `other`, `on`, `how`, `lsuffix`, `rsuffix`, `sort`, `validate`                                                                | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) |
| `pd.concat()`    | Concatenate DataFrames along axis          | `pd.concat([df1, df2])`        | `objs`, `axis`, `join`, `ignore_index`, `keys`, `levels`, `names`, `verify_integrity`, `sort`, `copy`                         | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)                               |
| `df.compare()`   | Compare differences between two DataFrames | `df1.compare(df2)`             | `other`, `align_axis`, `keep_shape`, `keep_equal`                                                                             | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html)                    |

In [None]:
df_merge1 = df.iloc[:5, :5]
df_merge2 = df.iloc[:5, :5]

df_merge1.merge(df_merge2, on='Customer_ID', how='left', suffixes=('', '_right'))

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Age_right,Gender_right,Income_Level_right,Marital_Status_right
0,37-611-6911,22,Female,Middle,Married,22,Female,Middle,Married
1,29-392-9296,49,Male,High,Married,49,Male,High,Married
2,84-649-5117,24,Female,Middle,Single,24,Female,Middle,Single
3,48-980-6078,29,Female,Middle,Single,29,Female,Middle,Single
4,91-170-9072,33,Female,Middle,Widowed,33,Female,Middle,Widowed


In [None]:
# joining on indices only
df_join1 = df.iloc[:5, :5]
df_join2 = df.iloc[:3, :5]

df_join1.join(df_join2, lsuffix='_left', rsuffix='_right')

Unnamed: 0,Customer_ID_left,Age_left,Gender_left,Income_Level_left,Marital_Status_left,Customer_ID_right,Age_right,Gender_right,Income_Level_right,Marital_Status_right
0,37-611-6911,22,Female,Middle,Married,37-611-6911,22.0,Female,Middle,Married
1,29-392-9296,49,Male,High,Married,29-392-9296,49.0,Male,High,Married
2,84-649-5117,24,Female,Middle,Single,84-649-5117,24.0,Female,Middle,Single
3,48-980-6078,29,Female,Middle,Single,,,,,
4,91-170-9072,33,Female,Middle,Widowed,,,,,


In [None]:
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2], ignore_index=True)

Unnamed: 0,0
0,a
1,b
2,c
3,d


In [None]:
pd.concat([s1, s2], keys=['s1', 's2'])

Unnamed: 0,Unnamed: 1,0
s1,0,a
s1,1,b
s2,0,c
s2,1,d


In [None]:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
print(df1)
df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])
print(df2)

  letter  number
0      a       1
1      b       2
  letter  number
0      c       3
1      d       4


In [None]:
pd.concat([df1, df2])

Unnamed: 0,letter,number
0,a,1
1,b,2
0,c,3
1,d,4


In [None]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df4 = pd.DataFrame(
{
"B": ["B2", "B3", "B6", "B7"],
"D": ["D2", "D3", "D6", "D7"],
"F": ["F2", "F3", "F6", "F7"],
},
index=[2, 3, 6, 7],
)

pd.concat([df1, df4], axis=1)

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

---

# Reshaping and Pivot Tables

| Method/Attribute    | Description                                                      | Python Example                                                      | Parameters                                                                                                        | Documentation                                                                                                                                                                        | Notes                                                                                                                                                                                                                                                                                                                                         |
| ------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `df.pivot()`        | Reshape data by column values                                    | `df.pivot(index='id', columns='year', values='value')`              | `index`: str or object  <br>`columns`: str or object  <br>`values`: str, optional                                 | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html)                                                                                            | This function does not support data **aggregation**, multiple values will result in a MultiIndex in the columns                                                                                                                                                                                                                               |
| `df.pivot_table()`  | Create a pivot table with aggregation                            | `df.pivot_table(index='A', columns='B', values='C', aggfunc='sum')` | `values`, `index`, `columns`, `aggfunc`, `fill_value`, `margins`, `dropna`, `margins_name`, `observed`, `sort`    | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table)                                                         | Supports aggregation                                                                                                                                                                                                                                                                                                                          |
| `df.stack()`        | Stack columns into multi-level index rows                        | `df.stack()`                                                        | `level`: int, str, list, default -1  <br>`dropna`: bool, default True                                             | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html)                                                                                            | Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame                                                                                                                                                                                                     |
| `df.unstack()`      | Unstack row index to columns                                     | `df.unstack()`                                                      | `level`: int, str, or list, optional  <br>`fill_value`: scalar, optional                                          | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html#pandas.DataFrame.unstack)                                                                 |                                                                                                                                                                                                                                                                                                                                               |
| `df.melt()`         | Unpivot from wide to long format                                 | `df.melt(id_vars=['A'], value_vars=['B', 'C'])`                     | `id_vars`, `value_vars`, `var_name`, `value_name`, `col_level`, `ignore_index`                                    | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html#pandas.DataFrame.melt)                                                                       | Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.                                                                                                                                                                                                                                                             |
| `df.explode()`      | Transform list-like elements in column to separate rows          | `df.explode('col')`                                                 | `column`: str or tuple  <br>`ignore_index`: bool, default False                                                   | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html)<br>[Example](https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-explode) | This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets. |
| `pd.crosstab()`     | Compute a cross-tabulation of two (or more) factors              | `pd.crosstab(df['A'], df['B'])`                                     | `index`, `columns`, `values`, `aggfunc`, `rownames`, `colnames`, `margins`, `margins_name`, `dropna`, `normalize` | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html)                                                                                                   |                                                                                                                                                                                                                                                                                                                                               |
| `pd.cut()`          | Bin values into discrete intervals                               | `pd.cut(df['col'], bins=3)`                                         | `x`, `bins`, `right`, `labels`, `retbins`, `precision`, `include_lowest`, `duplicates`, `ordered`                 | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.cut.html)                                                                                                        | Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.                                   |
| `pd.factorize()`    | Encode the object as an enumerated type or categorical variable. | `pd.factorize(df['col'])`                                           | `values`, `sort`, `na_sentinel`, `size_hint`                                                                      | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.factorize.html)                                                                                                  | ```codes, uniques = pd.factorize(np.array(['b', 'b', 'a', 'c', 'b'], dtype="O"), sort=True)<br>```                                                                                                                                                                                                                                            |
| `pd.get_dummies()`  | Convert categorical variable into dummy/indicator variables      | `pd.get_dummies(df, columns=['col'])`                               | `data`, `prefix`, `prefix_sep`, `dummy_na`, `columns`, `sparse`, `drop_first`, `dtype`                            | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)                                                                                                |                                                                                                                                                                                                                                                                                                                                               |
| `pd.from_dummies()` | Convert dummy indicator DataFrame back to categorical series     | `pd.from_dummies(df)`                                               | `data`, `prefix_sep`, `dtype`                                                                                     | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.from_dummies.html)                                                                                               |                                                                                                                                                                                                                                                                                                                                               |

In [None]:
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,4,Mixed,5,5,2.0,,Somewhat Sensitive,1,7,,Tablet,Credit Card,3/1/2024,True,False,Need-based,No Preference,2
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6
2,84-649-5117,24,Female,Middle,Single,Master's,High,Huzhen,Office Supplies,$426.22,2,Mixed,5,5,0.3,Low,Not Sensitive,1,7,Low,Smartphone,Debit Card,3/15/2024,True,True,Impulsive,No Preference,3
3,48-980-6078,29,Female,Middle,Single,Master's,Middle,Wiwilí,Home Appliances,$101.31,6,Mixed,3,1,1.0,High,Somewhat Sensitive,0,1,,Smartphone,Other,10/4/2024,True,True,Need-based,Express,10
4,91-170-9072,33,Female,Middle,Widowed,High School,Middle,Nara,Furniture,$211.70,6,Mixed,3,4,0.0,Medium,Not Sensitive,2,10,,Smartphone,Debit Card,1/30/2024,False,False,Wants-based,No Preference,4


### Pivot_Table

In [None]:
agg_func = ['mean', 'sum', 'min', 'max', 'count', 'std']
df.pivot_table(index=['Income_Level'], values='Frequency_of_Purchase', aggfunc=agg_func[0])

Unnamed: 0_level_0,Frequency_of_Purchase
Income_Level,Unnamed: 1_level_1
High,7.048544
Middle,6.835052


In [None]:
df.pivot_table(index=['Income_Level'], values='Frequency_of_Purchase', aggfunc=agg_func[0]).reset_index()

Unnamed: 0,Income_Level,Frequency_of_Purchase
0,High,7.048544
1,Middle,6.835052


In [None]:
agg_func = ['mean', 'sum', 'min', 'max', 'count', 'std']
df.pivot_table(index=['Gender'], columns=['Income_Level'], values='Customer_Satisfaction', aggfunc=agg_func[0])

Income_Level,High,Middle
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Agender,4.909091,6.25
Bigender,4.25,4.375
Female,5.109244,5.528037
Genderfluid,4.7,6.428571
Genderqueer,6.111111,7.333333
Male,5.392694,5.595652
Non-binary,5.75,7.25
Polygender,4.0,5.285714


In [None]:
df.pivot_table(index=['Gender', 'Marital_Status'], values='Frequency_of_Purchase', aggfunc=agg_func[0])

Unnamed: 0_level_0,Unnamed: 1_level_0,Frequency_of_Purchase
Gender,Marital_Status,Unnamed: 2_level_1
Agender,Divorced,10.0
Agender,Married,6.0
Agender,Single,6.833333
Agender,Widowed,8.2
Bigender,Divorced,3.8
Bigender,Married,9.0
Bigender,Single,9.25
Bigender,Widowed,6.333333
Female,Divorced,6.921569
Female,Married,6.898305


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

In [None]:
df.pivot_table(index=['Gender', 'Marital_Status'], values='Frequency_of_Purchase', aggfunc=agg_func[0], margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Frequency_of_Purchase
Gender,Marital_Status,Unnamed: 2_level_1
Agender,Divorced,10.0
Agender,Married,6.0
Agender,Single,6.833333
Agender,Widowed,8.2
Bigender,Divorced,3.8
Bigender,Married,9.0
Bigender,Single,9.25
Bigender,Widowed,6.333333
Female,Divorced,6.921569
Female,Married,6.898305


In [None]:
df.pivot_table(index=['Marital_Status'], values=['Frequency_of_Purchase', 'Customer_Satisfaction'], aggfunc=['mean'], margins=True)

Unnamed: 0_level_0,mean,mean
Unnamed: 0_level_1,Customer_Satisfaction,Frequency_of_Purchase
Marital_Status,Unnamed: 1_level_2,Unnamed: 2_level_2
Divorced,5.24898,6.955102
Married,5.644269,6.762846
Single,5.495868,7.136364
Widowed,5.211538,6.934615
All,5.399,6.945


In [None]:
df.pivot_table(index=['Marital_Status'], values=['Frequency_of_Purchase', 'Customer_Satisfaction'], aggfunc=['mean', 'sum'], margins=True)

Unnamed: 0_level_0,mean,mean,sum,sum
Unnamed: 0_level_1,Customer_Satisfaction,Frequency_of_Purchase,Customer_Satisfaction,Frequency_of_Purchase
Marital_Status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Divorced,5.24898,6.955102,1286,1704
Married,5.644269,6.762846,1428,1711
Single,5.495868,7.136364,1330,1727
Widowed,5.211538,6.934615,1355,1803
All,5.399,6.945,5399,6945


In [None]:
pivot_example = df.pivot_table(index=['Gender', 'Marital_Status'], values=['Frequency_of_Purchase', 'Customer_Satisfaction'], aggfunc='median')
pivot_example

Unnamed: 0_level_0,Unnamed: 1_level_0,Customer_Satisfaction,Frequency_of_Purchase
Gender,Marital_Status,Unnamed: 2_level_1,Unnamed: 3_level_1
Agender,Divorced,5.0,10.0
Agender,Married,6.0,6.0
Agender,Single,3.5,8.0
Agender,Widowed,4.0,8.0
Bigender,Divorced,4.0,4.0
Bigender,Married,5.0,9.0
Bigender,Single,2.5,10.0
Bigender,Widowed,4.0,6.0
Female,Divorced,5.0,7.0
Female,Married,6.0,7.0


In [None]:
pivot_example.index

MultiIndex([(    'Agender', 'Divorced'),
            (    'Agender',  'Married'),
            (    'Agender',   'Single'),
            (    'Agender',  'Widowed'),
            (   'Bigender', 'Divorced'),
            (   'Bigender',  'Married'),
            (   'Bigender',   'Single'),
            (   'Bigender',  'Widowed'),
            (     'Female', 'Divorced'),
            (     'Female',  'Married'),
            (     'Female',   'Single'),
            (     'Female',  'Widowed'),
            ('Genderfluid', 'Divorced'),
            ('Genderfluid',  'Married'),
            ('Genderfluid',   'Single'),
            ('Genderfluid',  'Widowed'),
            ('Genderqueer', 'Divorced'),
            ('Genderqueer',  'Married'),
            ('Genderqueer',   'Single'),
            ('Genderqueer',  'Widowed'),
            (       'Male', 'Divorced'),
            (       'Male',  'Married'),
            (       'Male',   'Single'),
            (       'Male',  'Widowed'),
            ( 'N

In [None]:
pivot_example.columns

Index(['Customer_Satisfaction', 'Frequency_of_Purchase'], dtype='object')

In [None]:
pivot_example['Customer_Satisfaction']

Unnamed: 0_level_0,Unnamed: 1_level_0,Customer_Satisfaction
Gender,Marital_Status,Unnamed: 2_level_1
Agender,Divorced,5.0
Agender,Married,6.0
Agender,Single,3.5
Agender,Widowed,4.0
Bigender,Divorced,4.0
Bigender,Married,5.0
Bigender,Single,2.5
Bigender,Widowed,4.0
Female,Divorced,5.0
Female,Married,6.0


In [None]:
pivot_example.loc['Male', 'Customer_Satisfaction']

Unnamed: 0_level_0,Customer_Satisfaction
Marital_Status,Unnamed: 1_level_1
Divorced,6.0
Married,6.0
Single,6.0
Widowed,5.0


In [None]:
pivot_example.loc[('Male', 'Married')]

Unnamed: 0_level_0,Male
Unnamed: 0_level_1,Married
Customer_Satisfaction,6.0
Frequency_of_Purchase,7.0


In [None]:
df['Time_of_Purchase'] = pd.to_datetime(df['Time_of_Purchase']) # Convert 'Time_of_Purchase' to datetime
df['Time_of_Purchase'].dt.year # Now you can access the year

df.pivot_table(index=['Marital_Status'], columns= df['Time_of_Purchase'].dt.month,values= ['Customer_Satisfaction'], aggfunc='mean')

Unnamed: 0_level_0,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction
Time_of_Purchase,1,2,3,4,5,6,7,8,9,10,11,12
Marital_Status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Divorced,5.666667,5.210526,4.809524,5.909091,5.375,5.809524,5.956522,5.0,5.259259,4.210526,4.25,5.8
Married,5.944444,6.0,5.206897,5.318182,5.583333,4.952381,6.384615,5.25,6.0,6.625,5.095238,5.588235
Single,5.190476,4.75,5.590909,5.222222,6.666667,5.15,5.647059,4.863636,5.642857,6.5,5.173913,5.230769
Widowed,5.619048,4.357143,5.333333,5.413793,4.888889,5.814815,5.035714,5.413793,4.6,5.055556,5.083333,5.210526


In [None]:
df.pivot_table(index=[df['Time_of_Purchase'].dt.year, df['Time_of_Purchase'].dt.day], columns= 'Marital_Status',values= ['Customer_Satisfaction'], aggfunc='mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction
Unnamed: 0_level_1,Marital_Status,Divorced,Married,Single,Widowed
Time_of_Purchase,Time_of_Purchase,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2024,1,5.4,6.4,4.909091,5.545455
2024,2,5.5,6.142857,5.0,4.666667
2024,3,4.785714,4.875,5.25,4.8
2024,4,5.625,7.857143,4.0,5.75
2024,5,5.375,4.75,6.6,5.285714
2024,6,6.4,7.333333,8.0,4.454545
2024,7,4.571429,6.2,5.5,6.166667
2024,8,5.7,5.8,6.375,4.666667
2024,9,7.0,3.5,5.727273,4.0
2024,10,4.333333,6.555556,4.0,7.714286


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

### Stack
* `stack()` movies columns to multi-level index rows

In [None]:
df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]], index=['cat', 'dog'], columns=['weight', 'height'])
df_single_level_cols

Unnamed: 0,weight,height
cat,0,1
dog,2,3


In [None]:
df_single_level_cols.stack(future_stack=True)

Unnamed: 0,Unnamed: 1,0
cat,weight,0
cat,height,1
dog,weight,2
dog,height,3


### Unstack
* `unstack()` moves the index row to the column
* `levels` parameter let's you select the index you want

In [None]:
pivot_example.unstack()

Unnamed: 0_level_0,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Frequency_of_Purchase,Frequency_of_Purchase,Frequency_of_Purchase,Frequency_of_Purchase
Marital_Status,Divorced,Married,Single,Widowed,Divorced,Married,Single,Widowed
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Agender,5.0,6.0,3.5,4.0,10.0,6.0,8.0,8.0
Bigender,4.0,5.0,2.5,4.0,4.0,9.0,10.0,6.0
Female,5.0,6.0,6.0,6.0,7.0,7.0,7.0,7.0
Genderfluid,6.0,4.0,7.5,3.0,7.0,9.0,5.0,10.0
Genderqueer,9.5,8.0,6.0,6.0,4.0,5.5,7.0,5.0
Male,6.0,6.0,6.0,5.0,7.0,7.0,7.0,7.0
Non-binary,6.0,7.0,8.0,7.5,9.0,7.0,8.0,9.5
Polygender,5.5,4.0,3.0,6.0,6.5,5.0,8.0,11.0


In [None]:
pivot_example.unstack(level=0)

Unnamed: 0_level_0,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Customer_Satisfaction,Frequency_of_Purchase,Frequency_of_Purchase,Frequency_of_Purchase,Frequency_of_Purchase,Frequency_of_Purchase,Frequency_of_Purchase,Frequency_of_Purchase,Frequency_of_Purchase
Gender,Agender,Bigender,Female,Genderfluid,Genderqueer,Male,Non-binary,Polygender,Agender,Bigender,Female,Genderfluid,Genderqueer,Male,Non-binary,Polygender
Marital_Status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Divorced,5.0,4.0,5.0,6.0,9.5,6.0,6.0,5.5,10.0,4.0,7.0,7.0,4.0,7.0,9.0,6.5
Married,6.0,5.0,6.0,4.0,8.0,6.0,7.0,4.0,6.0,9.0,7.0,9.0,5.5,7.0,7.0,5.0
Single,3.5,2.5,6.0,7.5,6.0,6.0,8.0,3.0,8.0,10.0,7.0,5.0,7.0,7.0,8.0,8.0
Widowed,4.0,4.0,6.0,3.0,6.0,5.0,7.5,6.0,8.0,6.0,7.0,10.0,5.0,7.0,9.5,11.0


In [None]:
pivot_example.reset_index()

Unnamed: 0,Gender,Marital_Status,Customer_Satisfaction,Frequency_of_Purchase
0,Agender,Divorced,5.0,10.0
1,Agender,Married,6.0,6.0
2,Agender,Single,3.5,8.0
3,Agender,Widowed,4.0,8.0
4,Bigender,Divorced,4.0,4.0
5,Bigender,Married,5.0,9.0
6,Bigender,Single,2.5,10.0
7,Bigender,Widowed,4.0,6.0
8,Female,Divorced,5.0,7.0
9,Female,Married,6.0,7.0


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

### Melt
* Unpivot from wide to long format
* Essentially movies column label to a row value
* Examples: https://www.geeksforgeeks.org/python-pandas-melt/

In [None]:
# melted example 1
pd.melt(df, id_vars=['Customer_ID'], value_vars=['Gender'])

Unnamed: 0,Customer_ID,variable,value
0,37-611-6911,Gender,Female
1,29-392-9296,Gender,Male
2,84-649-5117,Gender,Female
3,48-980-6078,Gender,Female
4,91-170-9072,Gender,Female
...,...,...,...
995,20-562-2569,Gender,Female
996,41-366-4205,Gender,Female
997,77-241-7621,Gender,Male
998,53-091-2176,Gender,Female


In [None]:
# melted example 2
melted_2 = pd.melt(df, id_vars=['Customer_ID'], value_vars=['Gender', 'Age'])
melted_2

Unnamed: 0,Customer_ID,variable,value
0,37-611-6911,Gender,Female
1,29-392-9296,Gender,Male
2,84-649-5117,Gender,Female
3,48-980-6078,Gender,Female
4,91-170-9072,Gender,Female
...,...,...,...
1995,20-562-2569,Age,44
1996,41-366-4205,Age,50
1997,77-241-7621,Age,26
1998,53-091-2176,Age,21


In [None]:
melted_2.loc[melted_2['Customer_ID'] == '37-611-6911']

Unnamed: 0,Customer_ID,variable,value
0,37-611-6911,Gender,Female
1000,37-611-6911,Age,22


In [None]:
# melted example 3
# Names of ‘variable’ and ‘value’ columns can be customized
pd.melt(df, id_vars=['Customer_ID'], value_vars=['Gender'],
        var_name='ChangedVarname', value_name='ChangedValname')

Unnamed: 0,Customer_ID,ChangedVarname,ChangedValname
0,37-611-6911,Gender,Female
1,29-392-9296,Gender,Male
2,84-649-5117,Gender,Female
3,48-980-6078,Gender,Female
4,91-170-9072,Gender,Female
...,...,...,...
995,20-562-2569,Gender,Female
996,41-366-4205,Gender,Female
997,77-241-7621,Gender,Male
998,53-091-2176,Gender,Female


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

### Explode
* Transform each element of a list-like to a row, replicating index values

In [None]:
data = {
  "Brand": ["Ford", "Ford", "Ford"],
  "Model": ["Sierra", "F-150", "Mustang"],
  "Typ" : ["2.0 GL", "Raptor", ["Mach-E", "Mach-1"]]
}

df_explode = pd.DataFrame(data)
df_explode

Unnamed: 0,Brand,Model,Typ
0,Ford,Sierra,2.0 GL
1,Ford,F-150,Raptor
2,Ford,Mustang,"[Mach-E, Mach-1]"


In [None]:
df_explode.explode('Typ')

Unnamed: 0,Brand,Model,Typ
0,Ford,Sierra,2.0 GL
1,Ford,F-150,Raptor
2,Ford,Mustang,Mach-E
2,Ford,Mustang,Mach-1


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

### Crosstab
* pandas.crosstab() function in Python is used to compute a cross-tabulation (contingency table) of two or more categorical variables
* By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed
* It also supports aggregation when additional data and a custom function are provided
* Note: one of the advantages of crosstab is that it has `normalize` param
* Example : https://pbpython.com/pandas-crosstab.html

In [None]:
pd.crosstab(index=df['Gender'], columns=df['Marital_Status'])

Marital_Status,Divorced,Married,Single,Widowed
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agender,3,5,6,5
Bigender,5,4,8,3
Female,102,118,109,123
Genderfluid,3,5,6,3
Genderqueer,2,4,3,3
Male,121,108,106,114
Non-binary,5,4,3,4
Polygender,4,5,1,5


In [None]:
pd.crosstab(df['Gender'], df['Marital_Status'], margins=True, normalize=True)

Marital_Status,Divorced,Married,Single,Widowed,All
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Agender,0.003,0.005,0.006,0.005,0.019
Bigender,0.005,0.004,0.008,0.003,0.02
Female,0.102,0.118,0.109,0.123,0.452
Genderfluid,0.003,0.005,0.006,0.003,0.017
Genderqueer,0.002,0.004,0.003,0.003,0.012
Male,0.121,0.108,0.106,0.114,0.449
Non-binary,0.005,0.004,0.003,0.004,0.016
Polygender,0.004,0.005,0.001,0.005,0.015
All,0.245,0.253,0.242,0.26,1.0


In [None]:
# normalize by columns
pd.crosstab(df['Gender'], df['Marital_Status'], margins=True, normalize='columns')

Marital_Status,Divorced,Married,Single,Widowed,All
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Agender,0.012245,0.019763,0.024793,0.019231,0.019
Bigender,0.020408,0.01581,0.033058,0.011538,0.02
Female,0.416327,0.466403,0.450413,0.473077,0.452
Genderfluid,0.012245,0.019763,0.024793,0.011538,0.017
Genderqueer,0.008163,0.01581,0.012397,0.011538,0.012
Male,0.493878,0.426877,0.438017,0.438462,0.449
Non-binary,0.020408,0.01581,0.012397,0.015385,0.016
Polygender,0.016327,0.019763,0.004132,0.019231,0.015


In [None]:
# normalize count by index
pd.crosstab(df['Gender'], df['Marital_Status'], margins=True, normalize='index')

Marital_Status,Divorced,Married,Single,Widowed
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agender,0.157895,0.263158,0.315789,0.263158
Bigender,0.25,0.2,0.4,0.15
Female,0.225664,0.261062,0.24115,0.272124
Genderfluid,0.176471,0.294118,0.352941,0.176471
Genderqueer,0.166667,0.333333,0.25,0.25
Male,0.269488,0.240535,0.23608,0.253898
Non-binary,0.3125,0.25,0.1875,0.25
Polygender,0.266667,0.333333,0.066667,0.333333
All,0.245,0.253,0.242,0.26


In [None]:
pd.crosstab(df['Gender'], df['Marital_Status'], values=df['Frequency_of_Purchase'], aggfunc='mean').round(1)

Marital_Status,Divorced,Married,Single,Widowed
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agender,10.0,6.0,6.8,8.2
Bigender,3.8,9.0,9.2,6.3
Female,6.9,6.9,7.0,6.7
Genderfluid,6.7,7.4,6.3,9.7
Genderqueer,4.0,5.2,6.7,4.3
Male,7.1,6.6,7.2,6.9
Non-binary,7.8,7.0,7.7,9.0
Polygender,6.8,6.6,8.0,9.2


In [None]:
# multiple columns and rows
pd.crosstab([df['Gender'], df['Income_Level']], [df['Marital_Status'], df['Occupation']])

Unnamed: 0_level_0,Marital_Status,Divorced,Divorced,Married,Married,Single,Single,Widowed,Widowed
Unnamed: 0_level_1,Occupation,High,Middle,High,Middle,High,Middle,High,Middle
Gender,Income_Level,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Agender,High,2,1,2,1,3,0,0,2
Agender,Middle,0,0,2,0,1,2,1,2
Bigender,High,2,0,3,0,4,2,0,1
Bigender,Middle,2,1,1,0,1,1,0,2
Female,High,32,29,28,25,32,32,36,24
Female,Middle,28,13,31,34,26,19,30,33
Genderfluid,High,2,0,0,2,2,2,1,1
Genderfluid,Middle,1,0,1,2,0,2,0,1
Genderqueer,High,1,1,2,0,2,1,1,1
Genderqueer,Middle,0,0,2,0,0,0,0,1


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

### Cut and QCut
* Bin values into discrete intervals
* Quantile-based discretization function

`pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)`

In [None]:
pd.cut(df['Age'], bins=5)

Unnamed: 0,Age
0,"(17.968, 24.4]"
1,"(43.6, 50.0]"
2,"(17.968, 24.4]"
3,"(24.4, 30.8]"
4,"(30.8, 37.2]"
...,...
995,"(43.6, 50.0]"
996,"(43.6, 50.0]"
997,"(24.4, 30.8]"
998,"(17.968, 24.4]"


In [None]:
pd.cut(df['Age'], bins=5).unique()

[(17.968, 24.4], (43.6, 50.0], (24.4, 30.8], (30.8, 37.2], (37.2, 43.6]]
Categories (5, interval[float64, right]): [(17.968, 24.4] < (24.4, 30.8] < (30.8, 37.2] <
                                           (37.2, 43.6] < (43.6, 50.0]]

In [None]:
pd.qcut(df['Age'], q=5).unique()

[(17.999, 25.0], (44.0, 50.0], (25.0, 31.0], (31.0, 38.0], (38.0, 44.0]]
Categories (5, interval[float64, right]): [(17.999, 25.0] < (25.0, 31.0] < (31.0, 38.0] <
                                           (38.0, 44.0] < (44.0, 50.0]]

In [None]:
pd.qcut(df['Age'], q=[0, .25, .5, .75, 1.],
        labels = ['very young', 'young', 'middle age', 'old'])

Unnamed: 0,Age
0,very young
1,old
2,very young
3,young
4,young
...,...
995,old
996,old
997,very young
998,very young


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

### pd.factorize()
* Encode the object as an enumerated type or categorical variable
* This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().

In [None]:
codes, uniques = pd.factorize(df['Gender'])
print(codes)
print(uniques)

[0 1 0 0 0 1 0 1 0 2 1 0 1 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 1 0 0 1 3 1 1 0 1
 0 0 1 0 0 1 1 1 0 0 0 1 0 0 4 0 1 1 0 0 0 0 0 1 5 0 0 0 3 2 1 0 1 2 0 1 1
 1 1 1 1 0 1 0 1 0 1 0 1 1 0 1 0 1 6 0 0 0 0 0 0 4 1 0 1 1 0 0 0 0 0 0 0 0
 0 0 0 1 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 7 1 1 1 1 0 1 0 0 1 1 1 1 7 1 0 1 1
 1 1 3 1 4 1 1 0 0 0 1 1 0 1 0 0 1 6 0 1 0 1 1 5 1 0 1 0 0 1 1 1 1 0 0 0 1
 0 0 4 0 1 1 1 0 0 6 1 1 7 1 1 0 0 0 5 0 0 1 0 3 0 1 1 1 0 0 1 1 1 0 1 1 1
 0 1 0 0 6 1 6 1 1 1 6 1 1 1 6 1 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 2 0
 0 0 0 0 1 0 0 1 1 2 0 4 3 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 7 3 1 1 0 1 0 0
 5 7 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 1 0 1 3 0 0 1 0 0 1
 1 3 1 1 1 0 3 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 7 0 0 1
 1 0 1 3 0 1 0 1 1 1 1 1 0 1 1 4 1 5 1 1 5 0 1 0 1 5 0 0 0 6 1 0 0 1 1 1 6
 0 0 2 0 0 0 0 0 0 1 6 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1 5 4 1 0 0 1 2 1
 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 2 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 1 5 1 0
 1 1 0 3 0 7 3 1 0 1 0 0 

In [None]:
df_dummy = df.filter(items=['Age', 'Gender'])
pd.get_dummies(df_dummy, columns=['Gender'])

Unnamed: 0,Age,Gender_Agender,Gender_Bigender,Gender_Female,Gender_Genderfluid,Gender_Genderqueer,Gender_Male,Gender_Non-binary,Gender_Polygender
0,22,False,False,True,False,False,False,False,False
1,49,False,False,False,False,False,True,False,False
2,24,False,False,True,False,False,False,False,False
3,29,False,False,True,False,False,False,False,False
4,33,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
995,44,False,False,True,False,False,False,False,False
996,50,False,False,True,False,False,False,False,False
997,26,False,False,False,False,False,True,False,False
998,21,False,False,True,False,False,False,False,False


In [None]:
pd.get_dummies(df_dummy, columns=['Gender'], dtype=int)

Unnamed: 0,Age,Gender_Agender,Gender_Bigender,Gender_Female,Gender_Genderfluid,Gender_Genderqueer,Gender_Male,Gender_Non-binary,Gender_Polygender
0,22,0,0,1,0,0,0,0,0
1,49,0,0,0,0,0,1,0,0
2,24,0,0,1,0,0,0,0,0
3,29,0,0,1,0,0,0,0,0
4,33,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
995,44,0,0,1,0,0,0,0,0
996,50,0,0,1,0,0,0,0,0
997,26,0,0,0,0,0,1,0,0
998,21,0,0,1,0,0,0,0,0


In [None]:
pd.from_dummies(pd.get_dummies(df_dummy['Gender'], columns=['Gender']))

Unnamed: 0,Unnamed: 1
0,Female
1,Male
2,Female
3,Female
4,Female
...,...
995,Female
996,Female
997,Male
998,Female


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

---

### Dataframe Group By

| Method/Attribute | Description                                                     | Python Example                                                   | Parameters                                                                                                                           | Documentation                                                                                                                                                                  |
| ---------------- | --------------------------------------------------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `df.groupby()`   | Group DataFrame by one or more columns                          | `df.groupby('category').sum()`                                   | `by`: column name(s) or function  <br>`axis`: {0, 1}  <br>`level`: int, str or sequence of int/str  <br>`sort`: bool, default `True` | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)<br>[Example](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) |
| `df.agg()`       | Aggregate using one or more operations over the specified axis. | `df.groupby('category').agg('mean')`                             | `func`: function or dict  <br>`axis`: {0, 1}  <br>`level`: int, str, or sequence of ints/strs                                        | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.aggregate.html)                                                                                  |
| `df.transform()` | Apply a function to each group and return a DataFrame           | `df.groupby('category').transform(lambda x: x * 2)`              | `func`: function or dict  <br>`axis`: {0, 1}  <br>`level`: int, str, or sequence of ints/strs                                        | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transform.html)                                                                                  |
| `df.filter()`    | Filter groups based on a condition                              | `df.groupby('category').filter(lambda x: x['value'].sum() > 10)` | `items`: list of column names  <br>`like`: string  <br>`regex`: string  <br>`axis`: int                                              | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html)                                                                 |
| `df.size()`      | Count the number of elements in each group                      | `df.groupby('category').size()`                                  | _(no parameters)_                                                                                                                    |                                                                                                                                                                                |
| `df.cumcount()`  | Count cumulative occurrences within groups                      | `df.groupby('category').cumcount()`                              | _(no parameters)_                                                                                                                    | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cumcount.html)                                                               |

| Method                                                                                                                                                                                                                      | Description                                                        |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| [`any()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.any.html#pandas.core.groupby.DataFrameGroupBy.any "pandas.core.groupby.DataFrameGroupBy.any")                     | Compute whether any of the values in the groups are truthy         |
| [`all()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.all.html#pandas.core.groupby.DataFrameGroupBy.all "pandas.core.groupby.DataFrameGroupBy.all")                     | Compute whether all of the values in the groups are truthy         |
| [`count()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html#pandas.core.groupby.DataFrameGroupBy.count "pandas.core.groupby.DataFrameGroupBy.count")             | Compute the number of non-NA values in the groups                  |
| [`cov()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.cov.html#pandas.core.groupby.DataFrameGroupBy.cov "pandas.core.groupby.DataFrameGroupBy.cov") *                   | Compute the covariance of the groups                               |
| [`first()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html#pandas.core.groupby.DataFrameGroupBy.first "pandas.core.groupby.DataFrameGroupBy.first")             | Compute the first occurring value in each group                    |
| [`idxmax()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.idxmax.html#pandas.core.groupby.DataFrameGroupBy.idxmax "pandas.core.groupby.DataFrameGroupBy.idxmax")         | Compute the index of the maximum value in each group               |
| [`idxmin()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.idxmin.html#pandas.core.groupby.DataFrameGroupBy.idxmin "pandas.core.groupby.DataFrameGroupBy.idxmin")         | Compute the index of the minimum value in each group               |
| [`last()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.last.html#pandas.core.groupby.DataFrameGroupBy.last "pandas.core.groupby.DataFrameGroupBy.last")                 | Compute the last occurring value in each group                     |
| [`max()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html#pandas.core.groupby.DataFrameGroupBy.max "pandas.core.groupby.DataFrameGroupBy.max")                     | Compute the maximum value in each group                            |
| [`mean()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html#pandas.core.groupby.DataFrameGroupBy.mean "pandas.core.groupby.DataFrameGroupBy.mean")                 | Compute the mean of each group                                     |
| [`median()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.median.html#pandas.core.groupby.DataFrameGroupBy.median "pandas.core.groupby.DataFrameGroupBy.median")         | Compute the median of each group                                   |
| [`min()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.min.html#pandas.core.groupby.DataFrameGroupBy.min "pandas.core.groupby.DataFrameGroupBy.min")                     | Compute the minimum value in each group                            |
| [`nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.nunique.html#pandas.core.groupby.DataFrameGroupBy.nunique "pandas.core.groupby.DataFrameGroupBy.nunique")     | Compute the number of unique values in each group                  |
| [`prod()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.prod.html#pandas.core.groupby.DataFrameGroupBy.prod "pandas.core.groupby.DataFrameGroupBy.prod")                 | Compute the product of the values in each group                    |
| [`quantile()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.quantile.html#pandas.core.groupby.DataFrameGroupBy.quantile "pandas.core.groupby.DataFrameGroupBy.quantile") | Compute a given quantile of the values in each group               |
| [`sem()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sem.html#pandas.core.groupby.DataFrameGroupBy.sem "pandas.core.groupby.DataFrameGroupBy.sem")                     | Compute the standard error of the mean of the values in each group |
| [`size()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.size.html#pandas.core.groupby.DataFrameGroupBy.size "pandas.core.groupby.DataFrameGroupBy.size")                 | Compute the number of values in each group                         |
| [`skew()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.skew.html#pandas.core.groupby.DataFrameGroupBy.skew "pandas.core.groupby.DataFrameGroupBy.skew") *               | Compute the skew of the values in each group                       |
| [`std()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.std.html#pandas.core.groupby.DataFrameGroupBy.std "pandas.core.groupby.DataFrameGroupBy.std")                     | Compute the standard deviation of the values in each group         |
| [`sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sum.html#pandas.core.groupby.DataFrameGroupBy.sum "pandas.core.groupby.DataFrameGroupBy.sum")                     | Compute the sum of the values in each group                        |
| [`var()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.var.html#pandas.core.groupby.DataFrameGroupBy.var "pandas.core.groupby.DataFrameGroupBy.var")                     | Compute the variance of the values in each group                   |

In [None]:
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,Frequency_of_Purchase,Purchase_Channel,Brand_Loyalty,Product_Rating,Time_Spent_on_Product_Research(hours),Social_Media_Influence,Discount_Sensitivity,Return_Rate,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,4,Mixed,5,5,2.0,,Somewhat Sensitive,1,7,,Tablet,Credit Card,2024-03-01,True,False,Need-based,No Preference,2
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,11,In-Store,3,1,2.0,Medium,Not Sensitive,1,5,High,Tablet,PayPal,2024-04-16,True,False,Wants-based,Standard,6
2,84-649-5117,24,Female,Middle,Single,Master's,High,Huzhen,Office Supplies,$426.22,2,Mixed,5,5,0.3,Low,Not Sensitive,1,7,Low,Smartphone,Debit Card,2024-03-15,True,True,Impulsive,No Preference,3
3,48-980-6078,29,Female,Middle,Single,Master's,Middle,Wiwilí,Home Appliances,$101.31,6,Mixed,3,1,1.0,High,Somewhat Sensitive,0,1,,Smartphone,Other,2024-10-04,True,True,Need-based,Express,10
4,91-170-9072,33,Female,Middle,Widowed,High School,Middle,Nara,Furniture,$211.70,6,Mixed,3,4,0.0,Medium,Not Sensitive,2,10,,Smartphone,Debit Card,2024-01-30,False,False,Wants-based,No Preference,4


In [None]:
df.groupby(['Gender'])['Frequency_of_Purchase'].mean()

Unnamed: 0_level_0,Frequency_of_Purchase
Gender,Unnamed: 1_level_1
Agender,7.473684
Bigender,7.4
Female,6.884956
Genderfluid,7.294118
Genderqueer,5.166667
Male,6.942094
Non-binary,7.875
Polygender,7.6


In [None]:
df.groupby(['Gender'])[['Frequency_of_Purchase', 'Customer_Satisfaction']].mean()

Unnamed: 0_level_0,Frequency_of_Purchase,Customer_Satisfaction
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Agender,7.473684,5.473684
Bigender,7.4,4.3
Female,6.884956,5.307522
Genderfluid,7.294118,5.411765
Genderqueer,5.166667,6.416667
Male,6.942094,5.496659
Non-binary,7.875,6.5
Polygender,7.6,4.6


In [None]:
df.groupby(['Gender', 'Marital_Status'])[['Frequency_of_Purchase', 'Customer_Satisfaction']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Frequency_of_Purchase,Customer_Satisfaction
Gender,Marital_Status,Unnamed: 2_level_1,Unnamed: 3_level_1
Agender,Divorced,10.0,4.666667
Agender,Married,6.0,6.8
Agender,Single,6.833333,5.333333
Agender,Widowed,8.2,4.8
Bigender,Divorced,3.8,5.6
Bigender,Married,9.0,5.25
Bigender,Single,9.25,3.0
Bigender,Widowed,6.333333,4.333333
Female,Divorced,6.921569,4.784314
Female,Married,6.898305,5.491525


In [None]:
df.groupby('Gender').agg(
    avg_satisfaction=('Customer_Satisfaction', 'mean'),
    min_satisfaction=('Customer_Satisfaction', 'min'),
    max_satisfaction=('Customer_Satisfaction', 'max'),
    )

Unnamed: 0_level_0,avg_satisfaction,min_satisfaction,max_satisfaction
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Agender,5.473684,1,10
Bigender,4.3,1,9
Female,5.307522,1,10
Genderfluid,5.411765,2,10
Genderqueer,6.416667,1,10
Male,5.496659,1,10
Non-binary,6.5,2,9
Polygender,4.6,1,9


In [None]:
df.groupby(['Gender', 'Income_Level']).agg(
    avg_satisfaction=('Customer_Satisfaction', 'mean'),
    min_satisfaction=('Customer_Satisfaction', 'min'),
    max_satisfaction=('Customer_Satisfaction', 'max'),
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_satisfaction,min_satisfaction,max_satisfaction
Gender,Income_Level,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agender,High,4.909091,1,9
Agender,Middle,6.25,3,10
Bigender,High,4.25,1,9
Bigender,Middle,4.375,2,9
Female,High,5.109244,1,10
Female,Middle,5.528037,1,10
Genderfluid,High,4.7,2,8
Genderfluid,Middle,6.428571,4,10
Genderqueer,High,6.111111,1,10
Genderqueer,Middle,7.333333,6,9


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

# Handling Missing Data & Duplicates

| Method/Attribute       | Description                                          | Python Example         | Parameters                                                                                                    | Documentation                                                                                       |
| ---------------------- | ---------------------------------------------------- | ---------------------- | ------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| `df.isnull()`          | Detect missing values in DataFrame                   | `df.isnull()`          | _(no parameters)_                                                                                             |                                                                                                     |
| `df.notnull()`         | Detect non-missing values in DataFrame               | `df.notnull()`         | _(no parameters)_                                                                                             |                                                                                                     |
| `df.dropna()`          | Remove missing values (rows or columns)              | `df.dropna()`          | `axis`: {0, 1}  <br>`how`: {'any', 'all'}  <br>`thresh`: int, `subset`: columns to check  <br>`inplace`: bool | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)          |
| `df.fillna()`          | Fill missing values with a specified value or method | `df.fillna(0)`         | `value`: scalar or dict  <br>`method`: {‘ffill’, ‘bfill’}  <br>`axis`: {0, 1}, `inplace`: bool                | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)          |
| `df.duplicated()`      | Find duplicate rows in DataFrame                     | `df.duplicated()`      | `subset`: columns to consider  <br>`keep`: {‘first’, ‘last’, False}, default ‘first’                          | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)      |
| `df.drop_duplicates()` | Remove duplicate rows from DataFrame                 | `df.drop_duplicates()` | `subset`: columns to consider  <br>`keep`: {‘first’, ‘last’, False}, default ‘first’                          | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) |

# Applyig Functions

## Examples

| Example | Description |
|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| `df['col'].map({'A': 'Apple', 'B': 'Banana'})` | Replace values in 'col' based on a dictionary; unmatched keys become NaN. |
| `df['col'].map(lambda x: x.lower())` | Convert strings in 'col' to lowercase using a lambda function for element-wise transformation. |
| `df['col'].map(str)` | Convert each element in 'col' to its string representation. |
| `df['col'].map('{:.2f}'.format)` | Format numerical values in 'col' to display with two decimal places. |
| `df['col'].map(pd.Series({'A': 1, 'B': 2}))` | Map values using a Series; similar to a dictionary but with Series alignment. |
| `df['col'].map(lambda x: x if x > 0 else 0)` | Replace negative values in 'col' with 0, leaving positive values unchanged. |
| `df['col'].apply(lambda x: x * 2)` | Multiply each value in 'col' by 2 (if 'col' is a Series). |
| `df[['col1', 'col2']].apply(lambda row: row['col1'] + row['col2'], axis=1)` | Concatenate strings from 'col1' and 'col2' for each row (row-wise operation). |
| `df[['col1', 'col2']].apply(sum, axis=0)` | Calculate the sum of values in 'col1' and 'col2' (column-wise summation). |
| `df.apply(pd.Series.nunique)` | Count the number of unique values in each column of the DataFrame. |
| `df.apply(lambda x: x.max() - x.min())` | Calculate the range (max - min) for each column in the DataFrame. |
| `df.apply(lambda row: row.fillna(row.mean()), axis=1)` | Fill missing values (NaN) in each row with the mean of that row's non-NaN values. |
| ```python
def custom_func(s):
    return s.max() - s.min()
df[['col1', 'col2']].apply(custom_func)
``` | Apply a user-defined function to calculate the range for specified columns. |
| `df['col'].map(len)` | Get the length of each element in 'col' (typically used for strings). |
| `df.apply(lambda x: sorted(x), axis=0)` | Sort the values within each column and return the sorted Series. |
| `df.groupby('group_col')['value_col'].apply(lambda x: x / x.sum())` | Calculate each value's proportion within its group (after grouping). |
| `df.apply(pd.to_datetime, errors='coerce')` | Convert all columns to datetime, with errors resulting in NaT (Not a Time). |
| `df['col'].map('${:,.2f}'.format)` | Format numerical values in 'col' as currency strings with two decimal places and thousands separators. |
| `df.apply(lambda x: x.astype(str).str.contains('pattern').any())` | Check if any string in each column contains a given 'pattern'; returns a Series of booleans. |
| `df.apply(lambda x: x.ffill(), axis=0)` | Forward-fill missing values (NaN) within each column. |
| `df['col'].map(lambda x: x**2 if pd.notna(x) else x)` | Square the values in 'col', but leave NaN values unchanged. |
| `df.select_dtypes(include='number').apply(lambda x: (x - x.mean()) / x.std())` | Standardize numerical columns (subtract mean, divide by standard deviation). |

[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

# Window Functions

* `df.rolling()`
* `df.expanding()`
* `df.shift()`

https://pandas.pydata.org/docs/dev/reference/window.html

# Rolling Window Functions

| Function | Description |
|---------------------------------------------------|-------------------------------------------------------------------|
| `Rolling.count([numeric_only])` | Calculate the rolling count of non-NaN observations. |
| `Rolling.sum([numeric_only, engine, ...])` | Calculate the rolling sum. |
| `Rolling.mean([numeric_only, engine, ...])` | Calculate the rolling mean. |
| `Rolling.median([numeric_only, engine, ...])` | Calculate the rolling median. |
| `Rolling.var([ddof, numeric_only, engine, ...])` | Calculate the rolling variance. |
| `Rolling.std([ddof, numeric_only, engine, ...])` | Calculate the rolling standard deviation. |
| `Rolling.min([numeric_only, engine, ...])` | Calculate the rolling minimum. |
| `Rolling.max([numeric_only, engine, ...])` | Calculate the rolling maximum. |
| `Rolling.first([numeric_only])` | Calculate the rolling first (left-most) element of the window. |
| `Rolling.last([numeric_only])` | Calculate the rolling last (right-most) element of the window. |
| `Rolling.corr([other, pairwise, ddof, ...])` | Calculate the rolling correlation. |
| `Rolling.cov([other, pairwise, ddof, ...])` | Calculate the rolling sample covariance. |
| `Rolling.skew([numeric_only])` | Calculate the rolling unbiased skewness. |
| `Rolling.kurt([numeric_only])` | Calculate the rolling Fisher's definition of kurtosis without bias. |
| `Rolling.apply(func[, raw, engine, ...])` | Calculate the rolling custom aggregation function. |
| `Rolling.pipe(func, *args, **kwargs)` | Apply a func with arguments to this Rolling object and return its result. |
| `Rolling.aggregate([func])` | Aggregate using one or more operations over the specified axis. |
| `Rolling.quantile(q[, interpolation, ...])` | Calculate the rolling quantile. |
| `Rolling.sem([ddof, numeric_only])` | Calculate the rolling standard error of the mean. |
| `Rolling.rank([method, ascending, pct, ...])` | Calculate the rolling rank. |
| `Rolling.nunique([numeric_only])` | Calculate the rolling nunique. |

## df.rolling + Aggregation + Min Period

In [None]:
# rolling mean with simple window function

# Create a sample DataFrame
data = {'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
df = pd.DataFrame(data)

# Calculate the rolling mean with a window of 3
df['RollingMean'] = df['Value'].rolling(window=3).mean()
print(df)

   Value  RollingMean
0      1          NaN
1      2          NaN
2      3          2.0
3      4          3.0
4      5          4.0
5      6          5.0
6      7          6.0
7      8          7.0
8      9          8.0
9     10          9.0


In [None]:
# Calculate the rolling mean with a window of 4 and min_periods=1
df['RollingMean'] = df['Value'].rolling(window=4, min_periods=1).mean()
print(df)

   Value  RollingMean
0      1          1.0
1      2          1.5
2      3          2.0
3      4          2.5
4      5          3.5
5      6          4.5
6      7          5.5
7      8          6.5
8      9          7.5
9     10          8.5


In [None]:
# Calculate the rolling sum with a window of 4 and min_periods=1
df['RollingSum'] = df['Value'].rolling(window=4, min_periods=1).sum()
print(df)

   Value  RollingMean  RollingSum
0      1          1.0         1.0
1      2          1.5         3.0
2      3          2.0         6.0
3      4          2.5        10.0
4      5          3.5        14.0
5      6          4.5        18.0
6      7          5.5        22.0
7      8          6.5        26.0
8      9          7.5        30.0
9     10          8.5        34.0


## Rolling with Time-Based Window

In [None]:
# Create a DataFrame with a datetime index
data = {'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
                               '2023-01-06', '2023-01-07', '2023-01-08', '2023-01-09', '2023-01-10'])}
df = pd.DataFrame(data).set_index('Date')

# Calculate the rolling mean with a 5-day window
df['RollingMean_5D'] = df['Value'].rolling(window='5D').mean()
print(df)

            Value  RollingMean_5D
Date                             
2023-01-01      1             1.0
2023-01-02      2             1.5
2023-01-03      3             2.0
2023-01-04      4             2.5
2023-01-05      5             3.0
2023-01-06      6             4.0
2023-01-07      7             5.0
2023-01-08      8             6.0
2023-01-09      9             7.0
2023-01-10     10             8.0


## Rolling with Custom Function (apply)

In [None]:
# Define a custom function to calculate the rolling range
def rolling_range(x):
    return x.max() - x.min()

# Calculate the rolling range with a window of 3
df['RollingRange'] = df['Value'].rolling(window=3).apply(rolling_range)
print(df)

            Value  RollingMean_5D  RollingRange
Date                                           
2023-01-01      1             1.0           NaN
2023-01-02      2             1.5           NaN
2023-01-03      3             2.0           2.0
2023-01-04      4             2.5           2.0
2023-01-05      5             3.0           2.0
2023-01-06      6             4.0           2.0
2023-01-07      7             5.0           2.0
2023-01-08      8             6.0           2.0
2023-01-09      9             7.0           2.0
2023-01-10     10             8.0           2.0


## Rolling + Multiple Aggregations

In [None]:
# Calculate multiple rolling statistics at once
df_agg = df['Value'].rolling(window=3).agg(['mean', 'sum', 'std'])
print(df_agg)

            mean   sum  std
Date                       
2023-01-01   NaN   NaN  NaN
2023-01-02   NaN   NaN  NaN
2023-01-03   2.0   6.0  1.0
2023-01-04   3.0   9.0  1.0
2023-01-05   4.0  12.0  1.0
2023-01-06   5.0  15.0  1.0
2023-01-07   6.0  18.0  1.0
2023-01-08   7.0  21.0  1.0
2023-01-09   8.0  24.0  1.0
2023-01-10   9.0  27.0  1.0


In [None]:
df['RollingRank'] = df['Value'].rolling(window=3).rank()
print(df)

            Value  RollingMean_5D  RollingRange  RollingRank
Date                                                        
2023-01-01      1             1.0           NaN          NaN
2023-01-02      2             1.5           NaN          NaN
2023-01-03      3             2.0           2.0          3.0
2023-01-04      4             2.5           2.0          3.0
2023-01-05      5             3.0           2.0          3.0
2023-01-06      6             4.0           2.0          3.0
2023-01-07      7             5.0           2.0          3.0
2023-01-08      8             6.0           2.0          3.0
2023-01-09      9             7.0           2.0          3.0
2023-01-10     10             8.0           2.0          3.0


## Expanding: Cumulative Statistics

The df.expanding() function in pandas is used to calculate cumulative statistics.

| Function | Description |
|---------------------------------------------------|-------------------------------------------------------------------|
| `Expanding.count([numeric_only])` | Calculate the expanding count of non-NaN observations. |
| `Expanding.sum([numeric_only, engine, ...])` | Calculate the expanding sum. |
| `Expanding.mean([numeric_only, engine, ...])` | Calculate the expanding mean. |
| `Expanding.median([numeric_only, engine, ...])` | Calculate the expanding median. |
| `Expanding.var([ddof, numeric_only, engine, ...])` | Calculate the expanding variance. |
| `Expanding.std([ddof, numeric_only, engine, ...])` | Calculate the expanding standard deviation. |
| `Expanding.min([numeric_only, engine, ...])` | Calculate the expanding minimum. |
| `Expanding.max([numeric_only, engine, ...])` | Calculate the expanding maximum. |
| `Expanding.first([numeric_only])` | Calculate the expanding first (left-most) element of the window. |
| `Expanding.last([numeric_only])` | Calculate the expanding last (right-most) element of the window. |
| `Expanding.corr([other, pairwise, ddof, ...])` | Calculate the expanding correlation. |
| `Expanding.cov([other, pairwise, ddof, ...])` | Calculate the expanding sample covariance. |
| `Expanding.skew([numeric_only])` | Calculate the expanding unbiased skewness. |
| `Expanding.kurt([numeric_only])` | Calculate the expanding Fisher's definition of kurtosis without bias. |
| `Expanding.apply(func[, raw, engine, ...])` | Calculate the expanding custom aggregation function. |
| `Expanding.pipe(func, *args, **kwargs)` | Apply a func with arguments to this Expanding object and return its result. |
| `Expanding.aggregate([func])` | Aggregate using one or more operations over the specified axis. |
| `Expanding.quantile(q[, interpolation, ...])` | Calculate the expanding quantile. |
| `Expanding.sem([ddof, numeric_only])` | Calculate the expanding standard error of the mean. |
| `Expanding.rank([method, ascending, pct, ...])` | Calculate the expanding rank. |
| `Expanding.nunique([numeric_only])` | Calculate the expanding nunique. |

In [None]:
# Create a sample DataFrame
data = {'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
df = pd.DataFrame(data)

# Calculate the expanding sum
df['ExpandingSum'] = df['Value'].expanding().sum()
print(df)

   Value  ExpandingSum
0      1           1.0
1      2           3.0
2      3           6.0
3      4          10.0
4      5          15.0
5      6          21.0
6      7          28.0
7      8          36.0
8      9          45.0
9     10          55.0


In [None]:
# Calculate the expanding mean with min_periods=3
df['ExpandingMean_3'] = df['Value'].expanding(min_periods=3).mean()
print(df)

   Value  ExpandingSum  ExpandingMean_3
0      1           1.0              NaN
1      2           3.0              NaN
2      3           6.0              2.0
3      4          10.0              2.5
4      5          15.0              3.0
5      6          21.0              3.5
6      7          28.0              4.0
7      8          36.0              4.5
8      9          45.0              5.0
9     10          55.0              5.5


In [None]:
# Calculate the expanding maximum and minimum
df['ExpandingMax'] = df['Value'].expanding().max()
df['ExpandingMin'] = df['Value'].expanding().min()
print(df)

   Value  ExpandingSum  ExpandingMean_3  ExpandingMax  ExpandingMin
0      1           1.0              NaN           1.0           1.0
1      2           3.0              NaN           2.0           1.0
2      3           6.0              2.0           3.0           1.0
3      4          10.0              2.5           4.0           1.0
4      5          15.0              3.0           5.0           1.0
5      6          21.0              3.5           6.0           1.0
6      7          28.0              4.0           7.0           1.0
7      8          36.0              4.5           8.0           1.0
8      9          45.0              5.0           9.0           1.0
9     10          55.0              5.5          10.0           1.0


In [None]:
# Calculate the expanding count of non-NaN values
df['ExpandingCount'] = df['Value'].expanding().count()
print(df)

   Value  ExpandingSum  ExpandingMean_3  ExpandingMax  ExpandingMin  \
0      1           1.0              NaN           1.0           1.0   
1      2           3.0              NaN           2.0           1.0   
2      3           6.0              2.0           3.0           1.0   
3      4          10.0              2.5           4.0           1.0   
4      5          15.0              3.0           5.0           1.0   
5      6          21.0              3.5           6.0           1.0   
6      7          28.0              4.0           7.0           1.0   
7      8          36.0              4.5           8.0           1.0   
8      9          45.0              5.0           9.0           1.0   
9     10          55.0              5.5          10.0           1.0   

   ExpandingCount  
0             1.0  
1             2.0  
2             3.0  
3             4.0  
4             5.0  
5             6.0  
6             7.0  
7             8.0  
8             9.0  
9            10.0 

In [None]:
# Define a custom function (e.g., root mean square)
def rms(x):
    return np.sqrt(np.mean(x**2))

# Calculate the expanding RMS
df['ExpandingRMS'] = df['Value'].expanding().apply(rms)
print(df)

   Value  ExpandingSum  ExpandingMean_3  ExpandingMax  ExpandingMin  \
0      1           1.0              NaN           1.0           1.0   
1      2           3.0              NaN           2.0           1.0   
2      3           6.0              2.0           3.0           1.0   
3      4          10.0              2.5           4.0           1.0   
4      5          15.0              3.0           5.0           1.0   
5      6          21.0              3.5           6.0           1.0   
6      7          28.0              4.0           7.0           1.0   
7      8          36.0              4.5           8.0           1.0   
8      9          45.0              5.0           9.0           1.0   
9     10          55.0              5.5          10.0           1.0   

   ExpandingCount  ExpandingRMS  
0             1.0      1.000000  
1             2.0      1.581139  
2             3.0      2.160247  
3             4.0      2.738613  
4             5.0      3.316625  
5             

# Shift

The `df.shift()` function in pandas is used to shift the index by a desired number of periods, which is particularly useful for time series data.

In [None]:
# Create a sample time series DataFrame
data = {'Value': [1, 2, 3, 4, 5],
        'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])}
df = pd.DataFrame(data).set_index('Date')

# Shift the 'Value' column down by one day
df['Value_Lagged'] = df['Value'].shift(1)
print(df)

            Value  Value_Lagged
Date                           
2023-01-01      1           NaN
2023-01-02      2           1.0
2023-01-03      3           2.0
2023-01-04      4           3.0
2023-01-05      5           4.0


In [None]:
# Shift the 'Value' column up by two days
df['Value_Lead'] = df['Value'].shift(-2)
print(df)

            Value  Value_Lagged  Value_Lead
Date                                       
2023-01-01      1           NaN         3.0
2023-01-02      2           1.0         4.0
2023-01-03      3           2.0         5.0
2023-01-04      4           3.0         NaN
2023-01-05      5           4.0         NaN


In [None]:
# Shift the index by one day, creating gaps in the original series.
df['Value_Shifted_1D'] = df['Value'].shift(periods=1, freq='D')
print(df)

            Value  Value_Lagged  Value_Lead  Value_Shifted_1D
Date                                                         
2023-01-01      1           NaN         3.0               NaN
2023-01-02      2           1.0         4.0               1.0
2023-01-03      3           2.0         5.0               2.0
2023-01-04      4           3.0         NaN               3.0
2023-01-05      5           4.0         NaN               4.0


In [None]:
# Calculate the difference between the current day's value and the previous day's value
df['Daily_Change'] = df['Value'] - df['Value'].shift(1)
print(df)

            Value  Value_Lagged  Value_Lead  Value_Shifted_1D  Daily_Change
Date                                                                       
2023-01-01      1           NaN         3.0               NaN           NaN
2023-01-02      2           1.0         4.0               1.0           1.0
2023-01-03      3           2.0         5.0               2.0           1.0
2023-01-04      4           3.0         NaN               3.0           1.0
2023-01-05      5           4.0         NaN               4.0           1.0


In [None]:
# Shift the 'Value' column down by one day and fill the NaN with the first value of the time series
df['Value_Shifted_Filled'] = df['Value'].shift(1, fill_value=df['Value'].iloc[0])
print(df)

            Value  Value_Lagged  Value_Lead  Value_Shifted_1D  Daily_Change  \
Date                                                                          
2023-01-01      1           NaN         3.0               NaN           NaN   
2023-01-02      2           1.0         4.0               1.0           1.0   
2023-01-03      3           2.0         5.0               2.0           1.0   
2023-01-04      4           3.0         NaN               3.0           1.0   
2023-01-05      5           4.0         NaN               4.0           1.0   

            Value_Shifted_Filled  
Date                              
2023-01-01                     1  
2023-01-02                     1  
2023-01-03                     2  
2023-01-04                     3  
2023-01-05                     4  


[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)

# Ranking and Ordering

| Method/Attribute   | Description                                   | Python Example                           | Parameters                                                                                                                                   | Documentation                                                                                   |
| ------------------ | --------------------------------------------- | ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `df.rank()`        | Compute numerical data ranks for each element | `df['rank'] = df['col'].rank()`          | `axis`: {0, 1}  <br>`method`: {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}  <br>`na_option`: {‘keep’, ‘top’, ‘bottom’}  <br>`ascending`: bool | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html)        |
| `df.sort_values()` | Sort by the values along a particular axis    | `df.sort_values('col', ascending=False)` | `by`: str or list of str  <br>`axis`: {0, 1}  <br>`ascending`: bool or list of bool  <br>`inplace`: bool                                     | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) |
| `df.sort_index()`  | Sort the DataFrame by its index               | `df.sort_index(ascending=False)`         | `axis`: {0, 1}  <br>`level`: int or str, or sequence of ints/strs  <br>`ascending`: bool, default True  <br>`inplace`: bool                  | [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html)  |

# String Operations

| Method | Description | Python Example | Parameters |
|---|---|---|---|
| `str.lower()` | Convert strings in the Series to lowercase. | `df['col'] = df['col'].str.lower()` | None |
| `str.upper()` | Convert strings in the Series to uppercase. | `df['col'] = df['col'].str.upper()` | None |
| `str.len()` | Compute the length of each string in the Series. | `df['col_len'] = df['col'].str.len()` | None |
| `str.strip([to_strip, side])` | Remove leading and trailing whitespace from strings in the Series. | `df['col_stripped'] = df['col'].str.strip()` | `to_strip`: Characters to remove (default whitespace), `side`: {'left', 'right', 'both'} (default 'both') |
| `str.lstrip([to_strip])` | Remove leading whitespace from strings in the Series. | `df['col_lstripped'] = df['col'].str.lstrip()` | `to_strip`: Characters to remove (default whitespace) |
| `str.rstrip([to_strip])` | Remove trailing whitespace from strings in the Series. | `df['col_rstripped'] = df['col'].str.rstrip()` | `to_strip`: Characters to remove (default whitespace) |
| `str.replace(pat, repl[, n, case, ...])` | Replace occurrences of a pattern or regular expression in the strings of the Series. | `df['col_replaced'] = df['col'].str.replace('old', 'new')` | `pat`: Pattern to replace, `repl`: Replacement string, `n`: Max replacements, `case`: Case sensitive |
| `str.cat([sep, others, na_rep, join])` | Concatenate strings in the Series with an optional separator. | `df['full_name'] = df['first_name'].str.cat(sep=' ', others=df['last_name'])` | `sep`: Separator, `others`: Series or list-like to concatenate, `na_rep`: Representation of missing values, `join`: Join type |
| `str.split([pat, n, expand])` | Split strings in the Series around occurrences of a separator/pattern. | `df[['col1', 'col2']] = df['col'].str.split(expand=True)` | `pat`: Separator/pattern, `n`: Max splits, `expand`: Expand to DataFrame |
| `str.rsplit([pat, n, expand])` | Split strings in the Series around occurrences of a separator/pattern, starting from the right. | `df[['col1', 'col2']] = df['col'].str.rsplit(pat='_', n=1, expand=True)` | `pat`: Separator/pattern, `n`: Max splits, `expand`: Expand to DataFrame |
| `str.contains(pat[, case, flags, ...])` | Check if strings in the Series contain a specific pattern or regular expression. | `df['contains_a'] = df['col'].str.contains('a')` | `pat`: Pattern to find, `case`: Case sensitive, `flags`: Regex flags |
| `str.startswith(pat[, na])` | Check if strings in the Series start with a specified prefix. | `df['starts_with_A'] = df['col'].str.startswith('A')` | `pat`: Prefix to check, `na`: Value for missing values |
| `str.endswith(pat[, na])` | Check if strings in the Series end with a specified suffix. | `df['ends_with_y'] = df['col'].str.endswith('y')` | `pat`: Suffix to check, `na`: Value for missing values |
| `str.get(i)` | Get the element at the specified index `i` from each string in the Series. | `df['first_letter'] = df['col'].str.get(0)` | `i`: Index position |
| `str.slice([start, stop, step])` | Slice substrings from strings in the Series. | `df['sliced_str'] = df['col'].str.slice(1, 4)` | `start`: Start position, `stop`: End position (exclusive), `step`: Slice step |
| `str.slice_replace([start, stop, repl])` | Replace a slice of each string in the Series with another string. | `df['replaced_slice'] = df['col'].str.slice_replace(1, 4, 'XYZ')` | `start`: Start position, `stop`: End position (exclusive), `repl`: Replacement string |
| `str.extract(pat[, flags, expand])` | Extract the first match of a regular expression pattern in the strings of the Series. | `df['extracted'] = df['col'].str.extract(r'([A-Za-z]+)')` | `pat`: Regular expression pattern, `flags`: Regex flags, `expand`: Expand to DataFrame |
| `str.extractall(pat[, flags])` | Extract all matches of a regular expression pattern in the strings of the Series. | `df['extracted_all'] = df['col'].str.extractall(r'([A-Za-z]+)')` | `pat`: Regular expression pattern, `flags`: Regex flags |
| `str.join(sep)` | Join the elements in each string of the Series, assuming each string is a sequence (list, etc.). | `df['joined_str'] = df['col'].str.join(sep='|')` | `sep`: Separator string |
| `str.get_dummies([sep])` | Split each string in the Series by a separator and return a DataFrame of dummy/indicator variables. | `df_dummies = df['col'].str.get_dummies(sep='|')` | `sep`: Separator to split on |
| `str.contains(pat, case=False, na=False)` | Check whether each text string in a Series matches a regular expression. | `df['col'].str.contains('hello', case=False)` | `pat` : str, `case` : bool, default False `na` : bool, default False |
| `str.startswith(pat, na=False)` | Check whether each text string in a Series starts with a pattern. | `df['col'].str.startswith('Start', na=False)` | `pat` : str, `na` : bool, default False |
| `str.endswith(pat, na=False)` | Check whether each text string in a Series ends with a pattern. | `df['col'].str.endswith('End', na=False)` | `pat` : str, `na` : bool, default False |
| `str.isalnum()` | Check whether all characters in each text string are alphanumeric. | `df['col'].str.isalnum()` | None |
| `str.isalpha()` | Check whether all characters in each text string are alphabetic. | `df['col'].str.isalpha()` | None |
| `str.isdigit()` | Check whether all characters in each text string are digits. | `df['col'].str.isdigit()` | None |
| `str.isspace()` | Check whether all characters in each text string are whitespace. | `df['col'].str.isspace()` | None |
| `str.islower()` | Check whether all characters in each text string are lowercase. | `df['col'].str.islower()` | None |
| `str.isupper()` | Check whether all characters in each text string are uppercase. | `df['col'].str.isupper()` | None |
| `str.istitle()` | Check whether all characters in each text string are titlecase. | `df['col'].str.istitle()` | None |
| `str.isnumeric()` | Check whether all characters in each text string are numeric. | `df['col'].str.isnumeric()` | None |
| `str.isdecimal()` | Check whether all characters in each text string are decimal. | `df['col'].str.isdecimal()` | None |
| `str.translate(table)` | Replace each character in the string using the given translation table. | `translation_table = str.maketrans('abc', 'xyz'); df['col'].str.translate(translation_table)` | `table` : dict or int |
| `str.pad(width, side='left', fillchar=' ')` | Pad strings in the Series with a specified character to a given width. | `df['col'].str.pad(width=10, side='left', fillchar='0')` | `width` : int, `side` : str, default ‘left’ `fillchar` : str, default ‘ ‘ |
| `str.center(width, fillchar=' ')` | Equivalent to str.pad(side='both', ...). | `df['col'].str.center(width=10, fillchar=' ')` | `width` : int, `fillchar` : str, default ‘ ‘ |
| `str.ljust(width, fillchar=' ')` | Equivalent to str.pad(side='left', ...). | `df['col'].str.ljust(width=10, fillchar=' ')` | `width` : int, `fillchar` : str, default ‘ ‘ |
| `str.rjust(width, fillchar=' ')` | Equivalent to str.pad(side='right', ...). | `df['col'].str.rjust(width=10, fillchar=' ')` | `width` : int, `fillchar` : str, default ‘ ‘ |
| `str.zfill(width)` | Pad strings in the Series by prepending ‘0’ characters to the left. | `df['col'].str.zfill(width=5)` | `width` : int |
| `str.encode(encoding='utf-8', errors='strict')` | Encode the strings in the Series using the specified encoding. | `df['col'] = df['col'].str.encode(encoding='utf-8')` | `encoding` : str, default ‘utf-8’ `errors` : str, default ‘strict’ |
| `str.decode(encoding='utf-8', errors='strict')` | Decode the strings in the Series using the specified encoding. | `df['col'] = df['col'].str.decode(encoding='utf-8')` | `encoding` : str, default ‘utf-8’ `errors` : str, default ‘strict’ |

[Back to Top](#scrollTo=lNNNn0xM-C5h&line=4&uniqifier=1)