# MCLabs Churn Analyzer - Data Pipeline

This Jupyter Notebook will use the data preparation and processing modules to prepare a master dataset for training and testing the models.

Note that this notebook cannot be run without raw data access. If you do not have access to raw individual timestamp data, then you should skip this notebook and use the provided master dataframe dataset in the [Model Creation Notebook](MCA_ModelCreation.ipynb).

## Module and Package Imports

This section will import any required modules for use in this notebook.

In [1]:
# System
from tqdm import tqdm
from pathlib import Path

# Data
import pandas as pd

# Custom Modules
from mcalib import McaDataUtils, McaDataPrepare, McaFeaturePipeline, McaTargetPipeline

# Output/Display
from tqdm import tqdm

## Pre-Model Data Pipeline

This section will:
- Gather all available individual timestamp datasets
- Pass them through the data preparation pipeline
- Combine them using a sliding window to join the first two into feature sets and the final into target set
- Collect all final feature and target samples into a master dataset for training and testing

Note that if the above option for using the master data frame is set to True, then this section will be skipped, and the notebook will load the master dataframe from file in the next section.

In [2]:
# Utility function to build a single window of data for pipeline
def buildPipelineWindow(timestamp1: str, timestamp2: str, timestamp3: str) -> pd.DataFrame:

	# Load three data files for model training
	df_t1 = McaDataUtils.getDfForTimestamp(timestamp=timestamp1)
	df_t2 = McaDataUtils.getDfForTimestamp(timestamp=timestamp2)
	df_t3 = McaDataUtils.getDfForTimestamp(timestamp=timestamp3)

	# Prepare all datasets
	df_t1 = McaDataPrepare.prepareData(df=df_t1, dfTimestamp=float(timestamp1))
	df_t2 = McaDataPrepare.prepareData(df=df_t2, dfTimestamp=float(timestamp2))
	df_t3 = McaDataPrepare.prepareData(df=df_t3, dfTimestamp=float(timestamp3))

	# Perform feature engineering between the first two timestamps
	df = McaFeaturePipeline.combineData(currentDf=df_t2, previousDf=df_t1)

	# Perform target engineering between the last two timestamps
	df = McaTargetPipeline.buildTarget(currentDf=df, futureDf=df_t3, onlyReturnTarget=False)

	# Drop UUID's before model
	df = McaDataUtils.clearUUIDs(df=df)

	# For now, drop rows where target is 0 (completely inactive)
	df = df[df["churn"] != 0].reset_index(drop=True)

	return df

# Make a dataframe for holding all data
masterDf = pd.DataFrame()

# Get all of the timestamps available
dataDirectory = Path("../data/gatheringoutput/")
timestamps = [path.name for path in dataDirectory.iterdir() if path.is_dir()]
print(f"Building master dataframe from {len(timestamps)} total timestamps!")

# Append each window's data to master dataframe
for window in tqdm(iterable=[timestamps[i:i+3] for i in range(len(timestamps) - 2)], desc="Processing Windows", unit="window"):
	testDf = buildPipelineWindow(timestamp1=window[0], timestamp2=window[1], timestamp3=window[2])
	masterDf = pd.concat([masterDf, testDf], ignore_index=True)

# Save master dataframe to file for later use independent of raw data access
masterDf.to_csv("../data/master/master_dataframe.csv", index=False)
print(f"Saved master dataframe to `../data/master/master_dataframe.csv`!")

Building master dataframe from 20 total timestamps!


Processing Windows: 100%|██████████| 18/18 [00:04<00:00,  4.19window/s]

Saved master dataframe to `../data/master/master_dataframe.csv`!



