# MCLabs Churn Analyzer - Data Preparation
Author: @cmh02

This Jupyter Notebook will be used for general data preparation for the model. As all of our data is coming from separate sources, we first need to combine our data into a singular source. We then will record some values so that we can later derive more features. We will then pre-process the data to address a variety of concerns.

In [7]:
'''
MODULE/PACKAGE IMPORTS
'''

# System
import os
import hashlib
from glob import glob
from dotenv import load_dotenv

# Data
import numpy as np
import pandas as pd

# Output/Display
from tqdm import tqdm

In [8]:
'''
ENVIRONMENT VARIABLES
'''

# Load environment file using python-dotenv
load_dotenv(dotenv_path="../env/.env")

# Load environmental variables
MCA_PEPPERKEY = os.getenv("MCA_PEPPERKEY")

# Ensure environmental variables are set
if not MCA_PEPPERKEY:
    raise ValueError("Missing required environment variable: MCA_PEPPERKEY")

In [10]:
'''
DATA ANONYMIZATION

To protect player privacy, the first portion of our data preparation is anonymizing our data. 
We will simply take the datafiles in gatheringoutput and replace the UUID field with a hashed
version of itself. Then we can save a private version, including these hashes, for our usage,
along with a public version, with no UUID's or hashes, for external analysis.
'''

# Define input and output folder paths
folderPath_gatheringoutput = "../data/gatheringoutput/"
folderPath_anonoutput_public = "../data/anonoutput/public/"
folderPath_anonoutput_private = "../data/anonoutput/private/"

# Create output folders if they don't exist
if not os.path.exists(folderPath_anonoutput_public):
    os.makedirs(folderPath_anonoutput_public, exist_ok=True)
if not os.path.exists(folderPath_anonoutput_private):
    os.makedirs(folderPath_anonoutput_private, exist_ok=True)

# Get the names of all gatheringoutput files using glob
gatheringOutputFiles = glob(os.path.join(folderPath_gatheringoutput, "**", "*.csv"), recursive=True)

# Iterate through the files
for filePath in tqdm(iterable=gatheringOutputFiles, desc="Anonymizing Data Files", unit="file"):
    # Read the CSV file
    df = pd.read_csv(filePath)

    # Anonymize the UUID column (UUID -> hash(PEPPER + UUID))
    df['UUID'] = [hashlib.sha256(f"{MCA_PEPPERKEY}:{uuid}".encode()).hexdigest() for uuid in df['UUID']]

    # Get relative path for gatheringoutput file location
    dataRelativeFilePath = os.path.relpath(filePath, folderPath_gatheringoutput)
    
	# Create private output path and save dataframe to path
    outputFilePath = os.path.join(folderPath_anonoutput_private, dataRelativeFilePath)
    os.makedirs(os.path.dirname(outputFilePath), exist_ok=True)
    df.to_csv(outputFilePath, index=False)
    
	# Drop the UUID column
    df.drop(columns=['UUID'], inplace=True)
    
	# Create public output path and save dataframe to path
    outputFilePath = os.path.join(folderPath_anonoutput_public, dataRelativeFilePath)
    os.makedirs(os.path.dirname(outputFilePath), exist_ok=True)
    df.to_csv(outputFilePath, index=False)


Anonymizing Data Files: 100%|██████████| 7/7 [00:00<00:00, 54.48file/s]


In [12]:
'''
DATA COMBINING

The next step in the data preparation process is combining all of our data from the various
data sources into a single dataset. We will take all data files located in the `anonoutput`
data directory and join them based on the UUID hash. All of the data will then be saved
in a single output file in the `combined` directory.
'''

# Define input and output folder paths
folderPath_anonoutput_private = "../data/anonoutput/private/"
folderPath_combined_public = "../data/combined/public/"
folderPath_combined_private = "../data/combined/private/"

# Create output folder if it doesn't exist
if not os.path.exists(folderPath_combined_public):
    os.makedirs(folderPath_combined_public, exist_ok=True)
if not os.path.exists(folderPath_combined_private):
    os.makedirs(folderPath_combined_private, exist_ok=True)

# Get the names of all anonoutput files using glob
anonOutputFiles = glob(os.path.join(folderPath_anonoutput_private, "**", "*.csv"), recursive=True)

# Initialize an empty DataFrame to hold combined data
combinedDataFrame = pd.DataFrame(columns=["UUID"])

# Iterate through the files
for filePath in tqdm(iterable=anonOutputFiles, desc="Combining Data Files", unit="file"):
	# Read the CSV file
	df = pd.read_csv(filePath)

	# Merge the DataFrame with the combinedData DataFrame
	combinedDataFrame = pd.merge(left=combinedDataFrame, right=df, on="UUID", how="outer")

# Get relative path for output file
dataRelativeFilePath = os.path.relpath(filePath, folderPath_anonoutput_private)
directoryPath = os.path.dirname(dataRelativeFilePath)
dataRelativeFilePath = os.path.join(directoryPath, "combined.csv")

# Create private output path and save combined data
outputFilePath = os.path.join(folderPath_combined_private, dataRelativeFilePath)
os.makedirs(os.path.dirname(outputFilePath), exist_ok=True)
combinedDataFrame.to_csv(outputFilePath, index=False)

# Drop the UUID column
combinedDataFrame.drop(columns=["UUID"], inplace=True)

# Create public output path and save combined data
outputFilePath = os.path.join(folderPath_combined_public, dataRelativeFilePath)
os.makedirs(os.path.dirname(outputFilePath), exist_ok=True)
combinedDataFrame.to_csv(outputFilePath, index=False)

Combining Data Files: 100%|██████████| 7/7 [00:00<00:00, 111.39file/s]
