# MCLabs Churn Analyzer - Data Preparation

This Jupyter Notebook will be used for general data preparation for the model. As all of our data is coming from separate sources, we first need to combine our data into a singular source. We then will record some values so that we can later derive more features. We will then pre-process the data to address a variety of concerns.

In [3]:
'''
MODULE/PACKAGE IMPORTS
'''

# System
import os
from glob import glob

# Data
import numpy as np
import pandas as pd

In [4]:
'''
DATA ANONYMIZATION

To protect player privacy, the first portion of our data preparation is anonymizing our data. 
We will simply take the datafiles in gatheringoutput and replace the UUID field for each entry
with a placeholder name to present that player.
'''

# Define input and output folder paths
folderPath_gatheringoutput = "../data/gatheringoutput/"
folderPath_anonoutput = "../data/anonoutput/"

# Create output folder if it doesn't exist
if not os.path.exists(folderPath_anonoutput):
    os.makedirs(folderPath_anonoutput, exist_ok=True)

# Get the names of all gatheringoutput files using glob
gatheringOutputFiles = glob(os.path.join(folderPath_gatheringoutput, "*.csv"))

# Iterate through the files
for filePath in gatheringOutputFiles:
    # Read the CSV file
    df = pd.read_csv(filePath)

    # Anonymize the UUID column (UUID -> PlayerXXXXX)
    df['UUID'] = [f'Player{str(i+1).zfill(5)}' for i in range(len(df))]

    # Save the anonymized data to a new CSV file
    outputFilePath = os.path.join(folderPath_anonoutput, os.path.basename(filePath))
    df.to_csv(outputFilePath, index=False)
