The aim of the transformation pipeline is to prepare the data in a way that it can be easily consumed by your downstream processes, whether that's a machine learning model, a recommendation system, or any other analytics

For the first stage, our primary goal is to develop a beta version of the recommendation system that recommends a job title based on job_qualifications, job_type, and job_location. We are not currently focusing on job_summary and job_description.

We'll focus on ensuring the job_qualifications, job_type, and job_location columns are appropriately processed and transformed for the recommendation system.

1. Datetime Conversion and Feature Extraction:

    a. Convert the date_of_job_post column into a datetime format.
    b. Extract the month, day, and year from the date_of_job_post and save them as separate columns. This can help in any time-based analysis or features you might want to consider in the future.

2. Categorical Encoding:

a. Convert categorical columns (job_location, company_name, job_type, and especially the title as it's your target) to a numerical format.
    
    a(1). Use appropriate encoding techniques. For title, which is the primary recommendation target, a label encoding might be sufficient. For other features like job_location or job_type, you might consider one-hot encoding if the number of unique categories is not too high, or label encoding otherwise.

3. Handling Missing Values:

    a. Ensure there are no missing values in your crucial columns like job_qualifications, job_type, and job_location. Decide on a strategy to handle them.




In [1]:
import os

In [2]:
%pwd

'/Users/macbookpro/Desktop/pixi_hr_project/pixi_hr/research'

In [3]:
os.chdir('../')

In [4]:
%pwd

'/Users/macbookpro/Desktop/pixi_hr_project/pixi_hr'

Update config.yaml entry...

In [7]:
from dataclasses import dataclass
from pathlib import Path

@dataclass(frozen=True)
class DataTransformationConfig:
    """
    Configuration entity for the Data Transformation pipeline.
    
    This configuration entity provides necessary paths and directories 
    related to data transformation activities.
    
    Attributes:
    - root_dir (Path): The root directory where data transformation artifacts are stored.
    - data_path (Path): The path to the dataset (typically CSV) that needs to be transformed.
    """

    # Root directory for storing transformation-related artifacts
    root_dir: Path

    # Path to the validated dataset for transformation
    data_path: Path

ConfigurationManager configuration.py

In [5]:
from src.pixi_hr.constants import *
from src.pixi_hr.utils.common import read_yaml, create_directories

In [8]:
class ConfigurationManager:
    def __init__(
            self,
            config_filepath=CONFIG_FILE_PATH,    # Path to the main configuration YAML file
            params_filepath=PARAMS_FILE_PATH,    # Path to the parameters YAML file
            schema_filepath=SCHEMA_FILE_PATH):   # Path to the schema YAML file
        
        # Load configurations, parameters, and schema details from their respective YAML files
        self.config = read_yaml(config_filepath)
        self.params = read_yaml(params_filepath)
        self.schema = read_yaml(schema_filepath)

        # Create root directory for storing all artifacts, as specified in the configuration
        create_directories([self.config.artifacts_root])

    def get_data_transformation_config(self) -> DataTransformationConfig:
        """
        Extracts the data transformation configuration from the main config.

        The function performs the following steps:
        1. Reads the data transformation section of the configuration.
        2. Ensures the specified root directory for data transformation exists (creates it if not).
        3. Returns a DataTransformationConfig object initialized with the extracted configuration.

        Returns:
        - DataTransformationConfig: A dataclass object containing the data transformation configuration.
        """
        
        # Extract the data transformation configuration from the main config
        config = self.config.data_transformation

        # Ensure the specified root directory for data transformation exists (creates it if not)
        create_directories([config.root_dir])

        # Create an instance of the DataTransformationConfig dataclass using the extracted configuration
        data_transformation_config = DataTransformationConfig(
            root_dir=config.root_dir,
            data_path=config.data_path
        )

        return data_transformation_config


Create Components

In [12]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from pixi_hr import logger
from sklearn.preprocessing import LabelEncoder
import re
from sklearn.preprocessing import MultiLabelBinarizer

In [16]:
class DataTransformation:
    """
    DataTransformation class for transforming the raw data into a machine learning ready format.
    
    Attributes:
    - config (DataTransformationConfig): Configuration object containing paths.
    - df (DataFrame): Pandas DataFrame loaded from the specified data file.
    """
    
    def __init__(self, config: DataTransformationConfig):
        """
        Initializes the DataTransformation class.

        Args:
        - config (DataTransformationConfig): Configuration object containing paths.
        """
        self.config = config

        try:
            # Load the data into a DataFrame
            self.df = pd.read_csv(self.config.data_path)
        except FileNotFoundError:
            logger.error(f"File not found: {self.config.data_path}")
            raise
        except Exception as e:
            logger.error(f"Error reading data file: {e}")
            raise


    def clean_skills(self, skill_list_str):
        """Clean the skills from the job_qualifications column."""
        # Convert the string representation of a list to an actual list
        skill_list = eval(skill_list_str)

        # Clean each skill
        cleaned_skills = []
        for skill in skill_list:
            cleaned_skill = skill.strip().lower()  # Convert to lowercase and remove leading/trailing whitespaces
            cleaned_skill = re.sub(r'[^a-z0-9]', '', cleaned_skill)  # Remove special characters
            cleaned_skills.append(cleaned_skill)

        return cleaned_skills
    
    def one_hot_encode_qualifications(self):
        """One-hot encodes the job qualifications."""
        # Clean the skills
        self.df['job_qualifications'] = self.df['job_qualifications'].apply(self.clean_skills)

        # Initialize the MultiLabelBinarizer
        mlb = MultiLabelBinarizer()

        # One-hot encode the cleaned skills
        encoded_qualifications = mlb.fit_transform(self.df['job_qualifications'])

        # Add a prefix to the encoded column names
        encoded_columns = [f"qual_{col}" for col in mlb.classes_]

        # Convert the one-hot encoded skills into a DataFrame
        encoded_df = pd.DataFrame(encoded_qualifications, columns=encoded_columns)

        # Drop the original job_qualifications column and concat the encoded DataFrame
        self.df = pd.concat([self.df.drop('job_qualifications', axis=1), encoded_df], axis=1)

        logger.info("One-hot encoding of job qualifications completed.")


    
    def datetime_conversion_and_extraction(self):
        """
        Convert the date_of_job_post column into a datetime format and extract
        year, month, day, hour, minute, and second as separate columns.
        """
        self.df['date_of_job_post'] = pd.to_datetime(self.df['date_of_job_post'])
        
        # Extracting different datetime features
        self.df['month_of_job_post'] = self.df['date_of_job_post'].dt.month
        self.df['day_of_job_post'] = self.df['date_of_job_post'].dt.day
        self.df['year_of_job_post'] = self.df['date_of_job_post'].dt.year
        self.df['hour_of_job_post'] = self.df['date_of_job_post'].dt.hour
        self.df['minute_of_job_post'] = self.df['date_of_job_post'].dt.minute
        self.df['second_of_job_post'] = self.df['date_of_job_post'].dt.second

        logger.info("Datetime conversion and feature extraction completed.")
    
   
    def categorical_encoding(self):
        """
        Convert categorical columns to numerical format using label encoding.
        """
        label_encoders = {}  # Store the encoders for potential use later
        for col in ['title', 'job_location', 'company_name', 'job_type']:
            le = LabelEncoder()
            self.df[col] = le.fit_transform(self.df[col])
            label_encoders[col] = le
        logger.info("Categorical encoding completed.")

    
    def handle_missing_values(self):
        """
        Handle missing values in the specified columns by dropping them.
        """
        self.df.dropna(subset=['job_qualifications', 'job_type', 'job_location'], inplace=True)
        logger.info("Handling of missing values completed.")

    
    def split_data(self):
        """
        Split the data into train and test sets and save them to respective paths.
        """
        train, test = train_test_split(self.df, test_size=0.2, random_state=44)

        train.to_csv(os.path.join(self.config.root_dir, 'train_data.csv'), index=False)
        test.to_csv(os.path.join(self.config.root_dir, 'test_data.csv'), index=False)

        logger.info("Data split into train and test sets and saved to respective paths.")
        logger.info(f"Train shape: {train.shape}")
        logger.info(f"Test shape: {test.shape}")

        print(f"Train shape: {train.shape}")
        print(f"Test shape: {test.shape}")

    
    def main(self):
        """
        Orchestrates the sequence of data transformations.
        """
        logger.info("Starting Data Transformation...")
        
        # Step 1: Datetime Conversion and Feature Extraction
        logger.info("Datetime Conversion and Feature Extraction...")
        self.datetime_conversion_and_extraction()

        # Step 2: Categorical Encoding
        logger.info("Categorical Encoding...")
        self.categorical_encoding()

        # Step 3: Handling Missing Values
        logger.info("Handling Missing Values...")
        self.handle_missing_values()

        # Step 4: One-Hot Encoding of Job Qualifications
        logger.info("One-Hot Encoding of Job Qualifications...")
        self.one_hot_encode_qualifications()

        # Step 5: Split Data into Train and Test sets
        logger.info("Splitting Data...")
        self.split_data()

        logger.info("Data Transformation completed successfully.")

Pipeline

In [17]:
from pixi_hr import logger
from pixi_hr.config.configuration import ConfigurationManager 
from pixi_hr.components.data_transformation import DataTransformationConfig

class DataTransformationPipeline:

    STAGE_NAME = "Data Transformation Stage"

    def __init__(self):
        pass


    def main(self):
        logger.info("Starting the Data Transformation Pipeline")

        # Step 1: Initialize Configuration Manager
        logger.info("Initializing configuration manager")
        config = ConfigurationManager()

        # Step 2: Fetch Data Transformation Configuration
        logger.info("Fetching Data Transformation Configuration..")
        data_transformation_config = config.get_data_transformation_config()

        # Step 3: Initialize Data Transformation Component
        logger.info("Initializing Data Transformation Component...")
        data_transformation = DataTransformation(config=data_transformation_config)

        # Step 4: Transform Data
        logger.info("Transforming data...")
        data_transformation.main()

        logger.info("Data Transformation Pipeline completed successfully.")


if __name__ == '__main__':
    try:
        logger.info(f">>>>>> Stage: {DataTransformationPipeline.STAGE_NAME} started <<<<<<")
        data_transformation_pipeline = DataTransformationPipeline()
        data_transformation_pipeline.main()
        logger.info(f">>>>>> Stage {DataTransformationPipeline.STAGE_NAME} completed <<<<<< \n\nx==========x")
    except Exception as e:
        logger.exception(f"Error encountered during the Data Transformation Pipeline: {e}")
        raise


[2023-08-26 13:30:57,397: 37: pixi_hr_project_logger: INFO: 231967474:  >>>>>> Stage: Data Transformation Stage started <<<<<<]
[2023-08-26 13:30:57,398: 14: pixi_hr_project_logger: INFO: 231967474:  Starting the Data Transformation Pipeline]
[2023-08-26 13:30:57,399: 17: pixi_hr_project_logger: INFO: 231967474:  Initializing configuration manager]
[2023-08-26 13:30:57,403: 41: pixi_hr_project_logger: INFO: common:  yaml file: config/config.yaml loaded successfully]
[2023-08-26 13:30:57,405: 41: pixi_hr_project_logger: INFO: common:  yaml file: params.yaml loaded successfully]
[2023-08-26 13:30:57,406: 41: pixi_hr_project_logger: INFO: common:  yaml file: schema.yaml loaded successfully]
[2023-08-26 13:30:57,407: 64: pixi_hr_project_logger: INFO: common:  Created directory at: artifacts]
[2023-08-26 13:30:57,407: 21: pixi_hr_project_logger: INFO: 231967474:  Fetching Data Transformation Configuration..]
[2023-08-26 13:30:57,408: 64: pixi_hr_project_logger: INFO: common:  Created direct