### Overview

This script is used to cluster job postings from a dataset (`df_A`) based on their titles and descriptions, matching them against predefined domains and sub-domains with associated keywords from another dataset (`df_B`). The code computes probabilities for each cluster based on keyword matches and assigns each job posting a primary and secondary cluster based on these probabilities.

### Steps and Key Functions

1. **Data Loading:**
   - The job postings data is loaded from a CSV file (`postings.csv`) into a pandas DataFrame `df_A`.
   - Domain and keyword mapping data is loaded from an Excel file (`Domain_TD.xlsx`) into another DataFrame `df_B`.

2. **`find_clusters_with_probabilities` function:**
   This function computes the probability that a job posting belongs to a specific domain and sub-domain cluster based on the following:
   - It matches keywords in the job description and title with the keywords associated with domains and sub-domains in `df_B`.
   - The probability is calculated based on the following rules:
     - If there are matches in the title, 50% is assigned to that cluster.
     - If there are 4 or more keyword matches in the description, 50% is assigned to that cluster.
     - If there are fewer than 4 keyword matches in the description, a percentage of the 50% is assigned based on the number of matches.
   
3. **Cluster Assignment:**
   - The function `find_clusters_with_probabilities` is applied to each job posting (row) in `df_A` to compute the clusters for each job posting. The result is stored in the new `clusters` column in `df_A`.
   
4. **Probability Calculation:**
   - The script calculates the probability for each cluster, which is the proportion of the total matches for a specific cluster out of all the matches in the job posting.
   - If a cluster has a match count greater than or equal to 100, it is assigned a probability of 100%.

5. **Primary and Secondary Clusters:**
   - After assigning clusters, the function `get_primary_secondary` sorts the clusters by their probabilities in descending order and assigns the primary and secondary clusters accordingly.
   - The primary and secondary clusters are stored in new columns: `primary_cluster` and `secondary_cluster` in `df_A`.

6. **Saving Results:**
   - After processing, the updated DataFrame `df_A` is saved to a new CSV file (`df_A_clustered.csv`).

### Example Output

The output DataFrame will have the following columns:
- `company_name`: The name of the company posting the job.
- `title`: The job title.
- `location`: The job location.
- `clusters`: A dictionary containing domain:sub-domain pairs with their corresponding probabilities.
- `primary_cluster`: The cluster with the highest probability.
- `secondary_cluster`: The cluster with the second-highest probability.

### Final Result

The final DataFrame provides each job posting's primary and secondary cluster assignments, helping to categorize the job postings based on the relevance of the title and description to predefined domains and sub-domains.


In [None]:
import pandas as pd


In [2]:
# Load dataframes
df_A = pd.read_csv('../Job_Postings_Data_2023_24/postings.csv',encoding='latin-1')
# df_A = df.sample(n=20000, random_state=42)
df_B = pd.read_excel('Domain_TD.xlsx')

# Initialize a new column in df_A for the cluster
df_A['clusters'] = None
df_A['title'] = df_A['title'].astype(str).fillna('')
df_A['description'] = df_A['description'].astype(str).fillna('')

In [4]:
def find_clusters_with_probabilities(title, description, df_B):
    cluster_matches = {}

    for index, row in df_B.iterrows():
        domain = row['Domain'].strip().lower()
        sub_domain = row['Sub-Domain'].strip().lower()
        #keywords = [keyword.strip().lower() for keyword in row['keywords'].split(',')]  # assuming keywords are comma-separated
        keywords = [keyword.strip().lower() for keyword in row['keywords'].split(',')] if isinstance(row['keywords'], str) else []

        # Match keywords in description
        description_match_count = sum(1 for keyword in keywords if keyword in description.lower())
        # Match keywords in title (checking each word in title against the keywords)
        title_word_count = len([word for word in title.lower().split() if word in keywords])

        cluster = f"{row['Domain']}:{row['Sub-Domain']}"

        # If there are matches in the title (more than 0 words in title match keywords), assign 50%
        if title_word_count > 0:
            if cluster not in cluster_matches:
                cluster_matches[cluster] = 0
            cluster_matches[cluster] += 50  # Assign 50% for title matches

        # If there are more than 4 keyword matches in the description, assign 50%
        if description_match_count >= 4:
            if cluster not in cluster_matches:
                cluster_matches[cluster] = 0
            cluster_matches[cluster] += 50  # Assign 50% for description matches

        # If the description has less than 4 keyword matches, calculate percentage
        if description_match_count < 4:
            description_percentage = (description_match_count / 4) * 50
            if cluster not in cluster_matches:
                cluster_matches[cluster] = 0
            cluster_matches[cluster] += description_percentage  # Add calculated percentage

    if not cluster_matches:
        return {'Uncategorized': 100}

    # Calculate probabilities
    total_matches = sum(cluster_matches.values())
    probabilities = {}

    for cluster, count in cluster_matches.items():
        if count >= 100:
            probabilities[cluster] = 100.0
        else:
            probabilities[cluster] = (count / total_matches) * 100.0

    return probabilities

In [5]:
# Apply the function to each row in df_A
df_A['clusters'] = df_A.apply(lambda row: find_clusters_with_probabilities(row['title'], row['description'], df_B), axis=1)

In [6]:
# Save the updated dataframe to a new CSV file if needed
df_A.to_csv('df_A_clustered.csv', index=False)6+

In [8]:
df_A.clusters[1]

{'software engineer:general': 6.0,
 'software engineer:aerospace': 0.0,
 'software engineer:management': 4.0,
 'software engineer:marketing': 0.0,
 'software engineer:healthcare': 14.000000000000002,
 'software engineer:cybersecurity': 0.0,
 'software engineer:robotics': 2.0,
 'software engineer:finance': 2.0,
 'software engineer:automobile': 0.0,
 'Robotics Engineer:Robotics': 2.0,
 'software engineer:customer relationship management': 4.0,
 'Accountant:finance': 4.0,
 'nurse:healthcare': 100.0,
 'Admin:management': 4.0,
 'sales representative:sales': 8.0,
 'sales manager:sales': 4.0,
 'scrum master:management': 0.0,
 'Architect:Design': 2.0,
 'Assistant manager:management': 8.0,
 'Business Analyst:general': 8.0,
 'associate director:general': 2.0,
 'automation engineer:general': 2.0,
 'chief financial officer:finance': 4.0,
 'transportation specialist:transport': 0.0,
 'chef:food chain': 2.0,
 'clerk:general': 2.0}

In [9]:
# Function to determine primary and secondary clusters
def get_primary_secondary(clusters):
    sorted_clusters = sorted(clusters.items(), key=lambda item: item[1], reverse=True)
    primary = sorted_clusters[0][0] if len(sorted_clusters) > 0 else None
    secondary = sorted_clusters[1][0] if len(sorted_clusters) > 1 else None
    return pd.Series([primary, secondary])

# Apply the function to each row in the DataFrame
df_A[['primary_cluster', 'secondary_cluster']] = df_A['clusters'].apply(get_primary_secondary)


In [11]:
len(df_A)

123849

In [12]:
df_A.to_csv('df_A_clustered_PS.csv', index=False)