### Domain Clustering using K-means

This Python code performs text preprocessing and clustering on job posting data. The steps involved are:

1. **Text Preprocessing:**
   - It loads a dataset of job postings (`postings.csv`), extracts the `title` and `description` columns, and processes them using the `nltk` library. 
   - The text is cleaned by removing non-word characters, extra spaces, converting to lowercase, and lemmatizing words while removing stopwords.

2. **TF-IDF Vectorization:**
   - A `TfidfVectorizer` is applied to the combined text (job title + description) to convert the text into numerical features for clustering.

3. **Dimensionality Reduction:**
   - Principal Component Analysis (PCA) is applied to reduce the dimensionality of the data to 2 components for visualization.

4. **K-Means Clustering:**
   - The code uses the Elbow Method to determine the optimal number of clusters by plotting the Within-Cluster Sum of Squares (WCSS).
   - It performs K-Means clustering with 10 clusters and assigns cluster labels to the job postings.

5. **Visualization:**
   - The clusters are visualized using a scatter plot with the two principal components and marked cluster centers.

6. **Result:**
   - The resulting clusters are stored in the DataFrame, which is then displayed for the first 20 job postings.

### Final Note

While the clustering approach in the initial code did not yield meaningful results due to the lack of sufficient domain knowledge, we successfully switched to a domain-specific clustering approach. This approach involved matching job postings against predefined domains and sub-domains using associated keywords, calculating probabilities for each match. Each job posting was then assigned primary and secondary clusters based on the highest probability matches. This method provided more relevant and actionable clustering results for categorizing the job postings.


In [None]:
import pandas as pd
import re
import spacy
import logging
import concurrent.futures

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# logging.basicConfig(filename='preprocess_debug.log', level=logging.DEBUG, format='%(asctime)s %(message)s')

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

In [None]:
data = pd.read_csv("job postings 2023 24/postings.csv")

In [None]:
data.head()

In [None]:
req_data = data[['title','description']]

In [None]:
req_data.head()

In [None]:
# def preprocess_text(text):
#     text = re.sub(r'\W', ' ', text)  # Remove non-word characters
#     text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
#     text = text.lower()  # Convert to lowercase
#     doc = nlp(text)
#     lemmatized_text = ' '.join([token.lemma_ for token in doc if not token.is_stop])  # Lemmatize and remove stop words
#     return lemmatized_text

In [None]:
# data['job_title'] = data['title'].apply(preprocess_text)
# data['job_description'] = data['description'].apply(preprocess_text)

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import multiprocessing as mp


# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text_nltk(text):
    if text is None:
        return ''
    
    text = re.sub(r'\W', ' ', text)  # Remove non-word characters
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = text.lower()  # Convert to lowercase
    words = word_tokenize(text)
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in words if word not in stop_words])
    
    # Log the processed text
#     logging.debug(f"Processed text: {lemmatized_text}")
    return lemmatized_text

# # Function to apply preprocessing to a batch of texts
# def preprocess_batch(batch):
#     return [preprocess_text_nltk(text) for text in batch]



In [None]:
req_data['title'] = req_data['title'].astype(str).fillna('no data')
req_data['description'] = req_data['description'].astype(str).fillna('no data')

In [None]:
req_data.head()

In [None]:
req_data['job_title'] = req_data['title'].apply(preprocess_text_nltk)
req_data['job_description'] = req_data['description'].apply(preprocess_text_nltk)

In [None]:
req_data['text'] = req_data['job_title'] + ' ' + req_data['job_description']


In [None]:
has_none_job_title = req_data['job_title'].isnull().any()
print(f"'job_title' column has None values: {has_none_job_title}")

# Check if the 'job_description' column has any None values
has_none_job_description = req_data['job_description'].isnull().any()
print(f"'job_description' column has None values: {has_none_job_description}")

In [None]:
req_data['text'] = req_data['text'].astype(str).fillna('no data')
vectorizer = TfidfVectorizer(max_features=100)
X = vectorizer.fit_transform(req_data['text']).toarray()

In [None]:
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)


In [None]:
X_reduced

In [None]:
wcss = []
num_clusters = 20  # Define the number of clusters to check
for i in range(1, num_clusters):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X_reduced)
    wcss.append(kmeans.inertia_)

# Ensure the x and y axes have the same dimensions
plt.plot(range(1, num_clusters), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
# Perform K-Means clustering with k=3
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(X_reduced)

# Add cluster labels to the DataFrame
req_data['cluster'] = kmeans.labels_

# Print cluster centers
print("Cluster Centers:\n", kmeans.cluster_centers_)

# Visualize the clusters
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=kmeans.labels_, cmap='viridis', marker='o')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='x')
plt.title('K-Means Clustering with k=3')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()


In [None]:
req_data[['title','description','cluster']][:20]