## Part A: Data Preprocessing, Model Training, and Storing the Model as a Pickle File

### Data Preprocessing: 
Preprocess the text columns (title, description, skills_desc) in the dataset.
### TF-IDF Vectorization: 
Perform TF-IDF vectorization on the preprocessed text data.
### Cosine Similarity Calculation: 
Store the cosine similarity matrix as a h5py file, so it can be used later for making recommendations.

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import numpy as np
import nltk
import pickle
import h5py
import os

In [4]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# Global variable for TF-IDF Vectorizer and Cosine Similarity Matrix

tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_features=10000)  # Limit the number of features
cosine_sim_matrix = None

In [8]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))


In [9]:
def preprocess_text(text):
    # Remove non-alphanumeric characters and lowercase the text
    text = re.sub(r'\W', ' ', text.lower())
    # Tokenize and lemmatize
    words = [lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words]
    return ' '.join(words)

In [10]:
def process_and_store_model(data_path= 'C:/Users/DELL/Linkedin-Job-Market-Analysis-using-ML/LinkedIn Scraper/job postings 2023 24/postings.csv' ):
    # Load job postings data
    data = pd.read_csv(data_path)
    
    # Combine the relevant text columns into a single string for each job
    data["combined_text"] = data["title"].fillna('') + ' ' + data["description"].fillna('') + ' ' + data["skills_desc"].fillna('')
    data["combined_text"] = data["combined_text"].apply(preprocess_text)
    
    # Perform TF-IDF vectorization
    tfidf_matrix = tfidf_vectorizer.fit_transform(data["combined_text"])
    
    # Use Approximate Nearest Neighbors with cosine similarity metric
    nbrs = NearestNeighbors(n_neighbors=10, metric='cosine', algorithm='brute').fit(tfidf_matrix)

    # Save the model and TF-IDF vectorizer to an HDF5 file
    model_directory = "D:/ARJYAHI/Models"
    os.makedirs(model_directory, exist_ok=True)
    h5_file_path = os.path.join(model_directory, 'model_data.h5')
    
    with h5py.File(h5_file_path, 'w') as h5f:
        # Serialize the TF-IDF vectorizer with pickle and store in HDF5
        tfidf_vectorizer_pickle = pickle.dumps(tfidf_vectorizer)
        h5f.create_dataset('tfidf_vectorizer', data=np.void(tfidf_vectorizer_pickle))
        # Save the NearestNeighbors model directly (or save parameters if large)
        nbrs_pickle = pickle.dumps(nbrs)
        h5f.create_dataset('nbrs', data=np.void(nbrs_pickle))

In [12]:
process_and_store_model()

## Part B: Loading the Files and Using the Model for Recommendations
In Part B, we load the files and use them to make recommendations based on user input.

This code is designed to recommend similar job postings based on a user's input, either a job title, a list of skills, or both. It uses a pre-trained machine learning model and cosine similarity to find the most relevant job postings. Here's how it works:

1. **Preprocessing the Input**:
The input text (either a job title, skills, or both) is preprocessed by removing non-alphanumeric characters and converting it to lowercase.
The text is tokenized (split into individual words) and lemmatized (reduced to their base form) to standardize the words for comparison.
Stop words (common but uninformative words like "the", "is", etc.) are removed during preprocessing.
2. **Loading the Pre-trained Model**:
The code loads a pre-trained TF-IDF vectorizer and Nearest Neighbors model from an HDF5 file using the h5py library. These models are used to vectorize the text input and find the most similar jobs based on the vectors.
3. **Calculating Similarity**:
The TF-IDF vectorizer transforms the preprocessed input into a numerical vector representation, which is then compared to vectors of job descriptions in the dataset.
The Nearest Neighbors model calculates the cosine distances between the input text and the job descriptions.
Cosine similarity is used as the metric for measuring similarity between the input and each job. Cosine similarity measures how close two vectors are in angle, and it ranges from 0 (no similarity) to 1 (perfect similarity). A lower cosine distance corresponds to higher similarity.
4. **Calculating and Displaying Similarity Scores**:
The distances returned by the Nearest Neighbors model are the dissimilarity values (i.e., the cosine distance). These values are converted to similarity scores by subtracting the distance from 1, since higher cosine similarity means the jobs are more similar.
Similarity Score = 1 - Cosine Distance
The code then adds the similarity score to the result for each job, allowing us to see how similar each recommended job is to the input.
5. **Returning the Results**:
The top n most similar job postings are returned, displaying important details such as:
Job ID
Company Name
Job Title
Job Description
Skills Description
Location
Similarity Score: A higher similarity score means the job posting is more relevant to the user's input.

In [13]:
import pandas as pd
import pickle
from sklearn.metrics.pairwise import cosine_similarity
import re
import h5py
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [14]:
# Initialize lemmatizer and stop words
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

In [15]:
# Define a function to preprocess and lemmatize text
def preprocess_text(text):
    # Remove non-alphanumeric characters and lowercase the text
    text = re.sub(r'\W', ' ', text.lower())
    # Tokenize and lemmatize
    words = [lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words]
    return ' '.join(words)

In [23]:
def load_model_and_recommend(title=None, skills=None, top_n=5):
    model_directory = "D:/ARJYAHI/Models"
    h5_file_path = os.path.join(model_directory, 'model_data.h5')
    
    with h5py.File(h5_file_path, 'r') as h5f:
        # Load the TF-IDF vectorizer and NearestNeighbors model from HDF5
        tfidf_vectorizer = pickle.loads(h5f['tfidf_vectorizer'][()].tobytes())
        nbrs = pickle.loads(h5f['nbrs'][()].tobytes())
    
    # Combine title and skills into input text and preprocess
    input_text = ' '.join(filter(None, [title, skills]))
    input_text = preprocess_text(input_text)
    
    # Transform the input text using the loaded TF-IDF vectorizer
    input_tfidf = tfidf_vectorizer.transform([input_text])
    
    # Find nearest neighbors (most similar jobs)
    distances, indices = nbrs.kneighbors(input_tfidf, n_neighbors=top_n)
    
    # Load the job postings dataset
    data = pd.read_csv("C:/Users/DELL/Linkedin-Job-Market-Analysis-using-ML/LinkedIn Scraper/job postings 2023 24/postings.csv")
    
    # Return the top_n similar jobs
    similar_jobs = data.iloc[indices[0]][["company_name", "title", "location"]]
    similar_jobs['similarity_score'] = 1 - distances[0]  # Cosine similarity score (higher is better)
    
    return similar_jobs

## Example Output

In [24]:
title_input = "Data Scientist"  # Or set it to None if only using skills
skills_input = None  # Or set it to a skills string if only using skills
similar_jobs = load_model_and_recommend(title=title_input, skills=skills_input)
print(similar_jobs)

            company_name                                              title  \
69503             Nebula                                     Data Scientist   
83764             Amazon  Language Data Scientist, Artificial General In...   
83835             Amazon  Language Data Scientist, Artificial General In...   
83406   Jobot Consulting                               Staff Data Scientist   
116889          hackajob                                     Data Scientist   

             location  similarity_score  
69503   United States          0.500107  
83764      Boston, MA          0.490506  
83835      Boston, MA          0.490506  
83406    New York, NY          0.477121  
116889     McLean, VA          0.470805  


## Final Conclusion and Explanation of Similarity Scores and Cosine Metrics

The final output provides a list of job postings that are most similar to the input text (either a job title, skills, or both). The key metric used to determine this similarity is the **cosine similarity score**. Here's a detailed explanation of the results and how the cosine similarity is calculated:

### 1. **Similarity Scores:**
   The **cosine similarity score** indicates how closely the input text (job title and/or skills) matches the job descriptions in the dataset. The score ranges from 0 to 1:
   - **1** indicates perfect similarity, meaning the input is almost identical to the job description.
   - **0** means no similarity at all.

   The job postings in the result are sorted by their cosine similarity score, with higher scores indicating a higher degree of relevance to the input.

   For example:
   - **Data Scientist** at **Nebula** has a similarity score of **0.500107**, meaning it's the most similar job to the input.
   - **Language Data Scientist, Artificial General Intelligence** at **Amazon** has a score of **0.490506**, indicating it's also highly relevant, but slightly less so than the Nebula job posting.
   - **Staff Data Scientist** at **Jobot Consulting** has a similarity score of **0.477121**, showing it is somewhat less similar.
   - **Data Scientist** at **hackajob** has the lowest similarity score of **0.470805**, but it's still relevant to the input.

### 2. **Cosine Similarity:**
   Cosine similarity is a metric used to measure how similar two vectors (in this case, job descriptions and the input text) are. It is calculated as the cosine of the angle between the vectors:
   $$
   \text{Cosine Similarity} = \frac{A \cdot B}{|A| |B|}
   $$
   - **A** and **B** represent two vectors, such as one for the input text and one for the job description.
   - The **dot product** (A â‹… B) measures how aligned the two vectors are.
   - The **magnitude** |A| and |B| of the vectors normalizes the calculation, ensuring that the result is independent of the vector lengths.

   A higher cosine similarity score means that the job description shares more common features with the input text, indicating that it is a better match.

### 3. **Interpretation of Results:**
   - The job postings with the highest similarity scores are more likely to match the user's input, based on the similarity between their job descriptions and the user's search query (title and/or skills).
   - The cosine similarity metric is effective in capturing the semantic similarity between the input and the job descriptions, even if the exact words do not match. For example, similar job titles with slightly different wording can still receive a high similarity score.

### 4. **Final Thoughts:**
   - The similarity scores provide an intuitive way to rank job postings based on how well they match the input text.
   - By using **cosine similarity**, we ensure that the recommendation system is sensitive to the context and meaning of the words, making it a robust method for job recommendation tasks.
   - The top job recommendations are ranked by their similarity scores, helping the user quickly identify the most relevant job postings based on their interests or expertise.

In conclusion, the cosine similarity metric serves as a powerful tool to find jobs that closely match the user's profile, providing a data-driven and efficient way to recommend similar job postings.