## Regression

Model purpose: given a job description, predict a score for the job description.

### Download and loading of dataset

In [2]:
import kagglehub
import os
import pandas as pd
# Download latest version to the specified directory
# path = kagglehub.dataset_download("arshkon/linkedin-job-postings")

path = "/home/leon/.cache/kagglehub/datasets/arshkon/linkedin-job-postings/versions/13"

print(f"Path to dataset files: {path}")
print(f"List of files in the dataset: {os.listdir(path)}")

Path to dataset files: /home/leon/.cache/kagglehub/datasets/arshkon/linkedin-job-postings/versions/13
List of files in the dataset: ['companies', 'postings.csv', 'jobs', 'mappings']


**Drop indexes with NaN values**

In [20]:
postings_path = path + "/postings.csv"
postings_df = pd.read_csv(postings_path, usecols=["job_id", "company_name", "company_id", "title", "location", "description", "max_salary", "views"])

### Data analysis and preprocessing

In [21]:
# Quick overview of the dataset
print(f"Number of rows: {postings_df.shape[0]}")
print(f"Number of columns: {postings_df.shape[1]}")

# Display the first few rows of the DataFrame
# print(postings_df.head())
# print(postings_df["description"][0])

# print number of unique values for views column
unique_views = postings_df["views"].nunique()
print(f"Number of unique values in the 'views' column: {unique_views}")

not_nan = postings_df["views"].notna()
print(f"Number of non-NaN values in the 'views' column: {not_nan.sum()}")



# Set zip_code column type to int
# postings_df["zip_code"] = postings_df["zip_code"].astype("Int64", errors="raise")


# Count rows where any of these columns is NaN
rows_with_any_nan = postings_df[["description", "views"]].isna().any(axis=1).sum()
print(f"Rows with at least one NaN value: {rows_with_any_nan}")

# drop rows with NaN values in specific columns
print(f"Number of rows before dropping NaN values: {postings_df.shape[0]}")
postings_df.dropna(subset=["description", "views"], inplace=True)
print(f"Number of rows after dropping NaN values: {postings_df.shape[0]}")

Number of rows: 123849
Number of columns: 8
Number of unique values in the 'views' column: 684
Number of non-NaN values in the 'views' column: 122160
Rows with at least one NaN value: 1696
Number of rows before dropping NaN values: 123849
Number of rows after dropping NaN values: 122153


**Cleaning the descriptions of unwanted characters such as emojis etc**

In [22]:
import re

# Nettoyage de base
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'\d{10,}', '', text)
    text = re.sub(r'[^a-zA-Z0-9\s.,!?]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the clean_text function and ASSIGN the result back
postings_df['description'] = postings_df['description'].apply(lambda x: clean_text(x))
postings_df

Unnamed: 0,job_id,company_name,title,description,max_salary,location,company_id,views
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,job descriptiona leading real estate firm in n...,20.0,"Princeton, NJ",2774458.0,20.0
1,1829192,,Mental Health Therapist/Counselor,"at aspen therapy and wellness , we are committ...",50.0,"Fort Collins, CO",,1.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,the national exemplar is accepting application...,65000.0,"Cincinnati, OH",64896719.0,8.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,senior associate attorney elder law trusts and...,175000.0,"New Hyde Park, NY",766262.0,16.0
4,35982263,,Service Technician,looking for hvac service tech with experience ...,80000.0,"Burlington, IA",,3.0
...,...,...,...,...,...,...,...,...
123844,3906267117,Lozano Smith,Title IX/Investigations Attorney,our walnut creek office is currently seeking a...,195000.0,"Walnut Creek, CA",56120.0,1.0
123845,3906267126,Pinterest,"Staff Software Engineer, ML Serving Platform",about pinterest millions of people across the ...,,United States,1124131.0,3.0
123846,3906267131,EPS Learning,"Account Executive, Oregon/Washington",company overview eps learning is a leading k12...,,"Spokane, WA",90552133.0,3.0
123847,3906267195,Trelleborg Applied Technologies,Business Development Manager,the business development manager is a hunter t...,,"Texas, United States",2793699.0,4.0
