The dataset underwent a series of steps to enhance its usability and eliminate noise. Below is an overview of the cleaning and preprocessing tasks performed with TFIDF process:

### Data Filtering:

Invalid or illogical data entries were removed, including instances with false experience (exp_check == 1), negative experience in any job_duration column, and total experience exceeding 900 months.
Data records associated with job titles containing terms like 'intern', 'student', 'owner', etc., were also filtered out to focus on meaningful job roles.

### Text Preprocessing:

The 'Summary', 'Skills', and 'Job_Description_1' columns underwent comprehensive text preprocessing to standardize and enhance the quality of textual data.
Text was converted to lowercase, special characters and numbers were removed, and tokenization was performed to break down the text into individual words.
Stop words (commonly used words with little semantic value) were eliminated, and lemmatization was applied to reduce words to their base form.

### TF-IDF Vectorization:

Term Frequency-Inverse Document Frequency (TF-IDF) vectorization was employed to convert the preprocessed text data into numerical feature vectors suitable for machine learning algorithms.
Separate TF-IDF vectorizers were created for 'Job_Description_1', 'Summary', and 'Skills' columns.
The resulting features were combined into a unified dataset, ensuring comprehensive coverage of relevant information.

### Feature Selection and Integration:

Features derived from TF-IDF vectorization were selected based on a specified maximum number of features.
Additional features, such as 'Word_Count', 'summary_is_null', and 'skills_is_null', were incorporated into the final dataset to provide contextual information for predictive modeling.

### Saving Processed Data:

The processed dataset, containing the selected features and relevant additional information, was saved as a CSV file ('text_data_t0.csv') for ease of access and future analysis.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import scipy.sparse
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_selection import VarianceThreshold

In [2]:
df = pd.read_csv('ready_train_w_regions.csv')

  df = pd.read_csv('ready_train_w_regions.csv')


In [3]:
print(df.columns)

Index(['profile_url', 'First Name', 'Middle Name', 'Surname', 'Linkedin_url',
       'connections', 'Summary', 'Skills', 'education level', 'title_1',
       'company_1', 'time_duration_1', 'Job_Description_1', 'title_2',
       'company_2', 'time_duration_2', 'Job_Description_2', 'title_3',
       'company_3', 'time_duration_3', 'Job_Description_3', 'title_4',
       'company_4', 'time_duration_4', 'Job_Description_4', 'title_5',
       'company_5', 'time_duration_5', 'Job_Description_5', 'title_6',
       'company_6', 'time_duration_6', 'Job_Description_6', 'title_7',
       'company_7', 'time_duration_7', 'Job_Description_7', 'title_8',
       'company_8', 'time_duration_8', 'Job_Description_8', 'title_9',
       'company_9', 'time_duration_9', 'Job_Description_9', 'title_10',
       'company_10', 'time_duration_10', 'Job_Description_10', 'school_name_1',
       'ed_time_duration_1', 'degree_name_1', 'education_fos_1',
       'school_name_2', 'ed_time_duration_2', 'degree_name_2',
 

In [4]:
df = df.drop('total_exp', axis=1)
time_duration_cols = ['time_duration_10', 'time_duration_9', 'time_duration_8',
                      'time_duration_6', 'time_duration_5', 'time_duration_4',
                      'time_duration_3', 'time_duration_2', 'time_duration_1']

df = df[df['exp_check'] == 0]
df['has_negative_value'] = df[time_duration_cols].apply(lambda row: 1 if any(val < 0 for val in row) else 0, axis=1)
df = df[df['has_negative_value'] == 0]
df['average_duration'] = df[time_duration_cols].mean(axis=1)
df['total_exp'] = df[time_duration_cols].sum(axis=1, skipna=True)
df = df[df['total_exp'] < 900]
df['Word_Count'] = df['Job_Description_1'].apply(lambda text: len(str(text).split()) if pd.notnull(text) else 0)
median_word_count = df['Word_Count'].median()
df['Word_Count'] = df['Word_Count'].replace(0, median_word_count)

In [5]:
df['summary_is_null'] = df['Job_Description_1'].isnull().astype(int)
df['skills_is_null'] = df['Skills'].isnull().astype(int)

In [6]:
null_counts_target = df[['profile_url','Job_Description_1', 'Summary', 'Skills', 'Word_Count']].isnull().sum()
print("Null counts")
print(null_counts_target)

Null counts
profile_url               0
Job_Description_1    124189
Summary               59201
Skills                88626
Word_Count                0
dtype: int64


### Data Pre-processiong

In [7]:
import nltk
from nltk.corpus import stopwords
nltk.download('omw-1.4')
# Download stopwords and WordNet data from NLTK
nltk.download('stopwords')
nltk.download('wordnet')


def preprocess_text(text):
    if pd.isnull(text):  # Check if the text is null
        return ''
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r"[^a-zA-Z]", " ", text)
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words("english"))  # Use set(stopwords) instead of `stopwords`
    tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Join the tokens back into a single string
    processed_text = " ".join(tokens)
    
    return processed_text

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Jainish\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jainish\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jainish\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
def preprocess_skills(skills):
    if pd.isnull(skills):  # Check if the skills are null
        return ''

    skills = skills.lower()  # Convert skills to lowercase

    # Split the skills by commas
    skill_list = skills.split(',')

    # Process each individual skill
    processed_skills = []
    for skill in skill_list:
        # Remove leading/trailing whitespace
        skill = skill.strip()

        # Replace space with underscore
        skill = skill.replace(' ', '_')

        # Remove special characters and numbers
        skill = re.sub(r"[^a-zA-Z]", " ", skill)

        # Tokenize the skill
        tokens = word_tokenize(skill)

        # Remove stopwords
        stop_words = set(stopwords.words("english"))
        tokens = [token for token in tokens if token not in stop_words]

        # Lemmatize the tokens
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(token) for token in tokens]

        # Join the tokens back into a single string
        processed_skill = "_".join(tokens)
        processed_skills.append(processed_skill)

    # Join the processed skills back into a single string
    processed_text = ", ".join(processed_skills)

    return processed_text

In [9]:
df_selected = df[['profile_url','Job_Description_1', 'Summary', 'Skills', 'Word_Count', 'summary_is_null', 'skills_is_null', 'Quitter']]

df_selected['Job_Description_1'] = df_selected['Job_Description_1'].apply(preprocess_text)
df_selected['Summary'] = df_selected['Summary'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected['Job_Description_1'] = df_selected['Job_Description_1'].apply(preprocess_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected['Summary'] = df_selected['Summary'].apply(preprocess_text)


In [10]:
df_selected['Skills'] = df_selected['Skills'].apply(preprocess_skills)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_selected['Skills'] = df_selected['Skills'].apply(preprocess_skills)


### TF-IDF Vectorization:

In [11]:
X = df_selected[['profile_url','Job_Description_1', 'Summary', 'Skills', 'Word_Count', 'summary_is_null', 'skills_is_null']]
y = df_selected['Quitter']

# Create TF-IDF vectorizer for Job_Description_1
tfidf_job_desc = TfidfVectorizer(max_features=10, use_idf=False)
X_job_desc_tfidf = tfidf_job_desc.fit_transform(X['Job_Description_1'])

# Get the feature names for Job_Description_1 and modify them
job_desc_feature_names = ['job_desc_' + name for name in tfidf_job_desc.get_feature_names_out()]

# Create TF-IDF vectorizer for Summary
tfidf_summary = TfidfVectorizer(max_features=10, use_idf=False)
X_summary_tfidf = tfidf_summary.fit_transform(X['Summary'])

# Get the feature names for Summary and modify them
summary_feature_names = ['summary_' + name for name in tfidf_summary.get_feature_names_out()]

# Create TF-IDF vectorizer for Skills
tfidf_skills = TfidfVectorizer(max_features=10, use_idf=False)
X_skills_tfidf = tfidf_skills.fit_transform(X['Skills'])
skills_feature_names = ['skills_' + name for name in tfidf_skills.get_feature_names_out()]

# Combine the TF-IDF features
X_combined = pd.DataFrame.sparse.from_spmatrix(scipy.sparse.hstack([X_job_desc_tfidf, X_summary_tfidf, X_skills_tfidf]), columns=job_desc_feature_names + summary_feature_names + skills_feature_names)

# Create a DataFrame with the selected features and Word_Count
X_selected_df = X_combined.copy()
X_selected_df['Word_Count'] = X['Word_Count']
X_selected_df['summary_is_null'] = X['summary_is_null']
X_selected_df['skills_is_null'] = X['skills_is_null']
X_selected_df['profile_url'] = X['profile_url']

X_selected_df.to_csv('text_data.csv', index=False)

In [12]:
X_selected_df.columns.tolist()

['job_desc_application',
 'job_desc_data',
 'job_desc_design',
 'job_desc_development',
 'job_desc_project',
 'job_desc_service',
 'job_desc_software',
 'job_desc_system',
 'job_desc_team',
 'job_desc_using',
 'summary_application',
 'summary_data',
 'summary_development',
 'summary_experience',
 'summary_software',
 'summary_system',
 'summary_team',
 'summary_technology',
 'summary_web',
 'summary_year',
 'skills_agile_methodology',
 'skills_cs',
 'skills_html',
 'skills_java',
 'skills_javascript',
 'skills_linux',
 'skills_mysql',
 'skills_python',
 'skills_software_development',
 'skills_sql',
 'Word_Count',
 'summary_is_null',
 'skills_is_null',
 'profile_url']