# <u>Preprocessing</u>

This section is related to the prepocessing of the requirements field of the Fake_Real_Job_Posting.csv file. More especially, we will focus on all rows were the requirements field is not missing (not "Not Mentioned"). This field is thus considered to non-informative when there is some missing values and it can create noises in the data. It could be then preferable not to select them.  
The preprocessing steps that will be implemented are the following:

- <u>Removal of URLs:</u>  
Removing URLs is important to ensure that web links do not affect the analysis of text. URLs typically do not contribute to the content or semantics of text, so their removal helps maintain focus on the meaningful information.   

- <u>Tokenization:</u>  
We will start with the tokenization in order to divide the text in token (or words). By starting to tokenize the text, it will then be easier to implement the other preprocessing steps (the data structure will be more manipulable).

- <u>Lower casing:</u>  
Then we will do the lower casing step which is important for text standardization. It could also be helpful if we do some text feature exraction as it helps to combine the same words together and thus reducing the duplications.

- <u>Removal of ponctuation:</u>  
Eliminating punctuation marks from the text is crucial to get rid of unnecessary token and to make sure they don't cause problems in the following processing steps. Punctuation marks usually don't hold significant meaning so they can be safely removed.


- <u>Removal of stopwords:</u>  
The removal of stopwords is crucial for text data preprocessing. Stopwords are common words that can be removed since they add little semantic value, they don't add valuable information for the text analysis. This step reduces noise and the dimensionality of the data.  

- <u>Lemmatization:</u>  
Finally, we will perform lemmatization to ensure that words are reduced to their base or dictionary form, which helps in standardizing the text.

The order of the preprocessing steps is important. Indeed, after removing the URLs, the tokenization serves as the foundation for all other steps, ensuring that text is divided into meaningful units (words, tokens). Lowercasing is performed next to standardize text and reduce duplications. Then, punctuation removal, stopword removal and lemmatization help clean the text, reduce its dimension and thus reduce noise.

In [17]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

path = 'Data/Fake_Real_Job_Posting.csv'
data_full = pd.read_csv(path)

data_reduced = data_full[data_full['requirements'] != "Not Mentioned"]

class PreprocessingClass:
    def remove_URL(self, text):
        url_pattern = r'#URL_[a-f0-9]+#'
        return re.sub(url_pattern, '', text)

    def preprocessing(self, text):
        lemmatizer = WordNetLemmatizer()
        stop_words = set(stopwords.words('english'))

        # URLs removal
        cleaned_text = self.remove_URL(text)
        # Tokenization
        tokens = word_tokenize(cleaned_text)
        # Lowercasing + remove non number and letter character (ponctuation removal)
        words = [word.lower() for word in tokens if word.isalnum()]
        # Stopwords removal
        words = [word for word in words if word not in stop_words]
        # Lemmatization
        words = [lemmatizer.lemmatize(word) for word in words]

        return ' '.join(words)

# instantiate the class
preprocessor = PreprocessingClass()

# apply the preprocessing function to the "requirements" field
data_reduced['requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_reduced['requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)


In [18]:
data_reduced['requirements']

0        year experience ux ui design portfolio contain...
1        food graduate similar disciplineadvanced haccp...
2        job duty responsibility analysisperform root c...
3        international broadcaster shall least five 5 y...
4        experience professional environmentsare net na...
                               ...                        
17875    essential relational database theory understan...
17876    need somebody really love adwords google servi...
17877    requires ability become forklift able effectiv...
17878    education experiencebachelor degree physic mat...
17879    fluency englishsimilar professional experience...
Name: requirements, Length: 15185, dtype: object