# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

In [4]:
from imblearn.over_sampling import SMOTE

First of all we'll load the dataset:

In [47]:
path_df = "Contract_dataset.pickle"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [48]:
df.head()

Unnamed: 0,Tender Title,Label,Tender No and date,Plant/Unit,Tender issue date and time,Bid Submission Closing date and Time,Label_Code
0,Procurement of water treatment chemicals for 4...,raw_materials,000000001738 Dt. 17/01/2020,IISCO Steel Plant,Feb 18 2020 5:00:00:000PM,Mar 25 2020 12:00:00:000PM,5
1,SPARES FOR ROD MILL LINER,hardware,003/215/1902000911/500006763/01/00,Rourkela Steel Plant,Jan 8 2020 8:00:00:000PM,May 21 2020 4:00:00:000PM,2
2,"SPARES FOR OVEN INTERLOCKING SYSTEM OF COB-1,3...",hardware,003/340/1902000406/01/00/500006867 dated 21.01...,Rourkela Steel Plant,Feb 26 2020 7:00:00:000PM,Mar 26 2020 4:00:00:000PM,2
3,ORIFICE FLOW METER FOR RAW WATER RISING MAIN,hardware,003/530/1902002486/01/00/500006927 DTD.20.02.2020,Rourkela Steel Plant,Feb 20 2020 8:00:00:000PM,Apr 20 2020 4:00:00:000PM,2
4,"FIRE FIGHTG ENSEMBLE SET (JACKET,TROUSER,BOOT,...",none,004/007/1848000076/02/00/500006945 DATED:05.0...,Rourkela Steel Plant,Apr 29 2020 4:00:00:000PM,May 20 2020 4:00:00:000PM,4


And visualize one sample news content:

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [49]:
# \r and \n
df['Content_Parsed_1'] = df['Tender Title'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [50]:
# text = "Mr Greenspan\'s"
# text

In [51]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [52]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [53]:
punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [54]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

In [55]:
df['Content_Parsed_4']

0      procurement of water treatment chemicals for 4...
1                              spares for rod mill liner
2      spares for oven interlocking system of cob-135...
3          orifice  flow meter for raw water rising main
4      fire fightg ensemble set (jackettrouserbooth/g...
                             ...                        
734    major modification/renovation of cisf colony a...
735    face lifting of single storied buildings at cd...
736    face lifting of two storied buildings at cd to...
737    building maintenance for sail house guest hous...
738                                 allotment of canteen
Name: Content_Parsed_4, Length: 734, dtype: object

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [56]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ayushi.Goel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ayushi.Goel\AppData\Roaming\nltk_data...


------------------------------------------------------------


[nltk_data]   Package wordnet is already up-to-date!


True

In [57]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [58]:
df.iloc[600]

Tender Title                            REVAMPING OF QUENCHING CAR WAGON AND REPAIR OF...
Label                                                                    skilled_manpower
Tender No and date                      NIT No : GM ( CC- W) / Coke Oven-Mech Mnt / EP...
Plant/Unit                                                             Bhilai Steel Plant
Tender issue date and time                                     Mar 12 2020 12:00:00:000PM
Bid Submission Closing date and Time                           Mar 24 2020  2:00:00:000PM
Label_Code                                                                              6
Content_Parsed_1                        REVAMPING OF QUENCHING CAR WAGON AND REPAIR OF...
Content_Parsed_2                        revamping of quenching car wagon and repair of...
Content_Parsed_3                        revamping of quenching car wagon and repair of...
Content_Parsed_4                        revamping of quenching car wagon and repair of...
Name: 603,

In [59]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    #print (row)
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.iloc[row]['Content_Parsed_4']
    text_words = str(text).split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [60]:
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [61]:
# # Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ayushi.Goel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [62]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [63]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [64]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

We can now loop through all the stop words:

In [65]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [66]:
df.loc[20]['Tender Title']

'OLFA FOR SALE OF COAL CHEMICALS'

1. Special character cleaning

In [67]:
df.loc[20]['Content_Parsed_1']

'OLFA FOR SALE OF COAL CHEMICALS'

2. Upcase/downcase

In [68]:
df.loc[20]['Content_Parsed_2']

'olfa for sale of coal chemicals'

3. Punctuation signs

In [69]:
df.loc[20]['Content_Parsed_3']

'olfa for sale of coal chemicals'

4. Possessive pronouns

In [70]:
df.loc[20]['Content_Parsed_4']

'olfa for sale of coal chemicals'

5. Stemming and Lemmatization

In [71]:
df.loc[20]['Content_Parsed_5']

'olfa for sale of coal chemicals'

6. Stop words

In [72]:
df.loc[20]['Content_Parsed_6']

'olfa  sale  coal chemicals'

Finally, we can delete the intermediate columns:

In [73]:
df.head(1)

Unnamed: 0,Tender Title,Label,Tender No and date,Plant/Unit,Tender issue date and time,Bid Submission Closing date and Time,Label_Code,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,Procurement of water treatment chemicals for 4...,raw_materials,000000001738 Dt. 17/01/2020,IISCO Steel Plant,Feb 18 2020 5:00:00:000PM,Mar 25 2020 12:00:00:000PM,5,Procurement of water treatment chemicals for 4...,procurement of water treatment chemicals for 4...,procurement of water treatment chemicals for 4...,procurement of water treatment chemicals for 4...,procurement of water treatment chemicals for 4...,procurement water treatment chemicals 4161 m...


In [74]:
list_columns = ["Label",  "Tender Title", "Content_Parsed_6", "Label_Code"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [75]:
df.head()

Unnamed: 0,Label,Tender Title,Content_Parsed,Label_Code
0,raw_materials,Procurement of water treatment chemicals for 4...,procurement water treatment chemicals 4161 m...,5
1,hardware,SPARES FOR ROD MILL LINER,spar rod mill liner,2
2,hardware,"SPARES FOR OVEN INTERLOCKING SYSTEM OF COB-1,3...",spar oven interlock system cob-135 6,2
3,hardware,ORIFICE FLOW METER FOR RAW WATER RISING MAIN,orifice flow meter raw water rise main,2
4,none,"FIRE FIGHTG ENSEMBLE SET (JACKET,TROUSER,BOOT,...",fire fightg ensemble set (jackettrouserbooth/g...,4


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [76]:
df.Label.value_counts()

skilled_manpower            246
none                        160
unskilled_manpower           94
raw_materials                88
hardware                     46
vehicle/equipment_hiring     42
machine                      35
electronics                  23
Name: Label, dtype: int64

In [77]:
# category_codes = {
#     'skilled_manpower': 1,
#     'none': 2,
#     'unskilled_manpower': 3,
#     'raw_materials': 4,
#     'hardware': 5,
#     'vehicle/equipment_hiring': 6,
#     'machine': 7,
#     'electronics': 8
    
# }

In [78]:
# # Category mapping
# df['Label_Code'] = df['Label']
# df = df.replace({'Label_Code':category_codes})

In [79]:
df.Label_Code.value_counts()

6    246
4    160
7     94
5     88
2     46
8     42
3     35
1     23
Name: Label_Code, dtype: int64

In [80]:
df.head()

Unnamed: 0,Label,Tender Title,Content_Parsed,Label_Code
0,raw_materials,Procurement of water treatment chemicals for 4...,procurement water treatment chemicals 4161 m...,5
1,hardware,SPARES FOR ROD MILL LINER,spar rod mill liner,2
2,hardware,"SPARES FOR OVEN INTERLOCKING SYSTEM OF COB-1,3...",spar oven interlock system cob-135 6,2
3,hardware,ORIFICE FLOW METER FOR RAW WATER RISING MAIN,orifice flow meter raw water rise main,2
4,none,"FIRE FIGHTG ENSEMBLE SET (JACKET,TROUSER,BOOT,...",fire fightg ensemble set (jackettrouserbooth/g...,4


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [81]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Label_Code'], 
                                                    test_size=0.2, 
                                                    random_state=8)

In [82]:
y_train.value_counts()

6    195
4    126
7     84
5     67
8     35
2     32
3     29
1     19
Name: Label_Code, dtype: int64

In [83]:
y_test.value_counts()

6    51
4    34
5    21
2    14
7    10
8     7
3     6
1     4
Name: Label_Code, dtype: int64

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [84]:
# Parameter election
ngram_range = (1,3)
min_df = 10
max_df = 1.
max_features = 500

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [85]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(587, 116)
(147, 116)


In [86]:
smt = SMOTE(random_state=777, k_neighbors=1)
features_train, labels_train = smt.fit_sample(features_train, labels_train)

In [87]:
features_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [88]:
# features_test = tfidf.transform(X_test).toarray()
# labels_test = y_test
# print(features_test.shape)

Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [89]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
#     trigrams = [v for v in feature_names if len(v.split(' ')) == 3]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
#     print("  . Most correlated trigrams:\n. {}".format('\n. '.join(trigrams[3:])))
    print("")


# 'electronics' category:
  . Most correlated unigrams:
. per
. year
. basis
. mine
. hire
  . Most correlated bigrams:
. sail ranchi
. supply installation

# 'hardware' category:
  . Most correlated unigrams:
. coal
. fa
. olfa
. mt
. chemicals
  . Most correlated bigrams:
. olfa sale
. coal chemicals

# 'machine' category:
  . Most correlated unigrams:
. township
. handle
. area
. house
. clean
  . Most correlated bigrams:
. supply installation
. rate contract

# 'none' category:
  . Most correlated unigrams:
. water
. sms
. procurement
. sale
. items
  . Most correlated bigrams:
. sail ranchi
. supply installation

# 'raw_materials' category:
  . Most correlated unigrams:
. 200520
. dt
. enable
. scrap
. waste
  . Most correlated bigrams:
. division sail
. work division

# 'skilled_manpower' category:
  . Most correlated unigrams:
. ote
. coke
. idle
. assets
. monitor
  . Most correlated bigrams:
. supply installation
. idle assets

# 'unskilled_manpower' category:
  . Most correla

As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [226]:
bigrams

['bokaro steel',
 'sru ifico',
 'steel plant',
 'job contract',
 'sail bokaro',
 'ore mine',
 'iron ore',
 'contract ote',
 'work division',
 'division sail',
 'various job',
 'dt 250320',
 'olfa sale',
 'dt 200520',
 'idle assets',
 'coal chemicals',
 'sail ranchi',
 'supply installation',
 'sms ii',
 'rate contract']

We can see there is only one. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Saving the files needed in the next steps:

In [227]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)