# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

* **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization.
* **Label coding**: creation of a dictionary to map each category to a code.
* **Train-test split**: to test the models on unseen data.
* **Text representation**: use of TF-IDF scores to represent text.

In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

**First of all we'll load the dataset:**

In [2]:
path_df = "/home/andrewpap22/Desktop/DataScience-testDir/data.pickle"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [3]:
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length
0,001.txt,Ad sales boost Time Warner profit\n\nQuarterly...,business,001.txt-business,1,2559
1,002.txt,Dollar gains on Greenspan speech\n\nThe dollar...,business,002.txt-business,1,2251
2,003.txt,Yukos unit buyer faces loan claim\n\nThe owner...,business,003.txt-business,1,1551
3,004.txt,High fuel prices hit BA's profits\n\nBritish A...,business,004.txt-business,1,2411
4,005.txt,Pernod takeover talk lifts Domecq\n\nShares in...,business,005.txt-business,1,1569


**Visualize one sample news content:**

In [4]:
df.loc[2]['Content']

'Yukos unit buyer faces loan claim\n\nThe owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (Â£479m) loan.\n\nState-owned Rosneft bought the Yugansk unit for $9.3bn in a sale forced by Russia to part settle a $27.5bn tax claim against Yukos. Yukos\' owner Menatep Group says it will ask Rosneft to repay a loan that Yugansk had secured on its assets. Rosneft already faces a similar $540m repayment demand from foreign banks. Legal experts said Rosneft\'s purchase of Yugansk would include such obligations. "The pledged assets are with Rosneft, so it will have to pay real money to the creditors to avoid seizure of Yugansk assets," said Moscow-based US lawyer Jamie Firestone, who is not connected to the case. Menatep Group\'s managing director Tim Osborne told the Reuters news agency: "If they default, we will fight them where the rule of law exists under the international arbitration clauses of the credit."\n\nRosneft official

# 1. Text cleaning and preperation

## 1.1. Special character cleaning

We can see the following special characters:

* \r
* \n
* \ before possessive pronouns (government's = government\'s)
* \ before possessive pronouns 2 (Yukos' = Yukos\')
* " when quoting text

In [5]:
# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

In [6]:
# Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a real character:

text = "Mr Greenspan\'s"
text

"Mr Greenspan's"

In [7]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

## 1.2. Upcase/Downcase

We'll downcase the texts because we want, for example, Football and football to be the same word.

In [8]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

## 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [9]:
punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

# 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [10]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

# 1.5. Stemming and Lemmatization 

*Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.*

In [11]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to
[nltk_data]     /home/andrewpap22/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/andrewpap22/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In [13]:
# IN order to lemmatize, we have to iterate through every word:

nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [14]:
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

# 1.6 Stop words 

In [15]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/andrewpap22/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [17]:
#To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

In [18]:
# We can now loop through all the stop words:

df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

*We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.*

*As an example, we'll show an original news article and its modifications throughout the process:*

In [19]:
df.loc[22]['Content']

'Mixed signals from French economy\n\nThe French economy picked up speed at the end of 2004, official figures show - but still looks set to have fallen short of the government\'s hopes.\n\nAccording to state statistics body INSEE, growth for the three months to December was a seasonally-adjusted 0.7-0.8%, ahead of the 0.6% forecast. If confirmed, that would be the best quarterly showing since early 2002. It leaves GDP up 2.3% for the full year, but short of the 2.5% which the French government had predicted.\n\nDespite the apparent shortfall in annual economic growth, the good quarterly figures - a so-called "flash estimate" - mark a continuing trend of improving indicators for the health of the French economy. The government is reiterating a 2.5% target for 2005, while the European Central Bank is making positive noises for the 12-nation eurozone as a whole. Also on Friday, France\'s industrial output for December was released, showing 0.7% growth. "The numbers are good," said David N

**1. Special character cleaning**

In [20]:
df.loc[22]['Content_Parsed_1']

"Mixed signals from French economy  The French economy picked up speed at the end of 2004, official figures show - but still looks set to have fallen short of the government's hopes.  According to state statistics body INSEE, growth for the three months to December was a seasonally-adjusted 0.7-0.8%, ahead of the 0.6% forecast. If confirmed, that would be the best quarterly showing since early 2002. It leaves GDP up 2.3% for the full year, but short of the 2.5% which the French government had predicted.  Despite the apparent shortfall in annual economic growth, the good quarterly figures - a so-called flash estimate - mark a continuing trend of improving indicators for the health of the French economy. The government is reiterating a 2.5% target for 2005, while the European Central Bank is making positive noises for the 12-nation eurozone as a whole. Also on Friday, France's industrial output for December was released, showing 0.7% growth. The numbers are good, said David Naude, econom

**1. Uppercase/Downcase :**

In [21]:
df.loc[22]['Content_Parsed_2']

"mixed signals from french economy  the french economy picked up speed at the end of 2004, official figures show - but still looks set to have fallen short of the government's hopes.  according to state statistics body insee, growth for the three months to december was a seasonally-adjusted 0.7-0.8%, ahead of the 0.6% forecast. if confirmed, that would be the best quarterly showing since early 2002. it leaves gdp up 2.3% for the full year, but short of the 2.5% which the french government had predicted.  despite the apparent shortfall in annual economic growth, the good quarterly figures - a so-called flash estimate - mark a continuing trend of improving indicators for the health of the french economy. the government is reiterating a 2.5% target for 2005, while the european central bank is making positive noises for the 12-nation eurozone as a whole. also on friday, france's industrial output for december was released, showing 0.7% growth. the numbers are good, said david naude, econom

**1. Punctuation signs:**

In [23]:
df.loc[22]['Content_Parsed_3']

"mixed signals from french economy  the french economy picked up speed at the end of 2004 official figures show - but still looks set to have fallen short of the government's hopes  according to state statistics body insee growth for the three months to december was a seasonally-adjusted 07-08% ahead of the 06% forecast if confirmed that would be the best quarterly showing since early 2002 it leaves gdp up 23% for the full year but short of the 25% which the french government had predicted  despite the apparent shortfall in annual economic growth the good quarterly figures - a so-called flash estimate - mark a continuing trend of improving indicators for the health of the french economy the government is reiterating a 25% target for 2005 while the european central bank is making positive noises for the 12-nation eurozone as a whole also on friday france's industrial output for december was released showing 07% growth the numbers are good said david naude economist at deutsche bank they

**1. Possessive pronouns:**

In [27]:
df.loc[22]['Content_Parsed_4'] 

'mixed signals from french economy  the french economy picked up speed at the end of 2004 official figures show - but still looks set to have fallen short of the government hopes  according to state statistics body insee growth for the three months to december was a seasonally-adjusted 07-08% ahead of the 06% forecast if confirmed that would be the best quarterly showing since early 2002 it leaves gdp up 23% for the full year but short of the 25% which the french government had predicted  despite the apparent shortfall in annual economic growth the good quarterly figures - a so-called flash estimate - mark a continuing trend of improving indicators for the health of the french economy the government is reiterating a 25% target for 2005 while the european central bank is making positive noises for the 12-nation eurozone as a whole also on friday france industrial output for december was released showing 07% growth the numbers are good said david naude economist at deutsche bank they sen

**1. Stemming and Lemmatization:**

In [28]:
df.loc[22]['Content_Parsed_5'] 

'mix signal from french economy  the french economy pick up speed at the end of 2004 official figure show - but still look set to have fall short of the government hop  accord to state statistics body insee growth for the three months to december be a seasonally-adjusted 07-08% ahead of the 06% forecast if confirm that would be the best quarterly show since early 2002 it leave gdp up 23% for the full year but short of the 25% which the french government have predict  despite the apparent shortfall in annual economic growth the good quarterly figure - a so-called flash estimate - mark a continue trend of improve indicators for the health of the french economy the government be reiterate a 25% target for 2005 while the european central bank be make positive noise for the 12-nation eurozone as a whole also on friday france industrial output for december be release show 07% growth the number be good say david naude economist at deutsche bank they send a positive signal of a rebound in outp

**1. Stop Words:**

In [29]:
df.loc[22]['Content_Parsed_6'] 

'mix signal  french economy   french economy pick  speed   end  2004 official figure show -  still look set   fall short   government hop  accord  state statistics body insee growth   three months  december   seasonally-adjusted 07-08% ahead   06% forecast  confirm  would   best quarterly show since early 2002  leave gdp  23%   full year  short   25%   french government  predict  despite  apparent shortfall  annual economic growth  good quarterly figure -  -called flash estimate - mark  continue trend  improve indicators   health   french economy  government  reiterate  25% target  2005   european central bank  make positive noise   12-nation eurozone   whole also  friday france industrial output  december  release show 07% growth  number  good say david naude economist  deutsche bank  send  positive signal   rebound  output  open  way   continuation   trend   new year service sector activity improve  january hit  seven-month high  unemployment remain high   10%'

In [30]:
# Now we can delete the intermediate coliumns: 

df.head(1)

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,001.txt,Ad sales boost Time Warner profit\n\nQuarterly...,business,001.txt-business,1,2559,Ad sales boost Time Warner profit Quarterly p...,ad sales boost time warner profit quarterly p...,ad sales boost time warner profit quarterly p...,ad sales boost time warner profit quarterly p...,ad sales boost time warner profit quarterly p...,ad sales boost time warner profit quarterly p...


In [31]:
list_columns = ["File_Name", "Category", "Complete_Filename", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [32]:
df.head()

Unnamed: 0,File_Name,Category,Complete_Filename,Content,Content_Parsed
0,001.txt,business,001.txt-business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sales boost time warner profit quarterly p...
1,002.txt,business,002.txt-business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gain greenspan speech dollar hit h...
2,003.txt,business,003.txt-business,Yukos unit buyer faces loan claim\n\nThe owner...,yukos unit buyer face loan claim owners emb...
3,004.txt,business,004.txt-business,High fuel prices hit BA's profits\n\nBritish A...,high fuel price hit ba profit british airways...
4,005.txt,business,005.txt-business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeover talk lift domecq share uk dr...


# 2. Label coding

*We'll create a dictionary with the label codification:*

In [33]:
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

df.head()

Unnamed: 0,File_Name,Category,Complete_Filename,Content,Content_Parsed,Category_Code
0,001.txt,business,001.txt-business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sales boost time warner profit quarterly p...,0
1,002.txt,business,002.txt-business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gain greenspan speech dollar hit h...,0
2,003.txt,business,003.txt-business,Yukos unit buyer faces loan claim\n\nThe owner...,yukos unit buyer face loan claim owners emb...,0
3,004.txt,business,004.txt-business,High fuel prices hit BA's profits\n\nBritish A...,high fuel price hit ba profit british airways...,0
4,005.txt,business,005.txt-business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeover talk lift domecq share uk dr...,0


# 3. Train - Test split

**We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.**

In [34]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.20, 
                                                    random_state=8)

Since we don't have much observations (only 2.225), we'll choose a test set size of 20% of the full dataset as described in the project definition document.

# Text representation : 

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features, since it's requested from the project, classification part to be done in that particular text representation

We have to define the different parameters:

* ngram_range: We want to consider both unigrams and bigrams.
* max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
* min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
* max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

See TfidfVectorizer? for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument norm.

In [35]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [36]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(1780, 300)
(445, 300)


Please note that we have fitted and then transformed the training set, but we have **only transformed the test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [37]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")

# 'business' category:
  . Most correlated unigrams:
. price
. market
. economy
. growth
. bank
  . Most correlated bigrams:
. last year
. year old

# 'entertainment' category:
  . Most correlated unigrams:
. music
. best
. star
. award
. film
  . Most correlated bigrams:
. mr blair
. prime minister

# 'politics' category:
  . Most correlated unigrams:
. minister
. blair
. election
. party
. labour
  . Most correlated bigrams:
. prime minister
. mr blair

# 'sport' category:
  . Most correlated unigrams:
. win
. side
. game
. team
. match
  . Most correlated bigrams:
. say mr
. year old

# 'tech' category:
  . Most correlated unigrams:
. digital
. software
. computer
. technology
. users
  . Most correlated bigrams:
. year old
. say mr



As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [38]:
bigrams

['tell bbc', 'last year', 'mr blair', 'prime minister', 'year old', 'say mr']

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

We'll save our files here and move on to the next notebook, the Classification one! 

In [41]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)