### 12.1. Feature engineering

    Feature engineering involves creating new features and transforming existing ones to improve the performance of our machine learning model. In the context of natural language processing (NLP) and text classification, I'm going to consider two feature engineering strategies: a more complext text preprocessing and the addition of more feature variables. 

#### 12.1.1. More complex text preprocessing 

- Lowercasing: Convert all text to lowercase to maintain consistency.
- Sentence segmentation
- Tokenization: split the text into individual words (tokens).
- Removing stop words.
- Lemmatization: reduce words to their base or root form to normalize variations.
- Removing special characters and numbers but keeping hyphens.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
%matplotlib inline
%config Inlinebackend.figure_format = 'retina'

import seaborn as sns
sns.set_context('poster')
sns.set(rc={'figure.figsize': (16., 9.)})
sns.set_style('whitegrid')
import seaborn as sns
import numpy as np
import re
import os

In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

In [3]:
five = pd.read_csv("/Users/usuari/Desktop/Ironhack/BOOTCAMP/projects/final_project/data/five.csv")
five.head(3)

Unnamed: 0,title,summary,genre
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction
1,The Plague,The text of The Plague is divided into five p...,literary fiction
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction


In [4]:
preproc = five.copy()

In [5]:
# I create a function that returns me the dataframe with a new column that contains the preprocessed text. 

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load NLTK resources outside the function
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocessing_fe(preproc):
    # Process each row in the DataFrame
    clean_sum = []

    for index, row in preproc.iterrows():
        # Preprocess the text
        text = row['summary'].lower()

        # Tokenize the text into sentences
        sentences = sent_tokenize(text)

        # Process each sentence
        clean_tokens = []
        for sentence in sentences:
            tokens = word_tokenize(sentence)

            # Modify the condition to handle hyphenated words
            filtered_tokens = [word for word in tokens if word.lower() not in stop_words and ('-' not in word or word.replace('-', '').isalpha())]

            lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

            # Allow hyphens in cleaned tokens and split hyphenated words
            clean_tokens.extend([re.sub(r'[^a-zA-Z-]', ' ', word).split() for word in lemmatized_tokens])

        cleaned_summary = ' '.join(' '.join(words) for words in clean_tokens)
        clean_sum.append(cleaned_summary)

    # Update the DataFrame column outside the loop
    preproc['clean_sum'] = clean_sum
    preproc['clean_sum'] = preproc['clean_sum'].astype(str)

    return preproc

[nltk_data] Downloading package punkt to /Users/usuari/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/usuari/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/usuari/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
preprocessing_fe(preproc)

Unnamed: 0,title,summary,genre,clean_sum
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near-future england lea...
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thou...
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul b umer german soldier wh...
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north ar...
4,Blade Runner 3: Replicant Night,"Living on Mars, Deckard is acting as a consul...",science fiction,living mar deckard acting consultant movie cr...
...,...,...,...,...
11008,Hounded,"Atticus O’Sullivan, last of the Druids, lives ...",fantasy,atticus sullivan last druid life peacefully...
11009,Charlie and the Chocolate Factory,Charlie Bucket's wonderful adventure begins wh...,fantasy,charlie bucket s wonderful adventure begin fin...
11010,Red Rising,"""I live for the dream that my children will be...",fantasy,live dream child born free say like land...
11011,Frostbite,"Rose loves Dimitri, Dimitri might love Tasha, ...",fantasy,rose love dimitri dimitri might love tasha m...


In [7]:
# I replace the "-" with a space to separate two tokens.
preproc['clean_sum'] = preproc['clean_sum'].str.replace('-', ' ')
preproc.head(4)

Unnamed: 0,title,summary,genre,clean_sum
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near future england lea...
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thou...
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul b umer german soldier wh...
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north ar...


In [8]:
# I drop the tokens that only have one letter. 

preproc['clean_sum'] = preproc['clean_sum'].apply(lambda text: ' '.join(word for word in word_tokenize(text.lower()) if len(word) > 1))
preproc

Unnamed: 0,title,summary,genre,clean_sum
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near future england lead ...
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thousa...
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul umer german soldier who u...
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north arc...
4,Blade Runner 3: Replicant Night,"Living on Mars, Deckard is acting as a consul...",science fiction,living mar deckard acting consultant movie cre...
...,...,...,...,...
11008,Hounded,"Atticus O’Sullivan, last of the Druids, lives ...",fantasy,atticus sullivan last druid life peacefully ar...
11009,Charlie and the Chocolate Factory,Charlie Bucket's wonderful adventure begins wh...,fantasy,charlie bucket wonderful adventure begin find ...
11010,Red Rising,"""I live for the dream that my children will be...",fantasy,live dream child born free say like land fathe...
11011,Frostbite,"Rose loves Dimitri, Dimitri might love Tasha, ...",fantasy,rose love dimitri dimitri might love tasha mas...


### 12.1.2. Add another feature variable

- Count the number of unique words in each summary 
- Count the number of total words in each summary
- Create two feature representing the length of each row (considering unique words, and also the count of total words). 

In [9]:
# Add a column with then count of unique words of each summary.

preproc['unique_word_count'] = preproc['clean_sum'].apply(lambda x: len(set(x.split())))
preproc.sample(5)

Unnamed: 0,title,summary,genre,clean_sum,unique_word_count
3009,Deep Wizardry,Nita's family goes on vacation with Kit and h...,fantasy,nita family go vacation kit dog ponch south sh...,319
7201,Wiseguy,Hill began his life of crime at age 12 in 195...,thriller,hill began life crime age working go fer paul ...,150
10267,Someone We Know,Maybe you don't know your neighbors as well as...,thriller,maybe know neighbor well thought difficult let...,68
5105,The House with the Green Shutters,"*Chapter I. On a weekday morning at eight, Go...",literary fiction,chapter weekday morning eight gourlay twelve c...,312
3154,Empire Star,"As the narrative opens, we meet Comet Jo at e...",science fiction,narrative open meet comet jo eighteen year age...,161


In [11]:
preproc['word_count'] = preproc['clean_sum'].apply(lambda x: len(x.split()))
preproc.head(4)

Unnamed: 0,title,summary,genre,clean_sum,unique_word_count,word_count
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near future england lead ...,416,588
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thousa...,424,609
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul umer german soldier who u...,277,375
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north arc...,371,549


In [12]:
# I save my dataframe with the changes saved:

preproc.to_csv("fe.csv", index=False)

# Specify the folder path and filename for the CSV file
folder_path = "/Users/usuari/Desktop/Ironhack/BOOTCAMP/projects/final_project/data"
file_name = "fe.csv"

# Combine the folder path and filename to create the full file path
full_file_path = f"{folder_path}/{file_name}"

# Export the DataFrame to the specified folder
preproc.to_csv(full_file_path, index=False)