### 6. Text preprocessing

I'm going to use NLTK for the preprocessing of the summaries:

- Lowercasing: Convert all text to lowercase to maintain consistency.
- Tokenization: Split the text into individual words (tokens).
- Removing stop words.
- Lemmatization or stemming: reduce words to their base or root form to normalize variations.
- Removing special characters and numbers.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
%matplotlib inline
%config Inlinebackend.figure_format = 'retina'

import seaborn as sns
sns.set_context('poster')
sns.set(rc={'figure.figsize': (16., 9.)})
sns.set_style('whitegrid')
import seaborn as sns
import numpy as np
import re
import os

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

In [3]:
five = pd.read_csv("/Users/usuari/Desktop/Ironhack/BOOTCAMP/projects/final_project/data/five.csv")
five.head(3)

Unnamed: 0,title,summary,genre
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction
1,The Plague,The text of The Plague is divided into five p...,literary fiction
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction


In [4]:
five.shape

(11013, 3)

In [None]:
# Option 1: text preprocessing without punctuation removal

In [5]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.7.2-py3-none-any.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.7.2


In [None]:
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from spellchecker import SpellChecker  

def preprocessing_1(five):
    spell_checker = SpellChecker()
    
    for index, row in five.iterrows():
        text = row['summary']
        
        # Handling Contractions
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"can't", "cannot", text)
        # Add more contraction expansions as needed
        
        # Spell Correction
        tokens = word_tokenize(text)
        corrected_tokens = [spell_checker.correction(word) for word in tokens]
        text = ' '.join(corrected_tokens)
        
        # Sentence Segmentation
        sentences = sent_tokenize(text)
        # Assuming you want to concatenate sentences with a space in between
        text = ' '.join(sentences)
        
        # Lowercasing if it's not an abbreviation.
        if re.match('([A-Z]+[a-z]*){2,}', text):
            text = text
        else:
            text = text.lower() 
        
        # Tokenization
        tokens = word_tokenize(text)
        
        # Removing stop words
        stop_words = set(stopwords.words('english'))
        filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
        
        # Lemmatization
        lemmatizer = WordNetLemmatizer()
        lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
        
        # Removing special characters and numbers
        clean_tokens = [re.sub(r'[^a-zA-Z]', '', word) for word in lemmatized_tokens]
        
        # Update the 'cleaned_summary' column with the preprocessed text
        five.at[index, 'cleaned_summary'] = ' '.join(clean_tokens)



1) Handling Contractions:
Expand contractions to ensure consistency. For example, convert "don't" to "do not."

2) Spell Correction:
Depending on the quality of your data, you might consider implementing a spell-checking mechanism to correct typos.

3) Sentence Segmentation (before tokenization):
If your summaries are long and contain multiple sentences, consider segmenting them into individual sentences.


4) OUTSIDE THE FUNCTION AND IN ANOTHER COLUMN - Part-of-Speech Tagging:

a - Perform part-of-speech tagging to understand the grammatical structure of sentences. This can be useful for certain types of analysis.
b - NER: name entity recognition

5) Add a column with the len of each summary, so that then I can do a groupby and plot a histogram for each genre. 


In [None]:
def preprocessing_1(five):
    for index, row in five.iterrows():
        text = row['summary']
        
        # Lowercasing if it's not an abbreviation.
        if re.match('([A-Z]+[a-z]*){2,}', word):
            text = text
        else:
            text = text.lower() 
        
        # Tokenization
        tokens = word_tokenize(text)
        
        # Removing stop words
        stop_words = set(stopwords.words('english'))
        filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
        
        # Lemmatization
        lemmatizer = WordNetLemmatizer()
        lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
        
        # Removing special characters and numbers
        clean_tokens = [re.sub(r'[^a-zA-Z]', '', word) for word in lemmatized_tokens]
        
        # Update the 'cleaned_summary' column with the preprocessed text
        five.at[index, 'cleaned_summary'] = ' '.join(clean_tokens)

In [None]:
# Option 2: text preprocessing with punctuation removal

In [None]:
# Option 3: text preprocessing with stemming instead of lemmatization
# SnowballStemmer

In [None]:
# Option 4: text preprocessing with handling rare words 
# (remove or replace rare words that might not contribute much to the model's understanding). 

In [None]:
# I want to find out the summary count of words (the average) for each genre.
# I also want to find out the summary count of unique words (average) for each genre. 

Feature Importance Analysis is a process in machine learning where you assess the significance or contribution of each feature (in this case, words) to the model's predictions. It helps you understand which features have the most impact on the model's performance. In the context of text data, features are often individual words or tokens.

Here's how you might conduct a Feature Importance Analysis for your text classification task:

Train a Model:

Train your machine learning model using your preprocessed text data.
Use a Model with Inherent Feature Importance:

Some models, like decision trees or random forests, have built-in mechanisms for calculating feature importance during training. These models can provide a direct measure of how much each word contributes to the model's decisions.
Feature Importance Metrics:

For models without built-in feature importance, you can use techniques like permutation importance or SHAP (SHapley Additive exPlanations) values. These methods provide insights into how much the inclusion or exclusion of a feature affects the model's predictions.
Visualization:

Visualize the feature importance scores. This could be in the form of a bar chart, where each bar represents the importance of a specific word.
Identify Influential Words:

Analyze the feature importance results to identify which words are the most influential in making predictions. This includes understanding whether rare words, in particular, play a significant role.
Why Feature Importance Analysis Matters:
Identifying Key Features:

It helps you identify which words or features are crucial for the model's decision-making process. This insight is valuable for understanding the interpretability of your model.
Optimizing Preprocessing:

You can use the results to optimize your preprocessing steps. If rare words turn out to be important, you might reconsider strategies for handling them.
Model Understanding:

Feature importance analysis provides a way to interpret your model's behavior. It helps answer questions like: "What words contribute the most to predicting a certain genre?"
Addressing Overfitting:

It can help identify if the model is overfitting to specific words, potentially leading to more robust models.

In [None]:
# Example with RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Assuming X_train is your feature matrix and y_train is the target variable

# Train a RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Get feature importances
feature_importances = model.feature_importances_

# Get the words (features)
words = your_feature_names  # Replace with your actual feature names (words)

# Create a bar chart for visualization
plt.figure(figsize=(10, 6))
plt.barh(words, feature_importances)
plt.xlabel('Feature Importance')
plt.ylabel('Words')
plt.title('Feature Importance Analysis')
plt.show()