# Feature Engineering & Representation Implementation

In this notebook we will implement a comprehensive feature engineering pipeline that combines three types of features: TF-IDF for capturing important terms, Word2Vec for semantic relationships, and metadata features from review statistics. This approach will help us ensure that we capture different aspects of the reviews for better representation.

### 1. Initial Setup and Data Loading:
#### Import Libraries
We will begin by importing necessary libraries for text vectorization (TF-IDF), word embeddings (Word2Vec), and other preprocessing tools. The cleaned dataset from our previous preprocessing steps will serve as our starting point.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from sklearn.preprocessing import StandardScaler

#### Load Dataset
Next, we are going to load the cleaned dataset and examine its structure such as shape, column names, data types and so on.

In [3]:
# Load the cleaned dataset
filtered_reviews = pd.read_csv('cleaned_reviews.csv')
filtered_reviews.head(5)

Unnamed: 0,product_id,reviews,review_count,avg_rating,sentiment_details,total_helpful_votes,preprocessed_reviews,preprocessed_tokens,comb_preprocessed_reviews,comb_preprocessed_tokens,sentiment,sentiment_encoded
0,2734888454,['My dogs loves this chicken but its a product...,2,3.5,"{'positive': 1, 'neutral': 0, 'negative': 1}",1,['my dogs loves this chicken but its a product...,['dog love chicken product china buying anymor...,my dogs loves this chicken but its a product f...,dog love chicken product china buying anymore ...,neutral,1
1,7800648702,"[""This came in a HUGE tin, much bigger than I ...",2,4.0,"{'positive': 1, 'neutral': 1, 'negative': 0}",0,['this came in a huge tin much bigger than i e...,['came huge tin much bigger expected cooky swe...,this came in a huge tin much bigger than i exp...,came huge tin much bigger expected cooky sweet...,positive,2
2,B00002NCJC,['Why is this $[...] when the same product is ...,2,4.5,"{'positive': 2, 'neutral': 0, 'negative': 0}",0,['why is this when the same product is availab...,['product available http www amazon com victor...,why is this when the same product is available...,product available http www amazon com victor f...,positive,2
3,B00002Z754,"[""I just received my shipment and could hardly...",2,5.0,"{'positive': 2, 'neutral': 0, 'negative': 0}",17,['i just received my shipment and could hardly...,['received shipment could hardly wait try prod...,i just received my shipment and could hardly w...,received shipment could hardly wait try produc...,positive,2
4,B00004RAMV,"[""Large, therefore keeps its content for a whi...",9,2.111111,"{'positive': 3, 'neutral': 0, 'negative': 6}",17,['large therefore keeps its content for a whil...,['large therefore keep content easy fill use r...,large therefore keeps its content for a while ...,large therefore keep content easy fill use reu...,neutral,1


In [4]:
filtered_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40289 entries, 0 to 40288
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   product_id                 40289 non-null  object 
 1   reviews                    40289 non-null  object 
 2   review_count               40289 non-null  int64  
 3   avg_rating                 40289 non-null  float64
 4   sentiment_details          40289 non-null  object 
 5   total_helpful_votes        40289 non-null  int64  
 6   preprocessed_reviews       40289 non-null  object 
 7   preprocessed_tokens        40289 non-null  object 
 8   comb_preprocessed_reviews  40289 non-null  object 
 9   comb_preprocessed_tokens   40289 non-null  object 
 10  sentiment                  40289 non-null  object 
 11  sentiment_encoded          40289 non-null  int64  
dtypes: float64(1), int64(3), object(8)
memory usage: 3.7+ MB


##### Observations:
- Successfully loaded dataset with all preprocessing columns intact
- Dataset maintains its structure with 10 columns and proper data types
- No missing values in any columns

### 2. Data Preparation for Feature Engineering:
We will organize our features and target variables for the train-test split, selecting relevant columns for different feature types (text-based and metadata).

In [7]:
def prepare_data_for_split(filtered_reviews):
    """
    Prepare X and y before train-test split.
    
    Parameters:
    -----------
    filtered_reviews : pandas.DataFrame
        Preprocessed review data
        
    Returns:
    --------
    X : pandas.DataFrame
        Features to be used for splitting (text and metadata)
    y : numpy.ndarray
        Target variable (sentiment)
    """
    # Select relevant columns for X
    X = filtered_reviews[[
        'comb_preprocessed_reviews',  # for TF-IDF
        'comb_preprocessed_tokens',   # for Word2Vec
        'review_count',              # metadata
        'avg_rating',                 # metadata
        'total_helpful_votes'
    ]].copy()
    
    # Target variable
    y = filtered_reviews['sentiment_encoded'].values
    
    return X, y

This above function will help create clear separation between features (X) and target variable (y). Selected balanced mix of text and metadata features and maintained both raw preprocessed reviews and tokenized versions for different feature extraction methods.

In [9]:
def perform_train_test_split(X, y, test_size=0.2, random_state=42):
    """
    Perform train-test split on the data.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=test_size,
                                                        random_state=random_state,
                                                        stratify=y)
    
    print("Data split sizes:")
    print(f"Training set: {len(X_train)} samples")
    print(f"Test set: {len(X_test)} samples")
    
    return X_train, X_test, y_train, y_test

This will help split the data into training (80%) and test (20%) sets using stratification method to maintain balanced sentiment distribution.

### 3. Feature Creation:
Next we will implement three types of features to capture different aspects of the reviews:

1. TF-IDF features for capturing important terms (5000 features)
2. Word2Vec embeddings for semantic relationships (100 features)
3. Metadata features including review count, average rating and total_helpful_votes (3 features)

This function will preprocess the text to TF-IDF (Term Frequency-Inverse Document Frequency) as feature representations using TfidfVectorizer. TF-IDF converts the text into numerical feature vectors and captures the importance of words compared to the entire collection. It will help generate 5000 features capturing important terms and bigrams using L2 normalization to handle varying review lengths.

In [12]:
def create_tfidf_features(X_train, X_test, max_features=5000):
    """
    Create TF-IDF features for train and test sets.
    
    Parameters:
    -----------
    X_train : pandas.DataFrame
        Training data
    X_test : pandas.DataFrame
        Test data
    max_features : int
        Maximum number of TF-IDF features
        
    Returns:
    --------
    dict containing:
        - X_train_tfidf: TF-IDF features for training data
        - X_test_tfidf: TF-IDF features for test data
        - tfidf: Fitted TfidfVectorizer
    """
    print("\nCreating TF-IDF features...")
    tfidf = TfidfVectorizer(
        max_features=max_features,
        ngram_range=(1, 2),
        strip_accents='unicode',
        analyzer='word',
        token_pattern=r'\w{1,}',
        stop_words='english',
        norm='l2')
    
    X_train_tfidf = tfidf.fit_transform(X_train['comb_preprocessed_reviews'])
    X_test_tfidf = tfidf.transform(X_test['comb_preprocessed_reviews'])
    
    print(f"TF-IDF features shape: {X_train_tfidf.shape[1]} features")
    
    return {'X_train_tfidf': X_train_tfidf,
            'X_test_tfidf': X_test_tfidf,
            'tfidf': tfidf}

Next we will implement Word2Vec embeddings to capture semantic relationships between words in the reviews. Word2Vec learns distributed vector representations of words based on their context, where words appearing in similar contexts are mapped to nearby points in the vector space. 

In [14]:
def create_document_vectors(text_series, w2v_model):
    """
    Create document vectors from text using Word2Vec model.
    
    Parameters:
    -----------
    text_series : pandas.Series
        Series containing tokenized text
    w2v_model : Word2Vec
        Trained Word2Vec model
        
    Returns:
    --------
    numpy.ndarray
        Document vectors
    """
    doc_vectors = np.zeros((len(text_series), w2v_model.vector_size))
    
    for idx, text in enumerate(text_series):
        tokens = text.split()
        valid_tokens = [token for token in tokens if token in w2v_model.wv]
        if valid_tokens:
            doc_vectors[idx] = np.mean([w2v_model.wv[token] for token in valid_tokens], axis=0)
            
    return doc_vectors

We will train the model only on the training data to prevent data leakage, use a context window of 5 words and minimum word frequency of 2. Each word will be represented by a 100-dimensional vector, and document vectors are created by averaging the vectors of all words in a review. This approach will help us capture semantic similarities and relationships between reviews that might not be captured by simple term frequency approaches.

In [16]:
def create_word2vec_features(X_train, X_test, vector_size=100):
    """
    Create Word2Vec features for train and test sets.
    
    Parameters:
    -----------
    X_train : pandas.DataFrame
        Training data
    X_test : pandas.DataFrame
        Test data
    vector_size : int
        Size of word vectors
        
    Returns:
    --------
    dict containing:
        - X_train_w2v: Word2Vec features for training data
        - X_test_w2v: Word2Vec features for test data
        - w2v_model: Trained Word2Vec model
    """
    print("\nCreating Word2Vec features...")
    
    # Train Word2Vec on training data only
    train_tokens = [text.split() for text in X_train['comb_preprocessed_tokens']]
    w2v_model = Word2Vec(sentences=train_tokens,
                         vector_size=vector_size,
                         window=5,
                         min_count=2,
                         workers=4)
    
    # Create document vectors for both sets
    X_train_w2v = create_document_vectors(X_train['comb_preprocessed_tokens'], w2v_model)
    X_test_w2v = create_document_vectors(X_test['comb_preprocessed_tokens'], w2v_model)
    
    print(f"Word2Vec features shape: {X_train_w2v.shape[1]} features")
    
    return {'X_train_w2v': X_train_w2v,
            'X_test_w2v': X_test_w2v,
            'w2v_model': w2v_model}

In this function, incorporated 3 statistical features about reviews that provide context beyond text content and maintained values as numeric without encoding.

In [19]:
def create_metadata_features(X_train, X_test):
    """
    Create metadata features for train and test sets.
    
    Parameters:
    -----------
    X_train : pandas.DataFrame
        Training data
    X_test : pandas.DataFrame
        Test data
        
    Returns:
    --------
    dict containing:
        - X_train_meta: Metadata features for training data
        - X_test_meta: Metadata features for test data
    """
    print("\nCreating metadata features...")
    metadata_columns = ['review_count', 'avg_rating', 'total_helpful_votes']
    
    X_train_meta = X_train[metadata_columns].values
    X_test_meta = X_test[metadata_columns].values
    
    print(f"Metadata features shape: {X_train_meta.shape[1]} features")
    
    return {'X_train_meta': X_train_meta,
            'X_test_meta': X_test_meta}

### 4. Feature Engineering Pipeline:
Lastly we will run the entire feature engineering pipeline to combine all features into a unified representation and apply standardization to ensure all features are on the same scale for model training.

In [21]:
def run_feature_engineering_pipeline(filtered_reviews, test_size=0.2, random_state=42):
    """
    Run the complete feature engineering pipeline including standardization.
    
    Parameters:
    -----------
    filtered_reviews : pandas.DataFrame
        Preprocessed review data
    test_size : float, default=0.2
        Size of test set
    random_state : int, default=42
        Random seed for reproducibility
        
    Returns:
    --------
    X_train : numpy.ndarray
        Training features
    X_test : numpy.ndarray
        Test features
    y_train : numpy.ndarray
        Training labels
    y_test : numpy.ndarray
        Test labels
    transformers : dict
        Dictionary containing fitted transformers
    """
    # 1. Prepare data for splitting
    print("Preparing data for split...")
    X, y = prepare_data_for_split(filtered_reviews)
    
    # 2. Perform train-test split
    print("\nPerforming train-test split...")
    X_train, X_test, y_train, y_test = perform_train_test_split(X, y, test_size=test_size, random_state=random_state)

    # 3. Create each type of features
    tfidf_features = create_tfidf_features(X_train, X_test)
    w2v_features = create_word2vec_features(X_train, X_test)
    meta_features = create_metadata_features(X_train, X_test)

    # 4. Combine features
    # Combine training features
    X_train_combined = np.hstack([tfidf_features['X_train_tfidf'].toarray(),
                                  w2v_features['X_train_w2v'],
                                  meta_features['X_train_meta']])
    
    # Combine test features
    X_test_combined = np.hstack([tfidf_features['X_test_tfidf'].toarray(),
                                 w2v_features['X_test_w2v'],
                                 meta_features['X_test_meta']])
    
    print(f"\nFinal feature shapes:")
    print(f"Training set: {X_train_combined.shape}")
    print(f"Test set: {X_test_combined.shape}")

    # 5. Standardize all features
    print("\nStandardizing all features...")
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_combined)
    X_test_scaled = scaler.transform(X_test_combined)
    
    # Collect transformers
    transformers = {'tfidf': tfidf_features['tfidf'],
                    'w2v': w2v_features['w2v_model'],
                    'scaler': scaler}
    
    return X_train_scaled, X_test_scaled, y_train, y_test, transformers

### 5. Final Pipeline Results:

In [23]:
# Run the complete pipeline
X_train, X_test, y_train, y_test, transformers = run_feature_engineering_pipeline(
    filtered_reviews,
    test_size=0.2)

# Print final shapes
print("\nFinal dataset shapes:")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}")
print(f"y_test: {y_test.shape}")

Preparing data for split...

Performing train-test split...
Data split sizes:
Training set: 32231 samples
Test set: 8058 samples

Creating TF-IDF features...
TF-IDF features shape: 5000 features

Creating Word2Vec features...
Word2Vec features shape: 100 features

Creating metadata features...
Metadata features shape: 3 features

Final feature shapes:
Training set: (32231, 5103)
Test set: (8058, 5103)

Standardizing all features...

Final dataset shapes:
X_train: (32231, 5103)
X_test: (8058, 5103)
y_train: (32231,)
y_test: (8058,)


### Final Observations:

1. Feature Engineering Results:
    - Total features created: 5,103 (5000 TF-IDF + 100 Word2Vec + 3 metadata)
    - Training set: 32,231 samples
    - Test set: 8,058 samples
    

2. Feature Engineering Pipeline Success:

    - Successfully combined multiple feature types and standardized
    - No missing values in final feature matrices
    - Preserved train/test separation throughout

3. Next Steps:

    - Features ready for model training
    - Saved transformers for future inference
    - Data saved in compressed numpy format for efficient loading

In [26]:
# Save numpy arrays
np.savez('data.npz', X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)

In [29]:
import joblib

# Save transformers
joblib.dump(transformers, 'transformers.joblib')

['transformers.joblib']