# Project Milestone 2: Acquire and Understand the Data

Project Group Members:
- Geoffrey Humphreys
- Amir Koupaei
- Chris Moon
- Connor Poetzinger

Wednesday, March 27, 2024

## Overview

This notebook contains the code and analysis for the second milestone of our project. The objective of this milestone focuses on preparing the [data set](https://www.kaggle.com/datasets/mexwell/fake-reviews-dataset) for subsequent analysis. For submission, follow the Milestone Checklist below.

**Milestone Checklist:**
- Access: Download, collect, or scrape* the dataset from the relevant source(s).
  - **Achieved by downloading the dataset from Kaggle.**
- Load: Start a new Jupyter Notebook, import necessary Python libraries and load your data set for inspection.
  - **Achieved by using the `pandas` library to load the dataset.**
- Understand: Examine the dataset. Ensure you understand the different features and their data types.
  - **Achieved by previewing the data, examining summary statistics, data types, missing values, and more**
- Preprocessing: Document any cleaning or preprocessing setup that may be necessary/required. This portion only includes the preprocessing steps, not the actual execution of the steps.
  - **Achieved by identifying and documenting the preprocessing steps necessary for the dataset.**

## Data Set Information and Background

In our group project, we are focusing on the critical task of distinguishing between genuine and counterfeit product reviews leveraging a specially curated dataset designed to mirror the complexities and nuances found in real-world online review platforms. This dataset contains a balanced collection of 40,000 product reviews, equally divided into two distinct categories:
- **Original Reviews**: Genuine product reviews written by real customers.
**Computer-Generated Fake Reviews**: Counterfeit product reviews generated by an algorithm.

Each review in the dataset is annotated according to its source category (OR or GC), enabling us to train and evaluate machine learning models to classify reviews as genuine or counterfeit. The dataset is stored in a CSV file, with each row representing a single review and containing the following columns:
1. `category`: Product category
2. `rating`: Rating of the product
3. `label`: Label indicating whether the review is fake or real
4. `text`: Review text

## Import the Data

The data is available in a CSV file named `fake_reviews.csv`. We will load the data into a pandas DataFrame and examine the first few rows to understand the structure of the data.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
import re

from bs4 import BeautifulSoup
from pandarallel import pandarallel
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

nltk.download("stopwords")
nltk.download("punkt")

In [None]:
df = pd.read_csv("../../data/raw/fake reviews dataset.csv")
df.shape

- **Rows:** 40,432
- **Columns:** 4

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.sample(5)

By previewing the data, we can see there are reviews for various product categories with a rating between 1 and 5. The `text_` column contains the review text with messy data that will require preprocessing to clean and prepare for analysis. For this type of task (cleaning unstructured text data), we will need to use natural language processing (NLP) techniques to process the text data effectively.

In [None]:
# Check for missing values
df.isna().mean()

In [None]:
df.info()

In [None]:
df["category"].value_counts()

In [None]:
df["label"].value_counts()

In [None]:
df.describe()

Lucky for us, the dataset contains no missing values, its classes are balanced, and the data types are easy to work with. We can focus on cleaning the text data and preparing it for analysis. The independent structured data columns (`category`, `rating`) will be useful for exploratory data analysis (EDA) and feature engineering. For modeling, we should consider One Hot encoding the `category` column and standardizing the `rating` column. The dependent variable `label` will be Label Encoded to prepare for classification modeling.

## Preprocessing Steps

Here we define the preprocessing steps necessary for the dataset. SKlearn's custom transformers will be used to implement these steps in the subsequent milestone. The preprocessing steps include:
1. **Text Cleaning**: Remove special characters, punctuation, stopwords, and perform lemmatization. The package `nltk` will be used for this task.
2. **One-Hot Encoding**: Encode the `category` column using one-hot encoding.
3. **Standardization**: Standardize the `rating` column to ensure all features are on the same scale.
4. **Label Encoding**: Encode the `label` column to convert the target variable into numerical format.

In [None]:
# Initialization
pandarallel.initialize(progress_bar=True)

In [None]:
class TextCleanerTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column_name):
        self.column_name = column_name

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_copy = X.copy()
        # Using pandarallel for parallel processing
        X_copy[self.column_name] = X_copy[self.column_name].parallel_apply(
            self.clean_text
        )
        return X_copy

    @staticmethod
    def clean_text(text):
        text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
        text = re.sub(r"https?://\S+|www\.\S+", "", text)  # Remove URLs
        text = re.sub(r"[^a-zA-Z\s]", "", text)  # Keep only alphabets
        text = text.lower()  # Convert to lowercase
        tokens = word_tokenize(text)  # Tokenization
        tokens = [
            word for word in tokens if word.lower() not in stopwords.words("english")
        ]  # Remove stopwords
        return " ".join(tokens)

In [None]:
class OneHotEncoderTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column_name):
        self.column_name = column_name
        self.encoder = OneHotEncoder()

    def fit(self, X, y=None):
        self.encoder.fit(X[[self.column_name]])
        return self

    def transform(self, X, y=None):
        X_copy = X.copy()
        encoded = self.encoder.transform(X[[self.column_name]]).toarray()
        for i, category in enumerate(self.encoder.categories_[0]):
            X_copy[category] = encoded[:, i]
        X_copy.drop(columns=[self.column_name], inplace=True)
        return X_copy

In [None]:
class StandardScalerTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column_name):
        self.column_name = column_name
        self.scaler = StandardScaler()

    def fit(self, X, y=None):
        self.scaler.fit(X[[self.column_name]])
        return self

    def transform(self, X, y=None):
        X_copy = X.copy()
        X_copy[self.column_name] = self.scaler.transform(X[[self.column_name]])
        return X_copy

In [None]:
class LabelEncoderTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column_name):
        self.column_name = column_name
        self.label_encoder = LabelEncoder()

    def fit(self, X, y=None):
        self.label_encoder.fit(X[self.column_name])
        return self

    def transform(self, X, y=None):
        X_copy = X.copy()
        X_copy[self.column_name] = 1 - self.label_encoder.transform(X[self.column_name])
        return X_copy

In [None]:
def apply_transformations(
    df,
    text_clean=True,
    one_hot_encode=True,
    standardize=True,
    encode_label=True,
    text_column=None,
    category_column=None,
    numerical_column=None,
    label_column=None,
):
    if text_clean and text_column:
        text_cleaner = TextCleanerTransformer(column_name=text_column)
        df = text_cleaner.transform(df)

    if one_hot_encode and category_column:
        one_hot_encoder = OneHotEncoderTransformer(column_name=category_column)
        one_hot_encoder.fit(df)
        df = one_hot_encoder.transform(df)

    if standardize and numerical_column:
        scaler = StandardScalerTransformer(column_name=numerical_column)
        scaler.fit(df)
        df = scaler.transform(df)

    if encode_label and label_column:
        label_encoder = LabelEncoderTransformer(column_name=label_column)
        label_encoder.fit(df)
        df = label_encoder.transform(df)

    return df

In [None]:
# Apply transformations
transformed_df = apply_transformations(
    df,
    text_column="text_",
    category_column="category",
    numerical_column="rating",
    label_column="label",
)

In [None]:
transformed_df

In [None]:
transformed_df.to_csv("../../data/processed/processed.csv", index=False)

## Executive Summary

In this milestone of our project, our team successfully acquired and prepared our dataset for subsequent analysis. The data was sourced from Kaggle, obtaining a collection of over forty thousand product reviews divided into Original Reviews(OR) and Computer-Generated Fake Reviews(GC). Each review was accompanied by essential features such as product category, rating, label (indicating whether it's genuine or fake), and the review text itself. Upon loading the data into a pandas DataFrame, the first priority was to gain a deep understanding of the structure and content.

Through examination, it was confirmed that the integrity of the dataset was intact, with no missing values, balanced classes, and straightforward data types. The data was then preprocessed to clean the text data, one-hot encode the `category` column, standardize the `rating` column, and label encode the `label` column. These preprocessing steps will be implemented in the subsequent milestone using custom transformers from the `sklearn` library. Custom transformers will allow us to streamline the preprocessing steps and fit into a machine learning pipeline for model training and evaluation.

The next milestone will focus on exploratory data analysis (EDA) to gain insights into the data and feature engineering to create new features that may improve model performance.