<a href="https://colab.research.google.com/github/arifaygun/CustomerEye/blob/main/Trustpilot_Report_(Freedom_Dept_Relief).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Import Libraries & Dataset**

In [22]:
#!pip install transformers
#!pip install torch
!pip install pycountry



In [23]:
import re
import pandas as pd
import numpy as np
import pycountry
from datetime import datetime, timedelta
from transformers import pipeline
from google.colab import drive

In [24]:
drive.mount('/content/drive/')
%cd /content/drive/My Drive/Customereye Reports/

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
/content/drive/My Drive/Customereye Reports


In [25]:
df = pd.read_csv('freedom_debt_relief.csv')
df.head()

Unnamed: 0,Reviewer Name,Reviews Count,Country Code,Experience Date,Rating,Review Date,Review Title,Review Text,Reply Date,Reply Text
0,Chris,1review,US,"August 02, 2023",5,2 days ago,Extremely difficult process of paying…,Extremely difficult process of paying down deb...,Reply from Freedom Debt Relief2 days ago,"Thank you for the lovely review, Chris! It is ..."
1,Jeffrey McKinney,1review,US,"March 07, 2023",5,3 days ago,Working the plan,My debt reduction has eliminated almost 8000 d...,Reply from Freedom Debt Relief3 days ago,Thank you for giving our team the opportunity ...
2,Wayne Cordes,1review,US,"September 28, 2023",5,20 hours ago,Welcomed Relief,Everyone has been so thorough and up front wit...,,
3,Judith Kuhns,1review,US,"October 11, 2023",5,4 days ago,Freedom Debt Relief has made my life as whole ...,Freedom Debt Relief has made my life a whole l...,Reply from Freedom Debt Relief3 days ago,It is great to hear that you are happy with yo...
4,Latia Bellamy,1review,US,"September 13, 2023",5,4 days ago,Every customer service representative I…,Every customer service representative I have s...,Reply from Freedom Debt Relief3 days ago,We are happy to hear that you're pleased with ...


### **Dataset Preprocessing**

**Streamlining Data Preprocessing for Review Analysis**

The provided code snippet presents a comprehensive function named `preprocessing` tailored for preparing data from a DataFrame `df` for review analysis. The function encompasses various data cleaning and transformation steps to ensure the data is structured and formatted optimally for downstream analysis. Here's a breakdown of the key preprocessing steps:

1. **Numeric Conversion**: Converts the 'Reviews Count' column to integers, ensuring consistency in data type.
  
2. **Text Replacement**: Removes specified text from the 'Reply Date' column, cleaning and standardizing the data.

3. **Date Conversion**: Converts date columns ('Experience Date', 'Review Date', 'Reply Date') to datetime format for effective temporal analysis.

4. **Feature Engineering**: Extracts the year from the 'Review Date' and creates a new 'Year' column, enabling analysis at yearly granularity.

5. **Handling Missing Values**: Removes rows with missing values in critical date columns to maintain data integrity.

6. **Text Concatenation**: Combines 'Review Title' and 'Review Text' into a single 'Reviews' column, consolidating review content for analysis.

7. **Column Renaming**: Renames the 'Reply Text' column to 'Replies' for clarity and consistency.

8. **Country Name Addition**: Adds a new 'Countries' column containing the names of countries corresponding to country codes.

9. **Response Time Calculation**: Calculates the response time between key events ('Experience Date' to 'Review Date' and 'Review Date' to 'Reply Date') in days, providing insights into response efficiency.

10. **Column Dropping and Reordering**: Drops unnecessary columns ('Reviewer Name', 'Reviews Count', 'Review Title', 'Review Text', 'Country') and reorders columns for improved readability and analysis.

By encapsulating these preprocessing steps into a single function, analysts can efficiently prepare raw data for review analysis, ensuring consistency, accuracy, and relevance in subsequent analytical tasks.

In [26]:
# Function for preprocessing
def preprocessing(df, reply_text):
    # Convert 'Reviews Count' to integers
    df['Reviews Count'] = df['Reviews Count'].str.extract('(\d+)').astype(int)

    # Replace the specified text in 'Reply Date' column
    df['Reply Date'] = df['Reply Date'].str.replace(reply_text, '').str.strip()

    # Convert 'Experience Date', 'Review Date', and 'Reply Date' to datetime
    date_columns = ['Experience Date', 'Review Date', 'Reply Date']
    df[date_columns] = df[date_columns].apply(pd.to_datetime, errors='coerce')

    # Drop rows with NaN values in 'Experience Date', 'Review Date', or 'Reply Date'
    df.dropna(subset=date_columns, inplace=True)

    # Extract 'Year' from 'Review Date' and create a new column
    df['Year'] = pd.to_datetime(df['Review Date']).dt.year.fillna(0).astype(int)

    # Concatenate 'Review Title' and 'Review Text' into a new 'Reviews' column
    df['Reviews'] = df['Review Title'].astype(str) + ' ' + df['Review Text'].astype(str)

    # Rename 'Reply Text' column to 'Replies'
    df.rename(columns={'Reply Text': 'Replies', 'Country Code': 'Country'}, inplace=True)

    # Add a new column with country names
    df['Countries'] = df['Country'].apply(lambda code: pycountry.countries.get(alpha_2=code).name if pycountry.countries.get(alpha_2=code) else None)

    # Calculate response time between 'Experience Date' and 'Review Date' in days
    df['Exp to Review'] = (df['Review Date'] - df['Experience Date']).dt.total_seconds() / 86400

    # Calculate response time between 'Review Date' and 'Reply Date' in days
    df['Review to Reply'] = (df['Reply Date'] - df['Review Date']).dt.total_seconds() / 86400

    # Round the values to the nearest integer and convert to int
    df[['Exp to Review', 'Review to Reply']] = df[['Exp to Review', 'Review to Reply']].round(0).astype(int)

    # Drop the unnecessary 'Reviewer Name', 'Reviews Count', 'Review Title' and 'Review Text','Country' columns
    df.drop(['Reviewer Name', 'Reviews Count', 'Review Title', 'Review Text','Country'], axis=1, inplace=True)

    # Rearrange the columns
    df = df[['Year', 'Experience Date', 'Review Date', 'Reply Date','Exp to Review',
             'Review to Reply','Rating', 'Countries', 'Reviews', 'Replies']]

    return df


# Calculate response time for each DataFrame
df1 = preprocessing(df, 'Reply from Freedom Debt Relief')

# Print the updated DataFrame information
pd.set_option('display.max_columns', None)

df1.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 36713 entries, 17 to 39076
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Year             36713 non-null  int64         
 1   Experience Date  36713 non-null  datetime64[ns]
 2   Review Date      36713 non-null  datetime64[ns]
 3   Reply Date       36713 non-null  datetime64[ns]
 4   Exp to Review    36713 non-null  int64         
 5   Review to Reply  36713 non-null  int64         
 6   Rating           36713 non-null  int64         
 7   Countries        36713 non-null  object        
 8   Reviews          36713 non-null  object        
 9   Replies          36713 non-null  object        
dtypes: datetime64[ns](3), int64(4), object(3)
memory usage: 3.1+ MB


In [27]:
df1.head()

Unnamed: 0,Year,Experience Date,Review Date,Reply Date,Exp to Review,Review to Reply,Rating,Countries,Reviews,Replies
17,2023,2023-09-29,2023-10-04,2023-10-04,5,0,5,United States,Negotiations and payments are done in a… Negot...,"We love that, Deborah! We are happy to hear th..."
23,2023,2023-04-22,2023-10-03,2023-10-04,164,1,5,United States,Stress Relief! The Freedom Debt Relief team me...,"Thank you for the lovely review, Beckie! We ar..."
28,2023,2023-02-14,2023-10-02,2023-10-03,230,1,5,United States,For anyone who has a lot of monthly… For anyon...,"Thank you for the lovely review, Gino! It is g..."
33,2023,2023-01-15,2023-10-03,2023-10-03,261,0,5,United States,FDR is the answer to my struggles I was strugg...,"That's amazing, Patricia! We are so happy for ..."
36,2023,2023-10-04,2023-10-04,2023-10-04,0,0,5,United States,Gliding through a change Watching my depth jou...,"You are welcome, Candi! We are extremely happy..."


### **Data Sampling**

**Implementing Random Sampling for Data Preprocessing**

Code snippet defines a function named `perform_sampling` that performs random sampling on a DataFrame `df`. It filters the data for specified years (2020, 2021, 2022, 2023), counts the occurrences of each rating for each year, and then conducts random sampling to obtain 10,000 data points for each year. Finally, it concatenates the sampled dataframes to create a single dataframe containing the sampled data. The function returns this sampled dataframe.

This function is then applied to a DataFrame `df1` using `perform_sampling(df1)`, and the resulting sampled dataframe (`sampled_df1`) is ensured to have exactly 10,000 data points by additional random sampling if needed. Finally, the information about the sampled dataframe is printed, showing the data types and memory usage information.

In [None]:
# Function for random sampling
def perform_sampling(df):
    years_to_keep = [2020,2021, 2022, 2023]

    # Filter data for the specified years
    filtered_df = df[df['Review Date'].dt.year.isin(years_to_keep)]

    # Count the occurrences of each rating for each year
    yearly_rating_counts = filtered_df.groupby(['Year', 'Rating']).size().unstack(fill_value=0)

    # Perform random sampling to get 10000 datapoints
    sampled_df = pd.DataFrame()
    for year in years_to_keep:
        year_data = filtered_df[filtered_df['Review Date'].dt.year == year]
        sampled_data = year_data.sample(n=1000, replace=True)  # Adjust the sampling size as needed
        sampled_df = pd.concat([sampled_df, sampled_data])

    return sampled_df

# Apply sampling for each dataset
sampled_df1 = perform_sampling(df1)

# Ensure all sampled dataframes have the same number of datapoints (1000)
sampled_df1 = sampled_df1.sample(n=1000, replace=True)

print(sampled_df1.info())

### **Apply Sentiment Analysis to Reviews**

**Enhancing Sentiment Analysis with Transformer Pipelines**

Model: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

In the realm of natural language processing (NLP), sentiment analysis serves as a crucial tool for understanding the emotional tone underlying text data. Leveraging pre-trained models like those provided by the Transformers library, sentiment analysis pipelines offer a streamlined approach to analyze sentiments within text data. By utilizing the `pipeline` module from the Transformers library, specifically tailored for text classification tasks, such as sentiment analysis, developers can seamlessly integrate powerful models into their workflows without extensive coding or model fine-tuning.

The provided code snippet exemplifies this integration by utilizing a pre-trained sentiment analysis model (`nlptown-bert-base-multilingual-uncased-sentiment`) through the Transformer pipeline. Firstly, the snippet initializes the pipeline for text classification, loading the sentiment analysis model. Then, it iterates over each row in a DataFrame, extracting the text to be analyzed. In order to accommodate the maximum sequence length supported by the model, the snippet truncates the input text to fit within the model's constraints. Subsequently, sentiment analysis is performed on the truncated text using the pipeline, and the resulting sentiment label is stored in a new column within the DataFrame. Through such techniques, sentiment analysis becomes not only accessible but also scalable and adaptable to various domains and applications, facilitating insightful analysis of sentiment dynamics within text data.

In [None]:
from transformers import pipeline

# Load the pipeline for text classification
pipe = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")

# Create a new column to store the sentiment analysis result
sampled_df1['Sentiment'] = ""

# Define the maximum sequence length supported by the model
max_seq_length = pipe.model.config.max_position_embeddings

# Function to extract numeric sentiment from label
def extract_numeric_sentiment(label):
    return int(label.split()[0])  # Extracting numeric value and converting to integer

# Iterate over each row in the DataFrame
for index, row in sampled_df1.iterrows():
    review_text = row['Reviews']

    # Truncate the review text if it exceeds the maximum sequence length
    truncated_review_text = review_text[:max_seq_length - 2]  # Accounting for [CLS] and [SEP] tokens
    sentiment_label = pipe(truncated_review_text)[0]['label']
    numeric_sentiment = extract_numeric_sentiment(sentiment_label)
    sampled_df1.at[index, 'Sentiment'] = numeric_sentiment

# Convert Sentiment column dtype to integer
sampled_df1['Sentiment'] = sampled_df1['Sentiment'].astype(int)

# Display the updated DataFrame
sampled_df1.head(500)


In [None]:
# Load the pipeline for text classification
pipe = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")

# Create a new column to store the sentiment analysis result
df1['Sentiment'] = ""

# Define the maximum sequence length supported by the model
max_seq_length = pipe.model.config.max_position_embeddings

# Function to extract numeric sentiment from label
def extract_numeric_sentiment(label):
    return int(label.split()[0])  # Extracting numeric value and converting to integer

# Iterate over each row in the DataFrame
for index, row in df1.iterrows():
    review_text = row['Reviews']

    # Truncate the review text if it exceeds the maximum sequence length
    truncated_review_text = review_text[:max_seq_length - 2]  # Accounting for [CLS] and [SEP] tokens
    sentiment_label = pipe(truncated_review_text)[0]['label']
    numeric_sentiment = extract_numeric_sentiment(sentiment_label)
    df1.at[index, 'Sentiment'] = numeric_sentiment

# Convert Sentiment column dtype to integer
df1['Sentiment'] = df1['Sentiment'].astype(int)

# Display the updated DataFrame
df1.head(1000)


In [None]:
# Get current date and time
now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Update the CSV path with current date and time suffix
output_csv_path = f"output_sentiment_analysis_{now}.csv"

# Save the DataFrame to a CSV file
df1.to_csv(output_csv_path, index=False)

# Display the updated DataFrame
print("Output DataFrame saved to:", output_csv_path)
