# **Order Reviews Data Transformation: Local CSV to Processed Data**

#### **Import necessary libraries**

In [29]:
import pandas as pd
import os

from IPython.display import display

#### **Paths to files**

In [30]:
base_path = os.path.dirname(os.getcwd()) 

raw_data_path = os.path.join(base_path, 'raw-data', 'olist_order_reviews_dataset.csv')
processed_data_path = os.path.join(base_path, 'processed-data', 'olist_order_reviews_dataset_transformed.parquet')

#### **Load the raw dataset**

In [35]:
print("Loading the dataset...")
df_raw = pd.read_csv(
    raw_data_path,
    quotechar='"',  
    escapechar='\\', 
    engine="python",  
    on_bad_lines="skip" 
)
print("Dataset loaded successfully!")

# Display a sample of the data
print("Sample of the raw dataset:")
display(df_raw.count())


Loading the dataset...
Dataset loaded successfully!
Sample of the raw dataset:


review_id                  99222
order_id                   99222
review_score               99222
review_comment_title       11568
review_comment_message     40975
review_creation_date       99222
review_answer_timestamp    99222
dtype: int64

#### **Check for null values and dataset information**

In [36]:
print("\nRaw dataset information:")
df_raw.info()

print("\nCount of null values per column:")
print(df_raw.isnull().sum())


Raw dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99222 entries, 0 to 99221
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   review_id                99222 non-null  object
 1   order_id                 99222 non-null  object
 2   review_score             99222 non-null  int64 
 3   review_comment_title     11568 non-null  object
 4   review_comment_message   40975 non-null  object
 5   review_creation_date     99222 non-null  object
 6   review_answer_timestamp  99222 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB

Count of null values per column:
review_id                      0
order_id                       0
review_score                   0
review_comment_title       87654
review_comment_message     58247
review_creation_date           0
review_answer_timestamp        0
dtype: int64


#### **Apply transformations to clean the dataset**

In [40]:
print("\nTransforming and cleaning the data...")

# Convert the last two columns to timestamp with specified format
for column in df_raw.columns[-2:]:
    df_raw[column] = pd.to_datetime(df_raw[column], format="%Y-%m-%d %H:%M:%S", errors="coerce")

# Remove duplicates based on a specific column
df_cleaned = (
    df_raw
    .drop_duplicates(subset=["review_id"])  # Remove duplicates based on review_id
)

# Display a sample of the cleaned dataset
print("Sample of the cleaned dataset:")
display(df_cleaned.head())

# Count the total rows in the cleaned dataset
cleaned_row_count = df_cleaned.shape[0]
print(f"Total rows in the cleaned dataset: {cleaned_row_count}")


Transforming and cleaning the data...
Sample of the cleaned dataset:


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01,2018-03-02 10:26:53


Total rows in the cleaned dataset: 98408


#### **Save the cleaned dataset**

In [38]:
print("\nSaving the cleaned dataset...")
df_cleaned.to_parquet(processed_data_path, index=False, engine='pyarrow')
print(f"Cleaned dataset saved successfully at: {processed_data_path.replace('.csv', '.parquet')}")


Saving the cleaned dataset...
Cleaned dataset saved successfully at: c:\Users\Fernando Correia\Desktop\Olist Ecommerce Tese\processed-data\olist_order_reviews_dataset_transformed.parquet
