# Amazon-Product-Recommendation-System :- Data Cleaning

In [3]:
import pandas as pd
import numpy as np
import os

data = pd.read_csv(r"C:\Users\giris\Documents\datasets for practice\Reviews.csv")
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


**Checking Null Values**

In [4]:
data.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               26
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

So, we have ~560,000 rows, removing 53 missing rows has zero statistical or modeling impact. It’s cleaner and avoids unnecessary text placeholders. So we gonn drop them.

In [6]:
data = data.dropna(subset=['ProfileName', 'Summary'])

In [9]:
print("Shape after removing nulls:", data.shape)
print("missng values after cleaning:", data.isna().sum())

Shape after removing nulls: (568401, 10)
missng values after cleaning: Id                        0
ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   0
Text                      0
dtype: int64


**Feature Selection**

In [10]:
print(data.columns.tolist())

['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text']


We are removing unnecessary columns to keep only the ones directly useful for building a recommendation system.This ensures the dataset stays lightweight while retaining all key information about user ratings and review content.

In [11]:
data = data[['UserId', 'ProductId', 'Score', 'Time', 'Summary', 'Text']]
print("Shape after selecting relevant columns:", data.shape )
print(data.head())

Shape after selecting relevant columns: (568401, 6)
           UserId   ProductId  Score        Time                Summary  \
0  A3SGXH7AUHU8GW  B001E4KFG0      5  1303862400  Good Quality Dog Food   
1  A1D87F6ZCVE5NK  B00813GRG4      1  1346976000      Not as Advertised   
2   ABXLMWJIXXAIN  B000LQOCH0      4  1219017600  "Delight" says it all   
3  A395BORC6FGVXV  B000UA0QIQ      2  1307923200         Cough Medicine   
4  A1UQRSCLF8GW1T  B006K2ZZ7K      5  1350777600            Great taffy   

                                                Text  
0  I have bought several of the Vitality canned d...  
1  Product arrived labeled as Jumbo Salted Peanut...  
2  This is a confection that has been around a fe...  
3  If you are looking for the secret ingredient i...  
4  Great taffy at a great price.  There was a wid...  


**Checking Duplicates**

In [12]:
data.duplicated().sum()

np.int64(831)

We removing 831 duplicate rows to ensure each user–product interaction is unique and the data remains clean.

In [13]:
data.drop_duplicates(inplace=True)

**Feature Engineering and Cleaning Data in Text and Summary Column**

We filtered the dataset to keep only valid ratings between 1 and 5, ensuring consistency with Amazon’s rating scale.

In [14]:

data = data[data['Score'].between(1, 5)]

We converted the Time column from UNIX timestamp to readable datetime format for easier temporal analysis.

In [15]:
data['Time'] = pd.to_datetime(data['Time'], unit='s')

we cleaned the Summary and Text columns by removing extra spaces and converting all text to lowercase for consistency. This standardization helps improve text processing and model accuracy.

In [16]:

data['Summary'] = data['Summary'].str.strip().str.lower()
data['Text'] = data['Text'].str.strip().str.lower()
data.head()

Unnamed: 0,UserId,ProductId,Score,Time,Summary,Text
0,A3SGXH7AUHU8GW,B001E4KFG0,5,2011-04-27,good quality dog food,i have bought several of the vitality canned d...
1,A1D87F6ZCVE5NK,B00813GRG4,1,2012-09-07,not as advertised,product arrived labeled as jumbo salted peanut...
2,ABXLMWJIXXAIN,B000LQOCH0,4,2008-08-18,"""delight"" says it all",this is a confection that has been around a fe...
3,A395BORC6FGVXV,B000UA0QIQ,2,2011-06-13,cough medicine,if you are looking for the secret ingredient i...
4,A1UQRSCLF8GW1T,B006K2ZZ7K,5,2012-10-21,great taffy,great taffy at a great price. there was a wid...



We have to cleane the Summary and Text columns by removing unwanted characters such as quotes (") and HTML tags (
).This ensures the text is neat, uniform, and free from unnecessary formatting, making it easier to read and analyze.

In [18]:
data['Summary'] = data['Summary'].str.replace('"', '', regex=False)
data['Summary'] = data['Summary'].str.replace('<br />', '', regex=False)
data['Summary'] = data['Summary'].str.strip().str.lower()

In [19]:
data['Text'] = data['Text'].str.replace('"', '', regex=False)
data['Text'] = data['Text'].str.replace('<br />', '', regex=False)
data['Text'] = data['Text'].str.strip().str.lower()


We are removing reviews exceeding 2000 characters to eliminate excessively long and potentially noisy text entries. This helped retain concise, meaningful reviews and made the dataset more balanced and easier to analyze.

In [21]:
data['text_lenghth'] = data['Text'].str.len()
data['text_lenghth'].describe()
data = data[data['text_lenghth'] <= 2000]
data['text_lenghth'] = data['Text'].str.len()
data['text_lenghth'].describe()
data.head()
data.drop(columns=['text_lenghth'], inplace=True)
data.head()

Unnamed: 0,UserId,ProductId,Score,Time,Summary,Text
0,A3SGXH7AUHU8GW,B001E4KFG0,5,2011-04-27,good quality dog food,i have bought several of the vitality canned d...
1,A1D87F6ZCVE5NK,B00813GRG4,1,2012-09-07,not as advertised,product arrived labeled as jumbo salted peanut...
2,ABXLMWJIXXAIN,B000LQOCH0,4,2008-08-18,delight says it all,this is a confection that has been around a fe...
3,A395BORC6FGVXV,B000UA0QIQ,2,2011-06-13,cough medicine,if you are looking for the secret ingredient i...
4,A1UQRSCLF8GW1T,B006K2ZZ7K,5,2012-10-21,great taffy,great taffy at a great price. there was a wid...


In [22]:

import re

def clean_text(text):
    text = re.sub(r'<.*?>', ' ', text)            # remove HTML tags
    text = re.sub(r'http\S+', ' ', text)          # remove URLs
    text = re.sub(r'[^a-zA-Z.,!?\'\s]', ' ', text) # keep letters and basic punctuation
    text = re.sub(r'\s+', ' ', text)              # normalize multiple spaces
    return text.strip().lower()


# Apply to both columns
data['Text'] = data['Text'].apply(clean_text)
data['Summary'] = data['Summary'].apply(clean_text)

data['Text'] = data['Text'].str.replace(r'[^\w\s]', '', regex=True)

Let's save the Dataset now.......


In [28]:
cleaned_path = r"C:\Users\giris\Documents\datasets for practice\cleaned_amazon_reviews.csv"
data.to_csv(cleaned_path, index=False)

print(f"Cleaned dataset saved successfully at: {cleaned_path}")

Cleaned dataset saved successfully at: C:\Users\giris\Documents\datasets for practice\cleaned_amazon_reviews.csv


We started by loading the Amazon reviews dataset and removed unwanted columns like Id and ProfileName to keep only the important ones. Then, we dropped missing and duplicate values to make the data clean. We also made sure the ratings were between 1 and 5 and converted the time column into a readable date format.

Next, we cleaned the Summary and Text columns by removing symbols, extra spaces, and HTML tags, and converted everything to lowercase for consistency. Finally, we saved the cleaned dataset, which is now ready for further analysis.