## Anna Kaniowska - Cellphones & Accessories dataset analysis

The goal of this project is to extract as much information form the data set that can be obtained here - http://snap.stanford.edu/data/amazon/Cell_Phones_&_Accessories.txt.gz (source: http://snap.stanford.edu/data/web-Amazon-links.html)

In [None]:
# All imports needed to perform the analysis
import gzip
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import re
import spacy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
  

#### Loading the dataset

In [None]:
# A modified version of function available on the source page
def parse(filename):
    """
    Parses a gzipped text file and returns a list of dictionaries containing the parsed data.

    Parameters:
    filename (str): The path to the gzipped text file to be parsed.

    Returns:
    list: A list of dictionaries containing the parsed data.
    """
    f = gzip.open(filename, 'rb')
    entry = {}
    data = []
    for line in f:
        l = line.decode('utf-8').strip()
        colonPos = l.find(':')
        if colonPos == -1:
            data.append(entry)
            entry = {}
            continue
        eName = l[:colonPos]
        rest = l[colonPos+2:]
        entry[eName] = rest
    data.append(entry)
    return data

# Loading the dataset
data = parse('Cell_Phones_&_Accessories.txt.gz')
df = pd.DataFrame(data)


#### Getting to know the dataset

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
print(f"The shape of the dataset is: {df.shape}")

Taking a first look at the data, it is visible that it shows the reviews that customers gave to the products. The products are mainly cellphones and their accesories. The dataset is big - almost 79 000 rows is a significant amount of data. 10 columns provide information about the rated product, the customer and their opinion on the product. When it comes to technical details - it is necessary to change 'unknown' values to NaN in order to prepare data to further analysis. Checking the dataset for duplicated rows and dropping existing ones is also necessary because this is something that cannot be seen at first glance. Another conclusion is that the types of the columns are not necessarily correct (e.g. product/price should be stored as float), it is also needed to be corrected.

#### Checking for duplicated rows

In [None]:
print(f"There is {df.duplicated().sum()} duplicated rows in the dataset")

In [None]:
df.drop_duplicates(inplace=True)

#### Handling missing values

In [None]:
df.replace("unknown", np.nan, inplace=True)

# Analyzing the missing values occurences
print("Missing values occurences:")
print(df.isna().sum())

# Checking anonymous reviews (those where userId and profileName is missing)
anon_reviews_perc = df['review/userId'].isna().sum()/df.shape[0] * 100
print(f"{anon_reviews_perc:.2f}% of the reviews are anonymously submitted.")


The first conclusion is one row that can be safely deleted in each column (it is very likely that it is the same row for each of the columns). \
The second conclusion refers to anonymous reviews. When missing values are less than 5% of given feature, they can be safely deleted without having impact on further analysis. For now the data will be divided into to sets - first with anonymous reviews and second with named reviews. It may be useful to further analysis.

In [None]:
# Extracting columns with 1 missing value
cols = ['product/productId', 'product/title', 'review/helpfulness', 'review/score', 'review/time', \
        'review/summary', 'review/text']
df = df.dropna(subset=cols)
print(f"Shape of the dataset after dropping NaNs: {df.shape}")

As expected, only one row of the data was deleted.

Before dividing dataset into two separate dataset it would be useful to correct the columns' types.

#### Correcting columns types

In [None]:
df['product/price'] = pd.to_numeric(df['product/price'], errors='coerce')
df['review/score'] = pd.to_numeric(df['review/score'], errors='coerce')
df['review/time'] = pd.to_datetime(df['review/time'].astype(float), unit='s')

def handle_helpfulness(x):
    """
    Converts the string representation of helpfulness scores to a float value between 0 and 1.

    Parameters:
        helpfulness (str): The string representation of helpfulness scores, in the format "x/y",
        where "x" is the number of users who found the review helpful and "y" is the total number of votes.

    Returns:
        float: The float value of the helpfulness score, calculated as "x / y". Returns 0 if "y" is 0.
    """
    try:
        nom, denom = x.split("/")
        return int(nom) / int(denom)
    except (ValueError, ZeroDivisionError):
        return 0

df['review/helpfulness'] = df['review/helpfulness'].apply(handle_helpfulness)

The price and score columns are stroing numeric values, the review has time in seconds. When it comes to helpfulness it was transformed to the float value that represents it.

Coming back to the missing values before the set division, the last thing about them is the price. As it is a significant amount of data in the dataset it would not make sense to drop it. Taking into consideration that assigning a price of a small accessory to a brand new cellphone would distort the dataset, the missing values in this column will not be replaced with mean, median or mode. The products are stored in more or less an order (similiar products next to one another) so an optimal way to impute the missing values would be kNN method.

In [None]:
features_with_missing_values = df[['product/price']]
imputer = KNNImputer(n_neighbors=5)
imputed_features = imputer.fit_transform(features_with_missing_values)
df['product/price'] = imputed_features

#### Division of a dataset

In [None]:
# Dividing the dataset into anonymous and named
df_anon = df[(df['review/userId'].isna()) & (df['review/profileName'].isna())]
df_named = df[~(df['review/userId'].isna()) & ~(df['review/profileName'].isna())]

# Deleting unnecessary columns from the anonymous reviews
df_anon = df_anon.drop(columns=['review/userId', 'review/profileName'])

# Checking if everything went as expected
print(f"Missing values occurences in anonymous reviews dataset:\n{df_anon.isna().sum()}")
print(f"Missing values occurences in named reviews dataset:\n{df_named.isna().sum()}")

The analysis will focus on the named reviews but there will always be an available point of reference.

#### Data Visualizations

In [None]:
# Extracting the columns with numercial and categorical variables
numerical = ['product/price', 'review/helpfulness', 'review/score', 'review/time']
categorical = ['product/productId', 'product/title', 'review/userId', 'review/profileName']

Summary and text would be difficult to visualize so those columns are not taken into consideration in this section.

#### a. Named Reviews

In [None]:
# Numerical variables
fig, axs = plt.subplots(2, 2, figsize=[15, 15])
fig.suptitle("Data distribution for Named Reviews")

max_counts = [df_named[col].value_counts().max() for col in numerical]
for i, col in enumerate(df_named[numerical]):
    sns.histplot(data=df_named, x=col, ax=axs[i//2, i%2], color='darkmagenta')
    axs[i//2, i%2].set(title=col, xlabel='Value')
plt.show()

It is visible that most of the reviews are rather extremely helpful or extremely unhelpful, something in between is not seen very often. A similiar situation can be observed in the Score histogram but not to such extent (4.0 note is observed more often than 1.0). Majority of reviews has a score 5.0 which is a sign of good quality of the products. What is interesting is that most of the reviews were registered in 2007-2008 which is the time of Global Financial Crisis. The plot which refers to the price is not very transparent, so in order to better understand the data there was a boxplot created.

In [None]:
fig, ax = plt.subplots(figsize=[8,6])
sns.boxplot(x=df_named['product/price'], ax=ax, color='darkmagenta')
ax.set(xlabel='Product Price', title="Boxplot for Product Price (Named Reviews)")
plt.show()

It is important to remember that more than a half of the price values were imputed by kNN algorithm so the results might not be 100% reliable. That is also the reason why outliers are not removed. Majority of the product is in the cheaper section, while more expensive products are bought less often - they are treated as outliers in this dataset. Half of the dataset (between first and third quartile) is represented by (15,35) range (approximately).

#### b. Anonymous Reviews

In [None]:
# Numerical variables
fig, axs = plt.subplots(2, 2, figsize=[15, 15])
fig.suptitle("Data distribution for Anonymous Reviews")

max_counts = [df_anon[col].value_counts().max() for col in numerical]
for i, col in enumerate(df_anon[numerical]):
    sns.histplot(data=df_anon, x=col, ax=axs[i//2, i%2], color='darkmagenta')
    axs[i//2, i%2].set(title=col, xlabel='Value')
plt.show()

The result are similar to those from Names Reviews dataset. The biggest difference is visible on the last plot - anonymous reviews were the most popular in 2004. It might prove that people in this time were not trusting the internet as much as they did in 2007-2008 and did not want to provide the personal data to any websites.

#### Positive vs. Negative Reviews

In [None]:
df_named['sentiment'] = df_named['sentiment'].map(lambda x: 1 if x > 3 else 0)

nlp = spacy.load('en_core_web_sm')

def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower().strip() for token in doc if not token.is_stop and not token.is_punct and not token.like_num]
    return " ".join(tokens)

df_named['review/text'] = df_named['review/text'].apply(preprocess_text)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df_named['review/text'])

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, df_named['sentiment'], test_size=0.25, random_state=42)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# evaluate the performance of the model on the test set
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1-score:", f1)


In [None]:
df_named['review/text']