<center>

# User Reviews ETL

<center>

In [42]:
# Importations.
import os
import pandas as pd
import ast
import numpy as np

<div style="text-align: justify">

### 1. Converting Data: From JSON Format to CSV Format

This is the final version of the code to transform the JSON file into a CSV file. The process was challenging because the reviews column was nested with a list that contained lists for every user. Inside each list, there was a dictionary with all the reviews from the user. Additionally, some reviews had escape sequences like \r (Carriage Return) or \n (New Line). When I attempted to normalize the file, the sequence and order were lost. To normalize the file, I had to clean the JSON file and create a new one. Then, I could successfully normalize the file.
</div>

In [43]:
# Transformation code: from JSON to CSV. 
# File paths.
user_reviews = 'PI MLOps - STEAM/user_reviews.json'
user_reviews_cleaned = 'PI MLOps - STEAM/user_reviews_cleaned.json'
user_reviews_csv = 'csv/user_reviews.csv'

# If the cleaned file does not exist, create it.
if not os.path.exists(user_reviews_csv):
    # Read the JSON file and clean the review text format.
    with open(user_reviews, encoding='utf-8') as f, open(user_reviews_cleaned, 'w', encoding='utf-8') as f_cleaned:
        for line in f.readlines():
            review_data = ast.literal_eval(line)
            user_reviews = review_data.get('reviews', [])

            # Clean the format of the review text.
            for review in user_reviews:
                review['review'] = review.get('review', '').replace('\r', ' ').replace('\n', ' ').replace('\t', ' ')

            # Write the cleaned line to the new file.
            f_cleaned.write(str(review_data) + '\n')

        # Read the cleaned JSON file and normalize to a DataFrame.
        reviews = []
        with open(user_reviews_cleaned, encoding='utf-8') as f:
            for line in f.readlines():
                review_data = ast.literal_eval(line)
                user_id = review_data.get('user_id', '')
                user_url = review_data.get('user_url', '')
                user_reviews = review_data.get('reviews', [])

                # Add 'user_id' and 'user_url' to each review.
                for review in user_reviews:
                    review['user_id'] = user_id
                    review['user_url'] = user_url

                reviews.extend(user_reviews)

        # Normalize the JSON to a DataFrame.
        df_reviews = pd.json_normalize(reviews)

        # Save the DataFrame as a CSV file.
        df_reviews.to_csv(user_reviews_csv, index=False)
    print(f'The file {user_reviews_csv} was successfully created.')        
else:
    print(f'The cleaned file {user_reviews_csv} already exists.')

The file csv/user_reviews.csv was successfully created.


<div style="text-align: justify">

### 2. Understanding How the Review Column Was Nested

After finding the 3 errors, I decided to check and discovered that the normalization of the file did not work. I reached this conclusion when I tested 2 items IDs from the user 76561198114558878. For instance, the item 108600 is not reflecting the full description, unlike the item 239140.

</div>

I found the user 76561198114558878 when I filtered the column posted in Power BI, so I took it for the test.

<div style="text-align: justify">

After comparing both outputs, I realized something was wrong. Taking a closer look at the **review** column in Power BI when I wanted to filter the information, I noticed there was an empty list [] and, after this, many other lists with dictionaries. The column composition was something like this: [[], [{'funny': '', 'posted': 'Posted November 5, 2011.', 'last_edited': '', 'item_id': '1250', 'helpful': 'No ratings yet', 'recommend': True, 'review': 'Simple yet with great replayability. In my opinion, it does 'zombie' hordes and teamwork better than Left 4 Dead, plus it has a global leveling system. A lot of down-to-earth 'zombie' splattering fun for the whole family. Amazed this sort of FPS is so rare.'}, {...}], [{'...'}]...] this made me rebuild the code.

</div>

In [44]:
# reading the csv file after the code rebuild.
df_reviews_final = pd.read_csv('csv/user_reviews.csv')

In [45]:
# User to check the new file.
user_id_value = 'AxeOfChaos'

# Filter the DataFrame for rows with the specified user_id.
results = df_reviews_final[df_reviews_final['user_id'] == user_id_value]

# Display the resulting DataFrame.
results

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
8824,,Posted August 31.,,475150,4 of 4 people (100%) found this review helpful,True,The original and the best just got even better...,AxeOfChaos,http://steamcommunity.com/id/AxeOfChaos
8825,,"Posted August 21, 2013.",,24740,No ratings yet,True,"Most fun you can have on 4 wheels, or even 2! ...",AxeOfChaos,http://steamcommunity.com/id/AxeOfChaos
8826,,"Posted July 30, 2012.",,105600,No ratings yet,True,Co-op is amazing :D,AxeOfChaos,http://steamcommunity.com/id/AxeOfChaos


<div style="text-align: justify">

I checked this new version in Power BI and this time there were no warnings. I took AxeOfChaos as a user to test because I was unable to find results for this user after the first.

</div>

<div style="text-align: justify">

### 3. Finding null and duplicate values.

I started cleaning the final file, removing the null and duplicate values. I checked and removed the null data in the **review** column.

</div>

In [46]:
df_reviews_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59276 entries, 0 to 59275
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   funny        8143 non-null   object
 1   posted       59276 non-null  object
 2   last_edited  6139 non-null   object
 3   item_id      59276 non-null  int64 
 4   helpful      59276 non-null  object
 5   recommend    59276 non-null  bool  
 6   review       59246 non-null  object
 7   user_id      59276 non-null  object
 8   user_url     59276 non-null  object
dtypes: bool(1), int64(1), object(7)
memory usage: 3.7+ MB


According to the info() method, there were null data.

In [47]:
# Check for null values in the DataFrame.
null_counts = df_reviews_final.isnull().sum()

# Display the count of null values per column.
null_counts

funny          51133
posted             0
last_edited    53137
item_id            0
helpful            0
recommend          0
review            30
user_id            0
user_url           0
dtype: int64

<div style="text-align: justify">

According to the result, there were null data in the columns funny, last edited and review. I focused on the review column because I thought the funny and last edited columns were unnecessary for the project.

</div>

In [48]:
# checking for rows where 'review' is NaN or null.
missing_reviews_rows = df_reviews_final[df_reviews_final['review'].isna()]
missing_reviews_rows.head(2)

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
3095,,Posted March 11.,,550,No ratings yet,True,,2ZESTY4ME,http://steamcommunity.com/id/2ZESTY4ME
4616,,"Posted September 19, 2014.",,550,No ratings yet,True,,76561198093337643,http://steamcommunity.com/profiles/76561198093...


In [49]:
# Remove rows where 'review' is null.
df_reviews_final = df_reviews_final.dropna(subset=['review'])

# Check for null values in the DataFrame.
null_counts = df_reviews_final.isnull().sum()

# Display the count of null values per column.
null_counts

funny          51104
posted             0
last_edited    53107
item_id            0
helpful            0
recommend          0
review             0
user_id            0
user_url           0
dtype: int64

I removed the NAN reviews because I think a review should give you an idea about the game you want to buy.

In [50]:
# finding duplicates.
duplicates = df_reviews_final.loc[df_reviews_final.duplicated()]
duplicates

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
1112,,"Posted September 24, 2015.",,346110,1 of 1 people (100%) found this review helpful,True,yep,bokkkbokkk,http://steamcommunity.com/id/bokkkbokkk
2891,,"Posted January 10, 2014.",,218620,1 of 3 people (33%) found this review helpful,True,"Good graphics, fun heists! A bit laggy",ImSeriouss,http://steamcommunity.com/id/ImSeriouss
2892,,"Posted January 10, 2014.",,105600,0 of 2 people (0%) found this review helpful,True,So fun! DEFINITELY NOT RIP OFF OF MINECRAFT! e...,ImSeriouss,http://steamcommunity.com/id/ImSeriouss
2893,,"Posted December 17, 2014.",,570,No ratings yet,True,bobo pinoy,ImSeriouss,http://steamcommunity.com/id/ImSeriouss
2894,,"Posted January 13, 2014.",,211820,No ratings yet,True,If you want to play this game.. expect glithes...,ImSeriouss,http://steamcommunity.com/id/ImSeriouss
...,...,...,...,...,...,...,...,...,...
44433,,Posted July 3.,,422400,No ratings yet,True,Muy entretenido y una coleccion de armas prome...,76561198092022514,http://steamcommunity.com/profiles/76561198092...
44434,,Posted June 1.,,218620,No ratings yet,True,"Tiene una jugabilidad y tematica muy buena :D,...",76561198092022514,http://steamcommunity.com/profiles/76561198092...
44435,,"Posted August 17, 2014.",,261820,No ratings yet,True,"Buen juego, no importa el desarrrollo que tien...",76561198092022514,http://steamcommunity.com/profiles/76561198092...
44436,,"Posted February 17, 2014.",,224260,No ratings yet,True,exelente aporte :D¡¡¡ es una buen mod basado e...,76561198092022514,http://steamcommunity.com/profiles/76561198092...


In [51]:
# searching a duplicate result considering the user_id and item_id
user_id_value = 'ImSeriouss'
item_id_value = 211820

# Filter the DataFrame for records with specific user_id and item_id
filtered_records = df_reviews_final[(df_reviews_final['user_id'] == user_id_value) & (df_reviews_final['item_id'] == item_id_value)]

# Display the resulting DataFrame
filtered_records

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
2888,,"Posted January 13, 2014.",,211820,No ratings yet,True,If you want to play this game.. expect glithes...,ImSeriouss,http://steamcommunity.com/id/ImSeriouss
2894,,"Posted January 13, 2014.",,211820,No ratings yet,True,If you want to play this game.. expect glithes...,ImSeriouss,http://steamcommunity.com/id/ImSeriouss


In [52]:
# deleting duplicate values.
# Get the total number of rows before deduplication.
total_rows_before = len(df_reviews_final)

# Remove duplicate rows.
df_reviews_final = df_reviews_final.drop_duplicates(keep='first')

# Get the total number of rows after deduplication.
total_rows_after = len(df_reviews_final)

# Calculate the number of rows removed.
rows_removed = total_rows_before - total_rows_after

# Print the information.
print(f'Total rows before: {total_rows_before}')
print(f'Total rows after: {total_rows_after}')
print(f'Rows removed: {rows_removed}')

Total rows before: 59246
Total rows after: 58372
Rows removed: 874


I wanted to be sure that the drop_duplicates function deleted the right amount of duplicate values.

<div style="text-align: justify">

### 4. Deleting cells with Whitespace and Empty Strings

Finding a removing the rows in which the column review is empty because as the reviews with NAN this does not provide useful information to buy a game.

</div>

In [60]:
# Clean the 'review' column by removing leading and trailing whitespaces
df_reviews_final['review'] = df_reviews_final['review'].str.strip()

# Replace values that are spaces or non-printable characters with 'missing data'
df_reviews_final['review'].replace('', 'missing data', inplace=True)

# Replace newline characters and tabs with 'missing data'
df_reviews_final['review'].replace(['\n', '\t'], 'missing data', regex=True, inplace=True)

# Fill missing values in the 'review' column with 'missing data'
df_reviews_final['review'].fillna('missing data', inplace=True)


In [61]:
# Filter rows where the 'review' column is equal to 'missing data'
missing_data_rows = df_reviews_final[df_reviews_final['review'] == 'missing data']
missing_data_rows

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
614,,"Posted December 13, 2013.",,570,No ratings yet,True,missing data,76561198070263209,http://steamcommunity.com/profiles/76561198070...
914,,"Posted November 25, 2013.",,215530,0 of 3 people (0%) found this review helpful,True,missing data,Azrafael,http://steamcommunity.com/id/Azrafael
9347,,"Posted November 25, 2013.",,233840,1 of 4 people (25%) found this review helpful,True,missing data,BomberThink,http://steamcommunity.com/id/BomberThink
9348,,"Posted January 31, 2014.",,211820,No ratings yet,True,missing data,BomberThink,http://steamcommunity.com/id/BomberThink
22879,1 person found this review funny,"Posted February 3, 2014.",,208090,2 of 3 people (67%) found this review helpful,True,missing data,rpsntc,http://steamcommunity.com/id/rpsntc
28607,1 person found this review funny,"Posted November 29, 2013.",,8500,No ratings yet,True,missing data,76561198040016388,http://steamcommunity.com/profiles/76561198040...
28958,,"Posted January 1, 2014.",,40800,No ratings yet,True,missing data,SILENTLIGHT,http://steamcommunity.com/id/SILENTLIGHT
55339,,"Posted January 5, 2014.",,10,No ratings yet,True,missing data,inconi70,http://steamcommunity.com/id/inconi70


I found 8 reviews with missing data in the dataset.

In [62]:
# Deletes the rows where the 'review' column is equal to 'missing data'
df_reviews_final = df_reviews_final[df_reviews_final['review'] != 'missing data']

<div style="text-align: justify">

### 5. Deleting columns

I considered that the columns **funny**, **last_edited**, **posted reviews**, **helpful**, and **user_url** were not necessary for the project.

</div>

In [64]:
# Deleting columns
columns_to_drop = ['funny', 'last_edited', 'posted', 'helpful', 'user_url']

# Drop the specified columns
df_reviews_final = df_reviews_final.drop(columns=columns_to_drop, errors='ignore')

In [65]:
df_reviews_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58364 entries, 0 to 59275
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   item_id    58364 non-null  int64 
 1   recommend  58364 non-null  bool  
 2   review     58364 non-null  object
 3   user_id    58364 non-null  object
dtypes: bool(1), int64(1), object(2)
memory usage: 1.8+ MB


In [66]:
# Checking the final version of the file before overwrite it.
df_reviews_final

Unnamed: 0,item_id,recommend,review,user_id
0,1250,True,Simple yet with great replayability. In my opi...,76561197970982479
1,22200,True,It's unique and worth a playthrough.,76561197970982479
2,43110,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479
3,251610,True,I know what you think when you see this title ...,js41637
4,227300,True,For a simple (it's actually not all that simpl...,js41637
...,...,...,...,...
59271,313160,True,"This is a great game, but I cant play one map ...",snarkcornwtt
59272,730,True,Counter Strike is like junk food.It's toxic as...,Fuckfhaisjnsnsjakaka
59273,240,True,잼꾸르잼,3214213216
59274,209120,True,"Great game, awkward to get running in windows 10",ChrisCoroner


<div style="text-align: justify">

### 5. It is time to overwrite it

</div>

In [67]:
# Overwrite the original CSV file.
df_reviews_final.to_csv('csv/user_reviews.csv', index=False)