<center>

# User Reviews ETL

<center>

In [10]:
# Importations.
import os
import pandas as pd
import ast
import numpy as np

<div style="text-align: justify">

### 1. Converting Data: From JSON Format to CSV Format

This is the final version of the code to transform the JSON file into a CSV file. The process was challenging because the reviews column was nested with a list that contained lists for every user. Inside each list, there was a dictionary with all the reviews from the user. Additionally, some reviews had escape sequences like \r (Carriage Return) or \n (New Line). When I attempted to normalize the file, the sequence and order were lost. To normalize the file, I had to clean the JSON file and create a new one. Then, I could successfully normalize the file.
</div>

In [3]:
# Transformation code: from JSON to CSV. 
# File paths.
user_reviews = 'PI MLOps - STEAM/user_reviews.json'
user_reviews_cleaned = 'PI MLOps - STEAM/user_reviews_cleaned.json'
user_reviews_csv = 'csv/user_reviews.csv'

# If the cleaned file does not exist, create it.
if not os.path.exists(user_reviews_csv):
    # Read the JSON file and clean the review text format.
    with open(user_reviews, encoding='utf-8') as f, open(user_reviews_cleaned, 'w', encoding='utf-8') as f_cleaned:
        for line in f.readlines():
            review_data = ast.literal_eval(line)
            user_reviews = review_data.get('reviews', [])

            # Clean the format of the review text.
            for review in user_reviews:
                review['review'] = review.get('review', '').replace('\r', ' ').replace('\n', ' ').replace('\t', ' ')

            # Write the cleaned line to the new file.
            f_cleaned.write(str(review_data) + '\n')

        # Read the cleaned JSON file and normalize to a DataFrame.
        reviews = []
        with open(user_reviews_cleaned, encoding='utf-8') as f:
            for line in f.readlines():
                review_data = ast.literal_eval(line)
                user_id = review_data.get('user_id', '')
                user_url = review_data.get('user_url', '')
                user_reviews = review_data.get('reviews', [])

                # Add 'user_id' and 'user_url' to each review.
                for review in user_reviews:
                    review['user_id'] = user_id
                    review['user_url'] = user_url

                reviews.extend(user_reviews)

        # Normalize the JSON to a DataFrame.
        df_reviews = pd.json_normalize(reviews)

        # Save the DataFrame as a CSV file.
        df_reviews.to_csv(user_reviews_csv, index=False)
    print(f'The file {user_reviews_csv} was successfully created.')        
else:
    print(f'The cleaned file {user_reviews_csv} already exists.')

The file csv/user_reviews.csv was successfully created.


In [5]:
# reading the file user_reviews.csv, during the first version of the transformation code.
df_reviews = pd.read_csv('csv/user_reviews.csv')

In [3]:
# checking the file information.
df_reviews.head(1)

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."


At the beginning, I thought this was nested like the Users Items file, so I updated the code and tried it again.

In [3]:
# checking the result after the unnesting proccess.
df_reviews_unnested = pd.read_csv('csv/user_reviews.csv')

In [25]:
# checking the result after the first update.
df_reviews_unnested.head(3)

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,76561197970982479,http://steamcommunity.com/profiles/76561197970...
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,76561197970982479,http://steamcommunity.com/profiles/76561197970...
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479,http://steamcommunity.com/profiles/76561197970...


<div style="text-align: justify">

I tried to unnest this file using the code for the User Reviews file inside 0. preparation.ipynb. Apparently, everything was perfect, but when I wanted to check the file in Power BI, it showed me a warning indicating that 3 items had errors. I could see the items ID, but I couldn't see the 3 users' IDs with the errors. The users were AxeOfChaos, cumtasteslikejelly, and 988988MePls.

</div>

<div style="text-align: justify">

### 2. Understanding How the Review Column Was Nested

After finding the 3 errors, I decided to check and discovered that the normalization of the file did not work. I reached this conclusion when I tested 2 items IDs from the user 76561198114558878. For instance, the item 108600 is not reflecting the full description, unlike the item 239140.

</div>

In [27]:
# User from error.
user_id_value = '76561198114558878'

# Filter the DataFrame for rows with the specified user_id.
user_results = df_reviews_unnested[df_reviews_unnested['user_id'] == user_id_value]

# Display the resulting DataFrame.
user_results


Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
58222,1 person found this review funny,"Posted December 19, 2015.",,349100,9 of 9 people (100%) found this review helpful,True,Notrium.I don't have words for this game. Prob...,76561198114558878,http://steamcommunity.com/profiles/76561198114...
58223,,"Posted May 19, 2014.",,108600,2 of 2 people (100%) found this review helpful,True,A fantastic zombie game! Fast paced; but also ...,76561198114558878,http://steamcommunity.com/profiles/76561198114...
58224,,"Posted August 12, 2014.",,227940,1 of 3 people (33%) found this review helpful,True,Heroes and Generals is a FPS based in the year...,76561198114558878,http://steamcommunity.com/profiles/76561198114...
58225,1 person found this review funny,Posted June 9.,,239140,2 of 10 people (20%) found this review helpful,False,Did you literally just make my base game redun...,76561198114558878,http://steamcommunity.com/profiles/76561198114...


I found the user 76561198114558878 when I filtered the column posted in Power BI, so I took it for the test.

In [35]:
# searching a result considering the user_id and item_id.
user_id_value = '76561198114558878'
item_id_value = 108600

# Filter the DataFrame for records with specific user_id and item_id.
filtered_records = df_reviews_unnested[(df_reviews_unnested['user_id'] == user_id_value) & (df_reviews_unnested['item_id'] == item_id_value)]

filtered_records

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
58223,,"Posted May 19, 2014.",,108600,2 of 2 people (100%) found this review helpful,True,A fantastic zombie game! Fast paced; but also ...,76561198114558878,http://steamcommunity.com/profiles/76561198114...


In [36]:
# Display the full content of the 'review' column for the specific user and item.
print("Review content:")
for index, row in filtered_records.iterrows():
    print(row['review'])

Review content:
 = It's a Zombie game. At least they aren't Nazis... /Senses incoming DLC./as holding my handgun upside down; AND backwards! D; What's up with that? You should also make sure that the if you get a certain style of melee weapon; they swing differently. Like; I don't think I'd swing an Axe like a Baseball bat. XD Just saying!


<div style="text-align: justify">

After comparing both outputs, I realized something was wrong. Taking a closer look at the **review** column in Power BI when I wanted to filter the information, I noticed there was an empty list [] and, after this, many other lists with dictionaries. The column composition was something like this: [[], [{'funny': '', 'posted': 'Posted November 5, 2011.', 'last_edited': '', 'item_id': '1250', 'helpful': 'No ratings yet', 'recommend': True, 'review': 'Simple yet with great replayability. In my opinion, it does 'zombie' hordes and teamwork better than Left 4 Dead, plus it has a global leveling system. A lot of down-to-earth 'zombie' splattering fun for the whole family. Amazed this sort of FPS is so rare.'}, {...}], [{'...'}]...] this made me rebuild the code.

</div>

In [2]:
# reading the csv file after the code rebuild.
df_reviews_final = pd.read_csv('csv/user_reviews.csv')

In [5]:
# User to check the new file.
user_id_value = 'AxeOfChaos'

# Filter the DataFrame for rows with the specified user_id.
results = df_reviews_final[df_reviews_final['user_id'] == user_id_value]

# Display the resulting DataFrame.
results

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
8824,,Posted August 31.,,475150,4 of 4 people (100%) found this review helpful,True,The original and the best just got even better...,AxeOfChaos,http://steamcommunity.com/id/AxeOfChaos
8825,,"Posted August 21, 2013.",,24740,No ratings yet,True,"Most fun you can have on 4 wheels, or even 2! ...",AxeOfChaos,http://steamcommunity.com/id/AxeOfChaos
8826,,"Posted July 30, 2012.",,105600,No ratings yet,True,Co-op is amazing :D,AxeOfChaos,http://steamcommunity.com/id/AxeOfChaos


<div style="text-align: justify">

I checked this new version in Power BI and this time there were no warnings. I took AxeOfChaos as a user to test because I was unable to find results for this user after the first.

</div>

<div style="text-align: justify">

### 3. Finding null and duplicate values.

I started cleaning the final file, removing the null and duplicate values. I checked and removed the null data in the **review** column.

</div>

In [6]:
df_reviews_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59276 entries, 0 to 59275
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   funny        8143 non-null   object
 1   posted       59276 non-null  object
 2   last_edited  6139 non-null   object
 3   item_id      59276 non-null  int64 
 4   helpful      59276 non-null  object
 5   recommend    59276 non-null  bool  
 6   review       59246 non-null  object
 7   user_id      59276 non-null  object
 8   user_url     59276 non-null  object
dtypes: bool(1), int64(1), object(7)
memory usage: 3.7+ MB


According to the info() method, there were null data.

In [7]:
# Check for null values in the DataFrame.
null_counts = df_reviews_final.isnull().sum()

# Display the count of null values per column.
null_counts

funny          51133
posted             0
last_edited    53137
item_id            0
helpful            0
recommend          0
review            30
user_id            0
user_url           0
dtype: int64

<div style="text-align: justify">

According to the result, there were null data in the columns funny, last edited and review. I focused on the review column because I thought the funny and last edited columns were unnecessary for the project.

</div>

In [8]:
# checking for rows where 'review' is NaN or null.
missing_reviews_rows = df_reviews_final[df_reviews_final['review'].isna()]
missing_reviews_rows.head(2)

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
3095,,Posted March 11.,,550,No ratings yet,True,,2ZESTY4ME,http://steamcommunity.com/id/2ZESTY4ME
4616,,"Posted September 19, 2014.",,550,No ratings yet,True,,76561198093337643,http://steamcommunity.com/profiles/76561198093...


In [9]:
# Remove rows where 'review' is null.
df_reviews_final = df_reviews_final.dropna(subset=['review'])

# Check for null values in the DataFrame.
null_counts = df_reviews_final.isnull().sum()

# Display the count of null values per column.
null_counts

funny          51104
posted             0
last_edited    53107
item_id            0
helpful            0
recommend          0
review             0
user_id            0
user_url           0
dtype: int64

I removed the NAN reviews because I think a review should give you an idea about the game you want to buy.

In [10]:
# finding duplicates.
duplicates = df_reviews_final.loc[df_reviews_final.duplicated()]
duplicates

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
1112,,"Posted September 24, 2015.",,346110,1 of 1 people (100%) found this review helpful,True,yep,bokkkbokkk,http://steamcommunity.com/id/bokkkbokkk
2891,,"Posted January 10, 2014.",,218620,1 of 3 people (33%) found this review helpful,True,"Good graphics, fun heists! A bit laggy",ImSeriouss,http://steamcommunity.com/id/ImSeriouss
2892,,"Posted January 10, 2014.",,105600,0 of 2 people (0%) found this review helpful,True,So fun! DEFINITELY NOT RIP OFF OF MINECRAFT! e...,ImSeriouss,http://steamcommunity.com/id/ImSeriouss
2893,,"Posted December 17, 2014.",,570,No ratings yet,True,bobo pinoy,ImSeriouss,http://steamcommunity.com/id/ImSeriouss
2894,,"Posted January 13, 2014.",,211820,No ratings yet,True,If you want to play this game.. expect glithes...,ImSeriouss,http://steamcommunity.com/id/ImSeriouss
...,...,...,...,...,...,...,...,...,...
44433,,Posted July 3.,,422400,No ratings yet,True,Muy entretenido y una coleccion de armas prome...,76561198092022514,http://steamcommunity.com/profiles/76561198092...
44434,,Posted June 1.,,218620,No ratings yet,True,"Tiene una jugabilidad y tematica muy buena :D,...",76561198092022514,http://steamcommunity.com/profiles/76561198092...
44435,,"Posted August 17, 2014.",,261820,No ratings yet,True,"Buen juego, no importa el desarrrollo que tien...",76561198092022514,http://steamcommunity.com/profiles/76561198092...
44436,,"Posted February 17, 2014.",,224260,No ratings yet,True,exelente aporte :D¡¡¡ es una buen mod basado e...,76561198092022514,http://steamcommunity.com/profiles/76561198092...


In [11]:
# searching a duplicate result considering the user_id and item_id
user_id_value = 'ImSeriouss'
item_id_value = 211820

# Filter the DataFrame for records with specific user_id and item_id
filtered_records = df_reviews_final[(df_reviews_final['user_id'] == user_id_value) & (df_reviews_final['item_id'] == item_id_value)]

# Display the resulting DataFrame
filtered_records

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
2888,,"Posted January 13, 2014.",,211820,No ratings yet,True,If you want to play this game.. expect glithes...,ImSeriouss,http://steamcommunity.com/id/ImSeriouss
2894,,"Posted January 13, 2014.",,211820,No ratings yet,True,If you want to play this game.. expect glithes...,ImSeriouss,http://steamcommunity.com/id/ImSeriouss


In [12]:
# deleting duplicate values.
# Get the total number of rows before deduplication.
total_rows_before = len(df_reviews_final)

# Remove duplicate rows.
df_reviews_final = df_reviews_final.drop_duplicates(keep='first')

# Get the total number of rows after deduplication.
total_rows_after = len(df_reviews_final)

# Calculate the number of rows removed.
rows_removed = total_rows_before - total_rows_after

# Print the information.
print(f'Total rows before: {total_rows_before}')
print(f'Total rows after: {total_rows_after}')
print(f'Rows removed: {rows_removed}')

Total rows before: 59246
Total rows after: 58372
Rows removed: 874


I wanted to be sure that the drop_duplicates function deleted the right amount of duplicate values.

<div style="text-align: justify">

### 4. Deleting cells with Whitespace and Empty Strings

Finding a removing the rows in which the column review is empty because as the reviews with NAN this does not provide useful information to buy a game.

</div>

In [3]:
# Display the data types of the 'review' column.
print("Data types in 'review' column:")
print(df_reviews_final['review'].apply(type).value_counts())

Data types in 'review' column:
review
<class 'str'>      58364
<class 'float'>        8
Name: count, dtype: int64


I found 2 different types of data in the column review, this made me wonder what kind of information the reviews had.

In [5]:
# Identify rows where 'review' has a data type of float.
float_review_rows = df_reviews_final[df_reviews_final['review'].apply(lambda x: isinstance(x, float))]

# Display the rows where 'review' is of type float.
print(float_review_rows[['user_id', 'item_id', 'review']])

                 user_id  item_id review
614    76561198070263209      570    NaN
914             Azrafael   215530    NaN
9242         BomberThink   233840    NaN
9243         BomberThink   211820    NaN
22398             rpsntc   208090    NaN
28002  76561198040016388     8500    NaN
28347        SILENTLIGHT    40800    NaN
54437           inconi70       10    NaN


I filtered the float type, and the reviews were filled with NAN. This led me to think that these reviews might be filled with spaces or empty strings, as that information was not deleted when I applied dropna to remove the nulls.

In [6]:
# Identify rows where 'review' is NaN using isna().
nan_review_rows = df_reviews_final[df_reviews_final['review'].isna()]

# Display the rows where 'review' is NaN.
print("Rows where 'review' is NaN:")
print(nan_review_rows[['user_id', 'item_id', 'review']])

Rows where 'review' is NaN:
                 user_id  item_id review
614    76561198070263209      570    NaN
914             Azrafael   215530    NaN
9242         BomberThink   233840    NaN
9243         BomberThink   211820    NaN
22398             rpsntc   208090    NaN
28002  76561198040016388     8500    NaN
28347        SILENTLIGHT    40800    NaN
54437           inconi70       10    NaN


In [11]:
# Convert various types of missing values to NaN.
df_reviews_final['review'] = df_reviews_final['review'].apply(lambda x: np.nan if pd.isna(x) or x == '' or x is None else x)

# Drop rows where 'review' is NaN.
df_reviews_final = df_reviews_final.dropna(subset=['review'])

When I used isnull those items were not detected, but when I used **isna** I could see and remove the other rows with spaces.

<div style="text-align: justify">

### 5. Creating the column year from the posted column

For the project, it was necessary to determine the year to calculate the most played genre. Additionally, this information is crucial for identifying the user who has accumulated the most hours for a specific genre.

</div>

In [28]:
# Extract unique date patterns using regular expressions.
date_patterns = df_reviews_final['posted'].str.extract(r'(\b\w+ \d{1,2},? \d{4}\b|\b\w+ \d{1,2}\b)')[0].unique()

# Print the unique date patterns.
for pattern in date_patterns:
    print(f"{pattern}  ", end="--")

November 5, 2011  --July 15, 2011  --April 21, 2011  --June 24, 2014  --September 8, 2013  --November 29, 2013  --February 3  --December 4, 2015  --November 3, 2014  --October 15, 2014  --October 14, 2013  --July 28, 2012  --June 2, 2012  --June 29, 2014  --November 22, 2012  --February 23, 2012  --April 15, 2014  --December 23, 2013  --March 14, 2014  --July 11, 2013  --May 5, 2014  --December 24, 2012  --October 21, 2012  --March 20, 2012  --March 9, 2012  --May 20  --July 24  --February 1, 2015  --June 20, 2014  --June 16  --June 11  --August 25, 2014  --December 25, 2013  --June 23, 2012  --September 5, 2015  --March 30, 2015  --February 19, 2014  --July 14, 2014  --April 27, 2013  --July 20, 2015  --November 4, 2013  --July 12, 2013  --August 19, 2012  --June 19, 2015  --September 20, 2014  --September 7, 2014  --December 19, 2014  --February 17, 2015  --June 7, 2014  --February 12, 2014  --February 9, 2014  --October 31, 2015  --February 27  --February 4, 2015  --August 23  --Apr

I found 2 date formats in the **posted** column: one as MM/DD/YYYY and the other as MM/DD.

In [53]:
# Extract the year from the posted column and creates the year column using regular expressions.
df_reviews_final['year'] = df_reviews_final['posted'].str.extract(r'\b(\d{4})\b')

# Handle cases where the year couldn't be extracted.
df_reviews_final['year'].fillna('Not Available', inplace=True)

# Checking the result of both conditions.
df_reviews_final.iloc[6:8]

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url,year
6,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...,evcentric,http://steamcommunity.com/id/evcentric,Not Available
7,,"Posted December 4, 2015.","Last edited December 5, 2015.",370360,No ratings yet,True,"""Run for fun? What the hell kind of fun is that?""",evcentric,http://steamcommunity.com/id/evcentric,2015


In [49]:
# Count occurrences of each value in the 'year' column
year_counts = df_reviews_final['year'].value_counts()

# Display the count specifically for 'Not Available'
not_available_count = year_counts.get('Not Available', 0)
print(f"Count for 'Not Available': {not_available_count}")

Count for 'Not Available': 9902


I checked the results, and I found 9,902 dates with a missing year.

In [54]:
# Remove rows where 'year' is equal to 'Not Available' and reset the index
df_reviews_final = df_reviews_final[df_reviews_final['year'] != 'Not Available'].reset_index(drop=True)

In [55]:
# checking the result.
df_reviews_final.iloc[6:8]

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url,year
6,,"Posted December 4, 2015.","Last edited December 5, 2015.",370360,No ratings yet,True,"""Run for fun? What the hell kind of fun is that?""",evcentric,http://steamcommunity.com/id/evcentric,2015
7,,"Posted November 3, 2014.",,237930,No ratings yet,True,"Elegant integration of gameplay, story, world ...",evcentric,http://steamcommunity.com/id/evcentric,2014


I decided to delete the rows with a missing year because this parameter is crucial for the entire project.

<div style="text-align: justify">

### 6. Deleting columns

I considered that the columns **funny**, **last_edited**, **posted reviews**, **helpful**, and **user_url** were not necessary for the project.

</div>

In [57]:
# Deleting columns
columns_to_drop = ['funny', 'last_edited', 'posted', 'reviews', 'helpful', 'user_url']

# Drop the specified columns
df_reviews_final = df_reviews_final.drop(columns=columns_to_drop, errors='ignore')

In [58]:
# Checking the final version of the file before overwrite it.
df_reviews_final

Unnamed: 0,item_id,recommend,review,user_id,year
0,1250,True,Simple yet with great replayability. In my opi...,76561197970982479,2011
1,22200,True,It's unique and worth a playthrough.,76561197970982479,2011
2,43110,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479,2011
3,251610,True,I know what you think when you see this title ...,js41637,2014
4,227300,True,For a simple (it's actually not all that simpl...,js41637,2013
...,...,...,...,...,...
48457,730,True,Neat,76561198239215706,2015
48458,730,True,its FUNNNNNNNN,wayfeng,2015
48459,253980,True,Awesome fantasy game if you don't mind the gra...,76561198251004808,2015
48460,730,True,Prettyy Mad Game,72947282842,2015


<div style="text-align: justify">

### 5. It is time to overwrite it

</div>

In [59]:
# Overwrite the original CSV file.
df_reviews_final.to_csv('csv/user_reviews.csv', index=False)