<center>

# Users Items ETL

<center>

In [2]:
# Importations.
import os
import ast
import pandas as pd

<div style="text-align: justify">

### 1. From Json to csv 

I transformed the JSON file into a CSV file to read it as a data frame. After the first look, I checked that the column **items** was nested, so I had to unnest the column. I deleted the initial version, updated the code, and then created a new file to save space, as the file size was 560 MB, and the free limit for GitHub LFS is 1 GB.

</div>

In [2]:
# For users_items.csv.
# File paths.
users_items = 'PI MLOps - STEAM/users_items.json'
users_items_csv = 'csv/users_items.csv'

# If the file does not exist, create the file.
if not os.path.exists(users_items_csv):
    # reading the Json data file.
    items = []
    with open(users_items, encoding='utf-8') as f:
        for line in f.readlines():
            items.append(ast.literal_eval(line))
            
    # Transform the Json file into a DataFrame and normalize the 'items' column.
    df_items = pd.json_normalize(items, record_path=['items'], meta=['user_id', 'items_count', 'steam_id', 'user_url'])
            
    # Save the dataframe as a CSV file in the csv folder.
    df_items.to_csv(users_items_csv, index=False)
    print(f'The file {users_items_csv} was successfully created.')
else:
    print(f'The file {users_items_csv} already exists.')

The file csv/users_items.csv was successfully created.


In [8]:
# Reading the csv file.
df_items = pd.read_csv('csv/users_items.csv')

In [9]:
# First check to the file.
df_items.head(1)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."


The column **items** was nested; the composition of this column was a list that contained a dictionary.

In [3]:
# Reading the csv after the update.
df_items_unnested = pd.read_csv('csv/users_items.csv')

In [4]:
# Checking the columns and file content after the update.
df_items_unnested.head(3)

Unnamed: 0,item_id,item_name,playtime_forever,playtime_2weeks,user_id,items_count,steam_id,user_url
0,10,Counter-Strike,6,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...
1,20,Team Fortress Classic,0,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...
2,30,Day of Defeat,7,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...


<div style="text-align: justify">

After normalization, the column **items** was unnested, and every game inside the **item_name** column was aligned with the correct user in the **user_id** column.

</div>

<div style="text-align: justify">

### 2. Checking NAN, None and Duplicate Items

I used 2 methods to find out if the file had NAN/None data.
but the results were negative, after applied the loc method I found that the file had 59104 duplicates.

</div>

In [5]:
# Procee to check if there is null data.
null_data = df_items_unnested.isnull().sum()
null_data

item_id             0
item_name           0
playtime_forever    0
playtime_2weeks     0
user_id             0
items_count         0
steam_id            0
user_url            0
dtype: int64

I used the isnull().sum() method, I could not find any null data.

In [7]:
# Second method to confirm if the file has any null data.
print(df_items_unnested.isna().any())

item_id             False
item_name           False
playtime_forever    False
playtime_2weeks     False
user_id             False
items_count         False
steam_id            False
user_url            False
dtype: bool


I tried a second variant to confirm the initial result.

In [8]:
# Finding duplicates.
duplicates = df_items_unnested.loc[df_items_unnested.duplicated()]
duplicates

Unnamed: 0,item_id,item_name,playtime_forever,playtime_2weeks,user_id,items_count,steam_id,user_url
164294,20,Team Fortress Classic,5,0,Nikiad,109,76561198084006094,http://steamcommunity.com/id/Nikiad
164295,50,Half-Life: Opposing Force,0,0,Nikiad,109,76561198084006094,http://steamcommunity.com/id/Nikiad
164296,70,Half-Life,0,0,Nikiad,109,76561198084006094,http://steamcommunity.com/id/Nikiad
164297,130,Half-Life: Blue Shift,0,0,Nikiad,109,76561198084006094,http://steamcommunity.com/id/Nikiad
164298,220,Half-Life 2,198,0,Nikiad,109,76561198084006094,http://steamcommunity.com/id/Nikiad
...,...,...,...,...,...,...,...,...
4898223,213670,South Park™: The Stick of Truth™,725,0,76561198080057659,39,76561198080057659,http://steamcommunity.com/profiles/76561198080...
4898224,221910,The Stanley Parable,53,0,76561198080057659,39,76561198080057659,http://steamcommunity.com/profiles/76561198080...
4898225,261030,The Walking Dead: Season Two,253,0,76561198080057659,39,76561198080057659,http://steamcommunity.com/profiles/76561198080...
4898226,273110,Counter-Strike Nexon: Zombies,0,0,76561198080057659,39,76561198080057659,http://steamcommunity.com/profiles/76561198080...


After applied the **loc** method and the **duplicate** function, I found **59.104** duplicates.

In [9]:
# Total rows before deduplication.
total_rows_before = len(df_items_unnested)

# Remove duplicate rows.
df_items_unnested = df_items_unnested.drop_duplicates(keep='first')

# Total rows after deduplication.
total_rows_after = len(df_items_unnested)

# Total rows removed.
rows_removed = total_rows_before - total_rows_after

# Print the information.
print(f'Total rows before: {total_rows_before}')
print(f'Total rows after: {total_rows_after}')
print(f'Rows removed: {rows_removed}')

Total rows before: 5153209
Total rows after: 5094105
Rows removed: 59104


I wanted to verify if the **drop_duplicates** function was going to delete the total amount of duplicates.

<div style="text-align: justify">

### 3. Deleting Columns

I deleted some columns that I considered unnecessary for the project. This helped me reduce the size of the file. The columns I removed were **playtime_2weeks** (which contained the same information as playtime forever), **steam_id** (which duplicated the information found in user_id), **items_count** (representing the number of items/games per user), and **user_url** (which contained the link to the user's profile).

</div>

In [10]:
# Deleting the columns.
df_items_unnested = df_items_unnested.drop(['playtime_2weeks', 'items_count', 'steam_id', 'user_url'], axis=1)

<div style="text-align: justify">

### 4. Checking the Playtime Forever Column

For the project, it was not necessary to retain information about items/games with 0 minutes of playtime. Steam does not allow you to write a review unless you have started/played the game.

</div>

In [15]:
# Filtrar las filas donde playtime_forever es menor de 30 minutos
df_items_unnested = df_items_unnested[df_items_unnested['playtime_forever'] >= 300]

In [16]:
# checking the result.
df_sorted = df_items_unnested.sort_values(by='playtime_forever')
df_sorted

Unnamed: 0,item_id,item_name,playtime_forever,user_id
2273913,218620,PAYDAY 2,300,76561198009403093
425878,304050,Trove,300,76561198093242240
2025969,202970,Call of Duty: Black Ops II,300,76561198049573978
162183,9450,"Warhammer 40,000: Dawn of War – Soulstorm",300,kaepora0gaebora
325591,99900,Spiral Knights,300,76561198067193543
...,...,...,...,...
1169053,72200,Universe Sandbox,600068,tsunamitad
959169,4000,Garry's Mod,613411,76561198039832932
2550730,42710,Call of Duty: Black Ops - Multiplayer,632452,76561198019826668
1581333,212200,Mabinogi,635295,Evilutional


After deleting the duplicates, the file initially had 5,094,105 rows, and now it only has 3,246,375. This means that there were 1,847,730 rows with 0 minutes.
Update: I had to further reduce the size of this file because the processing time for graphics was too high.

In [20]:
df_items_unnested.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1374010 entries, 0 to 1374009
Data columns (total 4 columns):
 #   Column            Non-Null Count    Dtype 
---  ------            --------------    ----- 
 0   item_id           1374010 non-null  int64 
 1   item_name         1374010 non-null  object
 2   playtime_forever  1374010 non-null  int64 
 3   user_id           1374010 non-null  object
dtypes: int64(2), object(2)
memory usage: 41.9+ MB


In [18]:
# resetting the index of the file.
df_items_unnested.reset_index(drop=True, inplace=True)
df_items_unnested

Unnamed: 0,item_id,item_name,playtime_forever,user_id
0,300,Day of Defeat: Source,4733,76561197970982479
1,240,Counter-Strike: Source,1853,76561197970982479
2,3830,Psychonauts,333,76561197970982479
3,3900,Sid Meier's Civilization IV,338,76561197970982479
4,6910,Deus Ex: Game of the Year Edition,2685,76561197970982479
...,...,...,...,...
1374005,370240,NBA 2K16,1533,76561198319916652
1374006,346330,BrainBread 2,756,76561198320038728
1374007,730,Counter-Strike: Global Offensive,4557,ArkPlays7
1374008,346110,ARK: Survival Evolved,623,ArkPlays7


I wanted to reset the index to have a better order and future reference if I need it.

<div style="text-align: justify">

### 5. It is time to overwrite it

The file size was reduced to just 137 MB.

</div>

In [19]:
# Overwrite the original CSV file.
df_items_unnested.to_csv('csv/users_items.csv', index=False)