# Removing Duplicates

A superficial analysis of the Steam Games dataset will point to no duplicates existing whithin the data. However, on a closer inspection, we do see some records that refer to the same title. In this notebook we'll show how we identified and adressed this matter.

In [1]:
#Import packages
import pandas as pd
import json

import requests
from bs4 import BeautifulSoup

In [2]:
#Load data
INPUT_PATH = "/kaggle/input/steam-games-dataset/"
df = pd.read_json(INPUT_PATH + "games.json", orient="index")
df.index.rename("app_id", inplace=True)

In [3]:
#Check for duplicate app_id values
df.index.value_counts()

app_id
20200      1
2016110    1
2193610    1
1997040    1
1965920    1
          ..
380        1
1475740    1
1021680    1
816250     1
3054200    1
Name: count, Length: 97410, dtype: int64

From the cell above, it's clear no duplications exists from the `app_id` standpoint. In order to identify the potential duplicates, we'll create a combined key from the concatenation of the `name`, `release_date` and `detailed_description` fields. These columns were chosen simply because it seems very unlikely to completely different titles would hold the same values for these fields combined.

In [4]:
#Create a combined key for duplicates identification
combined_key_fields = ["name", "release_date", "detailed_description"]
df["combined_key"] = [hash('-'.join(row)) for row in df[combined_key_fields].values]

In [5]:
#Identify potential duplicates
duplicates = df["combined_key"].value_counts()
duplicates = duplicates[duplicates > 1]
duplicates

combined_key
 4469120978521663240    20
 6682565971528231394     5
 3290368993772720387     5
-4601373082316208102     4
-685261494665709531      4
                        ..
-3600134180623906285     2
-783766612199636911      2
 8494472239269789341     2
-8619287524340747877     2
 898114009519406340      2
Name: count, Length: 71, dtype: int64

As we can see, we have 71 occurrences in which the titles with differend `app_ids` have the same name, release date and detailed description. This, however, is not enough for us to be sure about their duplicate status. The next step is to check some samples. We are then going to look at the records associted with the following `combined_key` values:
* `8933640421187601788`
* `-3183998450418249162`

In [6]:
#Check sample 1 (combined_key = 8933640421187601788)
df.query("combined_key == 8933640421187601788")

Unnamed: 0_level_0,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,...,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,peak_ccu,tags,combined_key
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [7]:
#Check sample 2 (combined_key = -3183998450418249162)
df.query("combined_key == -3183998450418249162")

Unnamed: 0_level_0,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,...,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,peak_ccu,tags,combined_key
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


A quick glance of the selected cases does not help us a lot. While many of the fields hold the same values, this is not the case for some of them. The next step is then to check on all cases which fields are the same and which change.

In [8]:
#Check fields with same values across all records of potential duplicates
duplicates_df = df.join(on="combined_key", other=duplicates, how="inner").sort_values(by="header_image")

for col in duplicates_df.columns:
    duplicates_df[col] = [str(val) for val in duplicates_df[col]]

duplicates_agg = duplicates_df.groupby(by="header_image").agg(
    {col: "nunique" for col in duplicates_df.columns}
)

duplicate_fields = []
for col in duplicates_agg.columns:
    duplicate_fields.append([col, sum(duplicates_agg[col] > 1)])

duplicate_fields = pd.DataFrame(duplicate_fields, columns=["field_name", "instances_with_distinct_values"])
duplicate_fields.sort_values(by="instances_with_distinct_values", ascending=False)

Unnamed: 0,field_name,instances_with_distinct_values
33,negative,67
32,positive,67
40,tags,66
37,median_playtime_forever,66
35,average_playtime_forever,66
39,peak_ccu,62
34,estimated_owners,51
19,recommendations,30
38,median_playtime_2weeks,22
36,average_playtime_2weeks,22


In a similar way to the analysis of the samples, we see that most fields hold the same values for the same `combined_key`. But, once again, some fields frequently have several different values for the same key. We need to go one step further.

Here, we are going to leave this notebook for a while and go the source of the data: Steam. If we browse its website and go through some titles, we'll be able to see the URLs follow the same simple strutucture: `https://store.steampowered.com/app/{app_id}`. This means we could use the `app_id` values in our data to check if these are indeed duplicates. Let's do this for the sample we explored earlier:

In [9]:
#Create field to hold the URL that directs to the title
base_url = "<a target='_blank' href=https://store.steampowered.com/app/{0}<>https://store.steampowered.com/app/{0}</a>"
duplicates_df["app_url"] = [base_url.format(app_id) for app_id in duplicates_df.index]

In [10]:
#Check URLs for sample 1 (combined_key = 8933640421187601788)
duplicates_df.query("combined_key == '8933640421187601788'")["app_url"].values

array([], dtype=object)

In [11]:
#Check URLs for sample 2 (combined_key = -3183998450418249162)
duplicates_df.query("combined_key == '-3183998450418249162'")["app_url"].values

array([], dtype=object)

If you clicked on the links for both samples, you probably saw that, for each one, the final URL was always the same. This is one more strong indicator about the duplicate nature of these games. 

There is only one last possible explanation to negate those results. That would be that these duplicates are actually the result of different bundles, or packages for the same game being identified as different titles. While this does not look likely at this moment, it could very well be the case. So let's check it now.

In [12]:
#Define function to parse "packages" values into a list of dictionaries
def get_number_of_packages(val):
    formatted_val = val.replace("\\", "").replace("'s ", "s ").replace("'", '"')
    packages = json.loads(formatted_val)
    try: 
        packages = packages[0]
        return len(packages["subs"])
    except:
        return 0

In [13]:
#Compute the amount of packages for each title and compare against the amount of unique app_ids associated to the same combined_key
duplicates_df["number_of_packages"] = [get_number_of_packages(package) for package in duplicates_df["packages"]]

duplicates_df.groupby(by="combined_key").agg({
    "number_of_packages": ["max", "min", "count"],
})

Unnamed: 0_level_0,number_of_packages,number_of_packages,number_of_packages
Unnamed: 0_level_1,max,min,count
combined_key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
-110184952151705440,0,0,2
-2284640409385713502,1,1,3
-2945165861894451980,2,2,2
-3049741392856618877,1,1,2
-3089905008029630728,1,1,3
...,...,...,...
8494472239269789341,1,1,2
8762429487034468491,3,3,2
8912315394927462818,0,0,2
892950807402587546,5,5,2


The results above show there is not a match between the number of packages available for a title and the amount of duplicates records found. This was to be expected, since we are talking about the same app in the end. With this, we can finally conclude our assessment and confirm the existence of duplicate records in the dataset.

Now, we can move on to the next step: removing those duplicates. In order to do this, we are going to use the previously presented logic regarding the Steam URLs and select only those records for which the `app_id` corresponds to the identifier in the redirected URL.

In [14]:
#Remove duplicate records
deduplicated_df = duplicates_df.copy()

deduplicated_df["final_url_app_id"] = [
    requests.get(url=f"https://store.steampowered.com/app/{app_id}", allow_redirects=True).url 
    for app_id in deduplicated_df.index
]
deduplicated_df["final_url_app_id"] = [
    int(url.replace("https://store.steampowered.com/app/", "").replace("/", "")) for url in deduplicated_df["final_url_app_id"]
]

deduplicated_df = deduplicated_df[deduplicated_df.index == deduplicated_df["final_url_app_id"]]

In [15]:
#Validate duplicates have been removed
deduplicated_df["combined_key"].value_counts()

combined_key
2553868700047025618     1
-8981574789490687166    1
-4078039156374915112    1
-6858067430236524913    1
-6965377987446051163    1
                       ..
6535279350457244099     1
-6797154105201435717    1
-3224544190165374964    1
6195112013312723679     1
-3568368896347377739    1
Name: count, Length: 71, dtype: int64

As we can see, the same amout of unique `combined_key` values exists in the deduplicated dataset, but now each one only has a sigle record. With this, we can safely say the existing duplicates have been addressed.