# Merging scraped Search Results with scraped Listings

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

Import Python modules

In [2]:
import os
from glob import glob

import pandas as pd

Import any custom modules

In [3]:
%aimport src.utils
from src.utils import (
    show_df,  # Display first and last n rows of a DataFrame
    show_df_dtypes_nans,  # Show the missing values and column datatypes side-by-side
)

<a href="table-of-contents"></a>

## [Table of Contents](#table-of-contents)
0. [About](#about)
1. [User Inputs](#user-inputs)
2. [Combine `selenium`-based files](#combine-`selenium`-based-files)
   - 2.1. [Load and Process all Listings files acquired using `selenium`](#load-and-process-all-listings-files-acquired-using-`selenium`)
   - 2.2. [Load and Process all Search Results files acquired using `selenium`](#load-and-process-all-search-results-files-acquired-using-`selenium`)
   - 2.3. [Merge with Price from Search Results dataset acquired using `selenium`](#merge-with-price-from-search-results-dataset-acquired-using-`selenium`)
3. [Combine `requests`-based files](#combine-`requests`-based-files)
   - 3.1. [Load and Process all Search Results files acquired using `requests`](#load-and-process-all-search-results-files-acquired-using-`requests`)
   - 3.2. [Load and Process all Listings files acquired using `requests`](#load-and-process-all-listings-files-acquired-using-`requests`)
   - 3.3. [Merge with Price from Search Results dataset acquired using `requests`](#merge-with-price-from-search-results-dataset-acquired-using-`requests`)
4. [Export combined dataset to disk](#export-combined-dataset-to-disk)

<a id="about"></a>

## 0. [About](#about)

In this notebook, we'll combine the search results and listings datasets scraped separately with `selenium` and `requests`. The combined dataset will then be processed (for example, to remove duplicate listings) and then exported to a single CSV file that can be used in further analysis.

<a id="user-inputs"></a>

## 1. [User Inputs](#user-inputs)

Define variables that can be changed when running this notebook

In [4]:
PROJ_ROOT_DIR = os.getcwd()

In [5]:
proc_data_filename = "processed_data.csv"

In [6]:
# Path to data/raw
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
raw_data_dir = os.path.join(data_dir, "raw")
# Path to data/processed
processed_data_dir = os.path.join(data_dir, "processed")

# Path to processed data file to be created as combination of datasets
# scraped with selenium and requests
processed_data_filepath = os.path.join(processed_data_dir, proc_data_filename)

# Path to data/raw/requests
requests_files_dir = os.path.join(raw_data_dir, "requests")

# Path to data/raw/selenium
selenium_files_dir = os.path.join(raw_data_dir, "selenium")

# List of listings CSV files created using requests
fpaths_requests = glob(os.path.join(requests_files_dir, "p*_*.csv"))

# List of listings CSV files created using selenium
fpaths_selenium = glob(os.path.join(selenium_files_dir, "p*_*.csv"))

# List of search results files created by Selenium
selenium_search_results_pages = glob(
    os.path.join(selenium_files_dir, "search_results_page_*_*.parquet.gzip")
)

# List of search results files created by requests
requests_search_results_pages = glob(
    os.path.join(requests_files_dir, "search_results_page_*_*.parquet.gzip")
)

<a id="combine-`selenium`-based-files"></a>

## 2. [Combine `selenium`-based files](#combine-`selenium`-based-files)

<a id="load-and-process-all-listings-files-acquired-using-`selenium`"></a>

### 2.1. [Load and Process all Listings files acquired using `selenium`](#load-and-process-all-listings-files-acquired-using-`selenium`)

We'll start by loading the listings scraped using Selenium. As discussed in `4_filter_requests_listings.ipynb`, we can merge listings with the search results acquired using Selenium on the `Title` column. So, we'll drop any rows with a missing value in the `Title` column. Finally, we'll sort the dataset by page and listing number

In [7]:
%%time
df_listings_sel = pd.concat(
    [
        pd.read_csv(fpath)
        for fpath in fpaths_selenium
    ],
    ignore_index=True,
).dropna(subset=["Title"]).sort_values(by=["page_num", "listing_num"]).reset_index(drop=True)
print(len(df_listings_sel))
show_df(df_listings_sel, 1)
show_df_dtypes_nans(df_listings_sel)

817


Unnamed: 0,review_type_all,overall_review_rating,pct_overall,pct_overall_threshold,pct_overall_lang,pct_overall_threshold_lang,platforms,user_defined_tags,num_steam_achievements,drm,rating,rating_descriptors,review_type_positive,review_type_negative,review_language_mine,Title,Genre,Release Date,Early Access Release Date,Developer,Publisher,Franchise,languages,num_languages,page_num,listing_num
0,1014.0,Very Positive,93.0,positive,92.0,positive,win,"Base Building, Strategy, Survival, Tower Defense, Hack and Slash, Action RPG, Crafting, Adventure, Simulation, RPG, Sci-fi, Resource Management, Exploration, Loot, Isometric, Building, Sandbox, Atmospheric, Action, Aliens",30.0,,,,938.0,76.0,412.0,The Riftbreaker,"Action, Adventure, Indie, RPG, Simulation, Strategy","14 Oct, 2021",,EXOR Studios,"EXOR Studios, Surefire.Games","EXOR Studios, surefiregames","English, French, German, Spanish - Spain, Japanese, Korean, Polish, Russian, Simplified Chinese, Portuguese - Brazil",10.0,2,1
816,3787.0,Very Positive,79.0,positive,76.0,positive,win,"Simulation, Casual, Indie, Singleplayer, Adventure, Life Sim, Nudity, Strategy, Open World, Sexual Content, First-Person, Action, Realistic, Funny, Hacking, Sandbox, Immersive Sim, Walking Simulator, Replay Value, Atmospheric",15.0,,,,2988.0,799.0,1010.0,Streamer Life Simulator,"Action, Adventure, Casual, Indie, Simulation, Strategy","21 Aug, 2020",,Cheesecake Dev,Cheesecake Dev,Cheesecake Dev,"English, French, German, Spanish - Spain, Simplified Chinese, Japanese, Polish, Portuguese, Russian, Turkish",10.0,50,23


Unnamed: 0,num_missing,dtype
review_type_all,13,float64
overall_review_rating,13,object
pct_overall,13,float64
pct_overall_threshold,13,object
pct_overall_lang,14,float64
pct_overall_threshold_lang,14,object
platforms,29,object
user_defined_tags,0,object
num_steam_achievements,183,float64
drm,551,object


CPU times: user 2.15 s, sys: 21.1 ms, total: 2.17 s
Wall time: 2.17 s


<a id="load-and-process-all-search-results-files-acquired-using-`selenium`"></a>

### 2.2. [Load and Process all Search Results files acquired using `selenium`](#load-and-process-all-search-results-files-acquired-using-`selenium`)

We'll now load the search results dataset and change the `page` (page number) column to an integer datatype. As with the listings dataset, we'll drop listings with a missing value in the `Title` column and sort the dataset

In [8]:
%%time
df_search_results_sel = (
    pd.concat(
        [
            pd.read_parquet(
                selenium_search_results_page,
                engine="auto",
            )
            for selenium_search_results_page in selenium_search_results_pages
        ],
        ignore_index=True,
    )
    .astype({"page": int})
    .dropna(subset=["title"])
    .sort_values(by=["page", "listing_counter"])
    .reset_index(drop=True)
)
print(df_search_results_sel["title"].nunique(), len(df_search_results_sel))

883 1225
CPU times: user 203 ms, sys: 42.6 ms, total: 245 ms
Wall time: 225 ms


All listings with a duplicated `title` are shown below

In [9]:
display(
    df_search_results_sel[
        df_search_results_sel.duplicated(subset=["title"], keep=False)
    ].sort_values(by="title")
)

Unnamed: 0,page,listing_counter,title,url,platform_names,release_date,discount_pct,original_price,discount_price
351,16,2,A Dance of Fire and Ice,https://store.steampowered.com/app/977950/A Da...,"win,mac","24 Jan, 2019",,CDN$ 6.69,
329,15,5,A Dance of Fire and Ice,https://store.steampowered.com/app/977950/A Da...,"win,mac","24 Jan, 2019",,CDN$ 6.69,
586,25,12,A Hat in Time,https://store.steampowered.com/app/253230/A Ha...,"win,mac","5 Oct, 2017",,CDN$ 32.99,
718,30,19,A Hat in Time,https://store.steampowered.com/app/253230/A Ha...,"win,mac","5 Oct, 2017",,CDN$ 32.99,
149,7,25,A Total War Saga: TROY,https://store.steampowered.com/app/1099410/A T...,"win,mac","2 Sep, 2021",,CDN$ 44.99,
...,...,...,...,...,...,...,...,...,...
1097,45,23,eFootball PES 2021 SEASON UPDATE,https://store.steampowered.com/app/1259970/eFo...,win,"15 Sep, 2020",,CDN$ 42.99,
123,6,24,iRacing,https://store.steampowered.com/app/266410/iRac...,win,"12 Jan, 2015",,CDN$ 10.99,
125,7,1,iRacing,https://store.steampowered.com/app/266410/iRac...,win,"12 Jan, 2015",,CDN$ 10.99,
501,22,2,shapez.io,https://store.steampowered.com/app/1318690/sha...,"win,linux","7 Jun, 2020",-50%,CDN$ 11.99,CDN$ 5.99


**Observations**
1. As we can see these duplicated `Title`s are also duplicated in the `url` column meaning they are the same listing that appeared on a different page number when the Selenium webdriver queried the Steam store search. We'll drop one occurrence of each duplicated listing and keep the other.

Drop duplicates using the Title and URL columns

In [10]:
df_search_results_sel = df_search_results_sel.drop_duplicates(
    subset=["title", "url"]
).reset_index(drop=True)
print(df_search_results_sel["title"].nunique(), len(df_search_results_sel))

883 883


<a id="merge-with-price-from-search-results-dataset-acquired-using-`selenium`"></a>

### 2.3. [Merge with Price from Search Results dataset acquired using `selenium`](#merge-with-price-from-search-results-dataset-acquired-using-`selenium`)

We'll now proceed to merge the listings and search results scraped with Selenium.

We'll merge with the `title` column only since the `url` was not captured when scraping the listing data, as seen from the list of columns in section [2.1 above](#load-and-process-all-listings-files-acquired-using-`selenium`)

In [11]:
%%time
dfm_sel = df_search_results_sel.merge(
    df_listings_sel,
    left_on=["title"],
    right_on=["Title"],
    how="left",
)
show_df(dfm_sel.drop(columns=["url", "user_defined_tags", "languages", "Genre"]), 1)
show_df_dtypes_nans(dfm_sel)

Unnamed: 0,page,listing_counter,title,platform_names,release_date,discount_pct,original_price,discount_price,review_type_all,overall_review_rating,pct_overall,pct_overall_threshold,pct_overall_lang,pct_overall_threshold_lang,platforms,num_steam_achievements,drm,rating,rating_descriptors,review_type_positive,review_type_negative,review_language_mine,Title,Release Date,Early Access Release Date,Developer,Publisher,Franchise,num_languages,page_num,listing_num
0,2,1,The Riftbreaker,win,"14 Oct, 2021",-10%,CDN$ 33.99,CDN$ 30.59,1014.0,Very Positive,93.0,positive,92.0,positive,win,30.0,,,,938.0,76.0,412.0,The Riftbreaker,"14 Oct, 2021",,EXOR Studios,"EXOR Studios, Surefire.Games","EXOR Studios, surefiregames",10.0,2.0,1.0
882,50,23,Streamer Life Simulator,win,"21 Aug, 2020",-45%,CDN$ 22.79,CDN$ 12.53,3787.0,Very Positive,79.0,positive,76.0,positive,win,15.0,,,,2988.0,799.0,1010.0,Streamer Life Simulator,"21 Aug, 2020",,Cheesecake Dev,Cheesecake Dev,Cheesecake Dev,10.0,50.0,23.0


Unnamed: 0,num_missing,dtype
page,0,int64
listing_counter,0,int64
title,0,object
url,0,object
platform_names,0,object
release_date,0,object
discount_pct,755,object
original_price,12,object
discount_price,755,object
review_type_all,82,float64


CPU times: user 20.2 ms, sys: 0 ns, total: 20.2 ms
Wall time: 18.1 ms


<a id="combine-`requests`-based-files"></a>

## 3. [Combine `requests`-based files](#combine-`requests`-based-files)

<a id="load-and-process-all-search-results-files-acquired-using-`requests`"></a>

### 3.1. [Load and Process all Search Results files acquired using `requests`](#load-and-process-all-search-results-files-acquired-using-`requests`)

We'll now load the search results dataset scraped with the `requests` library. We'll change the page number datatype and drop rows with a missing or blank value in the `title` column. We're also doing the following
- filtering out non-English search results
- adding an `app_id` column (which is extracted from the URL using a regular expression)
- (using similar logic to that for handling duplicates in the selenium dataset) we'll drop duplicates based on `url`, `title` and `platform_name` columns
  - this choice was also discussed in more detail in section 3.1 of `4_filter_requests_listings.ipynb`

In [12]:
%%time
df_search_results_requests = (
    pd.concat(
        [
            pd.read_parquet(f, engine="auto")
            for f in requests_search_results_pages
        ],
        ignore_index=True,
    )
    .drop(columns=["request_status_code"])
    .astype({"page": int}).dropna(subset=["title"])
    .sort_values(by=["page", "listing_counter"])
    .reset_index(drop=True)
)
print(f"Total number of search results scraped with requests = {len(df_search_results_requests):,}")

# Only select listing titles available in English
df_search_results_requests = df_search_results_requests[
    df_search_results_requests['title'].map(lambda x: x.isascii())
]

# Append an app_id column using a regex extract() from the URL
df_search_results_requests = df_search_results_requests.assign(
    app_id=df_search_results_requests["url"].str.extract(
        r"https://store.steampowered.com/app/(\d+)/*/"
    )
)
print(df_search_results_requests["title"].nunique())

# Remove listings that are missing a value in the title column
df_search_results_requests = df_search_results_requests[
    df_search_results_requests["title"] != ""
]

# Use the url, platform_names and title columns to drop duplicated listings
df_search_results_requests = df_search_results_requests.drop_duplicates(
    subset=["url", "title", "platform_names"]
)
show_df(df_search_results_requests, 2)
show_df_dtypes_nans(df_search_results_requests)

Total number of search results scraped with requests = 54,566
45798


Unnamed: 0,page,listing_counter,title,url,platform_names,release_date,discount_pct,original_price,discount_price,app_id
0,50,1,Blazing_Sails,https://store.steampowered.com/app/1158940/Blazing_Sails/,win,"Nov 5, 2020",-30%,14.99,10.49,1158940
1,50,2,LEGO_Harry_Potter_Years_14,https://store.steampowered.com/app/21130/LEGO_Harry_Potter_Years_14/,win,"Jun 25, 2010",,19.99,,21130
54560,2232,11,Animal_Shelter_Prologue,https://store.steampowered.com/app/1661260/Animal_Shelter_Prologue/,win,2021- Add to Wishlist!,,Free,,1661260
54561,2232,12,Dynasty_of_the_Sands,https://store.steampowered.com/app/1143070/Dynasty_of_the_Sands/,win,TBA,,,,1143070


Unnamed: 0,num_missing,dtype
page,0,int64
listing_counter,0,int64
title,0,object
url,0,object
platform_names,0,object
release_date,0,object
discount_pct,44725,object
original_price,5295,object
discount_price,44725,object
app_id,162,object


CPU times: user 6.6 s, sys: 827 ms, total: 7.43 s
Wall time: 3.96 s


Verify no duplicated URLs in this search results dataset

In [13]:
assert df_search_results_requests["url"].nunique() == len(df_search_results_requests)

<a id="load-and-process-all-listings-files-acquired-using-`requests`"></a>

### 3.2. [Load and Process all Listings files acquired using `requests`](#load-and-process-all-listings-files-acquired-using-`requests`)

We'll now load the listings scraped with `requests`. Again, we've only kept listings that are offered in English and added an `app_id` column

In [14]:
%%time
df_listings = pd.concat(
    [
        pd.read_csv(fpath).assign(filename=os.path.basename(fpath))
        for fpath in fpaths_requests
    ],
    ignore_index=True,
).sort_values(by=["page_num", "listing_num"])

# Select listings that support the English language
df_listings["languages"] = df_listings["languages"].str.contains("English")

# Append an app_id column using a regex extract() from the URL
df_listings = df_listings.assign(
    app_id=df_listings["url"].str.split(
        "app/", expand=True
    )[1].str.split("/", expand=True)[0]
)

df_listings = df_listings.reset_index(drop=True)

show_df(df_listings, 1)
show_df_dtypes_nans(df_listings)

Unnamed: 0,review_type_all,overall_review_rating,pct_overall,pct_overall_threshold,pct_overall_lang,pct_overall_threshold_lang,platforms,user_defined_tags,num_steam_achievements,drm,rating,rating_descriptors,review_type_positive,review_type_negative,review_language_mine,Title,Genre,Release Date,Early Access Release Date,Developer,Publisher,Franchise,languages,num_languages,page_num,listing_num,url,filename,app_id
0,,Very Positive,90.0,positive,,,win,"Early Access, Co-op, Naval Combat, Online Co-Op, Character Customization, FPS, Pirates, Historical, Open World, PvP, Fast-Paced, Battle Royale, First-Person, Survival, Team-Based, Competitive, Funny, Shooter, Action, Naval",,"Requires agreement to a 3rd-party EULA, Blazing Sails EULA",,,,,,Blazing Sails,"Action, Adventure, Casual, Indie, Early Access","Nov 5, 2020","Sep 9, 2020",Get Up Games,Iceberg Interactive,Iceberg Interactive,True,15.0,50,1,https://store.steampowered.com/app/1158940/Blazing_Sails/,p50_l1_Blazing_Sails.csv,1158940
10875,,Very Positive,91.0,positive,,,win,"Anime, Action, Beat 'em up, Indie, Female Protagonist, Cute, Platformer, Character Action Game, Fighting, Hack and Slash, JRPG, RPG, Singleplayer, Side Scroller",,,t,"Fantasy Violence, , Mild Suggestive Themes",,,,Fairy Bloom Freesia,"Action, Indie","Oct 17, 2012",,Edelweiss,Nyu Media,Nyu Media,True,2.0,505,25,https://store.steampowered.com/app/214590/Fairy Bloom Freesia/,p505_l25_Fairy_Bloom_Freesia.csv,214590


Unnamed: 0,num_missing,dtype
review_type_all,10876,float64
overall_review_rating,11,object
pct_overall,15,float64
pct_overall_threshold,15,object
pct_overall_lang,10876,float64
pct_overall_threshold_lang,10876,float64
platforms,37,object
user_defined_tags,1,object
num_steam_achievements,10876,float64
drm,9521,object


CPU times: user 25.4 s, sys: 0 ns, total: 25.4 s
Wall time: 25.4 s


Next, we'll use the added `app_id` column to append a column which indicates if a listing is actually a collection of games and not a single game
- collections have multiple `app_id`s in the URL, where each `app_id` corresponds to a single game, and we don't want to include these as we're only scraping single game listings

In [15]:
# Get index of all rows with multiple app_ids
is_collection_idx = df_listings[df_listings["app_id"].str.contains(",")].index
print(len(is_collection_idx))

60


Including each game in a collection will result in duplicating that game so we'll drop all rows that are collections

In [16]:
# Append a boolean column with a check of whether a listing is a collection
df_listings["is_collection"] = False
df_listings.loc[is_collection_idx, "is_collection"] = True

In [17]:
df_listings = df_listings.query("is_collection == False")

Count the number of listings with a blank string in the `Title` column

In [18]:
print(f"Missing title = {len(df_listings[df_listings['Title'] == ''])}")

Missing title = 0


Count the number of listings with a blank string in the `url`, `Release Date` or `platforms` columns (separately)

In [19]:
for c in ["url", "Release Date", "platforms"]:
    blank_values = (df_listings[c] == "").sum()
    print(f"Missing {c} = {blank_values}")

Missing url = 0
Missing Release Date = 0
Missing platforms = 0


Check how many unique and missing `Title`s are present in the listings

In [20]:
print(
    f"Number of unique titles = {df_listings['Title'].nunique():,}\n"
    f"Number of titles with missing values = {df_listings['Title'].isna().sum():,}\n"
    f"Number of listings = {len(df_listings):,}"
)

Number of unique titles = 10,072
Number of titles with missing values = 703
Number of listings = 10,816


Get listings with duplicate URLs

In [21]:
# Find duplicates - use keep=False to return all duplicated rows
dup_listings_requests = df_listings[
    df_listings.duplicated(subset=["url"], keep=False)
].sort_values(by=["url"])
print(f"Number of duplicate URLs = {len(dup_listings_requests)}")
display(
    dup_listings_requests[
        [
            "page_num",
            "listing_num",
            "Title",
            "Developer",
            "filename",
            "Publisher",
            "url",
            "user_defined_tags",
            "platforms",
        ]
    ]
)

Number of duplicate URLs = 2


Unnamed: 0,page_num,listing_num,Title,Developer,filename,Publisher,url,user_defined_tags,platforms
1125,95,20,,,p95_l20_Doom_Vfr.csv,,https://store.steampowered.com/app/650000/DOOM...,"Action, VR, Violent, Gore, FPS, Shooter, Horro...","win, vr_required"
1462,112,8,,,p112_l8_Doom_Vfr.csv,,https://store.steampowered.com/app/650000/DOOM...,"Action, VR, Violent, Gore, FPS, Shooter, Horro...","win, vr_required"


**Observations**
1. One URL in the `requests_listings_to_scrape.csv` file (produced by `4_filter_requests_listings.ipynb`, also shown in section 3.1 of this notebook) has likely resolved to the URL of another listing. This is the likely reason for the presence of this duplicated row, which we did not expect. We'll drop the duplicated row.

Drop rows with duplicated listing URLs

In [22]:
df_listings = df_listings.drop_duplicates(subset=["url"])

Explore listings with duplicates based on the `Title` and `URL` columns

In [23]:
# Find duplicates - use keep=False to return all duplicated rows
dup_titles = df_listings.dropna(subset=["Title"])[
    df_listings.dropna(subset=["Title"]).duplicated(subset=["Title", "url"], keep=False)
]
cols_to_show = [
    "page_num",
    "listing_num",
    "Title",
    "url",
    "filename",
    "Developer",
    "Publisher",
    "user_defined_tags",
    "platforms",
]
display(dup_titles[cols_to_show].sort_values(by="Title"))
print(f"Number of duplicate Titles = {int(len(dup_titles) / 2)}")
for _, row in dup_titles.sort_values(by="Title")[["url", "is_collection"]].iterrows():
    print(f"is_collection = {row['is_collection']}, URL = {row['url']}")

Unnamed: 0,page_num,listing_num,Title,url,filename,Developer,Publisher,user_defined_tags,platforms


Number of duplicate Titles = 0


Now, explore any remaining listings with only a duplicate `Title`

In [24]:
# Find duplicates - use keep=False to return all duplicated rows
dup_titles = df_listings.dropna(subset=["Title"])[
    df_listings.dropna(subset=["Title"]).duplicated(subset=["Title"], keep=False)
]
cols_to_show = [
    "Title",
    "url",
    "Developer",
    "Publisher",
    "user_defined_tags",
    "platforms",
    "is_collection",
    "page_num",
]
display(dup_titles[cols_to_show].sort_values(by="Title"))
print(f"Number of duplicate URLs = {int(len(dup_titles) / 2)}")
for _, row in dup_titles.sort_values(by="Title")[
    ["url", "is_collection", "page_num", "listing_num"]
].iterrows():
    print(
        f"is_collection = {row['is_collection']}, page_num = {row['page_num']}, "
        f"listing_num = {row['listing_num']}, URL = {row['url']}"
    )

Unnamed: 0,Title,url,Developer,Publisher,user_defined_tags,platforms,is_collection,page_num
10648,Airport Madness: Time Machine,https://store.steampowered.com/app/402210/Airp...,Big Fat Simulations Inc.,Big Fat Simulations Inc.,"Adventure, Simulation, Strategy","win, mac",False,496
10737,Airport Madness: Time Machine,https://store.steampowered.com/app/402210/Airp...,Big Fat Simulations Inc.,Big Fat Simulations Inc.,"Adventure, Simulation, Strategy","win, mac",False,500
10779,Alone on Mars,https://store.steampowered.com/app/1637490/Alo...,Gnelf,Gnelf,"Adventure, Action, Arcade, Platformer, Action-...",win,False,502
10689,Alone on Mars,https://store.steampowered.com/app/1637490/Alo...,Gnelf,Gnelf,"Adventure, Action, Arcade, Platformer, Action-...",win,False,498
8311,Anime girl Or Bottle?,https://store.steampowered.com/app/785110/Anim...,"Nikita ""Ghost_RUS""",Ghost_RUS Games,"Indie, Casual, Anime, Psychological Horror",win,False,400
...,...,...,...,...,...,...,...,...
10710,UFO vs Bikini,https://store.steampowered.com/app/1660750/UFO...,Xenon Lu,"WELOVEBOT CO., LTD.","RPG, Sci-fi, Casual, Cute, Aliens, Action-Adve...",win,False,499
10000,Village Monsters,https://store.steampowered.com/app/679830/Vill...,Josh Bossie,Josh Bossie,"Early Access, Life Sim, Time Management, Relax...","win, mac, linux",False,469
10812,Village Monsters,https://store.steampowered.com/app/679830/Vill...,Josh Bossie,Josh Bossie,"Early Access, Life Sim, Time Management, Relax...","win, mac, linux",False,503
8477,War Trigger 3,https://store.steampowered.com/app/298240/War_...,"Rocketeer Games Studio, LLC","Rocketeer Games Studio, LLC","Shooter, Free to Play, FPS, Multiplayer, Actio...","win, mac",False,407


Number of duplicate URLs = 41
is_collection = False, page_num = 496, listing_num = 18, URL = https://store.steampowered.com/app/402210/Airport_Madness_Time_Machine/
is_collection = False, page_num = 500, listing_num = 8, URL = https://store.steampowered.com/app/402210/Airport Madness: Time Machine/
is_collection = False, page_num = 502, listing_num = 2, URL = https://store.steampowered.com/app/1637490/Alone on Mars/
is_collection = False, page_num = 498, listing_num = 10, URL = https://store.steampowered.com/app/1637490/Alone_on_Mars/
is_collection = False, page_num = 400, listing_num = 3, URL = https://store.steampowered.com/app/785110/Anime_girl_Or_Bottle/
is_collection = False, page_num = 503, listing_num = 15, URL = https://store.steampowered.com/app/785110/Anime girl Or Bottle?/
is_collection = False, page_num = 490, listing_num = 21, URL = https://store.steampowered.com/app/1449540/Atlas_Rogues/
is_collection = False, page_num = 500, listing_num = 4, URL = https://store.steampowe

Show the page number on which duplicate titles are found and the number of pages with such duplicates

In [25]:
print(dup_titles["page_num"].unique().tolist(), dup_titles["page_num"].nunique())

[72, 73, 74, 78, 137, 139, 143, 144, 184, 313, 351, 354, 357, 359, 364, 365, 375, 380, 399, 400, 401, 402, 407, 418, 421, 436, 469, 471, 472, 473, 475, 477, 478, 481, 482, 483, 485, 487, 490, 493, 496, 498, 499, 500, 501, 502, 503, 504, 505] 49


**Observations**
1. All these URLs are not duplicated but the titles are duplicated.
2. From manual inspection of this list and inspecting some of the individual game listings on the Steam web store, one game in each pair of duplicates corresponds to a single episode within a multi-episode game. In other words, each multi-episode listing is a valid listing but it is like a collection since it refers to multiple individual games or episodes. Unlike a collection, multiple `app_id`s are not present in the URL of a multi-episode listing, so the `is_collection` column does not catch such listings. So, from each of these pairs that have been returned as duplicates, we'll manually drop the multi-episode listing and keep only the single-episode one.
3. The remaining are valid duplicated URLs. However, the source of these is not known since we weren't expecting duplicates. Without understanding where they are coming from, we'll drop both occurrences of the duplicated rows from the data.

Define a list of multi-episode listings by their URL

In [26]:
multi_episode_listings = [
    "https://store.steampowered.com/app/31220/Sam__Max_The_Devils_Playhouse/",
    "https://store.steampowered.com/app/8260/Sam__Max_Season_Two/",
    "https://store.steampowered.com/app/40960/The_Stronghold_Collection/",
    "https://store.steampowered.com/app/33220/Tom_Clancys_Splinter_Cell_Conviction_Deluxe_Edition/",
]

Remove multi-episode listings

In [27]:
df_listings = df_listings.query("url not in @multi_episode_listings")

After this removal, we'll drop the leftover duplicates `Title`s (from point 3. of the **Observations** above, we're dropping both occurrences of the duplicates so we'll set `keep=False` to do this)

In [28]:
df_listings = df_listings.dropna(subset=["Title"]).drop_duplicates(
    subset=["Title"], keep=False
)

There will then not be any duplicated `Title`s remaining

In [29]:
dup_titles = df_listings.dropna(subset=["Title"])[
    df_listings.dropna(subset=["Title"]).duplicated(subset=["Title"], keep=False)
]
print(f"Number of duplicate Titles = {int(len(dup_titles) / 2)}")

Number of duplicate Titles = 0


<a id="merge-with-price-from-search-results-dataset-acquired-using-`requests`"></a>

### 3.3. [Merge with Price from Search Results dataset acquired using `requests`](#merge-with-price-from-search-results-dataset-acquired-using-`requests`)

In [30]:
%%time
dfm = df_listings.merge(
    df_search_results_requests[
        ["url", "discount_pct", "original_price", "discount_price"]
    ],
    on="url",
    how="left",
)
show_df(dfm.drop(columns=["user_defined_tags", "drm"]), 1)
show_df_dtypes_nans(dfm)

Unnamed: 0,review_type_all,overall_review_rating,pct_overall,pct_overall_threshold,pct_overall_lang,pct_overall_threshold_lang,platforms,num_steam_achievements,rating,rating_descriptors,review_type_positive,review_type_negative,review_language_mine,Title,Genre,Release Date,Early Access Release Date,Developer,Publisher,Franchise,languages,num_languages,page_num,listing_num,url,filename,app_id,is_collection,discount_pct,original_price,discount_price
0,,Very Positive,90.0,positive,,,win,,,,,,,Blazing Sails,"Action, Adventure, Casual, Indie, Early Access","Nov 5, 2020","Sep 9, 2020",Get Up Games,Iceberg Interactive,Iceberg Interactive,True,15.0,50,1,https://store.steampowered.com/app/1158940/Blazing_Sails/,p50_l1_Blazing_Sails.csv,1158940,False,-30%,14.99,10.49
10034,,Mostly Positive,73.0,positive,,,"win, mac",,,,,,,Dead In Bermuda,"Adventure, Indie, RPG, Simulation, Strategy","Aug 27, 2015",,Ishtar Games,Dear Villagers,"Dead In Games, Dear Villagers",True,5.0,505,24,https://store.steampowered.com/app/384310/Dead In Bermuda/,p505_l24_Dead_In_Bermuda.csv,384310,False,,14.99,


Unnamed: 0,num_missing,dtype
review_type_all,10035,float64
overall_review_rating,10,object
pct_overall,14,float64
pct_overall_threshold,14,object
pct_overall_lang,10035,float64
pct_overall_threshold_lang,10035,float64
platforms,33,object
user_defined_tags,0,object
num_steam_achievements,10035,float64
drm,8854,object


CPU times: user 104 ms, sys: 0 ns, total: 104 ms
Wall time: 102 ms


<a id="export-combined-dataset-to-disk"></a>

## 4. [Export combined dataset to disk](#export-combined-dataset-to-disk)

We are now ready to vertically concatenate the merged datasets collected with `requests` and `selenium` and export the combined dataset to disk for use in analysis. However, inspecting the `original_price` and `discount_price` columns (shown below) we see that the currency has changed between the two datasets

In [31]:
show_df(dfm_sel[["original_price", "discount_price"]], 5)
show_df(dfm[["original_price", "discount_price"]], 5)

Unnamed: 0,original_price,discount_price
0,CDN$ 33.99,CDN$ 30.59
1,Free to Play,
2,CDN$ 79.99,
3,Free to Play,
4,CDN$ 49.99,
878,CDN$ 22.79,CDN$ 17.09
879,CDN$ 22.79,
880,CDN$ 17.49,
881,CDN$ 8.99,
882,CDN$ 22.79,CDN$ 12.53


Unnamed: 0,original_price,discount_price
0,14.99,10.49
1,19.99,
2,16.99,
3,4.99,
4,17.99,
10030,5.99,
10031,0.99,
10032,9.99,
10033,1.99,
10034,14.99,


As mentioned in `3_requests_download.ipynb`, due to the slow speed of scraping with Selenium on a PC, a cloud-based virtual machine (VM) was used to scrape data using `requests`. However, the VM was created within a US-based allocation zone and so it was assigned an IP address based in the US. This is likely the reason for the currency in the larger dataset collected with `requests` being in USD while the smaller scraped dataset (using `selenium`) is in CAD. We cannot make a comparison across currencies, and also there are likely problems due to differences in search results based on geo-location. So a search originating in Canada (as is the case with data scraped using Selenium) possibly won't give the same results as one in the US (data scraped using `requests`).

The number of missing values in the user-review columns scraped with `selenium` are shown below and compared to the missing values when scraping with `requests`

In [32]:
user_review_cols = [
    "review_type_all",
    "review_type_positive",
    "review_type_negative",
    "review_language_mine",
    "pct_overall_lang",
    "pct_overall_threshold_lang",
    "num_steam_achievements",
]

In [33]:
d_nans = {col: dfm[col].isna().sum() for col in user_review_cols}
d_nans.update({"num_rows": len(dfm)})
d_nans_sel = {col: dfm_sel[col].isna().sum() for col in user_review_cols}
d_nans_sel.update({"num_rows": len(dfm_sel)})

display(
    pd.DataFrame.from_dict(d_nans, orient="index")
    .rename(columns={0: "requests"})
    .merge(
        pd.DataFrame.from_dict(d_nans_sel, orient="index").rename(
            columns={0: "selenium"}
        ),
        left_index=True,
        right_index=True,
        how="left",
    )
    .rename_axis("column_name")
    .style.set_caption("Rows with missing values in the merged datasets")
)

Unnamed: 0_level_0,requests,selenium
column_name,Unnamed: 1_level_1,Unnamed: 2_level_1
review_type_all,10035,82
review_type_positive,10035,82
review_type_negative,10035,82
review_language_mine,10035,82
pct_overall_lang,10035,83
pct_overall_threshold_lang,10035,83
num_steam_achievements,10035,251
num_rows,10035,883


**Notes**
1. These counts are taken from the merged datasets `dfm` (`requests`) and `dfm_sel` (scraped with `selenium`) taken from sections [2.3 (using `selenium`)](#merge-with-price-from-search-results-dataset-acquired-using-`selenium`) and [3.3 (using `requests`)](#merge-with-price-from-search-results-dataset-acquired-using-`requests`) in this notebook.
2. The last row `num_rows` is the total number of listings scraped.

The scraping of the user-review statistics from individual game listings has failed for `requests` with only missing values in this column. This was not the case for scraping with Selenium. There is a relatively small fraction of listings scraped using `selenium` with missing values, which primarily occurred since the particular listing was recently released at the time of scraping (first two and a half weeks of October 2021). This is compared to 100 percent of rows scraped with `requests` containing missing values. This difference may be attributed to the fact that the Selenium-based scraping approach involved scrolling down on the listing page in order to bring the user-reviews section into focus (loading the necessary javascript on the page) and only then scraped the user-review statistics. This is discussed in `2_selenium.ipynb`. When scraping with `requests`, scrolling to the bottom of the page wasn't done and this is likely the reason for the missing values in the columns related to user-reviews for every scraped listing gathered with `requests`. The missing values in the `num_steam_achievements` column is likely due to differences in the position of this element on the Steam store webpage for the Canadian and US versions of the site.

Unfortunately, both of these problems (currency and missing values) were not caught until approximately one third of the listings scraped with `requests` were already gathered. Taking these problems into account, the smaller Selenium-based dataset will be ignored for further analysis and only the dataset scraped with `requests` (with more rows of listings) will be considered and we will ignore the user-review columns which are filled with missing values. So we will only export the processed and merged version of that dataset (done in section [3.3](#merge-with-price-from-search-results-dataset-acquired-using-`requests`) of this notebook) to disk below and this will be used in further analysis in `7_eda.ipynb`

In [34]:
dfm.to_csv(processed_data_filepath, index=False)

---

<span style="float:left">
    <a href="./5_requests_process.ipynb"><< 5 - Requests process</a>
</span>

<span style="float:right">
    <a href="./7_eda.ipynb">7 - Exploratory Data Analysis >></a>
</span>