# Filtering `requests` Search Results

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [2]:
import os
import time
from glob import glob

import pandas as pd

In [3]:
%aimport src.utils
from src.utils import show_df, show_df_dtypes_nans

<a href="table-of-contents"></a>

## [Table of Contents](#table-of-contents)
0. [About](#about)
1. [User Inputs](#user-inputs)
2. [`selenium`-based files](#`selenium`-based-files)
   - 2.1. [Load and Process all Listings files acquired using `selenium`](#load-and-process-all-listings-files-acquired-using-`selenium`)
   - 2.2. [Load and Process all Search Results files acquired using `selenium`](#load-and-process-all-search-results-files-acquired-using-`selenium`)
   - 2.3. [Merge listings with search results acquired using `selenium`](#merge-listings-with-search-results-acquired-using-`selenium`)
3. [`requests`-based files](#`requests`-based-files)
   - 3.1. [Load and Process all Search Results files acquired using `requests`](#load-and-process-all-search-results-files-acquired-using-`requests`)
   - 3.2. [Slice columns to use for scraping with `requests` and export to file](#slice-columns-to-use-for-scraping-with-using-`requests`-and-export-to-file)

<a id="about"></a>

## 0. [About](#about)

In this notebook, our primary objective will be to create a list of game listings that will be scraped using the `requests` library. It is not known if the scraped search results (using `requests`, in `3_requests_download.ipynb`) contain any duplicated game listings. Since it is not desired to scrape a listing more than once, we'll need to remove any duplicated listings from the search results dataset scraped with `requests` before scraping those listings to get additional attributes beyond platform and price. This will be discussed in [section 3.](#`requests`-based-files) of this notebook.

A secondary objective is to briefly explore the presence of duplicates in the search results scraped using the `selenium` library in `2_selenium.ipynb`. This will not result in creating a new dataset but will provide a means of comparing the possible presence of duplicated search results in this dataset to those in the search results scraped using `requests`. This will be discussed in [section 2.](#`selenium`-based-files) of this notebook. This choice will help us to decide how to merge the search results (which contains information about supported platforms and price) and listings (which does not include platform or price).

<a id="user-inputs"></a>

## 1. [User Inputs](#user-inputs)

Define variables that can be changed when running this notebook

In [4]:
PROJ_ROOT_DIR = os.getcwd()

In [5]:
# Columns to hide (for display purposes only)
cols_to_hide = ["Title", "Release Date", "Genre", "page_num", "listing_num"]

Define variables that depend on the variables defined above
- get paths to directories where single-row CSV files (produced after scraping) will be stored
- create lists of
  - search results files created using webscraping with Selenium and requests
  - listings files created using webscraping with Selenium

In [6]:
# Path to data/raw
data_dir = os.path.join(PROJ_ROOT_DIR, "data")
raw_data_dir = os.path.join(data_dir, "raw")

# Path to data/raw/requests
requests_files_dir = os.path.join(raw_data_dir, "requests")

# Path to data/raw/selenium
selenium_files_dir = os.path.join(raw_data_dir, "selenium")

# List of CSV files created using selenium
fpaths = glob(os.path.join(selenium_files_dir, "p*_*.csv"))

# List of search results files created by Selenium
selenium_search_results_pages = glob(
    os.path.join(selenium_files_dir, "search_results_page_*_*.parquet.gzip")
)

# List of search results files created by requests
requests_search_results_pages = glob(
    os.path.join(requests_files_dir, "search_results_page_*_*.parquet.gzip")
)

<a id="`selenium`-based-files"></a>

## 2. [`selenium`-based files](#`selenium`-based-files)

<a id="load-and-process-all-listings-files-acquired-using-`selenium`"></a>

### 2.1. [Load and Process all Listings files acquired using `selenium`](#load-and-process-all-listings-files-acquired-using-`selenium`)

In order to briefly explore the scraped listings acquired using `selenium`, we'll iterate over all the scraped listing files (from the above list of CSV files) and concatenate the files into a single `DataFrame` with all the listings

In [10]:
%%time
df_listings = pd.concat(
    [
        pd.read_csv(fpath)
        for fpath in fpaths
    ],
    ignore_index=True,
).dropna(subset=["Title"]).sort_values(by=["page_num", "listing_num"]).reset_index(drop=True)
show_df(df_listings, 1)
show_df_dtypes_nans(df_listings)

Unnamed: 0,review_type_all,overall_review_rating,pct_overall,pct_overall_threshold,pct_overall_lang,pct_overall_threshold_lang,platforms,user_defined_tags,num_steam_achievements,drm,rating,rating_descriptors,review_type_positive,review_type_negative,review_language_mine,Title,Genre,Release Date,Early Access Release Date,Developer,Publisher,Franchise,languages,num_languages,page_num,listing_num
0,1014.0,Very Positive,93.0,positive,92.0,positive,win,"Base Building, Strategy, Survival, Tower Defense, Hack and Slash, Action RPG, Crafting, Adventure, Simulation, RPG, Sci-fi, Resource Management, Exploration, Loot, Isometric, Building, Sandbox, Atmospheric, Action, Aliens",30.0,,,,938.0,76.0,412.0,The Riftbreaker,"Action, Adventure, Indie, RPG, Simulation, Strategy","14 Oct, 2021",,EXOR Studios,"EXOR Studios, Surefire.Games","EXOR Studios, surefiregames","English, French, German, Spanish - Spain, Japanese, Korean, Polish, Russian, Simplified Chinese, Portuguese - Brazil",10.0,2,1
608,329.0,Mixed,82.0,positive,79.0,positive,win,"Online Co-Op, Action, Early Access, Horror, Co-op Campaign, Psychological Horror, Co-op, Supernatural, First-Person, Stealth, Artificial Intelligence, Thriller, Mystery, Atmospheric, Adventure, Psychological, Cinematic, Survival Horror, Dark, PvP",20.0,,,,269.0,60.0,192.0,Haunt Chaser,"Action, Adventure, Indie","6 Oct, 2021","15 Jul, 2021",Clock Wizard Games,Clock Wizard Games,,"English, Turkish, French, German, Simplified Chinese, Portuguese - Brazil, Spanish - Latin America, Polish, Russian",9.0,36,24


Unnamed: 0,num_missing,dtype
review_type_all,13,float64
overall_review_rating,13,object
pct_overall,13,float64
pct_overall_threshold,13,object
pct_overall_lang,13,float64
pct_overall_threshold_lang,13,object
platforms,25,object
user_defined_tags,0,object
num_steam_achievements,132,float64
drm,411,object


CPU times: user 1.42 s, sys: 14.7 ms, total: 1.43 s
Wall time: 1.43 s


As we can see below, the `Title` column of this dataset is unique since there is one row for each scraped listing `Title`

In [11]:
print(df_listings["Title"].nunique(), len(df_listings))

609 609


<a id="load-and-process-all-search-results-files-acquired-using-`selenium`"></a>

### 2.2. [Load and Process all Search Results files acquired using `selenium`](#load-and-process-all-search-results-files-acquired-using-`selenium`)

We'll now get list of all the search results files collected with `selenium`

In [14]:
selenium_search_results_individual_pages = glob(
    os.path.join(selenium_files_dir, "search_results_page_*_2021*.parquet.gzip")
)
print(
    f"Found {len(selenium_search_results_individual_pages)} "
    "batched search results *.parquet.gzip file(s)."
)

Found 35 batched search results *.parquet.gzip file(s).


Similar to concatenating listings, we'll concatenate all the search results into a single `DataFrame`

In [15]:
df_search_results = (
    pd.concat(
        [
            pd.read_parquet(
                selenium_search_results_indivudual_page,
                engine="auto",
            )
            for selenium_search_results_indivudual_page in selenium_search_results_individual_pages
        ],
        ignore_index=True,
    )
    .astype({"page": int})
    .dropna(subset=["title"])
    .sort_values(by=["page", "listing_counter"])
    .reset_index(drop=True)
)
print(df_search_results["title"].nunique(), len(df_search_results))

655 875


Again, we'll verify that there are no listings with multiple URLs for a single title

In [16]:
listings_with_multiple_urls = (
    df_search_results.groupby("title")["url"].nunique().sort_values(ascending=False)
)
print(
    "Number of listing titles with more than one URL = "
    f"{len(listings_with_multiple_urls[listings_with_multiple_urls > 1])}"
)

Number of listing titles with more than one URL = 0


**Observations**
1. Again, we can see that there are no titles in the search results collected with `selenium` that have multiple listing URLs associated with them.

However, there are listings that have the same `title`

In [17]:
display(
    df_search_results[
        df_search_results.duplicated(subset=["title"], keep=False)
    ].sort_values(by="title")
)

Unnamed: 0,page,listing_counter,title,url,platform_names,release_date,discount_pct,original_price,discount_price
351,16,2,A Dance of Fire and Ice,https://store.steampowered.com/app/977950/A Da...,"win,mac","24 Jan, 2019",,CDN$ 6.69,
329,15,5,A Dance of Fire and Ice,https://store.steampowered.com/app/977950/A Da...,"win,mac","24 Jan, 2019",,CDN$ 6.69,
586,25,12,A Hat in Time,https://store.steampowered.com/app/253230/A Ha...,"win,mac","5 Oct, 2017",,CDN$ 32.99,
718,30,19,A Hat in Time,https://store.steampowered.com/app/253230/A Ha...,"win,mac","5 Oct, 2017",,CDN$ 32.99,
150,8,1,A Total War Saga: TROY,https://store.steampowered.com/app/1099410/A T...,"win,mac","2 Sep, 2021",,CDN$ 44.99,
...,...,...,...,...,...,...,...,...,...
393,17,19,Zombie Army 4: Dead War,https://store.steampowered.com/app/694280/Zomb...,win,"18 Feb, 2021",,CDN$ 56.99,
506,22,7,Zombie Army 4: Dead War,https://store.steampowered.com/app/694280/Zomb...,win,"18 Feb, 2021",,CDN$ 56.99,
761,32,12,Zombie Army 4: Dead War,https://store.steampowered.com/app/694280/Zomb...,win,"18 Feb, 2021",,CDN$ 56.99,
125,7,1,iRacing,https://store.steampowered.com/app/266410/iRac...,win,"12 Jan, 2015",,CDN$ 10.99,


**Observations**
1. These are valid duplicated listings and need to be dropped. They appeared on a different page number when the Selenium webdriver queried the Steam store search. We'll drop one occurrence of each duplicated listing and keep the other.
2. A unique listing in a search result will have a unique combination of listing title and URL. No other listing should have the same combination of these two columns. So, we can use these two columns to identify and remove duplicated entries from this dataset.

So, we will use the `title` and `url` columns to remove duplicates from this dataset of search results (collected with `selenium`)

In [22]:
df_search_results = df_search_results.drop_duplicates(
    subset=["title", "url"]
).reset_index(drop=True)
print(
    f"Number of unique titles = {df_search_results['title'].nunique()}, "
    f"Number of rows = {len(df_search_results)}"
)
show_df(df_search_results, 5)

Number of unique titles = 655, Number of rows = 655


Unnamed: 0,page,listing_counter,title,url,platform_names,release_date,discount_pct,original_price,discount_price
0,2,1,The Riftbreaker,https://store.steampowered.com/app/780310/The Riftbreaker/,win,"14 Oct, 2021",-10%,CDN$ 33.99,CDN$ 30.59
1,2,2,War Thunder,https://store.steampowered.com/app/236390/War Thunder/,"win,mac,linux,vr_supported","15 Aug, 2013",,Free to Play,
2,2,3,Red Dead Redemption 2,https://store.steampowered.com/app/1174180/Red Dead Redemption 2/,win,"5 Dec, 2019",,CDN$ 79.99,
3,2,4,Path of Exile,https://store.steampowered.com/app/238960/Path of Exile/,"win,mac","23 Oct, 2013",,Free to Play,
4,2,5,Sea of Thieves,https://store.steampowered.com/app/1172620/Sea of Thieves/,win,"3 Jun, 2020",,CDN$ 49.99,
650,36,19,METAL GEAR RISING: REVENGEANCE,https://store.steampowered.com/app/235460/METAL GEAR RISING: REVENGEANCE/,win,"9 Jan, 2014",,CDN$ 32.99,
651,36,20,Barony,https://store.steampowered.com/app/371970/Barony/,"win,mac,linux","23 Jun, 2015",,CDN$ 17.49,
652,36,21,Never Return,https://store.steampowered.com/app/1612620/Never Return/,win,"20 Aug, 2021",,CDN$ 15.49,
653,36,22,Roguebook,https://store.steampowered.com/app/1076200/Roguebook/,"win,mac,linux","17 Jun, 2021",,CDN$ 28.99,
654,36,24,Haunt Chaser,https://store.steampowered.com/app/1450180/Haunt Chaser/,win,"6 Oct, 2021",,CDN$ 17.49,


All the columns scraped with Selenium (from each listing) are shown below

In [19]:
print("Columns scraped using selenium")
list(df_listings)

Columns scraped using selenium


['review_type_all',
 'overall_review_rating',
 'pct_overall',
 'pct_overall_threshold',
 'pct_overall_lang',
 'pct_overall_threshold_lang',
 'platforms',
 'user_defined_tags',
 'num_steam_achievements',
 'drm',
 'rating',
 'rating_descriptors',
 'review_type_positive',
 'review_type_negative',
 'review_language_mine',
 'Title',
 'Genre',
 'Release Date',
 'Early Access Release Date',
 'Developer',
 'Publisher',
 'Franchise',
 'languages',
 'num_languages',
 'page_num',
 'listing_num']

Since the listing URL was not scraped from the individual listings, it is only present in the search results dataset. The URL would be unique for each listing and could be an alternative to the `title` column when merging these two datasets. However, its absence from the listings dataset means we have to use the `title` column on its own instead.

<a id="merge-listings-with-search-results-acquired-using-`selenium`"></a>

### 2.3. [Merge listings with search results acquired using `selenium`](#merge-listings-with-search-results-acquired-using-`selenium`)

Brief considerations for direction of merge when combining the search results and listings datasets.

Recall, when scraping with Selenium, we loaded a page of search results, clicked on a single listing, scraped it and returned (moved back) to the search results. This was done for a single page of search results at a time. As scraping with Selenium was slow, scraping was performed over multiple days. Over this time, Steam updated the search results. So, listings appearing on page 4 (during a given day's scraping) could appear on page 5 during the next day's scraping. In this scenario, the webscraping code checks for previously scraped listings (by listing name, which contains the Title of the listing) and skips re-scraping these listings during the next iterations through the search results pages. Since the listings that appeared in multiple search results are considered as duplicates and their listings were not clicked on (or scraped) a second time. This means a search results dataset could have more rows than the listings dataset for the same page. For this reason, the search results `DataFrame` will be `LEFT JOIN`ed to the listings dataset and the `JOIN`ed dataset will have rows with columns that are populated with values from fields scraped from the search results page but with missing values for columns that were scraped from the listing pages.

The search results and listings datasets are merged below

In [None]:
dfm = df_search_results.merge(
    df_listings,
    left_on=["title"],
    right_on=["Title"],
    how="left",
)
show_df(dfm.drop(columns=["url", "user_defined_tags", "languages", "Genre"]), 1)

We won't be using this merged dataset further in this notebook, but the same logic will be used when we do need to combine these two datasets (collected with `selenium`) for exploratory data analysis in section 2.3 of a later notebook `6_merge_searches_listings.ipynb`.

<a id="`requests`-based-files"></a>

## 3. [`requests`-based files](#`requests`-based-files)

We'll now filter the search results scraped with `requests` in `3_requests.ipynb` to eliminate duplicate listings. This is the primary objective of this notebook.

<a id="load-and-process-all-search-results-files-acquired-using-`requests`"></a>

### 3.1. [Load and Process all Search Results files acquired using `requests`](#load-and-process-all-search-results-files-acquired-using-`requests`)

Combine (vertically concatenate) the datasets for all the scraped search results pages. We'll also remove any listings not supporting English and extract an `app_id` from the URL

In [None]:
%%time
df_search_results_requests = (
    pd.concat(
        [
            pd.read_parquet(f, engine="auto")
            for f in requests_search_results_pages
        ],
        ignore_index=True,
    )
    .drop(columns=["request_status_code"])
    .astype({"page": int}).dropna(subset=["title"])
    .sort_values(by=["page", "listing_counter"])
    .reset_index(drop=True)
)
print(f"Total number of search results scraped with requests = {len(df_search_results_requests),}")

# Only select listing titles available in English
df_search_results_requests = df_search_results_requests[
    df_search_results_requests['title'].map(lambda x: x.isascii())
]

# Append an app_id column using a regex extract() from the URL
df_search_results_requests = df_search_results_requests.assign(
    app_id=df_search_results_requests["url"].str.extract(
        r"https://store.steampowered.com/app/(\d+)/*/"
    )
)

print(
    "Number of English language listings in the search results = "
    f"{len(df_search_results_requests),}"
)

Count how many listings are missing a title

In [None]:
print(
    f"Missing title = {len(df_search_results_requests[df_search_results_requests['title'] == ''])}"
)

Since there are only two such rows, we'll now remove these listings that are missing a value in the `title` column

In [None]:
df_search_results_requests = df_search_results_requests[
    df_search_results_requests["title"] != ""
]

We'll now count how many rows have an empty string in the `platform_names` or `release_date` columns (separately)

In [None]:
for c in ["url", "release_date", "platform_names"]:
    blank_values = (df_search_results_requests[c] == "").sum()
    print(f"Missing {c} = {blank_values}")

We'll count how many listings are missing a value in the `app_id` column

In [None]:
missing_app_ids = df_search_results_requests["app_id"].isna().sum()
print(f"Missing app_ids = {missing_app_ids}")

Since the `release_date` column has several blank values and the `app_id` is missing several values as well, we'll use only `url`, `platform_names` and `title` to find duplicated listings. Duplicates in all three of these columns will be true duplicate listings and should be dropped from this dataset.

We'll show duplicated listings below

In [None]:
# Find duplicates - use keep=False to return all duplicated rows
dup_search_res_requests = df_search_results_requests[
    df_search_results_requests.duplicated(
        subset=["url", "title", "platform_names"], keep=False
    )
].sort_values(by=["url"])
display(dup_search_res_requests)

**Observations**
1. These are true duplicated listings. They have been listed multiple times in the search results, each time on a different page. They need to be dropped.

We'll use the `url`, `platform_names` and `title` columns to drop duplicated listings (keeping only the first such listing and dropping the others)

In [None]:
df_search_results_requests = df_search_results_requests.drop_duplicates(
    subset=["url", "title", "platform_names"]
)

If we now consider the listing `title` to be unique, then we should expect that this dataset should have a single URL per `title`. Below are the listings with more than one URL for each `title`

In [None]:
%%time
multiple_urls = (
    df_search_results_requests.groupby("title")["url"]
    .nunique()
    .sort_values(ascending=False)
)
multiple_urls_titles = multiple_urls[multiple_urls > 1].index.tolist()
show_df(
    df_search_results_requests[
        df_search_results_requests["title"].isin(multiple_urls_titles)
    ].sort_values(by=["title"]), 5
)

**Observations**
1. Since the listing `title` was cleaned to remove special characters, this has caused different listings whose title names only differed by special characters to be incorrectly given the same `title`. This means the `title` column is not a valid indicator on its own of a unique listing in this search results dataset (with `requests`). While this was also done when scraping search results with `selenium`, the smaller size of the dataset acquired with the webdriver means we did not encounter such an occurrence. So, with `selenium`, dropping duplicates based on the `Title` column was valid but this is invalid for search results scraped with the `requests` library.

Next, we'll check if the same problem persists for the URL column, by counting the number of listing`titles` with more than one URL

In [None]:
%%time
multiple_urls = df_search_results_requests.groupby("url")["title"].count().sort_values(
    ascending=False
)
print(f"Number of URLs with more than one Title = {len(multiple_urls[multiple_urls > 1])}")
assert len(multiple_urls[multiple_urls > 1]) == 0

We'll also check if the number of unique URLs and the number of rows in this search results dataset (scraped with `requests`) agree with eachother

In [None]:
print(
    f"Number of unique URLs in search results dataset = {df_search_results_requests['url'].nunique():,}\n"
    f"Number of rows in search results dataset = {len(df_search_results_requests):,}"
)
assert df_search_results_requests["url"].nunique() == len(df_search_results_requests)

**Notes**
1. Preprocessing to remove special characters was not performed on the URL column of this dataset.

**Observations**
1. As we can see from the two checks above, the URL column of this dataset is unique. There are no listings which share the same URL but have different `title`s, and there is a unique URL for every row of the `DataFrame`.

We can now use the `requests` library to scrape listings from this dataset where each URL corresponds to a single listing. When that dataset is collected and we need to merge it with this dataset of search results, we will do so on the `url` column of both datasets since that is a unique column.

View the search results dataset, with only English listings (as is the case with the `selenium` dataset), and with no duplicates

In [None]:
show_df(df_search_results_requests, 2)
show_df_dtypes_nans(df_search_results_requests)

<a id="slice-columns-to-use-for-scraping-with-using-`requests`-and-export-to-file"></a>

### 3.2. [Slice columns to use for scraping with `requests` and export to file](#slice-columns-to-use-for-scraping-with-using-`requests`-and-export-to-file)

Finally, we'll now export this filtered search results dataset to a CSV file

In [None]:
df_search_results_requests[["page", "listing_counter", "title", "url"]].to_csv(
    os.path.join(requests_files_dir, "requests_listings_to_scrape.csv"),
    index=False,
)

We can now iterate over the rows of this file and scrape the listings (using `requests`) from the (unique) URL column.

---

<span style="float:left">
    <a href="./3_requests_download.ipynb"><< 3 - Scraping search results with the requests library</a>
</span>

<span style="float:right">
    <a href="./5_requests_listings_download.ipynb">5 - Scraping listings with the requests library >></a>
</span>