# Advanced Python
## `scrapy` practice with Steam

In this notebook, you can see an example of [**Steam**](https://store.steampowered.com/) (video game digital distribution service) using using an automatic parsing machine from `scrapy`. Based on the collected data, more advanced data analytics can be carried out in the future.

## Solution

You can read a detailed report of experiments in `README.md` in the repository.

Class for data collection:

In [1]:
%%writefile steam\steam\items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class SteamItem(scrapy.Item):
    link = scrapy.Field()
    platforms = scrapy.Field()
    name = scrapy.Field()
    price = scrapy.Field()
    categories = scrapy.Field()
    rev_number = scrapy.Field()
    overall_score = scrapy.Field()
    rel_date = scrapy.Field()
    developer = scrapy.Field()
    tags = scrapy.Field()


Overwriting steam\steam\items.py


Pipeline with result filtering:

In [2]:
%%writefile steam\steam\pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json


class SteamPipeline:

    def __init__(self):
        self.file = None

    def open_spider(self, spider):
        self.file = open("items.json", "w")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        try:
            # in the normal case the date format will be as follows: '6 Oct, 2022'
            game_year = int(item["rel_date"].split()[-1])
            # check: if the year is later than or equal to 2000, record the information
            if game_year >= 2000:
                line = json.dumps(ItemAdapter(item).asdict()) + '\n'
                self.file.write(line)
        except ValueError:
            # tried to convert a word out of letters to int 
            # for example, it could be 'coming soon'
            # then we skip
            pass

        return item


Overwriting steam\steam\pipelines.py


The parser itself:

In [3]:
%%writefile steam\steam\spiders\steam_spider.py
import scrapy
import requests
import re

from bs4 import BeautifulSoup
from lxml import etree
from steam.items import SteamItem

queries = ['shooter', 'racing', 'survival']  # quiries (what we are looking for)
API = ''  # API key for ScraperAPI (not needed)


def parse_links_on_games(url):
    """
    A function that searches for valid links to all games that there are
    as a result of the search query (usually there are 25 of them)
    :param url: str, a link to a search query of the format
        f'https://store.steampowered.com/search/?term={query}&ignore_preferences=1&page={page}'
    :return: list, links to all games that are in the result of the search query (usually there are 25 of them)
    """
    # read the page input
    r = requests.get(url)
    page = r.content.decode("utf-8")
    soup = BeautifulSoup(page, 'html.parser')

    # look for links to games
    # first take all the candidates for links
    all_games = soup.find('div', attrs={'id': 'search_resultsRows'})  # there's one element, all the games

    links = set()
    for game in all_games.find_all('a'):
        # run some tests
        if game.get('href') is not None:
            link = game.get('href')
            if link == '' or 'app' not in link or 'agecheck' in link:
                print('BAD', link)
                # if the link is broken or there's a captcha, skip it
                continue
            else:
                links.add(link)

    return list(links)


def make_start_urls(queries):
    """
    A function that searches for links to all games on the first two pages 
    of Steam search (without filters).
    :param queries: list[str], a list of natural language queries,
        that the user is interested in searching for
        e.g. ['shooter', 'racing', 'survival']
    :return: list[str], a list of links to games matching the search
        for the specified keywords
    """
    start_urls = []
    for query in queries:
        for page in range(1, 3):
            # link format:
            # https://store.steampowered.com/search/?term=shooter&ignore_preferences=1&page=1
            # term - request body
            # ignore_preferences - not to take into account any filters (geographical, linguistic)
            # page - page number (even if the default is infinite scrolling)
            url = f'https://store.steampowered.com/search/?term={query}&ignore_preferences=1&page={page}'
            print(url)
            # collect all references to games found in the query
            games_links = parse_links_on_games(url)
            # add to the overall list
            start_urls.extend(games_links)

    return start_urls


class SteamSpider(scrapy.Spider):
    name = 'Steam'
    allowed_domains = ['store.steampowered.com']

    def __init__(self):
        # search for starting links, i.e., those that match the specified query
        self.start_urls = make_start_urls(queries)
        self.log(f'Found {len(self.start_urls)} links')

    def parse(self, response):
        """
        Method that parses a specific game (product) link
        :param response: scrapy request to url passed to self.start_urls
        """
        
        # I will use both BeautifulSoup and xpath (search as xml)
        soup = BeautifulSoup(response.body, 'html.parser')
        dom = etree.HTML(str(soup))

        # Looking for a link to the game
        link_raw = soup.find('meta', attrs={'property': 'og:url'}).get('content')
        link = link_raw.strip()

        # Game title
        name_raw = dom.xpath('//div[@id="appHubAppName"][@class="apphub_AppName"]/text()')
        name = ''.join(name_raw).strip()

        # The categories to which the game belongs
        categories_raw = dom.xpath('//div[@class="blockbg"]/a/text()')
        categories = '/'.join([one_category.strip() for one_category in categories_raw[1:]])  # названия самой игры тут нет

        # Number of reviews
        rev_number_raw = dom.xpath('//div[@itemprop="aggregateRating"]/div[@class="summary column"]/span[@class="responsive_hidden"]/text()')
        rev_number = ','.join([re.sub(r'\D', '', rev_num) for rev_num in rev_number_raw])  # тут всегда должно быть одно число, но вдруг...

        # Overall score based on feedback
        overall_score_raw = dom.xpath('//div[@itemprop="aggregateRating"]/div[@class="summary column"]/span[@class="game_review_summary positive"]/text()')
        overall_score = ''.join(overall_score_raw).strip()

        # Release date
        rel_date_raw = dom.xpath('//div[@class="release_date"]/div[@class="date"]/text()')
        rel_date = ''.join(rel_date_raw).strip()

        # Developer (usually one)
        developer_raw = dom.xpath('//div[@class="dev_row"]/div[@id="developers_list"]/a/text()')
        developer = ','.join([one_developer.strip() for one_developer in developer_raw])

        # Tags (usually many)
        tags_raw = dom.xpath('//div[@class="glance_tags popular_tags"]/a/text()')
        tags = '/'.join([one_tags.strip() for one_tags in tags_raw])

        # Price
        price_raw = dom.xpath('//div[@class="discount_final_price"]/text()')
        if len(price_raw) == 0:
            # if there is no discount, this field will be empty
            price_raw = dom.xpath('//div[@class="game_purchase_price price"]/text()')

        price = '/'.join(price_raw).strip()

        # Available platforms
        platforms_raw = dom.xpath('//div[@class="sysreq_tabs"]/div/text()')
        platforms = '/'.join([one_platform.strip() for one_platform in platforms_raw])

        # Create an instance of the SteamItem class, pass information to it
        item = SteamItem()

        item["name"] = name
        item["categories"] = categories
        item["rev_number"] = rev_number
        item["overall_score"] = overall_score
        item["rel_date"] = rel_date
        item["developer"] = developer
        item["tags"] = tags
        item["price"] = price
        item["platforms"] = platforms
        item["link"] = link

        yield item


Overwriting steam\steam\spiders\steam_spider.py


### Check results

In [4]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_json('examples/items.json')

In [6]:
df = pd.read_json('examples/items.json')

In [7]:
df.head()

Unnamed: 0,name,categories,rev_number,overall_score,rel_date,developer,tags,price,platforms,link
0,Federation77,Action Games,88,Very Positive,"6 Oct, 2022",MIROWIN,Arena Shooter/Violent/Cyberpunk/Shooter/Crime/...,"274,50 pуб./584 pуб.",,https://store.steampowered.com/app/1620410/Fed...
1,World of Tanks Blitz - The Plush Matilda,Action Games,103,Mostly Positive,"24 Aug, 2021",Wargaming Group Limited,Action/Free to Play/Massively Multiplayer,1800 pуб.,Windows/macOS,https://store.steampowered.com/app/1713900/Wor...
2,Fog Of War - Complete Edition,Violent Games/Fog Of War - Free Edition/Downlo...,18,Positive,"22 Feb, 2018",Monkeys Lab.,Massively Multiplayer/Indie/Strategy/Action/RP...,199 pуб.,,https://store.steampowered.com/app/791890/Fog_...
3,CHERNOBYL: The Untold Story,Indie Games,271,Mostly Positive,"24 Sep, 2019",Mehsoft,Indie/Gore/Violent/Sexual Content/Open World/A...,149 pуб.,,https://store.steampowered.com/app/1155830/CHE...
4,World of Tanks Blitz - Starter Pack,Action Games,737,,"18 Jun, 2019",Wargaming Group Limited,Free to Play/Massively Multiplayer/Action,120 pуб.,Windows/macOS,https://store.steampowered.com/app/1072561/Wor...


In [8]:
df.tail()

Unnamed: 0,name,categories,rev_number,overall_score,rel_date,developer,tags,price,platforms,link
134,Ray of Light,Indie Games/Conglomerate 5 Franchise,36,,"12 Jul, 2018",SAFING,Indie/Horror/Survival/FPS/Action/Adventure/Sto...,"1258,80 pуб.",,https://store.steampowered.com/app/891100/Ray_...
135,RUINS Survival,Massively Multiplayer Games,54,,"4 Jul, 2019",ATD Game Studio,Indie/Action/Simulation/Adventure/Early Access...,259 pуб.,,https://store.steampowered.com/app/985720/RUIN...
136,Drift King: Survival,Racing Games,60,,"24 Nov, 2016",Destiny.Games,Racing/Automobile Sim/Massively Multiplayer/Si...,129 pуб.,Windows/macOS,https://store.steampowered.com/app/553290/Drif...
137,DawnWander,Adventure Games,15,,"22 Jun, 2021",DarkTree Development,Early Access/FPS/PvE/Shooter/First-Person/Alte...,259 pуб.,,https://store.steampowered.com/app/1600670/Daw...
138,RED EVIL,Action Games,15,,"7 Nov, 2019",meokigame,Action/Indie/Violent/Adventure/Survival/Horror...,"540,29 pуб.",,https://store.steampowered.com/app/1169070/RED...


In [9]:
df['link']

0      https://store.steampowered.com/app/1620410/Fed...
1      https://store.steampowered.com/app/1713900/Wor...
2      https://store.steampowered.com/app/791890/Fog_...
3      https://store.steampowered.com/app/1155830/CHE...
4      https://store.steampowered.com/app/1072561/Wor...
                             ...                        
134    https://store.steampowered.com/app/891100/Ray_...
135    https://store.steampowered.com/app/985720/RUIN...
136    https://store.steampowered.com/app/553290/Drif...
137    https://store.steampowered.com/app/1600670/Daw...
138    https://store.steampowered.com/app/1169070/RED...
Name: link, Length: 139, dtype: object