# Getting Game Franchises from Giantbomb

We can get the data for game franchises from Giantbomb. My api key is f2577eac7f52b72170fc1e8c5f0cebb778f38155

*Note that throughout the notebook we save backups of the data to csv, to avoid redownloading it. The code to reload the dataframe is commented out after the line where we save the backup. It's possible to simply skip to these parts and load the relevant dataframe from the csv, rather than doing the API download.*

In [1]:
import requests
import json
import pandas as pd
from pathlib import Path
import time
import os

## Scraping franchise IDs

Create the directory to store the raw info

In [2]:
franchises_dir = Path('raw_data\\raw_json\\game_franchises\\franchises')
franchises_dir.mkdir(parents=True, exist_ok=True)

Get all the data from the franchise ids and save as json files

In [None]:
user_agent = "Joshdharris"
headers = {'User-Agent': user_agent}

for x in range(0, 5550, 100):
    filename = "page{}-{}.json".format(x, x+99)
    if not Path(franchises_dir, filename).is_file():
        url = "https://www.giantbomb.com/api/franchises/?api_key=f2577eac7f52b72170fc1e8c5f0cebb778f38155&format=json&field_list=guid,name,description&offset={}".format(x)
        response = json.loads((requests.get(url, headers=headers)).content)
        with open(Path(franchises_dir, filename), "w") as fileOut:
            json.dump(response, fileOut)

This function will be used multiple times for the raw json files. Here we create an empty dataframe, then open all the json files in a specified directory and add them to the dataframe, before returning the dataframe

In [3]:
def create_dataframe(desired_name, raw_directory, add_path):
    #Create empty dataframe
    desired_name = pd.DataFrame(data=[], columns=None)

    #This line creates a list of all the json files in the directory
    files = list(Path(raw_directory).glob('*.json'))
    
    #Iterate through all json files in directory and add to dataframe
    for file in files:
        with open(file) as filein:
            tempDF = pd.read_json(filein)
            # For some of the dataframes, we need the path info. 
            if add_path == True:
                tempDF["path"] = str(file)
            desired_name = pd.concat([desired_name, tempDF])  
    return desired_name

Collate all the downloaded files into a single dataframe

In [4]:
franchise_ids = create_dataframe("franchise_ids", franchises_dir, False)
franchise_ids.sample()

Unnamed: 0,error,limit,offset,number_of_page_results,number_of_total_results,status_code,results,version
54,OK,100,2900,100,5524,1,"{'description': None, 'guid': '3025-3090', 'na...",1


In [5]:
franchise_ids.shape

(5524, 8)

Remove unnecessary rows. Again, create a function as it's used multiple times

In [6]:
def drop_unnecessary_columns(dataframe):
    dataframe = dataframe.drop(["error","limit", "offset", "number_of_page_results", "number_of_total_results", "status_code", "version"], axis = 1)
    return dataframe

In [7]:
franchise_ids = drop_unnecessary_columns(franchise_ids)
franchise_ids.shape

(5524, 1)

Extract information from results

In [8]:
franchise_ids["guid"] = [d.get('guid') for d in franchise_ids.results]
franchise_ids["name"] = [d.get('name') for d in franchise_ids.results]
franchise_ids["description"] = [d.get('description') for d in franchise_ids.results]

Get list of all franchise ids.

In [9]:
franchise_id_list = franchise_ids["guid"].tolist()
print(len(franchise_id_list))

5524


Make sure we have no duplicates of the ids

In [10]:
franchise_ids.guid.duplicated().sum()

0

Check out the resultant dataframe

In [11]:
franchise_ids.sample()

Unnamed: 0,results,guid,name,description
34,"{'description': None, 'guid': '3025-1177', 'na...",3025-1177,UMS,


Now that we've extracted all the information from the results column, we can drop it.

In [12]:
franchise_ids = franchise_ids.drop("results", axis = 1)
franchise_ids.sample()

Unnamed: 0,guid,name,description
22,3025-2020,Robot Unicorn Attack,<p>The Robot Unicorn Attacks are <a data-ref-i...


Finally, we can save the data to a csv file. This way we can reload it if we encounter any issues later on rather than redownloading the data

In [14]:
giantbomb_dir = Path("raw_data\\giantbomb_data")
giantbomb_dir.mkdir(parents = True, exist_ok=True)

In [15]:
franchise_ids.to_csv("raw_data\\giantbomb_data\\franchise_ids.csv")

In [17]:
#If necessary, uncomment out the following line of code to load the dataframe from the csv file
#franchise_ids = pd.read_csv("raw_data\\giantbomb_data\\franchise_ids.csv", index_col = 0)

## Scraping game IDs

Create directory for game franchises

In [60]:
games_dir = Path('raw_data\\raw_json\\game_franchises\\games')
games_dir.mkdir(parents=True, exist_ok=True)

Download json files for each franchise. We have a cap of 200 requests an hour, hence we're using a time restriction within the loop to prevent the api from blocking us.

In [None]:
for id in franchise_id_list:
    filename = "{}.json".format(id)
    if not Path(games_dir, filename).is_file():
        url = "https://www.giantbomb.com/api/franchise/{}/?api_key=f2577eac7f52b72170fc1e8c5f0cebb778f38155&format=json&field_list=games".format(id)
        response = json.loads((requests.get(url, headers=headers)).content)
        with open(Path(games_dir, filename), "w") as fileOut:
            json.dump(response, fileOut)
        time.sleep(18)

Create a games dataframe from all the games in the directory

In [61]:
games_ids = create_dataframe("games_ids", games_dir, True)
games_ids.sample()

Unnamed: 0,error,limit,offset,number_of_page_results,number_of_total_results,status_code,results,version,path
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,raw_data\raw_json\game_franchises\games\3025-4...


Edit the path column. This will be called franchise_id and we want to extract just the franchise id from the relevant path

In [62]:
# Note there that the lambda function is dependant on the path name. We're removing the first 40 characters which have the 
# directory information, and the last 5 (which have .json).
games_ids["path"] = games_ids["path"].apply(lambda x : x[40:-5])
games_ids = games_ids.rename(columns = {"path":"franchise_id"})
games_ids

Unnamed: 0,error,limit,offset,number_of_page_results,number_of_total_results,status_code,results,version,franchise_id
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-1
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-10
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-100
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-101
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-102
...,...,...,...,...,...,...,...,...,...
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-94
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-95
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-97
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-98


### Retrieving missing values

For some reason we have only 5346 results, but 5524 franchise ids. This is, we discovered, because the api is a little buggy. For some reason, having the field-list restriction on returns an empty result for the missing apis, so the workaround is to download all the information and extract the game information from that. We'll first get a list of all the game franchises that are missing

In [63]:
# Lists all json files in directory
games_in_dir = os.listdir(games_dir)
# Removes .json extension from names
for i in range(len(games_in_dir)):
    games_in_dir[i] = games_in_dir[i][:-5]
# Creates a new list of all the games in the franchise list, but not in the directory
missing_apis = [x for x in franchise_id_list if x not in games_in_dir]
# Check number of missing files
print(len(missing_apis))

183


We'll put the missing files into a separate directory and make a separate dataframe so we don't mess up anything with the current dataframe

In [64]:
missing_games_dir = Path('raw_data\\raw_json\\game_franchises\\games\\missing')
missing_games_dir.mkdir(parents=True, exist_ok=True)

In [28]:
for id in missing_apis:
    filename = "{}.json".format(id)
    if not Path(missing_games_dir, filename).is_file():
        url = "https://www.giantbomb.com/api/franchise/{}/?api_key=f2577eac7f52b72170fc1e8c5f0cebb778f38155&format=json".format(id)
        response = json.loads((requests.get(url, headers=headers)).content)
        with open(Path(missing_games_dir, filename), "w") as fileOut:
            json.dump(response, fileOut)

In [65]:
missing_games_ids = create_dataframe("missing_games_ids", missing_games_dir, True)
missing_games_ids

Unnamed: 0,error,limit,offset,number_of_page_results,number_of_total_results,status_code,results,version,path
aliases,OK,1,0,1,1,1,,1,raw_data\raw_json\game_franchises\games\missin...
api_detail_url,OK,1,0,1,1,1,https://www.giantbomb.com/api/franchise/3025-1...,1,raw_data\raw_json\game_franchises\games\missin...
characters,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,raw_data\raw_json\game_franchises\games\missin...
concepts,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,raw_data\raw_json\game_franchises\games\missin...
date_added,OK,1,0,1,1,1,2008-08-25 13:52:36,1,raw_data\raw_json\game_franchises\games\missin...
...,...,...,...,...,...,...,...,...,...
locations,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,raw_data\raw_json\game_franchises\games\missin...
name,OK,1,0,1,1,1,Shaman King,1,raw_data\raw_json\game_franchises\games\missin...
objects,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,raw_data\raw_json\game_franchises\games\missin...
people,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,raw_data\raw_json\game_franchises\games\missin...


We'll remove any rows that don't contain the game listings, rename the path column and extract the games ids

In [66]:
# Again, if the directory is changed, the relevant number in the lambda function must be changed as well
missing_games_ids = missing_games_ids.loc[missing_games_ids.index.isin(['games'])].copy()
missing_games_ids.loc[:, "franchise_id"] = missing_games_ids["path"].apply(lambda x: x[48:-5])
missing_games_ids = missing_games_ids.drop("path", axis = 1)

In [67]:
missing_games_ids.shape

(183, 9)

In [68]:
missing_games_ids.sample()

Unnamed: 0,error,limit,offset,number_of_page_results,number_of_total_results,status_code,results,version,franchise_id
games,OK,1,0,1,1,1,[{'api_detail_url': 'https://www.giantbomb.com...,1,3025-874


Finally, we'll combine our two dataframes to reintegrate the missing values into our original dataframe

In [69]:
games_ids = pd.concat([games_ids, missing_games_ids])
games_ids.franchise_id.count()

5524

And we can now see we have all the values we expected. We'll delete the missing_games_ids dataframe to save on memory

In [70]:
del missing_games_ids

### Working with the combined dataframe

Remove the unnecessary columns, so we're jsut left with results and franchise ids.

In [71]:
games_ids = drop_unnecessary_columns(games_ids)
games_ids.sample()

Unnamed: 0,results,franchise_id
games,[{'api_detail_url': 'https://www.giantbomb.com...,3025-5446


Explode the results so every game has its own row

In [72]:
games_ids = games_ids.explode("results")
games_ids.count()

results         35117
franchise_id    35139
dtype: int64

We have some empty results so we'll drop those before moving on

In [73]:
games_ids = games_ids.dropna()
games_ids.count()

results         35117
franchise_id    35117
dtype: int64

We extract individual game names and ids. We can also extract the individual game ids so we can do further analysis on other games

In [74]:
games_ids[["game_id", "name", "api_url"]] = games_ids["results"].apply(lambda x: pd.Series([x["id"], x["name"], x["api_detail_url"]]))

In [75]:
games_ids["game_id"].nunique()

29208

There's 29208 unique game ids which implies that around 6000 of our games are in multiple franchises. we can look at the ids of the games in the most franchises

In [76]:
games_ids["game_id"].value_counts().head()

6949     7
569      6
33847    6
73766    6
39509    5
Name: game_id, dtype: int64

In [77]:
games_ids = games_ids.drop("results", axis = 1)
games_ids.sample()

Unnamed: 0,franchise_id,game_id,name,api_url
games,3025-630,8724,Shining Force: The Sword of Hajya,https://www.giantbomb.com/api/game/3030-8724/


## Scraping additional game metadata

We can also get additional metadata for the game which will allow us to do further analysis on the different games in the franchises if we desire to.

In [79]:
game_dir = Path('raw_data\\raw_json\\game_franchises\\game')
game_dir.mkdir(parents=True, exist_ok=True)

Get all the unique game ids and place in a dictionary, with their relevant api url

In [80]:
game_api_dict = dict(zip(games_ids["game_id"].unique(), games_ids["api_url"].unique()))
print(len(game_api_dict))

29208


Download the metadata. Again, the 200 items an hour restriction means we ned to put in the time restriction

In [None]:
for game_id, api_url in game_api_dict.items():
    filename = "{}.json".format(game_id)
    if not Path(game_dir, filename).is_file():
        url = "{}?api_key=f2577eac7f52b72170fc1e8c5f0cebb778f38155&format=json".format(api_url)
        response = json.loads((requests.get(url, headers=headers)).content)
        with open(Path(game_dir, filename), "w") as fileOut:
            json.dump(response, fileOut)
        time.sleep(18)

### Combining the game metadata

Due to the nature of the metadata files, it actualy takes an insane amount of time and memory to create a new dataframe the way we created all the others. A much quicker way to do it will be to open the files, extract the relevant information and append them to lists, which we can then convert to a dataframe.

In [86]:
#Create empty lists to store the information for each column
ids = []
release_years = []
aliases = []
developers =[]
genres =[]
platforms =[]
publishers =[]
rating = []

#Iterate through the folder and retrieve the required inforamtion, appending to the relevant list
files = list(Path("raw_data/raw_json/game_franchises/game/").glob('*.json'))
for file in files:
    with open(file) as filein:
        tempDF = pd.read_json(filein)
        ids.append(tempDF["results"].get("id"))
        release_years.append(tempDF["results"].get("expected_release_year"))
        aliases.append(tempDF["results"].get("aliases"))
        developers.append(tempDF["results"].get("developers"))
        genres.append(tempDF["results"].get("genres")) 
        platforms.append(tempDF["results"].get("platforms"))
        publishers.append(tempDF["results"].get("publishers")) 
        rating.append(tempDF["results"].get("original_game_rating")) 

In [93]:
#Combine all the lists into a dictionary, then create a dataframe from the dictionary
metadataframe_dict = list(zip(ids, release_years, aliases, developers, genres, platforms, publishers, rating))
metadataframe = pd.DataFrame.from_dict(metadataframe_dict)

#Rename the columns in the dataframe
metadataframe = metadataframe.rename(columns = {0:"id", 1:"release_year", 2:"aliases", 3: "developers", 4:"genres", 5: "platforms", 6: "publishers", 7: "rating"})
metadataframe = metadataframe.sort_values("id")
metadataframe

Unnamed: 0,id,release_year,aliases,developers,genres,platforms,publishers,rating
0,1,1992.0,Desert Strike Advance,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...
9677,3,,,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,
14192,4,1986.0,,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,
21240,6,,SVR 2007,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...
26238,8,,Battle Formula,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,
...,...,...,...,...,...,...,...,...
28707,88822,,,,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...
28708,88824,,,,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...
28709,88831,,,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,
28710,88834,,,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,[{'api_detail_url': 'https://www.giantbomb.com...,


Convert all the dictionaries in the columns into lists

In [94]:
metadataframe["developers"] = metadataframe["developers"].apply(lambda x: [d["name"] for d in x] if x is not None else None)
metadataframe["genres"] = metadataframe["genres"].apply(lambda x: [d["name"] for d in x] if x is not None else None)
metadataframe["platforms"] = metadataframe["platforms"].apply(lambda x: [d["name"] for d in x] if x is not None else None)
metadataframe["publishers"] = metadataframe["publishers"].apply(lambda x: [d["name"] for d in x] if x is not None else None)
metadataframe["rating"] = metadataframe["rating"].apply(lambda x: [d["name"] for d in x] if x is not None else None)
metadataframe.sample()

Unnamed: 0,id,release_year,aliases,developers,genres,platforms,publishers,rating
11077,33113,1987.0,,[NCS Corporation],[Strategy],"[MSX, Sharp X68000, NEC PC-8801, NEC PC-9801, ...",[NCS Corporation],


Combine the dataframes and remove the redundant columns

In [96]:
games_ids = pd.merge(games_ids, metadataframe, how='left', left_on='game_id', right_on='id')
games_ids = games_ids.drop(["id","api_url"], axis = 1)
games_ids

Unnamed: 0,franchise_id,game_id,name,release_year,aliases,developers,genres,platforms,publishers,rating
0,3025-1,544,Super Mario All-Stars & Super Mario World,1994.0,Super Mario All-Stars and Super Mario World,"[Nintendo EAD, SRD Co. Ltd.]","[Compilation, Platformer]",[Super Nintendo Entertainment System],[Nintendo],[ESRB: K-A]
1,3025-1,6649,Super Mario Sunshine,,,[Nintendo EAD],"[Platformer, Action-Adventure]","[GameCube, Nintendo Switch]",[Nintendo],"[ESRB: E, OFLC: G, PEGI: 3+]"
2,3025-1,7314,Super Mario Bros. Deluxe,,Super Mario Bros. DX,[Nintendo R&D2],"[Action, Adventure, Compilation, Puzzle, Platf...","[Game Boy Color, Nintendo 3DS eShop]",[Nintendo],[ESRB: E]
3,3025-1,7358,Super Mario RPG: Legend of the Seven Stars,,Mario RPG\r\nSMRPG,"[Squaresoft, Square Enix]","[Adventure, Role-Playing]","[Super Nintendo Entertainment System, Wii Shop...",[Nintendo],"[ESRB: K-A, PEGI: 3+, ESRB: E, OFLC: G, CERO: A]"
4,3025-1,7406,Super Mario Bros. 2,,Super Mario USA\nSuper Mario Bros. USA\nSMB2,"[Nintendo EAD, SRD Co. Ltd.]",[Platformer],"[Nintendo Entertainment System, Wii Shop, Nint...",[Nintendo],"[ESRB: E, CERO: All Ages]"
...,...,...,...,...,...,...,...,...,...,...
35112,3025-999,26577,Shaman King: Funbari Spirits,,,[Dimps Corporation],[Fighting],[PlayStation 2],"[Bandai Co., Ltd.]",
35113,3025-999,26575,"Shaman King: Legacy of the Spirits, Soaring Hawk",,,,[Role-Playing],[Game Boy Advance],[Konami],
35114,3025-999,26581,Shaman King Chou Senjiryokketsu: Meramera/Funb...,,,[Studio Saizensen],[Action],[Game Boy Color],[King Records],
35115,3025-999,26580,Shaman King: Soul Fight!,,,"[Bandai Co., Ltd.]",[Fighting],[GameCube],"[Bandai Co., Ltd.]",


Finally we'll export all the collected data to a csv file. 

In [97]:
clean_data_dir = Path('clean_data')
clean_data_dir.mkdir(exist_ok=True)

In [98]:
games_ids.to_csv("clean_data\\clean_giantbomb_games_db.csv")

*Note, at this stage that the data could be cleaned further. We could do a little more on organising things like developers and publishers if we want to use this data later.*