# Bulbapedia Scraper

### Importing the needed libraries

In [1]:
import pandas as pd
import requests
import re
import json
import datetime
import os
from bs4 import BeautifulSoup

### Loading the page
First, we take the HTML code of the Bulbapedia website. From there, we shall find the elements the contain our desired information.  
https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number

In [2]:
# Site used for scraping
URL = 'https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number'

In [3]:
# Getting HTML code of page
page = requests.get(URL)
pageHTML = BeautifulSoup(page.content, 'html.parser')
pageHTML

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of Pokémon by National Pokédex number - Bulbapedia, the community-driven Pokémon encyclopedia</title>
<script src="/cdn-cgi/apps/head/gBGjtYtSEMyRflZogcJJMbLZn7I.js"></script><script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"4ac98438c3bacd7ecadecb23","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Pokémon_by_National_Pokédex_number","wgTitle":"List of Pokémon by National Pokédex number","wgCurRevisionId":3498912,"wgRevisionId":3498912,"wgArticleId":65356,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"]

### Inspecting the page
Upon inspection, the content of the page itself is stored in a *div* element with the id *mw-content-text*, and within that 
*div* element are *table* elements that contain the information for every generation of Pokemon.

We load the contents of the page into a list and use that list to find the specific *table* elements that have the information we need.

In [4]:
poke_content = pageHTML.find(id = 'mw-content-text')
poke_tables = poke_content.find_all('table')

On further inspection, the first element (Index 0) in *poke_tables* contains the code for the Shortcut section of the page. Index 1 and above, however, contain the elements for each generation of Pokemon.

Again, we load the specific data we need into another list.

In [5]:
poke_tables[0]

<table align="right" cellspacing="2" style="border: 2px solid #80964B; border-radius: 20px; -moz-border-radius: 20px; -webkit-border-radius: 20px; -khtml-border-radius: 20px; -icab-border-radius: 20px; -o-border-radius: 20px; margin-left: 5px; margin-bottom: 5px;" width="10%">
<tbody><tr>
<td align="center"><b>Shortcuts</b><br/><a class="mw-redirect" href="/wiki/Ndex" title="Ndex">Ndex</a><br/><a class="mw-redirect" href="/wiki/Olddex" title="Olddex">Olddex</a><br/><a class="mw-redirect" href="/wiki/Natdex" title="Natdex">Natdex</a>
</td></tr></tbody></table>

poke_tables[1] corresponds to the table for Generation 1, poke_tables[2] for Generation 2, and so on.

In [6]:
poke_tables[1]

<table align="center" style="border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px;; border: 2px solid #FF1111; background: #FF1111;">
<tbody><tr>
<th style="border-top-left-radius: 5px; -moz-border-radius-topleft: 5px; -webkit-border-top-left-radius: 5px; -khtml-border-top-left-radius: 5px; -icab-border-top-left-radius: 5px; -o-border-top-left-radius: 5px; background: #64D364"><a href="/wiki/List_of_Pok%C3%A9mon_by_Kanto_Pok%C3%A9dex_number" title="List of Pokémon by Kanto Pokédex number"><span style="color:#000;">Kdex</span></a>
</th>
<th style="background: #64D364">Ndex
</th>
<th style="background: #64D364">MS
</th>
<th style="background: #64D364">Pokémon
</th>
<th colspan="2" style="border-top-right-radius: 5px; -moz-border-radius-topright: 5px; -webkit-border-top-right-radius: 5px; -khtml-border-top-right-radius: 5px; -icab-border-top-right-radius: 5px; -o-border-top-right-radius: 5

In [7]:
# Storing table for Generation 2 in a list
poke_list = poke_tables[2]

### Viewing the contents of 'poke_list'

Index 0 contains a newline element. Index 1 contains the table where the information of the Pokemon in this generation is found.

In [8]:
poke_list.contents[0]

'\n'

In [9]:
poke_list.contents[1]

<tbody><tr>
<th style="border-top-left-radius: 5px; -moz-border-radius-topleft: 5px; -webkit-border-top-left-radius: 5px; -khtml-border-top-left-radius: 5px; -icab-border-top-left-radius: 5px; -o-border-top-left-radius: 5px; background: #D6D6D6"><a href="/wiki/List_of_Pok%C3%A9mon_by_Johto_Pok%C3%A9dex_number" title="List of Pokémon by Johto Pokédex number"><span style="color:#000;">Jdex</span></a>
</th>
<th style="background: #D6D6D6">Ndex
</th>
<th style="background: #D6D6D6">MS
</th>
<th style="background: #D6D6D6">Pokémon
</th>
<th colspan="2" style="border-top-right-radius: 5px; -moz-border-radius-topright: 5px; -webkit-border-top-right-radius: 5px; -khtml-border-top-right-radius: 5px; -icab-border-top-right-radius: 5px; -o-border-top-right-radius: 5px; background: #D6D6D6">Type
</th></tr>
<tr style="background:#FFF">
<td style="font-family:monospace">#001
</td>
<td style="font-family:monospace">#152
</td>
<th><a href="/wiki/Chikorita_(Pok%C3%A9mon)" title="Chikorita"><img alt="Ch

We make another list, *poke_info*, to single-out the information for each Pokemon.

In [10]:
poke_info = poke_list.contents[1]

### Viewing the contents of 'poke_info'
Below are the first few elements of the new list, *poke_info*.
The content is as follows:

Index 0 contains the table headers; <br>
Index 1 contains a newline element; <br>
Index 2 contains the row for Chikorita, the first on the list of Gen. 2 pokemon; <br>
Index 3 contains a newline element; <br>
Index 4 contains the row for Bayleef, the next entry in the Pokedex.

In [11]:
# Index 0 contains the table headers
poke_info.contents[0]

<tr>
<th style="border-top-left-radius: 5px; -moz-border-radius-topleft: 5px; -webkit-border-top-left-radius: 5px; -khtml-border-top-left-radius: 5px; -icab-border-top-left-radius: 5px; -o-border-top-left-radius: 5px; background: #D6D6D6"><a href="/wiki/List_of_Pok%C3%A9mon_by_Johto_Pok%C3%A9dex_number" title="List of Pokémon by Johto Pokédex number"><span style="color:#000;">Jdex</span></a>
</th>
<th style="background: #D6D6D6">Ndex
</th>
<th style="background: #D6D6D6">MS
</th>
<th style="background: #D6D6D6">Pokémon
</th>
<th colspan="2" style="border-top-right-radius: 5px; -moz-border-radius-topright: 5px; -webkit-border-top-right-radius: 5px; -khtml-border-top-right-radius: 5px; -icab-border-top-right-radius: 5px; -o-border-top-right-radius: 5px; background: #D6D6D6">Type
</th></tr>

In [12]:
# Index 1 contains a newline element
poke_info.contents[1]

'\n'

In [13]:
# Index 2 contains the row for Chikorita, the first on the list of Gen. 2 pokemon
poke_info.contents[2]

<tr style="background:#FFF">
<td style="font-family:monospace">#001
</td>
<td style="font-family:monospace">#152
</td>
<th><a href="/wiki/Chikorita_(Pok%C3%A9mon)" title="Chikorita"><img alt="Chikorita" decoding="async" height="40" src="//archives.bulbagarden.net/media/upload/4/41/152MS6.png" width="40"/></a>
</th>
<td><a href="/wiki/Chikorita_(Pok%C3%A9mon)" title="Chikorita (Pokémon)">Chikorita</a>
</td>
<td colspan="2" style="text-align:center; background:#78C850"><a href="/wiki/Grass_(type)" title="Grass (type)"><span style="color:#FFF">Grass</span></a>
</td></tr>

In [14]:
# Index 3 contains another 
poke_info.contents[3]

'\n'

In [15]:
# Index 4 contains the row for Bayleef, the next entry in the Pokedex
poke_info.contents[4]

<tr style="background:#FFF">
<td style="font-family:monospace">#002
</td>
<td style="font-family:monospace">#153
</td>
<th><a href="/wiki/Bayleef_(Pok%C3%A9mon)" title="Bayleef"><img alt="Bayleef" decoding="async" height="40" src="//archives.bulbagarden.net/media/upload/7/71/153MS6.png" width="40"/></a>
</th>
<td><a href="/wiki/Bayleef_(Pok%C3%A9mon)" title="Bayleef (Pokémon)">Bayleef</a>
</td>
<td colspan="2" style="text-align:center; background:#78C850"><a href="/wiki/Grass_(type)" title="Grass (type)"><span style="color:#FFF">Grass</span></a>
</td></tr>

There is a pattern in how each row of pokemon information is arranged in the HTML code.

All **even** numbered indices (apart from Index 0) are rows that contain the information of a pokemon. <br>
All **odd** numbered indices are newline elements or blank spaces.

### Scraping the data
Now that we know the general pattern of how the data is arranged, we can now scrape the data from the HTML code.<br>
Below will be the format for the final dataframe containing scraped data:

*rdex*: Regional Pokedex entry number <br>
*ndex*: National Pokedex entry number <br>
*name*: Name of Pokemon <br>
*type1*: First Type of Pokemon <br>
*type2*: Second Type of Pokemon <br>
*wiki*: Link to Pokemon's wiki page <br>
*gen*: Generation number <br>
*scrape_time*: Time of retrieving data

*Note: In the website, the column name for the regional Pokedex number varies per region (kdex for Kanto, jdex for Johto, and so on). To generalize, we used rdex to represent these values*

The following lines contain the algorithm to scrape the data from the Bulbapedia website.

In [16]:
# Initialization
extracted_poke_info = [] # Final storage of Pokemon info
rootURL = 'https://bulbapedia.bulbagarden.net' # Used later in providing the link for each Pokemon's wiki

In [17]:
# Creating the function
def scrape_gen(gen_num, time_now):
    curr_gen_list = poke_tables[gen_num]
    poke_raw_info = curr_gen_list.contents[1]
    # Loop for each Pokemon
    for j in range(0, len(poke_raw_info)):
        # If j (index) is even AND not equal to 0
        if (((j % 2) == 0) & (j != 0)):
            wiki = rootURL + poke_raw_info.contents[j].find('a').get('href')
            gen = 2
            
            pokemon = poke_raw_info.contents[j].text.strip().split('\n')
            
            # Pokemon with different regional form and 1 type
            if (len(pokemon) == 7):
                rdex = ''
                ndex = pokemon[0]
                name = pokemon[4]
                type1 = pokemon[6]
                type2 = ''
            
            # Pokemon with different regional form and 2 types
            elif (len(pokemon) == 8):
                rdex = ''
                ndex = pokemon[0]
                name = pokemon[4]
                type1 = pokemon[6]
                type2 = pokemon[7]
            
            # Pokemon with 1 form and 1 type
            elif (len(pokemon) == 9):
                rdex = pokemon[0]
                ndex = pokemon[2]
                name = pokemon[6]
                type1 = pokemon[8]
                type2 = ''
            
            # Pokemon with 1 form and 2 types
            elif (len(pokemon) == 10):
                rdex = pokemon[0]
                ndex = pokemon[2]
                name = pokemon[6]
                type1 = pokemon[8]
                type2 = pokemon[9]
            
            now = datetime.datetime.now()
            scrape_time = time_now
            extracted_poke_info.append((rdex, ndex, name, type1, type2, wiki, gen, scrape_time))

To test the function, let's try and scrape Gen 2 pokemon from the website.

In [18]:
now = datetime.datetime.now()
time_now = now.strftime('%m/%d/%Y %H:%M:%S')
scrape_gen(2, time_now)

In [19]:
# Viewing the scraped data
extracted_poke_info

[('#001',
  '#152',
  'Chikorita',
  'Grass',
  '',
  'https://bulbapedia.bulbagarden.net/wiki/Chikorita_(Pok%C3%A9mon)',
  2,
  '05/24/2022 19:47:17'),
 ('#002',
  '#153',
  'Bayleef',
  'Grass',
  '',
  'https://bulbapedia.bulbagarden.net/wiki/Bayleef_(Pok%C3%A9mon)',
  2,
  '05/24/2022 19:47:17'),
 ('#003',
  '#154',
  'Meganium',
  'Grass',
  '',
  'https://bulbapedia.bulbagarden.net/wiki/Meganium_(Pok%C3%A9mon)',
  2,
  '05/24/2022 19:47:17'),
 ('#004',
  '#155',
  'Cyndaquil',
  'Fire',
  '',
  'https://bulbapedia.bulbagarden.net/wiki/Cyndaquil_(Pok%C3%A9mon)',
  2,
  '05/24/2022 19:47:17'),
 ('#005',
  '#156',
  'Quilava',
  'Fire',
  '',
  'https://bulbapedia.bulbagarden.net/wiki/Quilava_(Pok%C3%A9mon)',
  2,
  '05/24/2022 19:47:17'),
 ('#006',
  '#157',
  'Typhlosion',
  'Fire',
  '',
  'https://bulbapedia.bulbagarden.net/wiki/Typhlosion_(Pok%C3%A9mon)',
  2,
  '05/24/2022 19:47:17'),
 ('',
  '#157',
  'Typhlosion',
  'Fire',
  'Ghost',
  'https://bulbapedia.bulbagarden.net/wi

Now, time to convert the scraped data into a dataframe, and then convert further into a .json file.

In [20]:
df_pokemon_list = pd.DataFrame(extracted_poke_info)

In [21]:
# Assigning columns for the dataframe
df_pokemon_list.columns = ['Rdex', 'Ndex', 'Name', 'Type 1', 'Type 2', 'Wiki Page', 'Generation', 'Scrapetime']

In [22]:
# Filename for the .json file; feel free to edit.
filename = 'Scraped Pokedex.json'

# Saving dataframe to a .json file, save in the same directory as this notebook is found.
df_pokemon_list.to_json(filename)

To check if the file was created, we can use *pandas* to open the .json file.

In [23]:
pokemon_json = pd.read_json(filename)
pokemon_json

Unnamed: 0,Rdex,Ndex,Name,Type 1,Type 2,Wiki Page,Generation,Scrapetime
0,#001,#152,Chikorita,Grass,,https://bulbapedia.bulbagarden.net/wiki/Chikor...,2,05/24/2022 19:47:17
1,#002,#153,Bayleef,Grass,,https://bulbapedia.bulbagarden.net/wiki/Baylee...,2,05/24/2022 19:47:17
2,#003,#154,Meganium,Grass,,https://bulbapedia.bulbagarden.net/wiki/Megani...,2,05/24/2022 19:47:17
3,#004,#155,Cyndaquil,Fire,,https://bulbapedia.bulbagarden.net/wiki/Cyndaq...,2,05/24/2022 19:47:17
4,#005,#156,Quilava,Fire,,https://bulbapedia.bulbagarden.net/wiki/Quilav...,2,05/24/2022 19:47:17
...,...,...,...,...,...,...,...,...
100,#250,#247,Pupitar,Rock,Ground,https://bulbapedia.bulbagarden.net/wiki/Pupita...,2,05/24/2022 19:47:17
101,#251,#248,Tyranitar,Rock,Dark,https://bulbapedia.bulbagarden.net/wiki/Tyrani...,2,05/24/2022 19:47:17
102,#252,#249,Lugia,Psychic,Flying,https://bulbapedia.bulbagarden.net/wiki/Lugia_...,2,05/24/2022 19:47:17
103,#253,#250,Ho-Oh,Fire,Flying,https://bulbapedia.bulbagarden.net/wiki/Ho-Oh_...,2,05/24/2022 19:47:17


### (Bonus) Appending to the existing .json file
Currently, the 'Scraped Pokedex.json' file contains a list of all Generation 2 Pokemon. To add another generation of Pokemon to this list, we can use the following:

In [24]:
# Initialization
df_existing_data = pd.read_json(filename) # Using pd.read_json results in a dataframe, to which we are going to append to later.
new_data = [] # Storage for the data to append
rootURL = 'https://bulbapedia.bulbagarden.net' # The same root URL as before

Since we already made the scraping algorithm for Gen 2, minor adjustments are made to make a function that can take an integer for the generation number as input.

In [25]:
# Creating the function
def append_gen(gen_num, time_now, filename):
    # Scraping algorithm
    curr_gen_list = poke_tables[gen_num]
    poke_raw_info = curr_gen_list.contents[1]
    for j in range(0, len(poke_raw_info)):
        if (((j % 2) == 0) & (j != 0)):
            wiki = rootURL + poke_raw_info.contents[j].find('a').get('href')
            gen = gen_num
            
            pokemon = poke_raw_info.contents[j].text.strip().split('\n')
            
            if (len(pokemon) == 7):
                rdex = ''
                ndex = pokemon[0]
                name = pokemon[4]
                type1 = pokemon[6]
                type2 = ''
            
            elif (len(pokemon) == 8):
                rdex = ''
                ndex = pokemon[0]
                name = pokemon[4]
                type1 = pokemon[6]
                type2 = pokemon[7]
            
            elif (len(pokemon) == 9):
                rdex = pokemon[0]
                ndex = pokemon[2]
                name = pokemon[6]
                type1 = pokemon[8]
                type2 = ''
            
            elif (len(pokemon) == 10):
                rdex = pokemon[0]
                ndex = pokemon[2]
                name = pokemon[6]
                type1 = pokemon[8]
                type2 = pokemon[9]
            
            scrape_time = time_now
            new_data.append((rdex, ndex, name, type1, type2, wiki, gen, scrape_time))
    
    # Converting scraped data into a dataframe
    df_new_data = pd.DataFrame(new_data)
    df_new_data.columns = ['Rdex', 'Ndex', 'Name', 'Type 1', 'Type 2', 'Wiki Page', 'Generation', 'Scrapetime']
    
    # Appending the new data to the existing dataframe    
    df_updated_data = pd.concat([df_existing_data, df_new_data])
    df_updated_data.columns = ['Rdex', 'Ndex', 'Name', 'Type 1', 'Type 2', 'Wiki Page', 'Generation', 'Scrapetime']
    df_updated_data.reset_index(inplace = True)
    
    # Deleting existing file
    os.remove(filename)
    
    # Creating new file with updated data
    df_updated_data.to_json(filename)

To test the function, let's try to append Gen 3 to our existing data.

In [26]:
# Calling the function with the parameter 3, to append Generation 3 to the existing data
now = datetime.datetime.now()
time_now = now.strftime('%m/%d/%Y %H:%M:%S')
append_gen(3, time_now, filename)

To check if our data was actually updated, let's use *pandas* to read the .json again.

In [27]:
pokemon_updated = pd.read_json(filename)
pokemon_updated

Unnamed: 0,index,Rdex,Ndex,Name,Type 1,Type 2,Wiki Page,Generation,Scrapetime
0,0,#001,#152,Chikorita,Grass,,https://bulbapedia.bulbagarden.net/wiki/Chikor...,2,05/24/2022 19:47:17
1,1,#002,#153,Bayleef,Grass,,https://bulbapedia.bulbagarden.net/wiki/Baylee...,2,05/24/2022 19:47:17
2,2,#003,#154,Meganium,Grass,,https://bulbapedia.bulbagarden.net/wiki/Megani...,2,05/24/2022 19:47:17
3,3,#004,#155,Cyndaquil,Fire,,https://bulbapedia.bulbagarden.net/wiki/Cyndaq...,2,05/24/2022 19:47:17
4,4,#005,#156,Quilava,Fire,,https://bulbapedia.bulbagarden.net/wiki/Quilav...,2,05/24/2022 19:47:17
...,...,...,...,...,...,...,...,...,...
243,138,#201,#385,Jirachi,Steel,Psychic,https://bulbapedia.bulbagarden.net/wiki/Jirach...,3,05/24/2022 19:47:17
244,139,#202,#386,Deoxys,Psychic,,https://bulbapedia.bulbagarden.net/wiki/Deoxys...,3,05/24/2022 19:47:17
245,140,#202,#386,Deoxys,Psychic,,https://bulbapedia.bulbagarden.net/wiki/Deoxys...,3,05/24/2022 19:47:17
246,141,#202,#386,Deoxys,Psychic,,https://bulbapedia.bulbagarden.net/wiki/Deoxys...,3,05/24/2022 19:47:17


In [28]:
pokemon_updated['Generation'].value_counts()

3    143
2    105
Name: Generation, dtype: int64