# Getting data directly from a website

By: Deborah Rose P. Buhion DATASCI S12

This notebook is based from [WebScraping.ipynb](https://github.com/brianehenyo/datapre-notebooks/blob/main/notebooks/02%20-%20Web%20Scraping.ipynb) 

Data is from [Bulbapedia's National Pokedex](https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number)

Note: This notebook is a template to extract data from one generation of Pokemon.

### Import `json` library

In [9]:
import json

### Import `requests` library

In [10]:
import requests

URL="https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number"

### Load the page

In [11]:
page=requests.get(URL)

### Parse HTML data

In [12]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

### Find all tables that contain Pokemon details

In [13]:
# Get main content <div>
poke_content=soup.find(id='mw-content-text')

# Get all <table> elements
poke_tables=poke_content.find_all('table')

### Get list of **X** Generation Pokemons

In [50]:
genX_list=poke_tables[8]

In [51]:
info_start=3

### Get all Gen **X** Pokemons

In [52]:
for i in range(info_start, len(genX_list.contents), 2):
    poke_info=genX_list.contents[i]
    kdex=poke_info.contents[1].text.strip()
    ndex=poke_info.contents[3].text.strip()
    name=poke_info.contents[7].text.strip()
    type1=poke_info.contents[9].text.strip()
    
    href=str(poke_info.contents[7])
    start = href.find('href="') + len('href="')
    end = href.find('" ')
    substring = href[start:end]
    link_start="https://bulbapedia.bulbagarden.net"
    link=link_start+substring
    
    if len(poke_info.contents) > 10:
        type2=poke_info.contents[11].text.strip()
        print(f'Pokemon {ndex} {name} is a {type1} & {type2} Pokemon. \nLearn more here: {link}')
    else:
        print(f'Pokemon {ndex} {name} is a {type1} Pokemon. \nLearn more here: {link}')

Pokemon #810 Grookey is a Grass Pokemon. 
Learn more here: https://bulbapedia.bulbagarden.net/wiki/Grookey_(Pok%C3%A9mon)
Pokemon #811 Thwackey is a Grass Pokemon. 
Learn more here: https://bulbapedia.bulbagarden.net/wiki/Thwackey_(Pok%C3%A9mon)
Pokemon #812 Rillaboom is a Grass Pokemon. 
Learn more here: https://bulbapedia.bulbagarden.net/wiki/Rillaboom_(Pok%C3%A9mon)
Pokemon #813 Scorbunny is a Fire Pokemon. 
Learn more here: https://bulbapedia.bulbagarden.net/wiki/Scorbunny_(Pok%C3%A9mon)
Pokemon #814 Raboot is a Fire Pokemon. 
Learn more here: https://bulbapedia.bulbagarden.net/wiki/Raboot_(Pok%C3%A9mon)
Pokemon #815 Cinderace is a Fire Pokemon. 
Learn more here: https://bulbapedia.bulbagarden.net/wiki/Cinderace_(Pok%C3%A9mon)
Pokemon #816 Sobble is a Water Pokemon. 
Learn more here: https://bulbapedia.bulbagarden.net/wiki/Sobble_(Pok%C3%A9mon)
Pokemon #817 Drizzile is a Water Pokemon. 
Learn more here: https://bulbapedia.bulbagarden.net/wiki/Drizzile_(Pok%C3%A9mon)
Pokemon #818 In

### Save them in a JSON

In [53]:
genX_json = []

for i in range(info_start, len(genX_list.contents), 2):
    poke_info=genX_list.contents[i]
    
    href=str(poke_info.contents[7])
    start = href.find('href="') + len('href="')
    end = href.find('" ')
    substring = href[start:end]
    link_start="https://bulbapedia.bulbagarden.net"
    link=link_start+substring
    
    kdex=poke_info.contents[1].text.strip()
    ndex=poke_info.contents[3].text.strip()
    name=poke_info.contents[7].text.strip()
    type1=poke_info.contents[9].text.strip()
    if len(poke_info.contents) > 10:
        type2=poke_info.contents[11].text.strip()
        genX_json.append({
            "kdex": kdex,
            "ndex": ndex,
            "name": name,
            "type1": type1,
            "type2": type2,
            "link": link
        })
    else:
        genX_json.append({
            "kdex": kdex,
            "ndex": ndex,
            "name": name,
            "type1": type1,
            "link": link
        })
        
genX_json

[{'kdex': '#001',
  'ndex': '#810',
  'name': 'Grookey',
  'type1': 'Grass',
  'link': 'https://bulbapedia.bulbagarden.net/wiki/Grookey_(Pok%C3%A9mon)'},
 {'kdex': '#002',
  'ndex': '#811',
  'name': 'Thwackey',
  'type1': 'Grass',
  'link': 'https://bulbapedia.bulbagarden.net/wiki/Thwackey_(Pok%C3%A9mon)'},
 {'kdex': '#003',
  'ndex': '#812',
  'name': 'Rillaboom',
  'type1': 'Grass',
  'link': 'https://bulbapedia.bulbagarden.net/wiki/Rillaboom_(Pok%C3%A9mon)'},
 {'kdex': '#004',
  'ndex': '#813',
  'name': 'Scorbunny',
  'type1': 'Fire',
  'link': 'https://bulbapedia.bulbagarden.net/wiki/Scorbunny_(Pok%C3%A9mon)'},
 {'kdex': '#005',
  'ndex': '#814',
  'name': 'Raboot',
  'type1': 'Fire',
  'link': 'https://bulbapedia.bulbagarden.net/wiki/Raboot_(Pok%C3%A9mon)'},
 {'kdex': '#006',
  'ndex': '#815',
  'name': 'Cinderace',
  'type1': 'Fire',
  'link': 'https://bulbapedia.bulbagarden.net/wiki/Cinderace_(Pok%C3%A9mon)'},
 {'kdex': '#007',
  'ndex': '#816',
  'name': 'Sobble',
  'type1': 

### Write to a JSON file

Change number according to what generation

In [54]:
filename='Pokemon Generation ' + '8'

In [55]:
with open(filename, 'w') as f:
    json.dump(genX_json, f)