# How to WEB SCRAPE USING AN API
*Web Scraping Tutorial by Data Scientist Brandon Navarrete*

# Project Description and Goals: 

<div class="alert alert-block alert-info">
<b>DESCRIPTION:</b>

* I will document my process of locating an api, explore the features and issues with that api.
* I will have create markdown for others to follow along and see my thought process.
* Create a reproducable deliverable and leave room for open communication and feedback.


<div class="alert alert-block alert-info">
<b>GOALS:</b>

* I Want to gain a better understanding of web scraping and the challenges that come with it
* To create a arsenal of solutions
* So that others may find success in my mistakes, failures, and wins.


# Imports I Used

In [1]:
# Essential Imports
import pandas as pd
import numpy as np
import os

# Import to Help Us With the API
import json
from json.decoder import JSONDecodeError
import requests

# Visual Imports
import time
from tqdm import tqdm

# Nuisance Imports
# I do not recommend this method do to volatile warnings.
import warnings
warnings.filterwarnings('ignore')

# The API I Am Using

https://pokeapi.co/api/v2/pokemon/1

 response.json()


<div class="alert alert-block alert-info">
<b>Note:</b>
    
* This API belongs to the website https://pokeapi.co/  
    
* The API I am using is not a root or home api, this url leads you to a specific page   
that contains information about a specific pokemon

* There is a very friendly GUI to experiment with on the website
    


# Let's See What We Are Working With

In [2]:
# let's creates a response for {pokemon/1} which is bulbasuar
# `requests.get` is a `method` to send a HTTP GET request to a specific URL
response = requests.get('https://pokeapi.co/api/v2/pokemon/1')

In [3]:
# verify a response was "good"
response.ok, response.status_code


(True, 200)

In [4]:
# Let's look at what this url return to us using 'response.text'
response.text

'{"abilities":[{"ability":{"name":"overgrow","url":"https://pokeapi.co/api/v2/ability/65/"},"is_hidden":false,"slot":1},{"ability":{"name":"chlorophyll","url":"https://pokeapi.co/api/v2/ability/34/"},"is_hidden":true,"slot":3}],"base_experience":64,"forms":[{"name":"bulbasaur","url":"https://pokeapi.co/api/v2/pokemon-form/1/"}],"game_indices":[{"game_index":153,"version":{"name":"red","url":"https://pokeapi.co/api/v2/version/1/"}},{"game_index":153,"version":{"name":"blue","url":"https://pokeapi.co/api/v2/version/2/"}},{"game_index":153,"version":{"name":"yellow","url":"https://pokeapi.co/api/v2/version/3/"}},{"game_index":1,"version":{"name":"gold","url":"https://pokeapi.co/api/v2/version/4/"}},{"game_index":1,"version":{"name":"silver","url":"https://pokeapi.co/api/v2/version/5/"}},{"game_index":1,"version":{"name":"crystal","url":"https://pokeapi.co/api/v2/version/6/"}},{"game_index":1,"version":{"name":"ruby","url":"https://pokeapi.co/api/v2/version/7/"}},{"game_index":1,"version":

In [5]:
# the response contains JSON, so that we can access keys
data = response.json()
# show that we contain dictionaries
print(type(data))

<class 'dict'>


In [6]:
# showing the dictionary, which contains dictionaries within dictionaries.
# 'abilities' is a key in a dictionary, which holds another dictionary which 
# hold the key name
data

{'abilities': [{'ability': {'name': 'overgrow',
    'url': 'https://pokeapi.co/api/v2/ability/65/'},
   'is_hidden': False,
   'slot': 1},
  {'ability': {'name': 'chlorophyll',
    'url': 'https://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'forms': [{'name': 'bulbasaur',
   'url': 'https://pokeapi.co/api/v2/pokemon-form/1/'}],
 'game_indices': [{'game_index': 153,
   'version': {'name': 'red', 'url': 'https://pokeapi.co/api/v2/version/1/'}},
  {'game_index': 153,
   'version': {'name': 'blue', 'url': 'https://pokeapi.co/api/v2/version/2/'}},
  {'game_index': 153,
   'version': {'name': 'yellow',
    'url': 'https://pokeapi.co/api/v2/version/3/'}},
  {'game_index': 1,
   'version': {'name': 'gold', 'url': 'https://pokeapi.co/api/v2/version/4/'}},
  {'game_index': 1,
   'version': {'name': 'silver',
    'url': 'https://pokeapi.co/api/v2/version/5/'}},
  {'game_index': 1,
   'version': {'name': 'crystal',
    'url': 'https://pokeapi.co/

In [7]:
# There is a lot of text. It makes it hard to read so we can use '.keys()'.
# Shows all our key values which hold other information.
data.keys()

dict_keys(['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'held_items', 'id', 'is_default', 'location_area_encounters', 'moves', 'name', 'order', 'past_types', 'species', 'sprites', 'stats', 'types', 'weight'])

# Let's Explore Pulling Data From This URL

In [8]:
temp_cols = ['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'id', 'location_area_encounters', 'moves', 'name', 'order', 'past_types', 'species','hp','attack','defense','special_attack','special_defense', 'type1','type2', 'weight']
temp_df = pd.DataFrame(columns=temp_cols)
temp_df

Unnamed: 0,abilities,base_experience,forms,game_indices,height,id,location_area_encounters,moves,name,order,past_types,species,hp,attack,defense,special_attack,special_defense,type1,type2,weight


# DATA

In [9]:
# We will use these keys to see what is stored
data.keys()

dict_keys(['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'held_items', 'id', 'is_default', 'location_area_encounters', 'moves', 'name', 'order', 'past_types', 'species', 'sprites', 'stats', 'types', 'weight'])

# Dataframe object to store information

In [10]:
# The end goal is to have a csv and dataframe, so let's create one to see how we would implement it
temp_df = pd.DataFrame(columns=['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'id', 'location_area_encounters', 'moves', 'name', 'order', 'past_types', 'species', 'hp', 'attack', 'defense', 'special_attack', 'special_defense', 'type1', 'type2', 'weight'])
temp_df.loc[len(temp_df)] = 0
    

In [11]:
# filled with empty values|
temp_df

Unnamed: 0,abilities,base_experience,forms,game_indices,height,id,location_area_encounters,moves,name,order,past_types,species,hp,attack,defense,special_attack,special_defense,type1,type2,weight
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# ABILITIES

<div class="alert alert-block alert-info">
<b>Info:</b>
    
* Below, I will go through the keys that I want to pull and begin to see how I would create a function to do all of this programatically 
    


In [12]:
# Let's look inside a key
data['abilities']

[{'ability': {'name': 'overgrow',
   'url': 'https://pokeapi.co/api/v2/ability/65/'},
  'is_hidden': False,
  'slot': 1},
 {'ability': {'name': 'chlorophyll',
   'url': 'https://pokeapi.co/api/v2/ability/34/'},
  'is_hidden': True,
  'slot': 3}]

<div class="alert alert-block alert-info">
<b>Abilities Takeaway:</b>
    
This is a list of dictionaries, where each dictionary represents an ability of a Pokémon. Each dictionary has three key-value pairs:

* 'ability': {'name': 'overgrow', 'url': 'https://pokeapi.co/api/v2/ability/65/'} - This key-value pair contains information about the name of the ability and the URL where more information can be found about the ability.
    
    
    

* 'is_hidden': False - This key-value pair indicates whether the ability is a hidden ability or not.  
    

* 'slot': 1 - This key-value pair indicates the slot number of the ability. Pokémon can have up to 3 abilities, and the slot number indicates the order in which the abilities are listed in the Pokémon's profile.


In [13]:
# I only want to pull the name of the abilities, and there are multiple values. 
# An empty list would help hold the values until I am ready to place them
ability = []

In [14]:
# this shows the value that I want to keep, look at the take away below
data['abilities'][0]['ability']['name']

'overgrow'

<div class="alert alert-block alert-info">
<b>Dictionary Takeaway:</b>
      
This is accessing a dictionary data which has a key 'abilities' that contains a list of dictionaries.  

The first element of this list is accessed with [0], and then the value associated with the key 'ability' in that dictionary is accessed with ['ability'], and finally  

the value associated with the key 'name' in that dictionary is accessed with ['name']. So this expression data['abilities'][0]['ability']['name'] gives you the name of the first ability of a pokemon.  

In [15]:
# I am appending the values, strings, to the empty list ability
ability.append(data['abilities'][0]['ability']['name'])

In [16]:
ability.append(data['abilities'][1]['ability']['name'])

In [17]:
# This would be our new list WITH desired values
ability

['overgrow', 'chlorophyll']

In [18]:
# let's save this to our dataframe, the [0] shows the exact row i want it to be saved to.
temp_df['abilities'][0] = ability

In [19]:
temp_df

Unnamed: 0,abilities,base_experience,forms,game_indices,height,id,location_area_encounters,moves,name,order,past_types,species,hp,attack,defense,special_attack,special_defense,type1,type2,weight
0,"[overgrow, chlorophyll]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Base_XP

<div class="alert alert-block alert-info">
<b>Note:</b>

The rest of this portion will mostly include comments on my thought process. (Not as much detail or markdown unless we encounter a new concept)

In [20]:
# a numerical value representing base experience, lets add that
data['base_experience']

64

In [21]:
# saving to dataframe
temp_df['base_experience'] = data['base_experience']

# Forms

In [22]:
# saving to dataframe
temp_df['forms'] = data['forms'][0]['name']

# height

In [23]:
# saving to dataframe
temp_df['height'] = data['height']

# location_area_encounters

In [24]:
# This key stores a url, tricky! 
# we just have to go to that url and pull the info we want
data['location_area_encounters']

'https://pokeapi.co/api/v2/pokemon/1/encounters'

In [25]:
# saving url to a varialbe `loc_url`, short for location_url
loc_url = data['location_area_encounters']

In [26]:
# saving the response
response_url = requests.get(loc_url)

# the response contains JSON
data_url = response_url.json()


In [27]:
# Checking the length
len(data_url)

3

In [28]:
# creating empty list
temp_name = []

In [29]:
# saving the names to our list
temp_name.append(data_url[0]['location_area']['name'])

In [30]:
temp_name.append(data_url[1]['location_area']['name'])

In [31]:
temp_name.append(data_url[2]['location_area']['name'])


In [32]:
# We went into the dictionary and pulled our info, saved them into our list and now to save to dataframe
temp_name

['cerulean-city-area', 'pallet-town-area', 'lumiose-city-area']

In [33]:
# List item added to dataframe
temp_df['location_area_encounters'][0] = temp_name

# Moves

In [34]:
# This was so much to go through. Too Much
data['moves']

[{'move': {'name': 'razor-wind', 'url': 'https://pokeapi.co/api/v2/move/13/'},
  'version_group_details': [{'level_learned_at': 0,
    'move_learn_method': {'name': 'egg',
     'url': 'https://pokeapi.co/api/v2/move-learn-method/2/'},
    'version_group': {'name': 'gold-silver',
     'url': 'https://pokeapi.co/api/v2/version-group/3/'}},
   {'level_learned_at': 0,
    'move_learn_method': {'name': 'egg',
     'url': 'https://pokeapi.co/api/v2/move-learn-method/2/'},
    'version_group': {'name': 'crystal',
     'url': 'https://pokeapi.co/api/v2/version-group/4/'}}]},
 {'move': {'name': 'swords-dance',
   'url': 'https://pokeapi.co/api/v2/move/14/'},
  'version_group_details': [{'level_learned_at': 0,
    'move_learn_method': {'name': 'machine',
     'url': 'https://pokeapi.co/api/v2/move-learn-method/4/'},
    'version_group': {'name': 'red-blue',
     'url': 'https://pokeapi.co/api/v2/version-group/1/'}},
   {'level_learned_at': 0,
    'move_learn_method': {'name': 'machine',
     'ur

In [35]:
# Creating a function to go in, take the info, and stop when the length is met
def iterate_dict(data):
    # creating an empty list
    temp_move = []
     
    for item in range(0,len(data['moves'])):
        
        temp_move.append(data['moves'][item]['move']['name'])
       
        
        
    return temp_move
    
        

In [36]:
# calling function
temp_move = iterate_dict(data)

In [37]:
# adding to dataframe
temp_df['moves'][0] = temp_move

# name

In [38]:
# adding to dataframe
temp_df['name'] = data['name']

# order

In [39]:
# adding to dataframe
temp_df['order'] = data['order']

# past_types

In [40]:
# sometimes values may be empty
data['past_types']

[]

# species

In [41]:
# adding to dataframe
temp_df['species'] = data['species']['name']

# stats

In [42]:
data['stats']

[{'base_stat': 45,
  'effort': 0,
  'stat': {'name': 'hp', 'url': 'https://pokeapi.co/api/v2/stat/1/'}},
 {'base_stat': 49,
  'effort': 0,
  'stat': {'name': 'attack', 'url': 'https://pokeapi.co/api/v2/stat/2/'}},
 {'base_stat': 49,
  'effort': 0,
  'stat': {'name': 'defense', 'url': 'https://pokeapi.co/api/v2/stat/3/'}},
 {'base_stat': 65,
  'effort': 1,
  'stat': {'name': 'special-attack',
   'url': 'https://pokeapi.co/api/v2/stat/4/'}},
 {'base_stat': 65,
  'effort': 0,
  'stat': {'name': 'special-defense',
   'url': 'https://pokeapi.co/api/v2/stat/5/'}},
 {'base_stat': 45,
  'effort': 0,
  'stat': {'name': 'speed', 'url': 'https://pokeapi.co/api/v2/stat/6/'}}]

In [43]:
# function to pull all stats
def iterate_dict(data):
    # creating an empty list
    temp_stat = []
     
    for item in range(0,len(data['stats'])):
        
        temp_stat.append(data['stats'][item]['base_stat'])
       
        
        
    return temp_stat
    
        

In [44]:
# saving values
hp,attk,df,spa,spd,s = iterate_dict(data)

In [45]:
# checking values, they will always be returned in the same order
hp,attk,df,spa,spd,s

(45, 49, 49, 65, 65, 45)

In [46]:
# adding to dataframe
temp_df[['hp','attack','defense','special_attack','special_defense','speed']] = hp,attk,df,spa,spd,s


In [47]:
# checking just to be save
temp_df[['hp','attack','defense','special_attack','special_defense','speed']]

Unnamed: 0,hp,attack,defense,special_attack,special_defense,speed
0,45,49,49,65,65,45


# types

In [48]:
# function for types
def iterate_type(data):
    # creating an empty list
    temp_type = []
     
    for item in range(0,len(data['types'])):
        
        temp_type.append(data['types'][item]['type']['name'])
       
        
        
    return temp_type
    

In [49]:
# add to df
temp_df[['type1','type2']]=iterate_type(data)

# Weight

In [50]:
# add to df
temp_df['weight']=data['weight']

In [51]:
# We were able to pull the info that we wanted, time to do all of this like a programmer.
temp_df

Unnamed: 0,abilities,base_experience,forms,game_indices,height,id,location_area_encounters,moves,name,order,...,species,hp,attack,defense,special_attack,special_defense,type1,type2,weight,speed
0,"[overgrow, chlorophyll]",64,bulbasaur,0,7,0,"[cerulean-city-area, pallet-town-area, lumiose...","[razor-wind, swords-dance, cut, bind, vine-whi...",bulbasaur,1,...,bulbasaur,45,49,49,65,65,grass,poison,69,45


# creating a function to do all of this and cycle through the api

In [52]:
start_url = 'https://pokeapi.co/api/v2/pokemon/'

In [53]:
def cycle_api(start_url, api_page = 1):

    # url search
    url = (f'https://pokeapi.co/api/v2/pokemon/'+str(api_page))
    
    # getting response
    response = requests.get(url)
    
    # .json
    data = response.json()
    
    return data

# Function Collection


<div class="alert alert-block alert-info">
<b>Note:</b> 
    
Here I started to make cleaned function that will be called. 
I chose to make several, smaller functions so that if something messed up, I could locatet the issue easier.

THINGS DID AND WILL MESS UP!



### ability

In [54]:
def ability_scrape(data):
    # empty list
    ability_list = []
    
    for item in range(0,len(data['abilities'])):
        
        ability_list.append(data['abilities'][item]['ability']['name'])
       
        
        
    return ability_list
    
        

###  base exp

In [55]:
def base_xp_scrape(data):
    # empty list
    xp_list = []
    
    xp_list.append(data['base_experience'])
      
    return xp_list
    
        

###  forms

In [56]:
def forms_scrape(data):
    # empty list
    forms_list = []
    
    forms_list.append( data['forms'][0]['name'])
      
    return forms_list
    

###  height

In [57]:
def height_scrape(data):
    # empty list
    height_list = []
    
    height_list.append(data['height'])
      
    return height_list
    

###  id

In [58]:
def id_scrape(data):
    # empty list
    id_list = []
    
    id_list.append(data['id'])
      
    return id_list
    

###  location Area

In [59]:
def location_scrape(data):
    # empty list
    location_list = []
    
    # url
    loc_url = data['location_area_encounters']
    
    # grabbing json file
    response_url = requests.get(loc_url)

    # the response contains JSON
    data_url = response_url.json()
    
    #
    for item in range(0,(len(data_url))):
        
        location_list.append(data_url[item]['location_area']['name'])         
   
      
    return location_list
    

###  move

In [60]:
def move_scrape(data):
    # creating an empty list
    move_list = []
     
    for item in range(0,len(data['moves'])):
        
        move_list.append(data['moves'][item]['move']['name'])
       
        
        
    return move_list
    
        

###  name

In [61]:
def name_scrape(data):
    # empty list
    name_list = []
    
    
    name_list.append(data['name'])
      
    return name_list
    

### order

In [62]:
def order_scrape(data):
    
    # empty list
    order_list = []
    
    
    order_list.append(data['order'])
      
    return order_list

### past types

In [63]:
def pt_scrape(data):
    
    # empty list
    pt_list = []
    
    
    pt_list.append(data['past_types'])
      
    return pt_list

###  species

In [64]:
def species_scrape(data):
    
    # empty list
    species_list = []
    
    
    species_list.append(data['species']['name'])
      
    return species_list

###  stats

In [65]:
def stats_scrape(data):
    return [stat['base_stat'] for stat in data['stats']]

###  type

In [66]:
def type_scrape(data):
    types = data.get('types', [])
    if len(types) == 1:
        return types[0]['type']['name'], None
    elif len(types) == 2:
        return types[0]['type']['name'], types[1]['type']['name']
    else:
        return None, None

###  weight 

In [67]:
def weight_scrape(data):
    
    # empty list
    weight_list = []
    
    
    weight_list.append(data['weight'])
      
    return weight_list

###  create df

In [68]:
def create_df():
    
    df_cols = ['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'id', 'location_area_encounters', 'moves', 'name', 'order', 'past_types', 'species','hp','attack','defense','special_attack','special_defense','speed' ,'type1','type2', 'weight']
    df = pd.DataFrame(columns=df_cols)
    df.loc[len(df)] = 0
    
    return df
    

### putting it all together

In [69]:
def scrape_results(data, df):
    # Create temporary dataframe to hold scraped data
    temp_df = pd.DataFrame(columns=['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'id', 'location_area_encounters', 'moves', 'name', 'order', 'past_types', 'species', 'hp', 'attack', 'defense', 'special_attack', 'special_defense', 'type1', 'type2', 'weight'])
    temp_df.loc[len(temp_df)] = 0

    # ability
    ability_list = ability_scrape(data)
    temp_df['abilities'][0] = ability_list

    # base_experience
    xp_list = base_xp_scrape(data)
    temp_df['base_experience'] = xp_list

    # forms
    forms_list = forms_scrape(data)
    temp_df['forms'] = forms_list

    # height
    height_list = height_scrape(data)
    temp_df['height'] = height_list

    # id
    id_list = id_scrape(data)
    temp_df['id'] = id_list

    # location area
    # loc_list = location_scrape(data)
    # temp_df['location_area_encounters'][0] = loc_list
    loc_list = location_scrape(data)
    temp_df['location_area_encounters'][0] = json.dumps(loc_list)

    # moves
    move_list = move_scrape(data)
    temp_df['moves'][0] = move_list

    # name
    name_list = name_scrape(data)
    temp_df['name'] = name_list

    # order
    order_list = order_scrape(data)
    temp_df['order'] = order_list

    # past type
    pt_list = pt_scrape(data)
    temp_df['past_types'][0] = pt_list

    # species
    species_list = species_scrape(data)
    temp_df['species'] = species_list

    # weight
    weight_list = weight_scrape(data)
    temp_df['weight'] = weight_list

    # stats
    # this order is returned ['hp','attack','defense','special_attack','special_defense','speed']
    hp, attk, defense, spa, spd, s = stats_scrape(data)
    temp_df['hp'] = hp
    temp_df['attack'] = attk
    temp_df['defense'] = defense
    temp_df['special_attack'] = spa
    temp_df['special_defense'] = spd
    temp_df['speed'] = s

    # type
    type1, type2 = type_scrape(data)
    temp_df['type1'] = type1
    temp_df['type2'] = type2

    
    print(len(df))
    # Append temporary dataframe to input dataframe
    df = df.append(temp_df, ignore_index=True)


    return df

# cycle through different url

In [70]:
def cycle_api(start_url):
    
        response = requests.get(start_url)
        if response.status_code == 200:
            data = response.json()
            
        return data
            
        

# lets cycle through the api to grab all pokemon, 1008

In [71]:
# define the starting URL for the API
start_url = 'https://pokeapi.co/api/v2/pokemon/1'

# define a function to run the whole process
def scrape_all_pokemons():
    # create an empty dataframe
    df = create_df()

    for i in tqdm(range(1, 1009)):
        url = f'https://pokeapi.co/api/v2/pokemon/{i}'
        
        data = cycle_api(url)
        
        df = scrape_results(data, df)
        
        # wait a bit to avoid overloading the API
        time.sleep(0.1)
        
        
   
    # return df
    return df

# Last Step is To Run the Section Below to Begin Scraping

In [75]:
# calling the function, the function to run all functions.
# saving the returned df
df = scrape_all_pokemons()

# saving df to csv( we could put this is the function as well)
df.to_csv('all_pokemons.csv', index=False)


<div class="alert alert-block alert-info">
<b>tqdm:</b>

* An explanation on how to read the tqdm progress bar I found on stack over flow by 'tuxdna'

* A link to the forum; https://stackoverflow.com/questions/52777424/explanation-of-output-of-python-tqdm

17%|█▋        | 134/782 [00:19<01:21,  7.98it/s, loss=0.375 ]  

The fields in order are:  
17%: Percentage complete.   

    
|█▋        | : Progress bar   

    
134/782: Number of items iterated over total number of items.   
    
    
[00:19<01:21,  7.98it/s, loss=0.375 ]: Lets break this down.   
    
    
00:19<01:21 : The left is the `runtime` and the right is estimated time left
7.98it/s: iterations per second
loss=0.375: As the label says, it is the `loss`.


In [74]:
# read from csv, so that we do not have to keep requesting data from the website
df = pd.read_csv('all_pokemons.csv')

In [None]:
# here is our product.
df

<div class="alert alert-block alert-info">
<b>FINAL TAKEAWAYS:</b>

So we are here, at the end. I won't go over cleaning it but I would set the index, look for None values/Nan, and explore with visuals.

I hope this was helpful in some regard, I know ill use this for refrence from time to time.
Please let me know what you think and if there was anything you would have done different. 

As Always, this was your local data scientist Brandon Navarrete, making data speak.
~Brandon.t.navarrete@gmail.com