# Reproducible code for creating a Pokémon Dataset  

This notebook downloads Pokémon data from the PokéAPI and stores this in one CSV-file that contains the **ID**, **Name**, **Type(s)**, **Description/Document**, **Filename**, **Tokens**, 
**Lemmas**, **POS**, **Proper Nouns** and **Named Entities**. 

All steps containing the proces of making the dataset can be reproduced by running the following cells top to bottom.
Let's start with our imports!

In [None]:
import requests
import pandas as pd
import os
import time
import unicodedata
import re
import spacy

Next, we set up our folders. 

In [None]:
data_folder = "data"
os.makedirs(data_folder, exist_ok=True)

Request Pokémon data from the API: a loop runs from 1 to 1025 (the official Pokémon count up to Generation 9). The outcome of this loop will contain: 
- Pokémon ID
- Pokémon name
- Pokémon type
- English Pokémon description 
- The name of the corresponding file <a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1)  

<blockquote> If the request fails, it retries up to 3 times. If it still fails, it skips that Pokémon and moves to the next. </blockquote>

<a name="cite_note-1"></a>1. [^](#cite_ref-1) The code below will also create a folder with seperate .txt-files with names 0001_bulbasaur.txt, 0002_ivysaur.txt, etc., containing the descriptions of the Pokémon. These will be stored as 'data' and will be reffered to in the CSV-file. 

In [None]:
pokemon_data = []
data_folder = "pokedata"
os.makedirs(data_folder, exist_ok=True) 

for i in range(1, 1026):
    print(f"Getting Pokémon {i}...")  # to see what the downloading status is 

    for attempt in range(3):
        p_resp = requests.get(f"https://pokeapi.co/api/v2/pokemon/{i}")
        if p_resp.status_code == 200:
            break
        print(f"Retry {attempt+1} for Pokémon {i} (status {p_resp.status_code})")
        time.sleep(1)
    else:
        print(f"Skipping Pokémon {i} after 3 failed attempts")  #retry to get info up to 3 times 
        continue

    p = p_resp.json()

    for attempt in range(3):
        s_resp = requests.get(f"https://pokeapi.co/api/v2/pokemon-species/{i}")
        if s_resp.status_code == 200:
            break
        print(f"Retry {attempt+1} for species {i} (status {s_resp.status_code})")
        time.sleep(1)
    else:
        print(f"Skipping species {i} after 3 failed attempts")
        continue

    s = s_resp.json()

    english_entries = [
        e['flavor_text'].replace("\n", " ").replace("\x0c", " ")
        for e in s['flavor_text_entries']
        if e['language']['name'] == 'en'
    ]
    description = english_entries[0] if english_entries else ""

    filename = f"{i:04d}_{p['name']}.txt"
    with open(os.path.join(data_folder, filename), "w", encoding="utf-8") as f:
        f.write(description)

    pokemon_data.append({
        "id": i,
        "name": p['name'],
        "types": [t['type']['name'] for t in p['types']],
        "description": description,
        "filename": filename
    })

    time.sleep(0.3)  # delay to avoid overload

print(f"\nAll done! Successfully collected data for {len(pokemon_data)} Pokémon.")

Now we convert the list into a dataframe using Pandas. 

In [None]:
df = pd.DataFrame(pokemon_data)
df.to_csv("pokedex.csv", index=False)
print("CSV saved as 'pokedex.csv'")

As a final step, we load a SpaCy English model and clean the Pokémon descriptions, so that they can be annotated with **tokens**, **lemmas**, **parts of speech**, **proper nouns**, and **named entities**. The POS tags are explained in a separate column, and intermediate data is removed before saving the fully annotated CSV. 

In [None]:
nlp = spacy.load("en_core_web_sm")

df = pd.read_csv("pokedex.csv")

def preprocess_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df["clean_text"] = df["description"].apply(preprocess_text)

# Run SpaCy pipeline
df["doc"] = df["clean_text"].apply(nlp)

df["tokens"] = df["doc"].apply(lambda doc: [t.text for t in doc])
df["lemmas"] = df["doc"].apply(lambda doc: [t.lemma_ for t in doc])
df["pos_tags"] = df["doc"].apply(
    lambda doc: [(t.text, t.tag_, spacy.explain(t.tag_)) for t in doc]
)
df["proper_nouns"] = df["doc"].apply(
    lambda doc: [t.text for t in doc if t.pos_ == "PROPN"]
)
df["named_entities"] = df["doc"].apply(
    lambda doc: [(ent.text, ent.label_) for ent in doc.ents]
)

df = df.drop(columns=["doc", "clean_text"])
df.to_csv("pokedex_annotated.csv", index=False)

print(f"All done! Annotated data saved as 'pokedex_annotated.csv' ({len(df)} Pokémon).")


# Bonus: some small initial analysis 

Within the dataset we have just created, there are many opportunities for further analysis. To illustrate this, the following section outlines several examples of analyses that can be performed using only the information contained in the dataset, without relying on any external data sources.

## Most common words per Pokémon type
This analysis lists the ten most frequent descriptive words for each Pokémon type. First, it loads the annotated CSV and converts the types and tokens columns from strings to lists. It explodes the types column so each Pokémon-type pair becomes a separate row, and removes any missing types. Custom stopwords, including 'pokémon' and 'body' are defined to avoid counting trivial words. Then, it iterates over each row, filtering tokens and updating a counter per type.


In [None]:
from collections import Counter, defaultdict
import ast
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")
stopwords = nlp.Defaults.stop_words
custom_stopwords = stopwords.union({"pokemon", "pokémon", "pokémons", "body"})

df = pd.read_csv("pokedex_annotated.csv")

df["types"] = df["types"].apply(ast.literal_eval)
df["tokens"] = df["tokens"].apply(ast.literal_eval)
df = df[df["tokens"].apply(len) > 0]

df = df.explode("types")

df = df[df["types"].notna()]

type_word_counts = defaultdict(Counter)

for _, row in df.iterrows():
    poke_type = row["types"]
    tokens = [
    t for t in row["tokens"]
    if t not in custom_stopwords and len(t) > 2
]
    type_word_counts[poke_type].update(tokens)

for poke_type, counter in type_word_counts.items():
    print(f"\n{poke_type} Pokémon:")
    for word, count in counter.most_common(10):
        print(f"  {word}: {count}")

We can visualise this using the WordCloud library, assigning words a size based on their frequency and a color corresponding to their Pokémon type. 

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

type_colors = {
    "fire": "#E41A1C",      # red
    "water": "#377EB8",     # blue
    "grass": "#4DAF4A",     # green
    "electric": "#FFD92F",  # yellow
    "psychic": "#984EA3",   # purple
    "rock": "#A65628",      # brown
    "ground": "#D95F02",    # orange
    "ice": "#A6CEE3",       # light blue
    "dragon": "#1B9E77",    # teal
    "fairy": "#F781BF"      # pink
}

word_frequencies = {}
word_to_type = {}

for poke_type, counter in type_word_counts.items():
    if poke_type not in type_colors:
        continue

    for word, count in counter.most_common(3):
        word_frequencies[word] = count
        word_to_type[word] = poke_type

def color_func(word, *args, **kwargs):
    return type_colors.get(word_to_type.get(word, ""), "black")

wc = WordCloud(
    width=800,
    height=600,
    background_color="white",
    color_func=color_func
).generate_from_frequencies(word_frequencies)

plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title("Top descriptive words per Pokémon type")
plt.show()


## Sentiment analysis
Sentiment analysis evaluates the tone of each Pokémon’s description using a sentiment analyzer (in this case, VADER). We calculate a sentiment score per Pokémon, where positive values indicate a positive description and negative values indicate a negative description. The code can then rank Pokémon by their sentiment to identify the most positive or negative ones and aggregate sentiment by type.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

df['sentiment'] = df['description'].apply(lambda x: sia.polarity_scores(str(x))['compound'])

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
df = pd.read_csv("pokedex_annotated.csv")

# Initialize VADER
sia = SentimentIntensityAnalyzer()
df['sentiment'] = df['description'].apply(lambda x: sia.polarity_scores(str(x))['compound'])

top_positive = df.sort_values('sentiment', ascending=False).head(10)
print("Top 10 most positive Pokémon:")
print(top_positive[['name', 'description', 'sentiment']])


top_negative = df.sort_values('sentiment', ascending=True).head(10)
print("\nTop 10 most negative Pokémon:")
print(top_negative[['name', 'description', 'sentiment']])

We can also make a bar chart showing which Pokémon types tend to have more positive or negative descriptions.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

df = pd.read_csv("pokedex_annotated.csv")
df['types'] = df['types'].apply(ast.literal_eval)

df['sentiment'] = df['description'].apply(lambda x: sia.polarity_scores(str(x))['compound'])

# Explode the types to handle dual-types
df_exploded = df.explode('types')

type_sentiment = df_exploded.groupby('types')['sentiment'].mean().sort_values()

# Color scale: red for negative, green for positive
norm = mcolors.Normalize(vmin=type_sentiment.min(), vmax=type_sentiment.max())
cmap = plt.cm.RdYlGn  # Red → Yellow → Green
colors = [cmap(norm(val)) for val in type_sentiment.values]

plt.figure(figsize=(10, 6))
plt.barh(type_sentiment.index, type_sentiment.values, color=colors)
plt.title("Average Sentiment per Pokémon Type")
plt.xlabel("Average Sentiment")
plt.ylabel("Pokémon Type")
plt.tight_layout()
plt.show()

## Heatmap of type co-occurrences (for the dual-type Pokémon).
There are Pokémon with exactly two types (dual-type Pokémon). We can investigate which types are most likely to co-occur by building a matric where rows and columns are all Pokémon types. In each cell, the number of Pokémon with that type combination is counted.

In [None]:
import ast
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("pokedex_annotated.csv")
df['types'] = df['types'].apply(ast.literal_eval)

# Keep only Pokémon with 2 types
dual_type_df = df[df['types'].apply(len) == 2]

# All types
all_types = sorted({t for sublist in df['types'] for t in sublist})

co_matrix = pd.DataFrame(0, index=all_types, columns=all_types)
for types in dual_type_df['types']:
    t1, t2 = types
    co_matrix.loc[t1, t2] += 1
    co_matrix.loc[t2, t1] += 1  # symmetric

fig, ax = plt.subplots(figsize=(12,10))
cax = ax.matshow(co_matrix, cmap='YlGnBu')  # color map

fig.colorbar(cax, label='Co-occurrence count')

ax.set_xticks(np.arange(len(all_types)))
ax.set_yticks(np.arange(len(all_types)))
ax.set_xticklabels(all_types, rotation=45, ha='right')
ax.set_yticklabels(all_types)

# Add counts inside cells
for i in range(len(all_types)):
    for j in range(len(all_types)):
        count = co_matrix.iloc[i, j]
        if count > 0:
            ax.text(j, i, str(count), va='center', ha='center', color='black')

plt.title("Co-occurrence of Dual Pokémon Types")
plt.tight_layout()
plt.show()
