# Setting up python for pruduction!

Here we will put together a series of Python scripts, each with a specific role in the data pipeline. This structure mimics a real-world production environment where different components of a data pipeline are separated for clarity and ease of maintenance, and reusability.

---

**About the data**

In this example we will be using the pokeAPI as it is super stable and easy to understand.


## The project file tree

```bash
pokemon_analysis_project/
├── pokemon_analyzer/
│   ├── __init__.py
│   ├── get_data.py
│   ├── clean_data.py
│   ├── feature_engineering.py
│   ├── create_plots.py
│   └── main.py
│
├── data/
│   ├── raw_pokemon_data.csv
│   ├── cleaned_pokemon_data.csv
│   └── featured_pokemon_data.csv
│
├── logs/
│   └── data_fetch.log
│
├── plots/
│   ├── attack_vs_defense.html
│   ├── type_distribution.html
│   └── combat_total_by_speed.html
│
└── requirements.txt
```


---

We designed this pipeline with the principle of "Separation of Concerns". Each file has one clear responsibility. This makes the project easy to understand, debug, and extend.

## get_data.py

- **What it does:** This script is solely responsible for data acquisition. It connects to the external PokéAPI, fetches the raw data, and saves it.

- **Why it's separate:** Data fetching can be complex and unreliable. By isolating it, we can focus on API logic, error handling (like request exceptions), logging, and user experience (the tqdm progress bar) without cluttering the data cleaning or analysis code.

## clean_data.py

- **What it does:** Takes the raw, messy data and prepares it for analysis. This includes handling missing values (type2), converting units (height/weight), and standardizing formats.

- **Why it's separate:** Data cleaning is a distinct and often iterative process. Separating it allows us to create a reliable, clean dataset that can be used by multiple different analyses or models later. It ensures that any subsequent step starts from the same clean baseline.

## feature_engineering.py

- **What it does:** This is the creative part of the analysis. It takes the clean data and creates new, insightful columns (features) like combat_total and bmi.

- **Why it's separate:** Feature engineering is experimental. You might try creating ten different features and only keep two. By having this logic in its own file, we can easily add, remove, and test new features without any risk of breaking the initial data fetching or cleaning steps.

## create_plots.py

- **What it does:** This is the presentation layer. It takes the final, featured data and generates visualizations.

- **Why it's separate:** The way we present data can change frequently based on the audience. This file isolates all visualization logic (using plotly), making it easy to change a plot type, add a new graph, or switch to a different plotting library entirely, without affecting the underlying data. We also built in flexibility to save files or display in notebooks.

## main.py

- **What it does:** This is the orchestrator. It doesn't perform any analysis itself; its only job is to call the functions from the other scripts in the correct order to run the entire pipeline from start to finish.

- **Why it's separate:** It provides a single, clear entry point to the application. When someone wants to run the project, they only need to run python `main.py`. It documents the workflow of the entire project at a high level.


# 1. get_data.py


In [1]:
import requests
import pandas as pd
import os
import logging
from tqdm import tqdm


def setup_logging():
    """
    Configures a robust logger to write to a file and the console.
    This function is safe to call multiple times.
    """
    # Create a 'logs' directory if it doesn't exist
    if not os.path.exists('logs'):
        os.makedirs('logs')

    log = logging.getLogger()  # Get the root logger
    log.setLevel(logging.INFO)  # Set the lowest level to capture all messages

    # If handlers are already present, remove them.
    # This ensures we don't add duplicate handlers on subsequent calls.
    if log.hasHandlers():
        log.handlers.clear()

    # Create file handler which logs detailed INFO messages
    fh = logging.FileHandler('logs/data_fetch.log', mode='w')
    fh.setLevel(logging.INFO)
    fh_formatter = logging.Formatter(
        '%(asctime)s - %(levelname)s - %(message)s')
    fh.setFormatter(fh_formatter)
    log.addHandler(fh)

    # Create console handler with a higher log level (WARNING)
    # This keeps the console clean for the progress bar.
    ch = logging.StreamHandler()
    ch.setLevel(logging.WARNING)
    ch_formatter = logging.Formatter('%(levelname)s: %(message)s')
    ch.setFormatter(ch_formatter)
    log.addHandler(ch)


def get_pokemon_data(num_pokemon=151):
    """
    Fetches data for a specified number of Pokémon from the PokéAPI, showing a progress bar.

    Args:
        num_pokemon (int): The number of Pokémon to fetch (defaults to 151).

    Returns:
        pandas.DataFrame: A DataFrame containing the fetched Pokémon data.
                          Returns None if the request fails.
    """

    setup_logging()
    logging.info(f"Starting API fetch for {num_pokemon} Pokémon.")

    print("Fetching data... (See 'logs/data_fetch.log' for detailed logs)")
    base_url = "https://pokeapi.co/api/v2/pokemon"
    pokemon_data = []

    for i in tqdm(range(1, num_pokemon + 1), desc="Fetching Pokémon"):
        try:
            response = requests.get(f"{base_url}/{i}")
            response.raise_for_status()
            data = response.json()

            pokemon_info = {
                'id': data['id'],
                'name': data['name'],
                'height': data['height'],
                'weight': data['weight'],
                'base_experience': data['base_experience'],
                'type1': data['types'][0]['type']['name'],
                'type2': data['types'][1]['type']['name'] if len(data['types']) > 1 else None,
                'hp': data['stats'][0]['base_stat'],
                'attack': data['stats'][1]['base_stat'],
                'defense': data['stats'][2]['base_stat'],
                'special-attack': data['stats'][3]['base_stat'],
                'special-defense': data['stats'][4]['base_stat'],
                'speed': data['stats'][5]['base_stat'],
                'sprite_url': data['sprites']['front_default']
            }
            pokemon_data.append(pokemon_info)
            logging.info(
                f"Successfully fetched data for #{data['id']} - {data['name']}.")

        except requests.exceptions.RequestException as e:
            logging.error(f"Error fetching data for Pokémon ID {i}: {e}")
            continue

    if not pokemon_data:
        logging.warning("No data was fetched. The final DataFrame is empty.")
        return None

    logging.info(f"Successfully fetched data for {len(pokemon_data)} Pokémon.")
    return pd.DataFrame(pokemon_data)


def save_data(df, folder="data", filename="raw_pokemon_data.csv"):
    """
    Saves a DataFrame to a CSV file.

    Args:
        df (pandas.DataFrame): The DataFrame to save.
        folder (str): The directory to save the file in.
        filename (str): The name of the file.
    """
    if df is None:
        logging.warning("DataFrame is None, skipping save.")
        return

    if not os.path.exists(folder):
        os.makedirs(folder)

    filepath = os.path.join(folder, filename)
    df.to_csv(filepath, index=False)
    print(f"\nData saved successfully to {filepath}")
    logging.info(f"DataFrame saved to {filepath}")


if __name__ == "__main__":
    print("Running get_data.py as a standalone script.")
    raw_data = get_pokemon_data(151)
    if raw_data is not None:
        save_data(raw_data)

Running get_data.py as a standalone script.
Fetching data... (See 'logs/data_fetch.log' for detailed logs)


Fetching Pokémon:   0%|          | 0/151 [00:00<?, ?it/s]ERROR: Error fetching data for Pokémon ID 1: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/1 (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1000)')))
ERROR: Error fetching data for Pokémon ID 2: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/2 (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1000)')))
Fetching Pokémon:   1%|▏         | 2/151 [00:00<00:09, 15.20it/s]ERROR: Error fetching data for Pokémon ID 3: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/3 (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1000)')))
ERROR: Error fetching data for Pokémon ID 4: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/4 (C

# 2. clean_data.py


In [2]:


def clean_pokemon_data(input_path="data/raw_pokemon_data.csv", output_path="data/cleaned_pokemon_data.csv"):
    """
    Cleans the raw Pokémon data.
    - Fills missing 'type2' values.
    - Converts weight from hectograms to kilograms.
    - Converts height from decimetres to meters.

    Args:
        input_path (str): The path to the raw data CSV file.
        output_path (str): The path to save the cleaned data CSV file.

    Returns:
        pandas.DataFrame: The cleaned DataFrame.
    """

    print("Starting data cleaning process...")
    try:
        df = pd.read_csv(input_path)
        print("Raw data loaded successfully.")
    except FileNotFoundError:
        print(f"Error: The file {input_path} was not found.")
        return None

    # Data Cleaning Steps

    # 1. Handle missing data
    # The 'type2' column has missing values for Pokémon with only one type.
    # We assign the result back to the column to avoid the FutureWarning.
    df['type2'] = df['type2'].fillna('None')
    print("Filled missing 'type2' values.")

    # 2. Convert units for clarity
    # The API provides weight in hectograms and height in decimetres.
    # Let's convert them to more standard units (kg and meters).
    df['weight_kg'] = df['weight'] / 10.0
    df['height_m'] = df['height'] / 10.0
    print("Converted weight to kg and height to meters.")

    # Drop the original columns
    df.drop(['weight', 'height'], axis=1, inplace=True)

    # Save Cleaned Data
    # It's good practice to save intermediate results
    output_folder = os.path.dirname(output_path)
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    df.to_csv(output_path, index=False)
    print(f"Cleaned data saved successfully to {output_path}")

    return df


if __name__ == "__main__":
    print("Running clean_data.py as a standalone script.")
    cleaned_df = clean_pokemon_data()
    if cleaned_df is not None:
        print("\nCleaned Data Head:")
        print(cleaned_df.head())

Running clean_data.py as a standalone script.
Starting data cleaning process...
Raw data loaded successfully.
Filled missing 'type2' values.
Converted weight to kg and height to meters.
Cleaned data saved successfully to data/cleaned_pokemon_data.csv

Cleaned Data Head:
   id        name  base_experience  type1   type2  hp  attack  defense  \
0   1   bulbasaur               64  grass  poison  45      49       49   
1   2     ivysaur              142  grass  poison  60      62       63   
2   3    venusaur              236  grass  poison  80      82       83   
3   4  charmander               62   fire    None  39      52       43   
4   5  charmeleon              142   fire    None  58      64       58   

   special-attack  special-defense  speed  \
0              65               65     45   
1              80               80     60   
2             100              100     80   
3              60               50     65   
4              80               65     80   

             

# 3. feature_engineering.py


In [3]:
import numpy as np


def create_features(input_path="data/cleaned_pokemon_data.csv", output_path="data/featured_pokemon_data.csv"):
    """
    Engineers new features from the cleaned Pokémon data.
    - Calculates a 'combat_total' stat.
    - Calculates BMI (Body Mass Index).
    - Categorizes Pokémon by speed.

    Args:
        input_path (str): Path to the cleaned data CSV.
        output_path (str): Path to save the featured data CSV.

    Returns:
        pandas.DataFrame: DataFrame with new features.
    """
    print("Starting feature engineering process...")
    try:
        df = pd.read_csv(input_path)
        print("Cleaned data loaded successfully.")
    except FileNotFoundError:
        print(f"Error: The file {input_path} was not found.")
        return None

    # Feature Engineering Steps

    # 1. Create a 'combat_total' stat
    # This gives a general idea of a Pokémon's overall strength in battle.
    stat_columns = ['hp', 'attack', 'defense',
                    'special-attack', 'special-defense', 'speed']
    df['combat_total'] = df[stat_columns].sum(axis=1)
    print("Created 'combat_total' feature.")

    # 2. Calculate Body Mass Index (BMI)
    # BMI = weight (kg) / (height (m))^2
    df['bmi'] = df.apply(
        lambda row: row['weight_kg'] /
        (row['height_m'] ** 2) if row['height_m'] > 0 else 0,
        axis=1
    )
    print("Created 'bmi' feature.")

    # 3. Categorize Pokémon by Speed
    # Create descriptive categories for speed stat.
    speed_bins = [0, 50, 80, 100, np.inf]
    speed_labels = ['Slow', 'Average', 'Fast', 'Very Fast']
    df['speed_category'] = pd.cut(
        df['speed'], bins=speed_bins, labels=speed_labels, right=False)
    print("Created 'speed_category' feature.")

    # Save Featured Data
    output_folder = os.path.dirname(output_path)
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    df.to_csv(output_path, index=False)
    print(f"Featured data saved successfully to {output_path}")

    return df


if __name__ == "__main__":
    print("Running feature_engineering.py as a standalone script.")
    featured_df = create_features()
    if featured_df is not None:
        print("\nFeatured Data Head (with new columns):")
        print(featured_df[['name', 'combat_total',
              'bmi', 'speed_category']].head())

Running feature_engineering.py as a standalone script.
Starting feature engineering process...
Cleaned data loaded successfully.
Created 'combat_total' feature.
Created 'bmi' feature.
Created 'speed_category' feature.
Featured data saved successfully to data/featured_pokemon_data.csv

Featured Data Head (with new columns):
         name  combat_total        bmi speed_category
0   bulbasaur           318  14.081633           Slow
1     ivysaur           405  13.000000        Average
2    venusaur           525  25.000000           Fast
3  charmander           309  23.611111        Average
4  charmeleon           405  15.702479           Fast


# 4. create_plots.py


In [4]:
import plotly.express as px
import plotly.io as pio


def create_and_save_plots(input_path="data/featured_pokemon_data.csv",
                          output_folder="plots",
                          save_files=True,
                          display_in_notebook=False):
    """
    Creates various plots from the featured Pokémon data.
    - Saves them as HTML files if save_files is True.
    - Displays them inline in a notebook if display_in_notebook is True.

    Args:
        input_path (str): Path to the featured data CSV.
        output_folder (str): Directory to save the plot HTML files.
        save_files (bool): If True, saves plots as HTML files.
        display_in_notebook (bool): If True, displays plots directly using fig.show().
    """
    print("Starting plot generation...")
    try:
        df = pd.read_csv(input_path)
        print("Featured data loaded successfully.")
    except FileNotFoundError:
        print(f"Error: The file {input_path} was not found.")
        return

    # Plot 1: Attack vs. Defense Scatter Plot
    fig1 = px.scatter(
        df,
        x='attack',
        y='defense',
        color='type1',
        hover_data=['name', 'combat_total'],
        title='Attack vs. Defense of Generation 1 Pokémon',
        labels={'attack': 'Attack Stat',
                'defense': 'Defense Stat', 'type1': 'Primary Type'}
    )

    # Plot 2: Distribution of Primary Types
    type_counts = df['type1'].value_counts()
    fig2 = px.bar(
        x=type_counts.index,
        y=type_counts.values,
        title='Distribution of Primary Pokémon Types',
        labels={'x': 'Pokémon Type', 'y': 'Count'}
    )

    # Plot 3: Combat Total by Speed Category
    fig3 = px.box(
        df,
        x='speed_category',
        y='combat_total',
        color='speed_category',
        title='Combat Power by Speed Category',
        labels={'speed_category': 'Speed Category',
                'combat_total': 'Total Combat Stats'},
        category_orders={"speed_category": [
            "Slow", "Average", "Fast", "Very Fast"]}
    )

    # Display or Save the plots
    if display_in_notebook:
        print("Displaying plots in notebook...")
        fig1.show()
        fig2.show()
        fig3.show()

    if save_files:
        if not os.path.exists(output_folder):
            os.makedirs(output_folder)
        print(f"Saving plot files in '{output_folder}/'")

        plot1_path = os.path.join(output_folder, "attack_vs_defense.html")
        pio.write_html(fig1, file=plot1_path, auto_open=False)
        print(f"- Saved: {plot1_path}")

        plot2_path = os.path.join(output_folder, "type_distribution.html")
        pio.write_html(fig2, file=plot2_path, auto_open=False)
        print(f"- Saved: {plot2_path}")

        plot3_path = os.path.join(output_folder, "combat_total_by_speed.html")
        pio.write_html(fig3, file=plot3_path, auto_open=False)
        print(f"- Saved: {plot3_path}")

    print("\nPlot generation complete.")


if __name__ == "__main__":
    print("Running create_plots.py as a standalone script (saving files).")
    create_and_save_plots(display_in_notebook=False, save_files=True)

Running create_plots.py as a standalone script (saving files).
Starting plot generation...
Featured data loaded successfully.
Saving plot files in 'plots/'
- Saved: plots/attack_vs_defense.html
- Saved: plots/type_distribution.html
- Saved: plots/combat_total_by_speed.html

Plot generation complete.


In [5]:
# to show plots in notebook run this

print("Generating and displaying plots for the notebook...")

create_and_save_plots(
    input_path="data/featured_pokemon_data.csv",
    display_in_notebook=True,
    save_files=False
)

Generating and displaying plots for the notebook...
Starting plot generation...
Featured data loaded successfully.
Displaying plots in notebook...



Plot generation complete.


# 5. **init**.py

The **init**.py file, even when empty, serves a critical function in Python.

It marks a directory as a Python "package". When Python's importer sees a directory containing an **init**.py file, it knows it can treat that directory as a single, importable package.

It enables imports between modules. Without **init**.py inside the pokemon_analyzer/ directory, you would not be able to write from get_data import get_pokemon_data inside main.py. Python wouldn't recognize get_data as being part of a larger package.

In short, it's the "glue" that allows our separate .py files to communicate with each other as part of a cohesive application. While it can be used for package-level initialization code, it is often left empty for simpler projects like this one.


In [6]:
# This file can be empty.
# Its presence makes the directory a Python package,
# allowing for imports between the .py files.

print("Pokemon analysis package initialized.")

Pokemon analysis package initialized.


# 6. main.py


In [7]:
# not used in the notebook as they are already imported above
# Import the functions from our other scripts

# from get_data import get_pokemon_data, save_data
# from clean_data import clean_pokemon_data
# from feature_engineering import create_features
# from create_plots import create_and_save_plots

def run_pipeline():
    """
    Executes the entire Pokémon data analysis pipeline.
    """
    print("=============================================")
    print("=== STARTING POKEMON DATA ANALYSIS PIPELINE ===")
    print("=============================================\n")

    # Step 1: Get Data
    print("Step 1: Fetching Data from PokéAPI ")
    raw_df = get_pokemon_data(num_pokemon=151)
    if raw_df is not None:
        save_data(raw_df, folder="data", filename="raw_pokemon_data.csv")
        print(" Step 1 Complete \n")
    else:
        print("Failed to fetch data. Aborting pipeline.")
        return

    # Step 2: Clean Data
    print("Step 2: Cleaning Raw Data ")
    cleaned_df = clean_pokemon_data(
        input_path="data/raw_pokemon_data.csv",
        output_path="data/cleaned_pokemon_data.csv"
    )
    if cleaned_df is not None:
        print("Step 2 Complete \n")
    else:
        print("Failed to clean data. Aborting pipeline.")
        return

    # Step 3: Feature Engineering
    print("Step 3: Engineering New Features ")
    featured_df = create_features(
        input_path="data/cleaned_pokemon_data.csv",
        output_path="data/featured_pokemon_data.csv"
    )
    if featured_df is not None:
        print("Step 3 Complete \n")
    else:
        print("Failed to engineer features. Aborting pipeline.")
        return

    # Step 4: Create Plots
    print("Step 4: Generating Visualizations ")
    create_and_save_plots(
        input_path="data/featured_pokemon_data.csv",
        output_folder="plots"
    )
    print("Step 4 Complete \n")

    print("======================================")
    print("=== PIPELINE EXECUTION FINISHED! ===")
    print("======================================")
    print("Check the 'data' folder for CSV files and the 'plots' folder for HTML graphs.")


if __name__ == "__main__":
    run_pipeline()

=== STARTING POKEMON DATA ANALYSIS PIPELINE ===

Step 1: Fetching Data from PokéAPI 
Fetching data... (See 'logs/data_fetch.log' for detailed logs)


Fetching Pokémon:   0%|          | 0/151 [00:00<?, ?it/s]ERROR: Error fetching data for Pokémon ID 1: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/1 (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1000)')))
ERROR: Error fetching data for Pokémon ID 2: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/2 (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1000)')))
ERROR: Error fetching data for Pokémon ID 3: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/3 (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1000)')))
Fetching Pokémon:   2%|▏         | 3/151 [00:00<00:07, 21.08it/s]ERROR: Error fetching data for Pokémon ID 4: HTTPSConnectionPool(host='pokeapi.co', port=443): Max retries exceeded with url: /api/v2/pokemon/4 (C

Failed to fetch data. Aborting pipeline.


# Just for fun!


In [9]:
from IPython.display import display, Image


def display_pokemon_by_id(pokemon_id, data_path="data/featured_pokemon_data.csv"):

    print(f"--- Looking for Pokémon with ID: {pokemon_id} ---")

    # Read the data
    df = pd.read_csv(data_path)

    # Find the Pokémon by its ID
    pokemon = df[df['id'] == pokemon_id].iloc[0]

    # Get the data for display
    name = pokemon['name'].capitalize()
    image_url = pokemon['sprite_url']

    # Display the image
    display(Image(url=image_url, width=250))

    # Print the stats
    print("\n" + "="*30)
    print(f"Stats for: {name} (#{pokemon['id']})")
    print("="*30)
    print(f"HP:              {pokemon['hp']}")
    print(f"Attack:          {pokemon['attack']}")
    print(f"Defense:         {pokemon['defense']}")
    print(f"Special Attack:  {pokemon['special-attack']}")
    print(f"Special Defense: {pokemon['special-defense']}")
    print(f"Speed:           {pokemon['speed']}")
    print("-"*30)
    print(f"Total Stats:     {pokemon['combat_total']}")
    print("="*30 + "\n")


if __name__ == "__main__":
    user_input = input("Enter a Pokémon ID (1-151) to display: ")
    pokemon_id_to_find = int(user_input)
    display_pokemon_by_id(pokemon_id_to_find)

--- Looking for Pokémon with ID: 130 ---



Stats for: Gyarados (#130)
HP:              95
Attack:          125
Defense:         79
Special Attack:  60
Special Defense: 100
Speed:           81
------------------------------
Total Stats:     540

