# WHONET Organisms Code Extraction and Processing

This notebook extracts organism codes, names, and types from the WHONET Organisms.txt file and saves them to a structured CSV file.

## Steps:
1. Load the Organisms.txt file from the local resources
2. Process and clean the data
3. Optionally supplement with web scraping from the WHONET Code Finder website
4. Save the complete data to CSV with additional metadata
5. Generate analysis and summary

In [1]:
import pandas as pd
import numpy as np
import re
import os
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Optional imports for web scraping if needed
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
import requests
from bs4 import BeautifulSoup
import time

## Load WHONET Organisms.txt File
First, we'll load the existing Organisms.txt file and process it

In [2]:
# File path to the Organisms.txt file
organisms_file = r'C:\Personal_Projects\Astro\project_resources\Organisms.txt'

# Check if file exists
if not os.path.exists(organisms_file):
    print(f"Error: File {organisms_file} not found!")
else:
    print(f"Loading organisms data from {organisms_file}...")
    
    # Load the file with tab delimiter
    try:
        # First attempt to read with pandas
        df_organisms = pd.read_csv(organisms_file, sep='\t')
        print(f"Successfully loaded {len(df_organisms)} organisms using pandas")
    except Exception as e:
        print(f"Error reading with pandas: {str(e)}")
        print("Trying alternative approach...")
        
        # Alternative approach: read line by line
        try:
            with open(organisms_file, 'r') as f:
                lines = f.readlines()
            
            # Get header from first line
            header = lines[0].strip().split('\t')
            
            # Process remaining lines
            data = []
            for line in lines[1:]:
                values = line.strip().split('\t')
                # Ensure values match header length
                while len(values) < len(header):
                    values.append('')
                data.append(values[:len(header)])  # Truncate if longer than header
            
            # Create DataFrame
            df_organisms = pd.DataFrame(data, columns=header)
            print(f"Successfully loaded {len(df_organisms)} organisms using manual approach")
        except Exception as e2:
            print(f"Error with alternative approach: {str(e2)}")
            df_organisms = pd.DataFrame()  # Create empty DataFrame as fallback
            
# Display the DataFrame structure
if not df_organisms.empty:
    print(f"\nDataFrame contains {len(df_organisms)} rows and {len(df_organisms.columns)} columns")
    print("\nColumns:")
    print(list(df_organisms.columns))
    print("\nFirst 5 records:")
    display(df_organisms.head())
else:
    print("Failed to load organisms data from file.")

Loading organisms data from C:\Personal_Projects\Astro\project_resources\Organisms.txt...
Successfully loaded 2946 organisms using pandas

DataFrame contains 2946 rows and 26 columns

Columns:
['WHONET_ORG_CODE', 'REPLACED_BY', 'ORGANISM', 'TAXONOMIC_STATUS', 'COMMON', 'ORGANISM_TYPE', 'ANAEROBE', 'MORPHOLOGY', 'SUBKINGDOM_CODE', 'FAMILY_CODE', 'GENUS_GROUP', 'GENUS_CODE', 'SPECIES_GROUP', 'SEROVAR_GROUP', 'MSF_GRP_CLIN', 'SCT_CODE', 'SCT_TEXT', 'GBIF_TAXON_ID', 'GBIF_DATASET_ID', 'GBIF_TAXONOMIC_STATUS', 'KINGDOM', 'PHYLUM', 'CLASS', 'ORDER', 'FAMILY', 'GENUS']

First 5 records:


Unnamed: 0,WHONET_ORG_CODE,REPLACED_BY,ORGANISM,TAXONOMIC_STATUS,COMMON,ORGANISM_TYPE,ANAEROBE,MORPHOLOGY,SUBKINGDOM_CODE,FAMILY_CODE,...,SCT_TEXT,GBIF_TAXON_ID,GBIF_DATASET_ID,GBIF_TAXONOMIC_STATUS,KINGDOM,PHYLUM,CLASS,ORDER,FAMILY,GENUS
0,saj,,Abiotrophia adiacens,O,,+,,,+,,...,Granulicatella adiacens,3227166,7ddf754f-d193-4cc9-b351-99906754a03b,accepted,Bacteria,Firmicutes,Bacilli,Lactobacillales,Carnobacteriaceae,Granulicatella
1,sdf,,Abiotrophia defectiva,C,,+,,,+,,...,Abiotrophia defectiva,3227009,7ddf754f-d193-4cc9-b351-99906754a03b,accepted,Bacteria,Firmicutes,Bacilli,Lactobacillales,Aerococcaceae,Abiotrophia
2,abi,,Abiotrophia sp.,C,,+,,,+,,...,Abiotrophia species,3227008,7ddf754f-d193-4cc9-b351-99906754a03b,accepted,Bacteria,Firmicutes,Bacilli,Lactobacillales,Aerococcaceae,Abiotrophia
3,acf,,Absidia corimbifera,O,,f,,,,,...,,2558326,7ddf754f-d193-4cc9-b351-99906754a03b,accepted,Fungi,Zygomycota,Mucoromycetes,Mucorales,Lichtheimiaceae,Lichtheimia
4,abs,,Absidia sp.,C,,f,,,,,...,Absidia species,2558224,7ddf754f-d193-4cc9-b351-99906754a03b,accepted,Fungi,Zygomycota,Mucoromycetes,Mucorales,Cunninghamellaceae,Absidia


In [3]:
# Process the organism data from file

# Create a clean DataFrame with the columns we need
if not df_organisms.empty:
    print("Cleaning and processing organism data from text file...")
    
    # Save original data count for reference
    original_count = len(df_organisms)
    
    # Function to map organism type codes to full descriptions
    def map_organism_type(code):
        type_map = {
            '+': 'Gram-positive',
            '-': 'Gram-negative',
            'a': 'Anaerobe',
            'f': 'Fungus',
            'm': 'Mycobacteria',
            'b': 'Bacteria',
            'w': 'Other'
        }
        return type_map.get(str(code), 'Unknown')
    
    # Create new DataFrame with selected columns
    organism_cols = ['WHONET_ORG_CODE', 'ORGANISM', 'ORGANISM_TYPE', 'COMMON']
    
    # Check if all required columns exist
    missing_cols = [col for col in organism_cols if col not in df_organisms.columns]
    
    if missing_cols:
        print(f"Warning: Missing columns in the file: {missing_cols}")
        # Create dictionary with available columns
        df_clean = pd.DataFrame()
        for col in organism_cols:
            if col in df_organisms.columns:
                df_clean[col] = df_organisms[col]
            else:
                df_clean[col] = ''
    else:
        df_clean = df_organisms[organism_cols].copy()
    
    # Rename columns to standardized names
    df_clean = df_clean.rename(columns={
        'WHONET_ORG_CODE': 'ORGANISM_CODE',
        'ORGANISM': 'ORGANISM_NAME'
    })
    
    # Add expanded organism type description
    if 'ORGANISM_TYPE' in df_clean.columns:
        df_clean['ORGANISM_TYPE_DESCRIPTION'] = df_clean['ORGANISM_TYPE'].apply(map_organism_type)
    else:
        df_clean['ORGANISM_TYPE_DESCRIPTION'] = 'Unknown'
    
    # Add common organism flag if available
    if 'COMMON' in df_clean.columns:
        df_clean['IS_COMMON'] = df_clean['COMMON'].apply(lambda x: 'Yes' if x == 'X' else 'No')
    else:
        df_clean['IS_COMMON'] = 'Unknown'
    
    # Clean up
    df_clean = df_clean.fillna('')
    
    # Ensure all codes are uppercase
    if 'ORGANISM_CODE' in df_clean.columns:
        df_clean['ORGANISM_CODE'] = df_clean['ORGANISM_CODE'].str.upper()
    
    # Store the original file data in a separate variable for safekeeping
    df_file_data = df_clean.copy()
    
    print(f"Processed {len(df_clean)} organisms successfully from text file")
    
    # Show sample of processed data
    print("\nSample of processed data from text file:")
    display(df_clean.head())
else:
    print("No data to process from text file.")
    df_clean = pd.DataFrame()
    df_file_data = pd.DataFrame()

Cleaning and processing organism data from text file...
Processed 2946 organisms successfully from text file

Sample of processed data from text file:


Unnamed: 0,ORGANISM_CODE,ORGANISM_NAME,ORGANISM_TYPE,COMMON,ORGANISM_TYPE_DESCRIPTION,IS_COMMON
0,SAJ,Abiotrophia adiacens,+,,Gram-positive,No
1,SDF,Abiotrophia defectiva,+,,Gram-positive,No
2,ABI,Abiotrophia sp.,+,,Gram-positive,No
3,ACF,Absidia corimbifera,f,,Fungus,No
4,ABS,Absidia sp.,f,,Fungus,No


## Supplement with Web Scraping (Optional)
We can supplement our data with web scraping from WHONET Code Finder if needed

In [4]:
# Optional: Web scraping for organism data as a supplement to file data

# Set this to True if you want to attempt web scraping
do_web_scraping = False  

# Initialize lists to store web-scraped data
organism_codes_web = []
organism_names_web = []

if do_web_scraping:
    print("Attempting to supplement data from WHONET Code Finder website...")
    
    # First, install required packages if needed
    !pip install selenium webdriver_manager pandas beautifulsoup4

    try:
        # Set up Chrome options
        chrome_options = Options()
        chrome_options.add_argument('--headless')  # Run in headless mode
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')

        # Initialize the browser
        print("Setting up Chrome driver...")
        driver = webdriver.Chrome(
            service=Service(ChromeDriverManager().install()),
            options=chrome_options
        )

        # Navigate to the website
        url = 'https://qaapt.com/whonet/code/finder'
        print(f"Accessing {url}...")
        driver.get(url)

        # Wait for the page to load
        print("Waiting for page to load...")
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'select[name="code_type"]'))
        )

        # Select 'Organism' from dropdown
        print("Selecting 'Organism' from dropdown...")
        dropdown = driver.find_element(By.CSS_SELECTOR, 'select[name="code_type"]')
        for option in dropdown.find_elements(By.TAG_NAME, 'option'):
            if option.text.strip().lower() == 'organism':
                option.click()
                break

        # Wait for the search input to become available
        search_input = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'input[type="search"]'))
        )

        # Click the search button to get all results
        search_button = driver.find_element(By.CSS_SELECTOR, 'button.btn-danger')
        search_button.click()

        # Wait for results to load
        print("Waiting for results to load...")
        time.sleep(3)

        # Check if table is present
        try:
            table = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'table.datatable'))
            )
            print("Found results table.")
        except:
            print("Could not find results table.")
            raise Exception("Table not found")

        # Extract table content
        rows = driver.find_elements(By.CSS_SELECTOR, 'table.datatable tbody tr')
        print(f"Found {len(rows)} organism entries.")

        # Process each row
        for row in rows:
            cells = row.find_elements(By.TAG_NAME, 'td')
            if len(cells) >= 2:
                code = cells[0].text.strip()
                name = cells[1].text.strip()
                if code and name:  # Only add if both values are non-empty
                    organism_codes_web.append(code)
                    organism_names_web.append(name)

        print(f"Successfully extracted {len(organism_codes_web)} organisms from web.")

    except Exception as e:
        print(f"An error occurred during web scraping: {str(e)}")
        
        # Try fallback method with direct HTTP request
        try:
            print("\nAttempting fallback method with direct HTTP request...")
            
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            }
            response = requests.get('https://qaapt.com/whonet/code/finder', headers=headers)
            print(f"HTTP Status Code: {response.status_code}")
            
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                
                # Look for any table that might contain organism data
                tables = soup.find_all('table')
                print(f"Found {len(tables)} tables on the page.")
                
                # Parse tables if any found
                if len(tables) > 0:
                    for table in tables:
                        rows = table.find_all('tr')
                        for row in rows[1:]:  # Skip header row
                            cells = row.find_all('td')
                            if len(cells) >= 2:
                                code = cells[0].get_text().strip()
                                name = cells[1].get_text().strip()
                                if code and name:
                                    organism_codes_web.append(code)
                                    organism_names_web.append(name)
        except Exception as e2:
            print(f"Fallback approach also failed: {str(e2)}")
            
            # Add some common organisms as fallback if all web approaches fail
            if len(organism_codes_web) == 0:
                print("Using predefined list of common WHONET organism codes as fallback.")
                
                common_organisms = [
                    ("aba", "Acinetobacter baumannii"),
                    ("eco", "Escherichia coli"),
                    ("kpn", "Klebsiella pneumoniae"),
                    ("pae", "Pseudomonas aeruginosa"),
                    ("sau", "Staphylococcus aureus")
                ]
                
                # Add these to our web data lists
                for code, name in common_organisms:
                    organism_codes_web.append(code.upper())
                    organism_names_web.append(name)
                    
                print(f"Added {len(common_organisms)} common organisms as fallback.")

    finally:
        try:
            if 'driver' in locals():
                driver.quit()
                print("Browser closed.")
        except:
            pass
    
    # Create DataFrame from web-scraped data
    if len(organism_codes_web) > 0:
        df_web = pd.DataFrame({
            'ORGANISM_CODE': [code.upper() for code in organism_codes_web],
            'ORGANISM_NAME': organism_names_web
        })
        
        # Add placeholder values for web data
        df_web['ORGANISM_TYPE'] = ''
        df_web['ORGANISM_TYPE_DESCRIPTION'] = 'Unknown'
        df_web['IS_COMMON'] = 'No'
        df_web['DATA_SOURCE'] = 'Web'
        
        # Display summary of web data
        print(f"\nCreated web DataFrame with {len(df_web)} organisms")
        display(df_web.head())
        
        # Now merge with the file data (if any) - but don't overwrite the original!
        if not df_file_data.empty:
            # Find web data not in file data
            file_codes = set(df_file_data['ORGANISM_CODE'].str.upper())
            new_web_data = df_web[~df_web['ORGANISM_CODE'].isin(file_codes)]
            
            if len(new_web_data) > 0:
                print(f"\nFound {len(new_web_data)} new organisms from web not in file data")
                
                # Combine with file data
                df_combined = pd.concat([df_file_data, new_web_data], ignore_index=True)
                print(f"Combined dataset now has {len(df_combined)} organisms")
                
                # Update df_clean to use this combined dataset
                df_clean = df_combined.copy()
            else:
                print("\nNo new organisms found from web - keeping only file data")
        else:
            # If no file data, just use web data
            print("\nNo file data available, using only web-scraped data")
            df_clean = df_web.copy()
    else:
        print("\nNo web data extracted. Using only file data.")
else:
    print("\nWeb scraping disabled. Using only file data.")


Web scraping disabled. Using only file data.


## Save Final Organism Data
Now we'll save the processed data to CSV and Excel files

In [5]:
# Save the processed data directly to CSV and Excel
if not df_clean.empty:
    print(f"\nPreparing to save {len(df_clean)} organisms to files...")
    
    # Define output paths
    output_csv = r'C:\Personal_Projects\Astro\project_resources\Organisms_Final.csv'
    output_excel = r'C:\Personal_Projects\Astro\project_resources\Organisms_Final.xlsx'
    
    # Add metadata
    df_clean['EXTRACTION_DATE'] = datetime.now().strftime('%Y-%m-%d')
    
    # Add data source if not already present
    if 'DATA_SOURCE' not in df_clean.columns:
        df_clean['DATA_SOURCE'] = 'WHONET Organisms.txt'
    
    # Ensure proper column order
    column_order = [
        'ORGANISM_CODE', 'ORGANISM_NAME', 'ORGANISM_TYPE', 
        'ORGANISM_TYPE_DESCRIPTION', 'IS_COMMON', 'EXTRACTION_DATE', 'DATA_SOURCE'
    ]
    
    # Keep only columns that exist in our DataFrame
    valid_cols = [col for col in column_order if col in df_clean.columns]
    
    # Create export DataFrame
    df_export = df_clean[valid_cols].copy()
    
    # Sort by organism code
    df_export = df_export.sort_values('ORGANISM_CODE').reset_index(drop=True)
    
    # Save to CSV and Excel
    df_export.to_csv(output_csv, index=False)
    df_export.to_excel(output_excel, index=False)
    
    print(f"Successfully saved {len(df_export)} organisms to:")
    print(f" - CSV: {output_csv}")
    print(f" - Excel: {output_excel}")
    
    # Display sample of final data
    print("\nSample of final dataset:")
    display(df_export.head())
    
    # Create a special extract for common organisms
    common_orgs = df_clean[df_clean['IS_COMMON'] == 'Yes'].copy() if 'IS_COMMON' in df_clean.columns else pd.DataFrame()
    
    if not common_orgs.empty:
        common_path = r'C:\Personal_Projects\Astro\project_resources\Common_Organisms.xlsx'
        common_orgs.sort_values('ORGANISM_CODE').reset_index(drop=True).to_excel(common_path, index=False)
        print(f"\nAlso saved {len(common_orgs)} common organisms to {common_path}")
else:
    print("No data to save.")


Preparing to save 2946 organisms to files...
Successfully saved 2946 organisms to:
 - CSV: C:\Personal_Projects\Astro\project_resources\Organisms_Final.csv
 - Excel: C:\Personal_Projects\Astro\project_resources\Organisms_Final.xlsx

Sample of final dataset:
Successfully saved 2946 organisms to:
 - CSV: C:\Personal_Projects\Astro\project_resources\Organisms_Final.csv
 - Excel: C:\Personal_Projects\Astro\project_resources\Organisms_Final.xlsx

Sample of final dataset:


Unnamed: 0,ORGANISM_CODE,ORGANISM_NAME,ORGANISM_TYPE,ORGANISM_TYPE_DESCRIPTION,IS_COMMON,EXTRACTION_DATE,DATA_SOURCE
0,,Nannizzia sp.,f,Fungus,No,2025-05-26,WHONET Organisms.txt
1,103.0,Escherichia coli O103,-,Gram-negative,No,2025-05-26,WHONET Organisms.txt
2,104.0,Salmonella Typhimurium DT 104,-,Gram-negative,No,2025-05-26,WHONET Organisms.txt
3,111.0,Escherichia coli O111,-,Gram-negative,No,2025-05-26,WHONET Organisms.txt
4,135.0,"Neisseria meningitidis, serogroup W135",-,Gram-negative,No,2025-05-26,WHONET Organisms.txt



Also saved 85 common organisms to C:\Personal_Projects\Astro\project_resources\Common_Organisms.xlsx


## Analyze Organism Data
Generate statistics and analysis of the organism data

In [6]:
# Analyze the organism data
if not df_clean.empty:
    print("\n=== Organism Data Analysis ===")
    
    # Organism type analysis
    if 'ORGANISM_TYPE_DESCRIPTION' in df_clean.columns:
        type_counts = df_clean['ORGANISM_TYPE_DESCRIPTION'].value_counts()
        print("\nOrganism Type Distribution:")
        for type_name, count in type_counts.items():
            if type_name:  # Skip empty values
                percentage = count / len(df_clean) * 100
                print(f"  {type_name}: {count} organisms ({percentage:.1f}%)")
    
    # Common organisms
    if 'IS_COMMON' in df_clean.columns:
        common_count = (df_clean['IS_COMMON'] == 'Yes').sum()
        print(f"\nCommon Organisms: {common_count} ({common_count/len(df_clean)*100:.1f}%)")
    
    # Code pattern analysis
    if 'ORGANISM_CODE' in df_clean.columns:
        code_lengths = df_clean['ORGANISM_CODE'].str.len()
        print("\nOrganism Code Analysis:")
        print(f"  Average code length: {code_lengths.mean():.1f} characters")
        print(f"  Most common lengths: {code_lengths.value_counts().head(3).to_dict()}")
        
        # First letter frequency
        first_letters = df_clean['ORGANISM_CODE'].str[0].value_counts()
        print("\nMost Common First Letters in Codes:")
        for letter, count in first_letters.head(5).items():
            print(f"  {letter}: {count} organisms")
    
    # Generate summary report
    summary_file = r'C:\Personal_Projects\Astro\project_resources\organism_data_summary.txt'
    with open(summary_file, 'w') as f:
        f.write(f"WHONET Organism Data Summary - {datetime.now().strftime('%Y-%m-%d')}\n")
        f.write("=" * 60 + "\n\n")
        f.write(f"Total organisms: {len(df_clean)}\n")
        
        if 'ORGANISM_TYPE_DESCRIPTION' in df_clean.columns:
            f.write("\nOrganism Type Distribution:\n")
            for type_name, count in type_counts.items():
                if type_name:  # Skip empty values
                    percentage = count / len(df_clean) * 100
                    f.write(f"  {type_name}: {count} organisms ({percentage:.1f}%)\n")
        
        if 'IS_COMMON' in df_clean.columns:
            common_count = (df_clean['IS_COMMON'] == 'Yes').sum()
            f.write(f"\nCommon Organisms: {common_count} ({common_count/len(df_clean)*100:.1f}%)\n")
            
            # List of common organisms
            f.write("\nList of Common Organisms:\n")
            common_orgs = df_clean[df_clean['IS_COMMON'] == 'Yes']
            for _, row in common_orgs.iterrows():
                f.write(f"  {row['ORGANISM_CODE']}: {row['ORGANISM_NAME']}\n")
    
    print(f"\nDetailed summary saved to {summary_file}")
else:
    print("No data to analyze.")

print("\n✓ Process completed.")


=== Organism Data Analysis ===

Organism Type Distribution:
  Gram-negative: 1114 organisms (37.8%)
  Fungus: 478 organisms (16.2%)
  Anaerobe: 432 organisms (14.7%)
  Gram-positive: 428 organisms (14.5%)
  Other: 213 organisms (7.2%)
  Unknown: 104 organisms (3.5%)
  Bacteria: 96 organisms (3.3%)
  Mycobacteria: 81 organisms (2.7%)

Common Organisms: 85 (2.9%)

Organism Code Analysis:
  Average code length: 3.0 characters
  Most common lengths: {3: 2945, 0: 1}

Most Common First Letters in Codes:
  C: 394 organisms
  S: 345 organisms
  P: 276 organisms
  A: 255 organisms
  M: 236 organisms

Detailed summary saved to C:\Personal_Projects\Astro\project_resources\organism_data_summary.txt

✓ Process completed.


## Instructions for Running This Notebook

To successfully process WHONET organism data:

1. Make sure the Organisms.txt file is available in the project_resources directory

2. Run each cell in order from top to bottom

3. The notebook will:
   - Load and parse the Organisms.txt file
   - Extract organism codes, names, and types
   - Optionally supplement with web scraping (disabled by default)
   - Save processed data as CSV and Excel files
   - Generate analysis and summary

4. The outputs will be saved as:
   - `Organisms_Final.csv` - Main CSV file with organism codes, names and types
   - `Organisms_Final.xlsx` - Excel version of the data
   - `organism_data_summary.txt` - Summary statistics
   - `Common_Organisms.xlsx` - List of common organisms

## Adjusting the Process

To enable web scraping supplementation, set `do_web_scraping = True` in the web scraping cell, but note that:

- Web scraping requires a working internet connection
- You need Chrome and appropriate ChromeDriver installed
- The process may take longer with web scraping enabled
- Web data will supplement, not replace, the file data