# Building the Core Dataset for Plant Recommendations (Table13)

Name: Zihan

## Step 1: Import required libraries and set file paths
First, we import all required Python libraries and define the directory containing the raw JSON files and the path for the final output CSV file.

In [None]:
# Import required libraries
import os
import json
import pandas as pd
from glob import glob

# Define input (raw data) and output (processed data) paths
DETAILS_DIR = "01_raw_data/01_species_details"
OUTPUT_PATH = "02_wrangled_data/Table13_GeneralPlantListforRecommendation.csv"

## Step 2: Define core data - Hardiness Zone to Temperature (°C) conversion table
This is the key to our temperature conversion. We create a dictionary (lookup table) to directly convert a plant's minimum hardiness zone to the absolute minimum temperature it can survive (in Celsius).

In [None]:
# This conversion table is based on the USDA Plant Hardiness Zone standards
HARDINESS_ZONE_TO_CELSIUS = {
    "1": -51.1, "2": -45.6, "3": -40.0, "4": -34.4, "5": -28.9,
    "6": -23.3, "7": -17.8, "8": -12.2, "9": -6.7, "10": -1.1,
    "11": 4.4,  "12": 10.0, "13": 15.6
}

## Step 3: Define function to process individual JSON files
To keep the code clean, we create a function specifically responsible for processing individual plant JSON files. Its task is to read JSON content, extract the fields we need, and complete the conversion from hardiness zone to specific temperature.

In [None]:
def process_plant_details(details_json):
    """
    Extract and convert required information from individual plant JSON data.
    (Updated version with list to JSON string conversion)
    """
    # Skip plants with ID greater than 3000 based on original logic
    plant_id = details_json.get("id")
    if not plant_id or plant_id > 3000:
        return None

    # --- Convert Hardiness Zone to absolute minimum temperature ---
    hardiness_data = details_json.get("hardiness", {})
    min_zone = hardiness_data.get("min")
    # Look up temperature from conversion table, default to None if not found
    absolute_min_temp = HARDINESS_ZONE_TO_CELSIUS.get(min_zone)

    # --- Robustly handle plant_type field ---
    plant_type_raw = details_json.get("type")
    plant_type_processed = plant_type_raw.lower() if plant_type_raw else ""

    # --- Build the record we need ---
    record = {
        "general_plant_id": plant_id,
        "plant_type": plant_type_processed,
        # [Modified] Directly convert sunlight list to JSON string for MySQL storage
        "sunlight": json.dumps(details_json.get("sunlight", []), ensure_ascii=False),
        "watering": details_json.get("watering"),
        "drought_tolerant": details_json.get("drought_tolerant", False),
        "absolute_min_temp_c": absolute_min_temp
    }
    return record

## Step 4: Main process - Iterate through files, process data and create DataFrame
Now, we will execute the main data processing workflow. The code will iterate through all JSON files, use the function defined above to process them, and then aggregate all results into a pandas DataFrame.

In [None]:
print("Starting to process all plant JSON files...")

# Find all plant detail JSON files
detail_files = glob(os.path.join(DETAILS_DIR, "plant_species_details_*.json"))
print(f"Found {len(detail_files)} JSON files ready for processing.")

# Loop through all files and process them
all_plant_data = []
for file_path in detail_files:
    try:
        with open(file_path, "r", encoding="utf-8") as f:
            data = json.load(f)
            processed_record = process_plant_details(data)
            # Only successfully processed records will be added
            if processed_record:
                all_plant_data.append(processed_record)
    except Exception as e:
        print(f"Error processing file {file_path}: {e}")

print(f"Successfully processed {len(all_plant_data)} plant records.")

# Convert record list to pandas DataFrame for easier subsequent operations
df = pd.DataFrame(all_plant_data)
df = df.sort_values('general_plant_id')

print("DataFrame created successfully, ready for filtering.")
df.head() # Display first few rows for checking

开始处理所有植物的JSON文件...
找到了 1484 个JSON文件准备处理。
成功处理了 1484 条植物记录。
DataFrame创建成功，准备进行筛选。


Unnamed: 0,general_plant_id,plant_type,sunlight,watering,drought_tolerant,absolute_min_temp_c
0,1,tree,"[""full sun""]",Frequent,False,-17.8
596,2,tree,"[""full sun""]",Average,False,-34.4
707,3,tree,"[""Full sun"", ""part shade""]",Average,True,-40.0
818,4,tree,"[""full sun""]",Average,True,-34.4
929,5,tree,"[""full sun"", ""part shade"", ""filtered shade""]",Frequent,False,-23.3


## Step 5: Filter data, organize format and save as CSV file
In the final step, we filter the DataFrame (remove trees), organize the column order and format, and then save the final perfect result as a CSV file.

In [None]:
# --- Filtering step: Remove all plants with plant_type 'tree' ---
initial_rows = len(df)
df_filtered = df[df['plant_type'] != 'tree'].copy() # Use .copy() to avoid SettingWithCopyWarning
rows_after_filter = len(df_filtered)
print(f"Before filtering: {initial_rows} rows of data.")
print(f"After removing {initial_rows - rows_after_filter} trees: {rows_after_filter} rows remaining.")


# --- Final processing and saving ---
# Define columns for final CSV output, note that 'plant_type' column is no longer needed after filtering
final_columns = [
    "general_plant_id",
    "sunlight",
    "watering",
    "drought_tolerant",
    "absolute_min_temp_c"
]
df_final = df_filtered[final_columns]

# Sort by ID to maintain data consistency
df_final = df_final.sort_values(by="general_plant_id").reset_index(drop=True)

# [Modified] Removed previous conversion of sunlight column as this operation has been moved to process_plant_details function
# df_final['sunlight'] = df_final['sunlight'].apply(lambda x: json.dumps(x, ensure_ascii=False))

# Ensure correct data types
df_final["general_plant_id"] = pd.to_numeric(df_final["general_plant_id"]).astype("Int64")
df_final["drought_tolerant"] = df_final["drought_tolerant"].astype(bool)


# Create output directory (if it doesn't exist)
os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)

# Save final DataFrame as CSV file without index column
df_final.to_csv(OUTPUT_PATH, index=False)

print("\nProcessing completed!")
print(f"Table13 successfully saved to: \n{OUTPUT_PATH}")

# Display first few rows of final table
df_final.head()

筛选前共有 1484 行数据。
筛选掉 476 种树木后，剩余 1008 行数据。

处理完成！
Table13已成功保存至: 
02_wrangled_data/Table13_GeneralPlantListforRecommendation.csv


Unnamed: 0,general_plant_id,sunlight,watering,drought_tolerant,absolute_min_temp_c
0,398,"[""full sun"", ""part shade""]",Average,True,-17.8
1,399,"[""full sun"", ""part shade""]",Average,False,-23.3
2,400,"[""full sun"", ""part shade""]",Average,True,-28.9
3,401,"[""Full sun"", ""part shade""]",Average,False,-23.3
4,402,"[""full sun"", ""part shade""]",Average,True,-28.9


## Step 6 - Import Plant Disease Link Table (Table13) into MySQL

In [None]:
import mysql.connector
from mysql.connector import Error

# Database connection configuration (consistent with what you provided)
# Note: In production environment, do not hardcode passwords in code.
db_config = {
    'host': 'database-plantx.cqz06uycysiz.us-east-1.rds.amazonaws.com',
    'user': 'zihan',
    'password': '2002317Yzh12138.',
    'database': 'FIT5120_PlantX_Database',
    'allow_local_infile': True,
    'use_pure': True,
    'charset': 'utf8mb4'
}

# Try to connect to database and create Table13
try:
    connection = mysql.connector.connect(**db_config)
    if connection.is_connected():
        print("Successfully connected to MySQL server.")
        cursor = connection.cursor()

        # SQL statement: Create Table13_GeneralPlantListforRecommendation
        # Note that our final CSV does not contain plant_type column, so the table creation statement matches this
        create_table_13 = """
        CREATE TABLE IF NOT EXISTS Table13_GeneralPlantListforRecommendation (
            general_plant_id INT PRIMARY KEY,
            sunlight JSON,
            watering TEXT,
            drought_tolerant BOOLEAN,
            absolute_min_temp_c DOUBLE
        ) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
        """

        cursor.execute(create_table_13)
        connection.commit()
        print("Table13 table structure created successfully.")

except Error as e:
    print(f"Error occurred while creating Table13 table structure: {e}")

finally:
    if 'connection' in locals() and connection.is_connected():
        cursor.close()
        connection.close()
        print("MySQL connection for creating Table13 table structure closed.")

成功连接到MySQL服务器。
Table13 的表结构创建成功。
用于创建Table13表结构的MySQL连接已关闭。


## Step 7 - Import Data into Table13

In [None]:
try:
    # Re-establish connection for data import
    connection = mysql.connector.connect(**db_config)
    if connection.is_connected():
        print("Successfully connected to MySQL server for Table13 data import.")
        cursor = connection.cursor()

        # Define query statement for loading data from CSV file
        # Note: Pandas uses '\n' as line terminator by default, which is more universal than '\r\n'
        load_data_query_13 = f"""
        LOAD DATA LOCAL INFILE '02_wrangled_data/Table13_GeneralPlantListforRecommendation.csv'
        INTO TABLE Table13_GeneralPlantListforRecommendation
        CHARACTER SET utf8mb4
        FIELDS TERMINATED BY ','
        OPTIONALLY ENCLOSED BY '"'
        LINES TERMINATED BY '\\n'
        IGNORE 1 LINES
        (
            general_plant_id,
            sunlight,
            watering,
            @drought_tolerant_var,
            absolute_min_temp_c
        )
        SET drought_tolerant = IF(@drought_tolerant_var = 'True', 1, 0);
        """

        cursor.execute(load_data_query_13)
        connection.commit()
        print(f"Table13 data import successful! Rows affected: {cursor.rowcount}")

except Error as e:
    print(f"Error occurred during Table13 data import: {e}")

finally:
    if 'connection' in locals() and connection.is_connected():
        cursor.close()
        connection.close()
        print("MySQL connection for Table13 import closed.")

为导入Table13数据，已成功连接到MySQL服务器。
Table13 数据导入成功！影响行数: 1008
用于导入Table13的MySQL连接已关闭。


## Step 8 - Verify Imported Rows and Preview (Table11)

In [None]:
try:
    connection = mysql.connector.connect(**db_config)
    if connection.is_connected():
        cursor = connection.cursor()

        # Get total number of rows in the table
        cursor.execute("SELECT COUNT(*) FROM Table13_GeneralPlantListforRecommendation")
        row_count = cursor.fetchone()[0]
        print(f"Table13_GeneralPlantListforRecommendation currently contains {row_count} rows of data.")

        # Get and print first 5 rows for preview
        print("\n--- Table13 first 5 rows preview ---")
        cursor.execute("SELECT * FROM Table13_GeneralPlantListforRecommendation LIMIT 5")
        rows = cursor.fetchall()
        for row in rows:
            print(row)

except Error as e:
    print(f"Error occurred during Table13 verification: {e}")

finally:
    if 'connection' in locals() and connection.is_connected():
        cursor.close()
        connection.close()
        print("MySQL connection for Table13 verification closed.")

Table13_GeneralPlantListforRecommendation 表中当前包含 1008 行数据。

--- Table13 前5行数据预览 ---
(398, '["full sun", "part shade"]', 'Average', 1, -17.8)
(399, '["full sun", "part shade"]', 'Average', 0, -23.3)
(400, '["full sun", "part shade"]', 'Average', 1, -28.9)
(401, '["Full sun", "part shade"]', 'Average', 0, -23.3)
(402, '["full sun", "part shade"]', 'Average', 1, -28.9)
用于验证Table13的MySQL连接已关闭。
