**Classification Report**

* **Precision (No): 0.76** 
    - Out of all the instances your model predicted as "No" (no road access), 76% of them were actually correct.
    - This indicates a moderate rate of false positives for the "No" class, meaning the model sometimes predicts no access when there actually is.

* **Recall (No): 0.89** 
    - Out of all the instances that truly had "No" road access, your model was able to correctly identify 89% of them.
    - This suggests a relatively low rate of false negatives for the "No" class, meaning the model is good at catching instances without road access.

* **F1-score (No): 0.82** 
    - The F1-score is the harmonic mean of precision and recall, providing a balanced measure of the model's performance on the "No" class.
    - An F1-score of 0.82 is considered good, showing a decent balance between correctly identifying instances without access and minimizing false positives.

* **Precision (Yes): 0.89**
    - Out of all the instances your model predicted as "Yes" (having road access), 89% of them were actually correct.
    - This implies a low rate of false positives for the "Yes" class.

* **Recall (Yes): 0.77**
    - Out of all the instances that truly had road access, your model was able to correctly identify 77% of them.
    - This indicates a somewhat higher rate of false negatives for the "Yes" class compared to the "No" class, meaning the model might miss some properties that do have road access

* **F1-score (Yes): 0.83**
    - The F1-score for the "Yes" class is 0.83, demonstrating good overall performance but slightly lower than the "No" class due to the lower recall.

* **Accuracy: 0.82**
    - Overall, your model correctly classified 82% of the properties in the dataset, regardless of their actual road access status.

* **Macro avg & Weighted avg:** 
    - These are average scores across both classes, calculated differently.
    - **Macro avg** gives equal weight to both classes, even if they have imbalanced support (different number of samples).
    - **Weighted avg** takes into account the number of samples in each class, giving more weight to the class with more samples.
    - In this case, both averages are around 0.82-0.83, suggesting a fairly balanced performance across both classes.

**Confusion Matrix**

* The confusion matrix provides a more detailed look at the model's predictions.
* **True Positives (TP): 100** - The model correctly predicted 100 properties as having road access.
* **True Negatives (TN): 94** - The model correctly predicted 94 properties as having no road access
* **False Positives (FP): 30** - The model incorrectly predicted 30 properties as having road access when they didn't
* **False Negatives (FN): 12** - The model incorrectly predicted 12 properties as having no road access when they actually did

**Overall Interpretation:**

* Your model demonstrates a good overall performance in predicting road access, with an accuracy of 82%.
* It's particularly strong at identifying properties **without** road access, as evidenced by the high recall (0.89) and F1-score (0.82) for the "No" class
* There is room for improvement in identifying properties **with** road access, as the recall for the "Yes" class is slightly lower (0.77).

**Potential Next Steps:**

* If the focus is on correctly identifying properties *without* road access, the model is already doing a good job. You could further fine-tune it to improve precision for the "No" class if minimizing false positives is crucial.
* If identifying properties *with* road access is more important, you could try techniques to improve the recall for the "Yes" class. This might involve adjusting the model's parameters, using a different algorithm, or gathering more data for properties with road access


In [1]:
import pandas as pd
import numpy as np
import requests
from math import sqrt, pi
from sklearn.metrics import (
    classification_report, confusion_matrix, f1_score, precision_score, 
    recall_score, cohen_kappa_score, roc_auc_score
)
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
import seaborn as sns
import logging
import time

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def calculate_radius(acre, margin=1.5):
    """Calculate the radius of a circle with the same area as the property in acres."""
    return sqrt(acre * 4046.86 / pi) * margin

def check_road_access(lat, lon, radius, retries=3):
    """Check if the property has road access within the calculated radius."""
    overpass_url = "http://overpass-api.de/api/interpreter"
    overpass_query = f"""
    [out:json];
    way(around:{radius},{lat},{lon})["highway"];
    out body;
    """
    
    for attempt in range(retries):
        try:
            response = requests.get(overpass_url, params={'data': overpass_query}, timeout=15)
            response.raise_for_status()
            data = response.json()
            return bool(data.get('elements'))
        except (requests.RequestException, ValueError) as e:
            logging.warning(f"Request failed (attempt {attempt + 1}/{retries}): {e}")
            if attempt < retries - 1:
                time.sleep(1)  # Retry after a short delay
            else:
                return False

def check_road_access_weighted(lat, lon, acre, margins, weights):
    """Check road access using weighted margins."""
    score = 0
    for margin, weight in zip(margins, weights):
        radius = calculate_radius(acre, margin)
        if check_road_access(lat, lon, radius):
            score += weight
        # time.sleep(1)  # Short delay to avoid overwhelming the API
    
    # Determine the final access based on the accumulated score
    return 'Yes' if score >= sum(weights) / 3 else 'No'

def filter_properties_with_weighted_road_access(df):
    """Filter properties based on weighted road access prediction."""
    margins = np.arange(0.0, 3.1, 0.25)
    weights = np.linspace(1.0, 0.1, len(margins))
    return [check_road_access_weighted(row['LATITUDE'], row['LONGITUDE'], row['ACREAGE'], margins, weights) for _, row in df.iterrows()]

def plot_confusion_matrix(cm, target_names, title='Confusion Matrix', cmap='Blues', normalize=True):
    """Plot the confusion matrix using seaborn."""
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='.2f' if normalize else 'd', cmap=cmap, cbar=False, 
                xticklabels=target_names, yticklabels=target_names)
    plt.title(f'{title}\nAccuracy={np.trace(cm)/np.sum(cm):.4f}')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

def clean_final_sheet(df):
    """Clean up the final DataFrame by identifying the correct 'Road access' column and removing duplicates."""
    road_access_columns = [col for col in df.columns if 'Road access' in col]
    df = df.drop(columns=road_access_columns[1:])
    df.rename(columns={road_access_columns[0]: 'Road access'}, inplace=True)
    return df

def main():
    # Load the Excel file
    excel_file_path = 'Road access Testing Data.xlsx'
    df = pd.read_excel(excel_file_path)

    # Use a sample of 50 rows for testing
    # df = df.sample(n=100, random_state=42)

    # Ensure the ACREAGE column is numeric and drop rows with NaN values
    df['ACREAGE'] = pd.to_numeric(df['ACREAGE'], errors='coerce')
    df = df.dropna(subset=['ACREAGE'])

    # Get the actual and predicted road access values
    actual_access = df['Road access'].tolist()
    predicted_access = filter_properties_with_weighted_road_access(df)

    # Generate and log the classification report
    report = classification_report(actual_access, predicted_access, target_names=['No', 'Yes'])
    logging.info(f"Classification Report:\n{report}")

    # Save the classification report to a text file
    report_file = f"{excel_file_path.split('.')[0]}_Classification_Report.txt"
    with open(report_file, "w") as file:
        file.write("Classification Report:\n")
        file.write(report)

    # Generate and log the confusion matrix
    cm = confusion_matrix(actual_access, predicted_access, labels=['No', 'Yes'])
    logging.info(f"Confusion Matrix:\n{cm}")

    # Plot the confusion matrix
    plot_confusion_matrix(cm, target_names=['No', 'Yes'])

    # Fit the LabelBinarizer on the actual labels
    lb = LabelBinarizer()
    actual_binarized = lb.fit_transform(actual_access)
    predicted_binarized = lb.transform(predicted_access)

    # Additional performance metrics
    precision = precision_score(actual_access, predicted_access, pos_label='Yes')
    recall = recall_score(actual_access, predicted_access, pos_label='Yes')
    f1 = f1_score(actual_access, predicted_access, pos_label='Yes')
    kappa = cohen_kappa_score(actual_access, predicted_access)
    auc = roc_auc_score(actual_binarized, predicted_binarized)

    logging.info(f"Precision (Yes): {precision:.4f}")
    logging.info(f"Recall (Yes): {recall:.4f}")
    logging.info(f"F1 Score (Yes): {f1:.4f}")
    logging.info(f"Cohen's Kappa: {kappa:.4f}")
    logging.info(f"AUC-ROC: {auc:.4f}")

    # Filter and save properties with road access
    df['Predicted Road Access'] = predicted_access
    cleaned_filtered_df = clean_final_sheet(df[df['Predicted Road Access'] == 'Yes'])
    output_file_path = f"{excel_file_path.split('.')[0]}_FILTERED_CLEANED.xlsx"
    cleaned_filtered_df.to_excel(output_file_path, index=False)
    logging.info(f"Filtered and cleaned properties saved to {output_file_path}")

if __name__ == "__main__":
    main()


FileNotFoundError: [Errno 2] No such file or directory: 'Road access Testing Data.xlsx'

In [None]:
import pandas as pd
import numpy as np
import requests
from math import sqrt, pi
import logging
from tqdm import tqdm
import time

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def calculate_radius(acre, margin=1.5):
    """Calculate the radius of a circle with the same area as the property in acres."""
    return sqrt(acre * 4046.86 / pi) * margin

def check_road_access(lat, lon, radius, retries=3):
    """Check if the property has road access within the calculated radius."""
    overpass_url = "http://overpass-api.de/api/interpreter"
    overpass_query = f"""
    [out:json];
    way(around:{radius},{lat},{lon})["highway"];
    out body;
    """
    
    for attempt in range(retries):
        try:
            response = requests.get(overpass_url, params={'data': overpass_query}, timeout=15)
            response.raise_for_status()
            data = response.json()
            return bool(data.get('elements'))
        except (requests.RequestException, ValueError) as e:
            logging.warning(f"Request failed (attempt {attempt + 1}/{retries}): {e}")
            if attempt < retries - 1:
                time.sleep(1)  # Retry after a short delay
            else:
                return False

def check_road_access_weighted(lat, lon, acre, margins, weights):
    """Check road access using weighted margins."""
    score = 0
    for margin, weight in zip(margins, weights):
        radius = calculate_radius(acre, margin)
        if check_road_access(lat, lon, radius):
            score += weight
    
    # Calculate the percentage score
    percentage_score = round(score / sum(weights), 3)
    
    # Determine the final access based on the accumulated score
    road_access = 'Yes' if score >= sum(weights) / 3 else 'No'
    
    return road_access, percentage_score

def filter_properties_with_weighted_road_access(df):
    """Filter properties based on weighted road access prediction and return road access and score."""
    margins = np.arange(0.0, 3.1, 0.25)
    weights = np.linspace(1.0, 0.1, len(margins))
    
    results = [check_road_access_weighted(row['LATITUDE'], row['LONGITUDE'], row['ACREAGE'], margins, weights)
               for _, row in tqdm(df.iterrows(), total=len(df), desc="Filtering properties")]
    
    df['Road-Access'], df['Percentage'] = zip(*results)
    return df

def plot_confusion_matrix(cm, target_names, title='Confusion Matrix', cmap='Blues', normalize=True):
    """Plot the confusion matrix using seaborn."""
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='.2f' if normalize else 'd', cmap=cmap, cbar=False, 
                xticklabels=target_names, yticklabels=target_names)
    plt.title(f'{title}\nAccuracy={np.trace(cm)/np.sum(cm):.4f}')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

def clean_final_sheet(df):
    """Clean up the final DataFrame by identifying the correct 'Road access' column and removing duplicates."""
    road_access_columns = [col for col in df.columns if 'Road access' in col]
    df = df.drop(columns=road_access_columns[1:])
    df.rename(columns={road_access_columns[0]: 'Road access'}, inplace=True)
    return df

def main():
    # Load the Excel file
    excel_file_path = 'Export-Baldwin-MI-target-rocket.xlsx SCRUBBED.xlsx'
    df = pd.read_excel(excel_file_path)

    # Ensure the ACREAGE column is numeric and drop rows with NaN values
    df['ACREAGE'] = pd.to_numeric(df['ACREAGE'], errors='coerce')
    df = df.dropna(subset=['ACREAGE'])

    # Filter properties based on road access and calculate percentage score
    df = filter_properties_with_weighted_road_access(df)

    # Save the DataFrame with road access and percentage columns
    output_file_path = f"{excel_file_path.split('.')[0]}_WITH_ROAD_ACCESS.xlsx"
    df.to_excel(output_file_path, index=False)
    logging.info(f"Filtered properties saved to {output_file_path}")

if __name__ == "__main__":
    main()


Filtering properties: 100%|██████████| 83/83 [05:34<00:00,  4.03s/it]
2024-08-22 22:58:37,500 - INFO - Filtered Road-Access properties saved to Export-Baldwin-MI-target-rocket_ROAD-ACCESS_FILTERED.xlsx
