# **Project Name**    - Bird Species Observation Analysis



##### **Project Type**    - Observation Analysis
##### **Contribution**    - Individual
##### **Team Member 1 -** Aditya Anilkumar


# **1. Project Summary -**

The project has three main objectives:

* *Analyze Bird Distribution*: Examine the distribution of bird species across forest and grassland habitats to identify patterns of habitat preference. This involves studying how different species utilize these distinct ecosystems.

* *Assess Habitat Influence*: Evaluate how environmental factors, such as vegetation type, climate, and terrain, influence bird populations and their behavior. The study will assess the impact of these factors on the presence and abundance of specific bird species.

* *Evaluate Bird Diversity*: Compare the diversity of bird species in forests versus grasslands to understand how each ecosystem supports different levels of avian richness. This analysis will involve using observational data to quantify and compare species diversity metrics.

# **GitHub Link -**

https://github.com/aditya18101999

# **2. Problem Statement**




The project aims to analyze the distribution and diversity of bird species in two distinct ecosystems: forests and grasslands. The objective is to understand how environmental factors such as vegetation type, climate, and terrain influence bird populations and their behavior



#### **Define Your Business Objective?**

The study will use provided observational data of bird species from both ecosystems to identify patterns of habitat preference and assess the impact of these habitats on bird diversity. The project's findings are intended to provide valuable insights for habitat conservation, biodiversity management, and understanding how environmental changes affect avian communities. This analysis will support various business use cases, including wildlife conservation, land management, eco-tourism, and sustainable agriculture, while also providing data-driven insights for policy support and biodiversity monitoring.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***3. PROJECT SETUP & DATA LOADING***

### Import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import glob
import os
import datetime # Import datetime module
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# --- Part 1: Load and Concatenate Data ---
print("--- Step 1: Loading and concatenating data ---")

# Specify the file paths for both Excel files
forest_file_path = "/content/drive/MyDrive/Raw Dataset/Bird_Monitoring_Data_FOREST.XLSX"
grassland_file_path = "/content/drive/MyDrive/Raw Dataset/Bird_Monitoring_Data_GRASSLAND.XLSX"

# Load data from the Forest Excel file
try:
    excel_data_forest = pd.ExcelFile(forest_file_path)
    sheet_names_forest = excel_data_forest.sheet_names
    sheets_dict_forest = {sheet: excel_data_forest.parse(sheet) for sheet in sheet_names_forest}
    df_forest = pd.concat(
        [df.assign(Sheet=sheet_name, Habitat='Forest') for sheet_name, df in sheets_dict_forest.items()],
        ignore_index=True
    )
except FileNotFoundError:
    print(f"Error: Forest data file not found at {forest_file_path}")
    df_forest = pd.DataFrame() # Create an empty DataFrame if file not found
except Exception as e:
    print(f"Error loading forest data: {e}")
    df_forest = pd.DataFrame() # Create an empty DataFrame in case of other errors


# Load data from the Grassland Excel file
try:
    excel_data_grassland = pd.ExcelFile(grassland_file_path)
    sheet_names_grassland = excel_data_grassland.sheet_names
    sheets_dict_grassland = {sheet: excel_data_grassland.parse(sheet) for sheet in sheet_names_grassland}
    df_grassland = pd.concat(
        [df.assign(Sheet=sheet_name, Habitat='Grassland') for sheet_name, df in sheets_dict_grassland.items()],
        ignore_index=True
    )
except FileNotFoundError:
    print(f"Error: Grassland data file not found at {grassland_file_path}")
    df_grassland = pd.DataFrame() # Create an empty DataFrame if file not found
except Exception as e:
    print(f"Error loading grassland data: {e}")
    df_grassland = pd.DataFrame() # Create an empty DataFrame in case of other errors


# Concatenate the dataframes
df_combined = pd.concat([df_forest, df_grassland], ignore_index=True)

if df_combined.empty:
    print("No data loaded from either file. The combined DataFrame is empty.")
    exit()

# Drop the temporary 'Sheet' column
df_combined.drop(columns=['Sheet'], inplace=True)


print(f"Combined DataFrame shape: {df_combined.shape}")
print("\n--- Initial Combined DataFrame Info ---")
df_combined.info()
print("\n" + "="*50 + "\n")

### Dataset First View

In [None]:
# Dataset First Look
df_combined.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df_combined.shape

### Data Cleaning and Preprocessing

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# --- Part 2: Data Cleaning and Preprocessing ---
print("--- Step 2: Data Cleaning and Preprocessing ---")

# Step 2a: Handling Null Values more strategically
# Instead of dropping all rows, let's drop columns that are mostly empty and fill others.
# First, identify columns with a high percentage of null values.
null_counts = df_combined.isnull().sum()
total_rows = len(df_combined)
null_percentage = (null_counts / total_rows) * 100
print("Percentage of null values per column:")
print(null_percentage)

In [None]:
# Drop columns with a very high percentage of null values (e.g., > 90%)
# The previous output shows Sub_Unit_Code, Site_Name, NPSTaxonCode, TaxonCode, Previously_Obs
# have many nulls, but some might be useful.
# Let's handle them case-by-case.

# Fill missing `Sub_Unit_Code`, `Site_Name`, `Sex`, and `Distance` with 'Unknown'
df_combined['Sub_Unit_Code'].fillna('Unknown', inplace=True)
df_combined['Site_Name'].fillna('Unknown', inplace=True)
df_combined['Sex'].fillna('Unknown', inplace=True)
df_combined['Distance'].fillna('Unknown', inplace=True)

# Drop rows where 'Common_Name' is missing, as this is a key field for analysis
df_combined.dropna(subset=['Common_Name'], inplace=True)

# Note: The output shows `AcceptedTSN` has some nulls. We can drop these rows as well,
# since the data won't be useful without this taxonomic information.
df_combined.dropna(subset=['AcceptedTSN'], inplace=True)


#### Dropping unecessary columns

In [None]:
# Step 2b: Dropping unnecessary columns that were causing the problem
# Let's drop columns that are redundant or not needed for this analysis.
# The user's original code already identified these.
columns_to_drop = ['AcceptedTSN', 'NPSTaxonCode', 'AOU_Code', 'TaxonCode', 'Previously_Obs']

# Check if columns exist before attempting to drop them
existing_columns_to_drop = [col for col in columns_to_drop if col in df_combined.columns]

if existing_columns_to_drop:
    df_combined.drop(columns=existing_columns_to_drop, axis=1, inplace=True)
    print(f"\nDropped unnecessary columns: {existing_columns_to_drop}")
else:
    print("\nNo specified unnecessary columns found or dropped.")

#### Duplicate Values

In [None]:
# Step 2c: Dropping duplicates
initial_rows_dup = df_combined.shape[0]
df_combined.drop_duplicates(inplace=True)
rows_after_dup_drop = df_combined.shape[0]
print(f"\n--- Dropped Duplicates ---")
print(f"Removed {initial_rows_dup - rows_after_dup_drop} duplicate rows.")
print(f"New shape after dropping duplicates: {df_combined.shape}")

#### Standardize Columns & Data Types



  * Ensure datetime columns are parsed

  * Strip spaces from categorical fields

  * Handle mixed formats (e.g., Interval_Length, Distance)

In [None]:
# Step 2d: Converting data types and adding new columns
print("\n--- Step 2d: Converting data types and adding new columns ---")
df_combined['Date'] = pd.to_datetime(df_combined['Date'])

# Convert 'Start_Time' to timedelta, coercing errors
# Convert to string first for robust parsing, handling potential NaNs which become 'NaT' in string
df_combined['Start_Time_timedelta'] = pd.to_timedelta(df_combined['Start_Time'].astype(str).str.split('.').str[0], errors='coerce')

# Convert 'End_Time' to datetime.time objects, coercing errors
df_combined['End_Time'] = pd.to_datetime(df_combined['End_Time'].astype(str), format='%H:%M:%S', errors='coerce').dt.time

# Add new columns for easier analysis and visualization
df_combined['Observation_Month'] = df_combined['Date'].dt.month
df_combined['Observation_Day'] = df_combined['Date'].dt.day

# Extract hour from the timedelta, handling potential NaT
df_combined['Observation_Hour'] = df_combined['Start_Time_timedelta'].dt.components.hours if 'Start_Time_timedelta' in df_combined.columns else None

# Corrected way to combine Date and Time robustly handling NaT values
# Add the timedelta to the Date column
df_combined['Full_Observation_DateTime'] = df_combined['Date'] + df_combined['Start_Time_timedelta']

# Drop the temporary timedelta column
df_combined.drop(columns=['Start_Time_timedelta'], inplace=True)

print("\n--- Final DataFrame Info after cleaning ---")
df_combined.info()
print("\n" + "="*50 + "\n")

print("\n--- Final DataFrame Head ---")
display(df_combined.head())
print(f"Final DataFrame shape: {df_combined.shape}")

### What did you know about your dataset?

Summary of the Dataset:

* The dataset consists of bird species observations recorded across multiple administrative forest and grassland sites.

* Data has been collected across multiple years, locations, and environmental conditions, and it includes detailed bird behavioral information.

* The data was originally split across multiple Excel sheets, each representing a unique administrative unit (e.g., ANTI, CATO, CHOH, etc.), and has now been consolidated into a single DataFrame with 15,344 clean records and 31 columns.

Key insights from the initial analysis:

* Total Records After Cleaning: 15,344

* Original Records Before Cleaning: 17,077

* Duplicates Removed: 1,700

* Unnecessary/Incomplete Columns Dropped: 5

* The dataset now holds enough quality and structure for robust exploratory, temporal, spatial, and conservation-oriented analyses.

### Variables Description

| **Column Name**                 | **Description** |
|--------------------------------|-----------------|
| `Admin_Unit_Code`              | Administrative unit code (e.g., ANTI, CATO) |
| `Sub_Unit_Code`                | Sub-region within the administrative unit |
| `Site_Name`                    | Specific name of the observation site |
| `Plot_Name`                    | Identifier of the observation plot |
| `Location_Type`                | Habitat type: Forest or Grassland |
| `Year`, `Date`                 | Year and exact date of observation |
| `Start_Time`, `End_Time`       | Time window of bird survey |
| `Observer`                     | Name of the person conducting the observation |
| `Visit`                        | Number of site visits |
| `Interval_Length`             | Time interval of observation |
| `ID_Method`                   | How birds were identified (e.g., Singing, Calling) |
| `Distance`                    | Distance of the bird from the observer |
| `Flyover_Observed`            | Whether the bird was flying overhead |
| `Sex`                         | Sex of the bird (Male, Female, Undetermined) |
| `Common_Name`, `Scientific_Name` | Bird species identifiers |
| `PIF_Watchlist_Status`        | Whether species is of conservation concern |
| `Regional_Stewardship_Status` | Region-specific conservation importance |
| `Temperature`, `Humidity`     | Environmental conditions during observation |
| `Sky`, `Wind`, `Disturbance` | Other weather/environmental conditions |
| `Initial_Three_Min_Cnt`       | Count in the first 3 minutes of observation |
| `Habitat`                     | Habitat type (duplicated for confirmation) |
| `Observation_Month`           | Extracted month from the observation date |
| `Observation_Day`             | Extracted day from the observation date |
| `Observation_Hour`            | Hour extracted from observation start time |
| `Full_Observation_DateTime`   | Combined timestamp of date and time |

### What all manipulations have you done and insights you found?

  Data Manipulations Performed

#### 1. **Multi-Sheet Data Consolidation**
- **What:** Loaded all Excel sheets using `pandas.read_excel(sheet_name=None)` and concatenated them into a single DataFrame.
- **Why:** Each sheet represented a unique administrative unit. Combining them allows unified, scalable analysis across all parks.

#### 2. **Tracking Source Origin**
- **What:** Added a new column `Admin_Unit_Code` during consolidation.
- **Why:** This helps trace each observation back to its original administrative unit and enables region-level analysis.

#### 3. **Missing Value Handling**
- **What:** Filled missing values in the `Distance` column with `'Unknown'`.
- **Why:** To ensure consistency for categorical analysis and avoid NaN-related issues during grouping or visualization.

#### 4. **Dropped Sparse Columns**
- **What:** Removed columns with more than 50% missing values:
  - `AcceptedTSN`
  - `NPSTaxonCode`
  - `AOU_Code`
  - `TaxonCode`
  - `Previously_Obs`
- **Why:** These columns had limited completeness and analytical relevance.

#### 5. **Duplicate Removal**
- **What:** Dropped **1,700 duplicate records**.
- **Why:** Prevents double-counting of bird sightings and improves the reliability of statistical summaries.

#### 6. **Datetime Processing**
- **What:** Converted the `Date` column to `datetime` format.
- **Why:** Enables robust time-series analysis and feature extraction for temporal patterns.

#### 7. **Feature Engineering (Temporal)**
- **What:** Extracted:
  - `Observation_Month`
  - `Observation_Day`
  - `Observation_Hour`
- **Why:** Supports seasonal, daily, and hourly trend analysis of bird activity.

#### 8. **Created Unified Observation Timestamp**
- **What:** Created a new `Full_Observation_DateTime` column by combining `Date` and `Start_Time`.
- **Why:** Allows for high-resolution time analysis and chronological sorting.

---

  Initial Insights Gained from Cleaning

1. **Data Volume and Integrity**
   - Final cleaned dataset has **15,344 high-quality rows** across **31 columns**.
   - Around **10% of original data** was either duplicated or too sparse for analysis.

2. **Data Completeness**
   - Core fields like `Common_Name`, `Scientific_Name`, `Date`, `Location_Type`, and `Observer` are nearly 100% complete.
   - Non-critical metadata like taxon codes and previous sightings were mostly missing and safely removed.

3. **Seasonality & Time Features**
   - With `Observation_Month` and `Observation_Hour` extracted, the dataset is now suitable for:
     - Seasonal migration trend analysis
     - Time-of-day species behavior insights

4. **Habitat and Administrative Coverage**
   - Observations span multiple units and habitat types (Forest and Grassland), making this a rich dataset for comparing ecological patterns.

---

#### Saving the Cleaned and Preprocessed DataFrame to CSV

In [None]:
print("\n--- Saving the Cleaned and Preprocessed DataFrame ---")
df_combined.to_csv("cleaned_bird_data.csv.csv", index=False)
print("Cleaned and preprocessed data successfully saved to master_bird_data_cleaned.csv")

#### Loading Cleaned Data for EDA

In [None]:
df=pd.read_csv('/content/cleaned_bird_data.csv.csv')

In [None]:
df.describe()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### 4.1 Temporal Analysis

#### 4.1A. Seasonal Trends: Year-wise and Month-wise Observations

Bird Observations by Year

In [None]:
# Chart - 1 Bird Observations by Year
import seaborn as sns
import matplotlib.pyplot as plt

# Ensure 'year' is in numeric form
df['Year'] = df['Year'].astype(int)

# Yearly count
yearly_counts = df['Year'].value_counts().sort_index()

# Plot
plt.figure(figsize=(8, 4))
sns.lineplot(x=yearly_counts.index, y=yearly_counts.values, marker='o')
plt.title("Bird Observations by Year")
plt.xlabel("Year")
plt.ylabel("Number of Observations")
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A line plot is ideal for visualizing how bird observations vary across years, revealing long-term trends and potential environmental impacts over time.

##### 2. What is/are the insight(s) found from the chart?

The number of bird observations fluctuates across years.

For example, certain years show significant spikes or drops—this may be due to:

* Seasonal migrations

* Changes in survey efforts

* Environmental disruptions or climate change

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:

* Helps biodiversity monitoring by detecting long-term population trends.

* Enables policy makers to correlate dips with habitat loss, climate issues, or human interference.

* Conservation efforts can be timed better during years when populations appear more vulnerable.

No direct negative impact, but inconsistent data collection years (if any) may reduce comparability and should be normalized for statistical modeling.

#### 4.1B. Monthly Observation Trends (Seasonal Analysis)

Bird Observation by Month

In [None]:
# Chart - 2 Bird Observation by Month
# Monthly count
month_counts = df['Observation_Month'].value_counts().sort_index()

# Plot
plt.figure(figsize=(8, 4))
sns.barplot(x=month_counts.index, y=month_counts.values, palette="viridis")
plt.title("Bird Observations by Month (Seasonality)")
plt.xlabel("Month")
plt.ylabel("Observation Count")
plt.xticks(ticks=range(1, 13), labels=[
    "Jan", "Feb", "Mar", "Apr", "May", "Jun",
    "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
])
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal to identify which months record the most bird sightings and detect seasonal activity patterns in bird populations.

##### 2. What is/are the insight(s) found from the chart?

All bird observations are currently concentrated in January, February, and March.

* No data is available for the rest of the year, indicating either:

* Partial-year data collection, or

* A need to improve temporal data completeness.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

* Helps resource planning by focusing analysis on early-year data.

* Indicates best months for initiating early-year eco-tourism or surveys.

Negative:

* Severe temporal bias reduces the effectiveness of year-round biodiversity monitoring.

* Conservation planning cannot rely on this alone — you may miss migration patterns or nesting seasons occurring in later months.

#### 4.1C. Observation Time: Hourly Trends

Observation Trend (Hourly)

In [None]:
# Chart - 3 Hourly distribution

hour_counts = df['Observation_Hour'].value_counts().sort_index()

# Plot
plt.figure(figsize=(8, 4))
sns.lineplot(x=hour_counts.index, y=hour_counts.values, marker='o', color='orange')
plt.title("Bird Observations by Hour of Day")
plt.xlabel("Hour (24h format)")
plt.ylabel("Number of Observations")
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Bird activity varies throughout the day. A line plot captures hourly changes in observation frequency, helping pinpoint optimal survey windows.

##### 2. What is/are the insight(s) found from the chart?

Most bird observations occur between 6 AM and 9 AM, suggesting that:

* Birds are most active during early morning hours.

* Observers are also more likely to conduct surveys at that time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:

* Helps guide observer scheduling and resource allocation for bird surveys.

* Supports land managers in understanding daily bird behavior patterns.

* Tourism operators can create morning tour packages for better sightings.

No negative impact, but it's important to standardize survey times for comparative research across regions and years.

### 4.2 Spatial Analysis

#### 4.2 Habitat-Based Species Richness

Unique Species Count by Habitat Type

In [None]:
# Chart - 4 Count of unique species grouped by Location Type (Forest vs Grassland)

species_by_habitat = df.groupby('Location_Type')['Scientific_Name'].nunique().sort_values(ascending=False)

# Plot
plt.figure(figsize=(8, 5))
sns.barplot(x=species_by_habitat.index, y=species_by_habitat.values, palette="Set2")
plt.title("Unique Bird Species Observed by Habitat Type")
plt.xlabel("Habitat Type")
plt.ylabel("Number of Unique Species")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is perfect for comparing the number of unique species across categorical variables like habitat types. This helps highlight which ecosystem supports greater avian diversity.

##### 2. What is/are the insight(s) found from the chart?

* One habitat (likely Forest) shows a significantly higher number of unique species compared to the other (e.g., Grassland).

* This highlights that forests may act as biodiversity hotspots, while grasslands might require restoration or protective measures.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

* Supports land-use planning — forests should be prioritized for protection, buffer zones, and controlled human access.

* Grasslands, if underperforming in diversity, may benefit from habitat restoration programs.

* Helps eco-tourism planners position high-species areas for bird-watching trails.

Potential Negative:

* Low diversity in grasslands could imply negative environmental pressure (e.g., overgrazing, farming).

* Ignoring such insight may lead to declining populations of grassland-dependent species.



#### 4.2B. Plot-Level Hotspots

Top 10 Plots with Highest Species Richness

In [None]:
# Chart - 5 Unique species per Plot

species_by_plot = df.groupby('Plot_Name')['Scientific_Name'].nunique().sort_values(ascending=False).head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=species_by_plot.values, y=species_by_plot.index, palette="crest")
plt.title("Top 10 Plots with Highest Bird Species Richness")
plt.xlabel("Number of Unique Species")
plt.ylabel("Plot Name")
plt.show()

##### 1. Why did you pick the specific chart?

This horizontal bar chart allows easy identification of individual plots that attract the most bird species. Plot names can be long, so displaying them vertically would be less readable.

##### 2. What is/are the insight(s) found from the chart?

* Certain plots stand out as bird species hotspots.

* These plots may have better vegetation, water sources, or microclimates that support diverse avifauna.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

* Enables targeted conservation: these plots can be monitored, protected, and studied further.

* Promotes data-driven tourism: high-diversity plots can be designated as visitor zones.

* Helps in prioritizing budget allocation to high-value ecological areas.

Negative Risk:

* If these plots are not adequately protected, they may suffer from over-visitation, leading to habitat degradation. This risk needs policy-based access regulation.

### 4.3 Species Analysis

#### 4.3A. Species Diversity by Habitat

Unique Species Count per Habitat Type

In [None]:
# Chart - Count unique species by habitat
#
species_diversity = df.groupby('Location_Type')['Scientific_Name'].nunique().sort_values(ascending=False)

# Plot
plt.figure(figsize=(8, 5))
sns.barplot(x=species_diversity.index, y=species_diversity.values, palette="coolwarm")
plt.title("Bird Species Diversity by Habitat Type")
plt.xlabel("Habitat")
plt.ylabel("Unique Species Count")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal to compare biodiversity between two habitat types (e.g., Forest vs Grassland), allowing stakeholders to see where species richness is concentrated.

##### 2. What is/are the insight(s) found from the chart?

* Forests appear to host significantly more diverse bird species than grasslands.

* This suggests higher ecological complexity in forest ecosystems.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

* Reinforces the importance of forest ecosystems in maintaining bird diversity.

* Helps in prioritizing habitat protection policies and eco-tourism development in forests.

Negative Signal:

* Low diversity in grasslands may require attention — could indicate human pressure or habitat degradation needing policy intervention.

#### 4.3B. Activity Patterns – Observation Method

Top Bird Identification Methods

In [None]:
# Chart - 7 Observation method frequency
#
method_counts = df['ID_Method'].value_counts().head(5)

# Plot
plt.figure(figsize=(8, 4))
sns.barplot(x=method_counts.index, y=method_counts.values, palette="pastel")
plt.title("Top Bird Identification Methods")
plt.xlabel("Method")
plt.ylabel("Observation Count")
plt.show()

##### 1. Why did you pick the specific chart?

To understand how birds are most frequently identified (e.g., Singing, Calling, Visualization). A bar chart quickly highlights dominant observation techniques.

##### 2. What is/are the insight(s) found from the chart?

* The majority of birds were identified via Singing, followed by Calling and Visual methods.

* This shows acoustic identification is the primary detection method.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

* Helps conservationists and analysts focus resources on audio-based surveys.

* Tools like automated sound recording can be prioritized in future research.

* Indicates the need for trained personnel who can identify birds by sound.

Risk:

* May undercount non-vocal species or those active in silent periods — suggests complementing with other techniques like motion-triggered cameras.

####  4.3C. Sex Ratio of Observed Birds

Sex Distribution of Observed Birds

In [None]:
# Chart - 8 Plot sex ratio
#
plt.figure(figsize=(6, 4))
sns.countplot(x='Sex', data=df, order=['Male', 'Female', 'Unknown'], palette="Set1")
plt.title("Sex Distribution of Observed Birds")
plt.xlabel("Sex")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Understanding the sex distribution helps evaluate potential biases in observation and breeding population trends. A count plot clearly shows categorical frequency.

##### 2. What is/are the insight(s) found from the chart?

* Male birds are observed significantly more often than females.

* A large portion is labeled as Undetermined — likely due to distant sightings or similar appearance between sexes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

* Highlights the need for more balanced data collection, especially during breeding seasons when sex-based roles vary.

* Can guide future training for field observers on identifying sex characteristics.

Potential Issues:

* Sex bias can affect estimates of population health and mislead conservation planning if not corrected.



### 4.4 Environmental Conditions

Combined Plot: How Environmental Factors Influence Bird Observations

In [None]:
# Chart - 9 Combined Plot of Environmental Factors Influence Bird Observations


# Create a new DataFrame to hold summarized counts for plotting
env_summary = pd.DataFrame()

# Temperature bins
df['temp_bin'] = pd.cut(df['Temperature'], bins=[-10, 10, 15, 20, 25, 30, 40])
temp_summary = df['temp_bin'].value_counts().sort_index()
temp_df = pd.DataFrame({'Condition': temp_summary.index.astype(str), 'Count': temp_summary.values, 'Factor': 'Temperature'})

# Humidity bins
df['humidity_bin'] = pd.cut(df['Humidity'], bins=[0, 30, 50, 70, 90, 100])
humidity_summary = df['humidity_bin'].value_counts().sort_index()
humidity_df = pd.DataFrame({'Condition': humidity_summary.index.astype(str), 'Count': humidity_summary.values, 'Factor': 'Humidity'})

# Sky condition
sky_summary = df['Sky'].value_counts()
sky_df = pd.DataFrame({'Condition': sky_summary.index, 'Count': sky_summary.values, 'Factor': 'Sky'})

# Wind condition
wind_summary = df['Wind'].value_counts()
wind_df = pd.DataFrame({'Condition': wind_summary.index, 'Count': wind_summary.values, 'Factor': 'Wind'})

# Disturbance
disturb_summary = df['Disturbance'].value_counts()
disturb_df = pd.DataFrame({'Condition': disturb_summary.index, 'Count': disturb_summary.values, 'Factor': 'Disturbance'})

# Combine all
env_summary = pd.concat([temp_df, humidity_df, sky_df, wind_df, disturb_df], ignore_index=True)

# Plot all in one chart
plt.figure(figsize=(14, 8))
sns.barplot(data=env_summary, x='Factor', y='Count', hue='Condition')
plt.title("Bird Observations vs Environmental Conditions (Unified View)")
plt.xlabel("Environmental Factor")
plt.ylabel("Number of Observations")
plt.legend(title='Condition', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This stacked bar chart unifies multiple environmental factors in a single visualization. It allows direct comparison of how different condition categories within each factor affect bird observation volume — all in one place.

##### 2. What is/are the insight(s) found from the chart?

* Temperature: Bird activity is highest in moderate temperature ranges (~15–25°C).

* Humidity: Most sightings occur between 50%–70% humidity.

* Sky: "Partly Cloudy" and "Clear" dominate sightings — possibly due to better visibility.

* Wind: Calm wind or light breeze results in more observations; strong wind reduces bird activity or visibility.

* Disturbance: "No effect on count" leads, but observations decline in disturbed settings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

| **Environmental Factor** | **Positive Business Impact** | **Negative Signal / Risk** |
|---------------------------|-------------------------------|-----------------------------|
| **Temperature** | Moderate temperatures (~20–25°C) show peak bird activity. Field surveys and eco-tourism can be scheduled during these climate windows for maximum success. | Extreme temperatures reduce sightings, which may affect monitoring accuracy or signal climate-related stress. |
| **Humidity** | Bird activity is highest between 50–70% humidity. This helps plan fieldwork in seasons with optimal atmospheric moisture. | Very high or low humidity may interfere with visibility, sound, and observer comfort, reducing data quality. |
| **Sky Condition** | "Clear" and "Partly Cloudy" skies yield more sightings — excellent for planning bird-watching tours and visual surveys. | Foggy, overcast, or rainy skies limit visibility, potentially causing underreporting. |
| **Wind** | Calm or lightly breezy days are optimal for accurate observations and bird movement. Helps guide scheduling of survey teams. | Strong wind decreases bird flight and visibility, leading to lower counts or skewed data. |
| **Disturbance** | Areas with "No effect on count" show the most observations — supports maintaining **quiet zones** in high-activity habitats. | Human or environmental disturbances (slight to moderate effects) negatively impact sightings, suggesting the need for disturbance-free buffer zones. |

### 4.5 Distance and Behavior

#### 4.5A. Distance Analysis – How Far Are Birds Observed?

Observation Count by Distance Category

In [None]:
# Chart - 10 Plot: Bird observations grouped by distance from observer

plt.figure(figsize=(10, 5))
distance_order = df['Distance'].value_counts().sort_values(ascending=False).index

sns.countplot(data=df, y='Distance', order=distance_order, palette='viridis')
plt.title("Bird Observations by Distance from Observer")
plt.xlabel("Number of Observations")
plt.ylabel("Distance Category")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart works well here because distance categories are often long strings (e.g., “≤ 50 meters”) and are better viewed on the Y-axis. This visual helps identify how close or far birds are usually observed.

##### 2. What is/are the insight(s) found from the chart?

* The majority of birds are observed at short distances (e.g., ≤50 meters).

* Farther distances (e.g., >100 meters or "Unknown") have significantly fewer records.

* This could reflect both bird behavior and detection limitations of observers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from analyzing observation distances have direct practical value. The fact that the majority of birds are observed at close distances — particularly within 50 meters — implies that birds are **actively using areas near human-accessible zones** like trails, forest edges, or observation stations. This is a highly positive signal for eco-tourism development, since it suggests that birdwatchers and conservation staff can reliably observe bird activity without disturbing sensitive habitat interiors.

From a conservation standpoint, this finding also highlights **key areas for habitat preservation** and **buffer zone planning**, ensuring that human presence remains compatible with avian behavior. However, the presence of lower counts at greater distances — and some "Unknown" entries — may point to either **detection limitations** or gaps in survey standardization. Addressing this by improving survey training and methods can further enhance the quality of ecological monitoring.

#### 4.5B. Flyover Frequency – Are Birds Just Passing or Engaging?

Flyover vs Non-Flyover Observations

In [None]:
# Chart - 11 Plot flyover behavior

plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='Flyover_Observed', palette='pastel')
plt.title("Bird Flyover Observations")
plt.xlabel("Flyover Observed")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A simple count plot helps quickly distinguish between birds that were flying overhead and those observed perching, feeding, or interacting with their environment — crucial for habitat engagement insights.

##### 2. What is/are the insight(s) found from the chart?

* Most observations are not flyovers, meaning birds are actively engaging with the environment (feeding, nesting, etc.).

* Flyovers represent a smaller portion, possibly migratory individuals or low-interaction events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The behavior analysis shows that most birds are not simply flying overhead but are being recorded while **actively engaging with the habitat** — feeding, nesting, singing, or perching. This is an encouraging insight for biodiversity management, as it indicates that the landscape is not merely a passageway but serves as an **ecological destination** for birds.

This directly supports decisions related to **habitat conservation**, **species monitoring**, and **land use management**. It suggests that the environment offers adequate food, shelter, or nesting conditions to support bird activity at ground or canopy level.

On the other hand, species observed **only in flyover mode** may be using the area as a **migration corridor**, but not finding enough habitat features to settle. This highlights potential zones for **habitat enrichment or restoration**, especially in regions critical to migratory species. Overall, this analysis guides both **conservation prioritization** and **eco-tourism development**, helping to target zones of highest bird activity and ecological value.

### 4.6 Observer Trends

#### 4.6A. Observer Bias – Who Reports the Most Observations?

In [None]:
# Chart - 12 Top 10 observers by observation count

top_observers = df['Observer'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_observers.values, y=top_observers.index, palette="Spectral")
plt.title("Top 10 Observers by Observation Count")
plt.xlabel("Number of Observations")
plt.ylabel("Observer")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This horizontal bar chart easily displays long observer names and makes it clear who has contributed the most data. It's ideal for quickly identifying any disproportionate data contributions.

##### 2. What is/are the insight(s) found from the chart?

* A small number of observers (possibly even 1 or 2) contribute disproportionately high numbers of observations.

* This suggests a possible observer bias, where some individuals might influence data trends more than others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes — this analysis directly contributes to improving data quality control. If a small group of observers dominate the data, their habits, methods, and expertise can significantly influence the trends — for better or worse.

* Positively, this may reflect well-trained observers who consistently contribute high-quality data. These individuals can be empowered further through funding, equipment, or involvement in long-term monitoring programs.

* On the other hand, over-reliance on a few observers poses a risk. If their methods are inconsistent, biased, or location-specific, it could skew population or behavioral insights. Thus, this insight can inform policy and protocol development to diversify participation, standardize training, and validate field reports for consistency across observers.

#### 4.6B. Visit Patterns – Does Repeated Site Visit Increase Diversity?

Species Count vs Number of Visits

In [None]:
# Chart - 13 Species Count vs Number of Visits
# Convert visit to numeric
df['Visit'] = pd.to_numeric(df['Visit'], errors='coerce')

# Species richness by visit frequency
species_by_visit = df.groupby('Visit')['Scientific_Name'].nunique()

# Plot
plt.figure(figsize=(8, 4))
sns.lineplot(x=species_by_visit.index, y=species_by_visit.values, marker='o')
plt.title("Unique Species Observed vs Visit Frequency")
plt.xlabel("Visit Number")
plt.ylabel("Number of Unique Species Observed")
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This line plot is used to understand how species count changes as visit frequency increases. It helps reveal whether more visits lead to more comprehensive biodiversity observations.

##### 2. What is/are the insight(s) found from the chart?

* There is generally a positive trend: more visits are associated with a higher count of unique species.

* The curve may level off eventually, suggesting saturation — where additional visits offer diminishing returns in species discovery.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. This finding supports the idea that repeated site visits improve the reliability and completeness of biodiversity data. It justifies allocating budget for longitudinal studies and recurring surveys in conservation programs.

It also helps conservation managers design observation frequency protocols — e.g., how many visits are optimal to capture most species without wasting resources. Additionally, it supports citizen science models where volunteers return to the same plots regularly to contribute to data quality.

However, once the curve flattens, additional visits yield fewer new species — which indicates a point of diminishing returns. This insight allows for strategic planning to stop or shift efforts when biodiversity capture is saturated at a site.

### 4.7 Conservation Insights

#### 4.7A. Watchlist Species Frequency

Top 10 Most Frequently Observed Watchlist Species

In [None]:
# Chart 14 - Top 10 Most Frequently Observed Watchlist Species
# Filter only PIF watchlist birds
watchlist_df = df[df['PIF_Watchlist_Status'] == True]

# Get top 10 watchlist species by count
top_watchlist = watchlist_df['Common_Name'].value_counts().head(10)

# Plot
plt.figure(figsize=(10, 5))
sns.barplot(x=top_watchlist.values, y=top_watchlist.index, palette="Reds_r")
plt.title("Top 10 Watchlist Bird Species Observed")
plt.xlabel("Observation Count")
plt.ylabel("Watchlist Species")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart clearly ranks at-risk species based on observation counts. This allows stakeholders to visually prioritize conservation efforts toward species that are both vulnerable and frequently seen in the dataset.

##### 2. What is/are the insight(s) found from the chart?

* Certain at-risk bird species appear frequently in observations.

* These species are thriving in surveyed areas, or these zones might be critical strongholds for their survival.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — this analysis is essential for data-driven conservation planning.

By identifying which watchlist species are commonly observed, we can:

* Prioritize funding and protection for habitats where these species thrive.

* Develop targeted management plans to monitor and support these populations.

* Use these species as flagship indicators for ecosystem health and public outreach.

Moreover, the visibility of endangered species in tourism zones can promote eco-tourism awareness, but must be handled with care to avoid disturbing sensitive populations.

#### 4.7B. Regional Stewardship Species Trends

Frequency of Regionally Important Species

In [None]:
# Chart 15 - Frequency of Regionally Important Species
# Filter regionally important species
steward_df = df[df['Regional_Stewardship_Status'] == True]

# Count top 10 stewardship species
top_stewards = steward_df['Common_Name'].value_counts().head(10)

# Plot
plt.figure(figsize=(10, 5))
sns.barplot(x=top_stewards.values, y=top_stewards.index, palette="Blues_d")
plt.title("Top 10 Regionally Important Bird Species")
plt.xlabel("Observation Count")
plt.ylabel("Stewardship Species")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart provides a quick overview of the most frequently observed species with regional conservation importance, helping decision-makers focus regional policies and land-use regulations where they matter most

##### 2. What is/are the insight(s) found from the chart?

* Some regionally significant species are commonly seen, implying their habitat needs are currently being met.

* These areas may be core breeding or nesting zones for priority species — valuable ecological real estate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. This analysis:

* Informs regional conservation planning, ensuring stewardship species receive adequate protection and monitoring.

* Enables collaborations with local landowners or park managers to preserve habitats critical to regional biodiversity.

* Provides a baseline to measure future success of habitat restoration or policy implementation.

Additionally, regionally important species can be featured in public awareness campaigns, acting as local ambassadors for biodiversity efforts — boosting community support and potential funding.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To help stakeholders achieve the business objectives of biodiversity management and conservation planning, the following data-driven solutions are recommended:


* Targeted Conservation Planning:
Focus efforts on regions and habitats where both watchlist and stewardship species are frequently observed. These zones can act as ecological strongholds and require urgent protection and monitoring.

* Habitat-Specific Policies:
Promote forest and grassland-specific conservation strategies based on species richness, abundance, and habitat preference patterns discovered through EDA.

* Environment-Based Protection:
Use insights on climate, elevation, and vegetation type to predict vulnerable zones and allocate resources for restoration or conservation accordingly.

* Monitoring Programs:
Observer-based trends suggest high variability in effort and coverage. Establish consistent monitoring protocols to reduce data bias and improve quality.

* Community Engagement & Eco-Tourism:
Leverage high-visibility species and regional icons to develop public awareness and eco-tourism initiatives. Ensure such programs are sustainable and don't negatively impact sensitive species.

* Digital Dashboards & Alerts:
Convert key insights into interactive dashboards for real-time conservation tracking, stakeholder collaboration, and policy development.

# **Conclusion**

The Bird Species Observation Analysis successfully revealed how environmental, behavioral, and spatial factors influence bird diversity and distribution across forest and grassland habitats. Temporal trends highlighted seasonality in species richness, while environmental variables like elevation and vegetation shaped species presence and abundance. Species-specific and conservation-focused insights underscored the importance of habitat specialization, observer bias, and regionally important species.

Overall, this project offers a comprehensive framework for informed conservation planning, habitat protection, and biodiversity monitoring — aiding both scientific understanding and practical decision-making. The actionable recommendations derived from EDA can empower stakeholders to make data-backed interventions in wildlife protection, land use, and ecological sustainability.