# 🐰 Global Rabbit Population Analysis

This notebook provides comprehensive analysis of global rabbit population data, including exploratory data analysis, visualization, statistical analysis, and predictive modeling.

## Objectives

1. Analyze global rabbit population trends over time
2. Explore species distribution across different regions
3. Examine habitat impact on population
4. Analyze conservation status and endangered populations
5. Build predictive models for future population projections

## 1. Setup and Data Loading

Let's start by importing the necessary libraries for our analysis:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import plotly.express as px
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
import os

# Set plotting styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("viridis")

# Configure plot size and resolution
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['figure.dpi'] = 100

# Display all dataframe columns
pd.set_option('display.max_columns', None)

### Generate Sample Rabbit Population Data

Since we don't have real data yet, we'll generate synthetic data for our analysis. In a real-world scenario, we would load data from an external source.

In [None]:
def generate_rabbit_population_data():
    """
    Generate synthetic rabbit population data for analysis.
    
    Returns:
        pandas.DataFrame: A dataframe with synthetic rabbit population data
    """
    # Set random seed for reproducibility
    np.random.seed(42)
    
    # Define parameters
    years = list(range(2000, 2026))  # 2000 to 2025
    regions = ['North America', 'Europe', 'Asia', 'Africa', 'Australia', 'South America']
    species = ['European Rabbit', 'Cottontail', 'Hare', 'Jackrabbit', 'Pygmy Rabbit']
    habitats = ['Forest', 'Grassland', 'Desert', 'Urban']
    conservation_statuses = ['Least Concern', 'Near Threatened', 'Vulnerable', 'Endangered']
    
    # Generate sample population data
    data = []
    
    for year in years:
        for region in regions:
            for specie in species:
                # Base population varies by species and region
                if specie == 'European Rabbit':
                    base_population = np.random.randint(20000, 60000)
                elif specie == 'Cottontail':
                    base_population = np.random.randint(15000, 45000)
                elif specie == 'Hare':
                    base_population = np.random.randint(10000, 30000)
                elif specie == 'Jackrabbit':
                    base_population = np.random.randint(5000, 20000)
                else:  # Pygmy Rabbit
                    base_population = np.random.randint(1000, 10000)
                
                # Add regional adjustments
                if region == 'North America':
                    regional_factor = 1.2
                elif region == 'Europe':
                    regional_factor = 1.1
                elif region == 'Asia':
                    regional_factor = 1.3
                elif region == 'Africa':
                    regional_factor = 0.9
                elif region == 'Australia':
                    regional_factor = 0.8
                else:  # South America
                    regional_factor = 0.7
                
                # Create population trend with various factors
                trend = (year - 2000) * 500  # Increasing trend over time
                seasonal = np.sin(year) * 2000  # Seasonal variation
                random_factor = np.random.normal(0, 5000)  # Random noise
                
                # Calculate final population
                population = max(100, int(base_population * regional_factor + trend + seasonal + random_factor))
                
                # Determine habitat based on region and species with some randomness
                if region in ['North America', 'Europe']:
                    habitat_weights = [0.4, 0.3, 0.1, 0.2]  # Forest, Grassland, Desert, Urban
                elif region in ['Asia', 'Africa']:
                    habitat_weights = [0.2, 0.4, 0.3, 0.1]
                else:  # Australia, South America
                    habitat_weights = [0.3, 0.3, 0.3, 0.1]
                
                habitat = np.random.choice(habitats, p=habitat_weights)
                
                # Determine conservation status based on population trends and some randomness
                if population < 5000:
                    status_weights = [0.1, 0.2, 0.3, 0.4]  # Higher chance of being endangered
                elif population < 15000:
                    status_weights = [0.2, 0.3, 0.4, 0.1]
                elif population < 30000:
                    status_weights = [0.5, 0.3, 0.1, 0.1]
                else:
                    status_weights = [0.8, 0.1, 0.05, 0.05]  # Likely least concern
                
                conservation_status = np.random.choice(conservation_statuses, p=status_weights)
                
                # Add data point
                data.append({
                    'Year': year,
                    'Region': region,
                    'Species': specie,
                    'Population': population,
                    'Habitat': habitat,
                    'Conservation_Status': conservation_status
                })
    
    # Convert to DataFrame
    df = pd.DataFrame(data)
    
    # Add some derived features
    
    # 1. Calculate year-over-year growth rates
    df = df.sort_values(['Region', 'Species', 'Year'])
    df['YoY_Growth'] = df.groupby(['Region', 'Species'])['Population'].pct_change() * 100
    
    # 2. Add normalized population (relative to the max for that species)
    df['Normalized_Population'] = df.groupby(['Species'])['Population'].transform(
        lambda x: (x - x.min()) / (x.max() - x.min())
    )
    
    # 3. Add a binary feature for endangered/non-endangered
    df['Is_Endangered'] = df['Conservation_Status'].apply(
        lambda x: 1 if x == 'Endangered' else 0
    )
    
    # 4. Calculate species dominance (% of total population in a region)
    df['Species_Dominance'] = df.apply(
        lambda row: row['Population'] / df[(df['Year'] == row['Year']) & 
                                          (df['Region'] == row['Region'])]['Population'].sum(),
        axis=1
    )
    
    return df

# Generate the data
rabbit_df = generate_rabbit_population_data()

# Save to CSV in the data/processed directory
os.makedirs('../data/processed', exist_ok=True)
rabbit_df.to_csv('../data/processed/rabbit_population.csv', index=False)

# Display the first few rows
rabbit_df.head()

## 2. Exploratory Data Analysis

Now that we have our data, let's explore it to understand its structure and characteristics.

In [None]:
# Check the dimensions of our dataset
print(f"Dataset dimensions: {rabbit_df.shape}")

# Display data types
print("\nData types:")
print(rabbit_df.dtypes)

# Check for missing values
print("\nMissing values:")
print(rabbit_df.isnull().sum())

# Basic statistics
print("\nBasic statistics:")
print(rabbit_df.describe())

# Count unique values in categorical columns
print("\nUnique values in categorical columns:")
for col in ['Region', 'Species', 'Habitat', 'Conservation_Status']:
    print(f"\n{col} unique values: {rabbit_df[col].nunique()}")
    print(rabbit_df[col].value_counts())

### Population Distribution Analysis

Let's analyze the distribution of rabbit populations across different dimensions.