#### Introduction
Before the analysis can begin, it is important to make sure that sufficient data is 
available to meet the project goals. Additionally the data must be reliably sourced 
and collected using approved standards. Here the 'UKBMS site indices data 2023' is 
cleaned through a series of exploratory and filtering operations. 

#### Workflow
 1) Extra spaces are removed. 
 2) The data is filtered by year, country and index score.  
 3) Duplicate surveys are removed.
 4) Two species (difficult to distinguish in the field) are grouped together into a 
 single index
 5) Migratory species and species with insufficient data for the analysis are removed. 

In [1]:
# Importing packages
import pandas as pd
import os
from pathlib import Path

# Importing localised file directory
project_root = Path(os.environ['butterfly_project'])

# Importing dataset
ukbms_site_indices = pd.read_csv(
    project_root/'Data'/'UKBMS'/'ukbms_site_indices'/'ukbmssiteindices2023.csv'
)

#### Removing Extra Spaces

In [3]:
# Single Spaces at start/end of string
ukbms_site_indices = (
    ukbms_site_indices
    # .apply() used to apply lambda function to each dataframe column
    # .strip() removes spaces at start and end of string
    # dtype=='object' ensures function is only applied to non-numeric values
    .apply(lambda x: x.str.strip() if x.dtype=='object' 
           else x) # if record is numeric no changes are made
)

# Removing 2 or more consecutive spaces
ukbms_site_indices = (
    ukbms_site_indices
    # finds records where 2 or more spaces exist and replaces with single space
    .apply(lambda x: x.str.replace(' {2,}', ' ') if x.dtype=='object' # string type
           else x) # if record is numeric no changes are made
)

#### Filtering
Removing entries dating pre 1976, entries from outisde the UK and sites with index 
scores of -2. 

#### Explanation 

Sites pre 1976:
The transect method was still being developed pre 1976. Results from 
these years could have been collected using sampling methods inconsistent with later 
years.

Entries Outside the UK:
The parameters of this project mean locations outside the UK are excluded.

Index scores of -2:
Scores of -2 are for sites where insufficient data was collected to record a reliable 
abundance estimate. 

In [None]:
ukbms_site_indices_filter = (
    ukbms_site_indices[(ukbms_site_indices['YEAR']>=1976) # rows pre 1976 removed
    & (ukbms_site_indices['SITE_INDEX']!=-2) # index scores of -2 removed
    # all records not from the 'channel islands' or 'isle of man' are retained.
    & (~ukbms_site_indices['COUNTRY'] # tilde (~) reverses boolean output
       .isin(['Channel Islands', 'Isle of Man']))])

#### Duplicate Surveys
Duplicate surveys exist for: 

'SPECIES CODE' 98, 'SITE_CODE' 1613, 'YEAR' 2001. 

'SPECIES CODE' 93, 'SITE_CODE' 1018, 'YEAR' 2008. 

In both instances, there is 1 duplicate row with a different 'SITE_INDEX' score. In 
the absence of further information, it is assumed that the lastest site index score is 
the correct score.

In [8]:
 # Duplicated rows are removed. 
ukbms_site_indices_filter = (
    ukbms_site_indices_filter
    .drop_duplicates(['SITE_CODE', 'SPECIES_CODE', 'YEAR']) 
)

#### Grouping Species
'Thymelicus Lineola' and 'Thymelicus Sylvestrius' are difficult to distinguish in the 
field.

For this reason many of the abundance indices have been grouped as 'Thymelicus 
lineola/sylvestris'. However for some years/survey sites, a grouping does not exist.

For consistency this project will use the grouped abundance index. Hence, a combined 
index needs to be created at all sites where the species have not been grouped. 

In [None]:
# Storing old dataframe. Copying records to 'ukbms_species_combine'
ukbms_species_combine = ukbms_site_indices_filter.copy()

# If a 'SPECIES' record 'Thymelicus lineola/sylvestris' already exists, then the index
# should be left. If no combined record exists, the two constituent records 'Thymelicus 
# Lineola' and 'Thymelicus Sylvestrius' are aggregated at each site for each year.

# First find all surveys where a combined index exists
tlts_combined = ukbms_species_combine[ # tlts short for 'Thymelicus lineola/sylvestris'
(ukbms_species_combine['SPECIES']=='Thymelicus lineola/sylvestris')
]
# extracting useful columns that will be used to merge with 'individual' df
tlts_combined = (
    tlts_combined[['SPECIES', 'SITE_CODE', 'YEAR']]
    .reset_index(drop=True)
)

# Find all surveys where individual indices exist
tlts_individual = ukbms_species_combine[
(ukbms_species_combine['SPECIES']=='Thymelicus lineola')
| (ukbms_species_combine['SPECIES']=='Thymelicus sylvestris')
]
# extracting useful columns that will be used to merge with 'combined' df
tlts_individual = (
    tlts_individual[['SPECIES', 'SITE_CODE', 'COUNTRY', 'YEAR', 'SITE_INDEX']]
    .reset_index(drop=True)
)

In [10]:
# Find all survey site/year groupings that have individual indices but not combined. 
tlts_indicator = (
    tlts_individual.merge(tlts_combined,
                          # Year/site combinations where a 'combined' survey does not 
                          # exist will show in the left table
                          on=['YEAR', 'SITE_CODE'],
                          how='left',
                          suffixes=['', '_combined'], # to distinguish duplicate columns
                          indicator=True) # identifies records appearing in left df only
)

In [11]:
# The indicator df is filtered to leave only survey site/years without combined index
tlts_not_combined = (
    # years/site_codes that did not have combined index
    tlts_indicator[tlts_indicator['_merge']=='left_only'] 
    .reset_index(drop=True)
    .drop(columns=['SPECIES_combined', '_merge']) # removing redundant columns
)

# 'tlts_not_combined' df is stored and new df 'tlts_group' created
tlts_group = tlts_not_combined.copy()

# species surveys taken from the same location and year are summed together. Where only
# one row exists in a 'group', the sum is the just the single index value. 
tlts_group = (
    tlts_not_combined
    .groupby(['SITE_CODE', 'YEAR', 'COUNTRY'])
    .agg(agg_index=('SITE_INDEX', 'sum'))
    .reset_index() # index columns reintegrated into dataframe
)

In [None]:
# Renaming 'agg_index' column to 'SITE_INDEX' for consistency with main dataframe:
# 'ukbms_species_combine'
tlts_group = tlts_group.rename(columns={'agg_index':'SITE_INDEX'})

# Adding combined species records
tlts_group['SPECIES'] = 'Thymelicus lineola/sylvestris'
tlts_group['COMMON_NAME'] = 'Essex/Small Skipper'
tlts_group['SPECIES_CODE'] = 121

Now, all surveys that were not combined, have a combined index. These can be added to
the main dataframe: 'ukbms_species_combine'. All indiviudal species indices 
('Thymelicus Lineola' and 'Thymelicus Sylvestrius') are no longer required and should 
be removed. 

In [13]:
# Adding dataframe with new species groupings (tlts_group) onto main dataframe 
# (ukbms_species_combine)
ukbms_species_clean = pd.concat([ukbms_species_combine, tlts_group], ignore_index=True)

# removing individual species indices
# 'SPECIES_CODE' 119 is 'Thymelicus lineola' and 120 is 'Thymelicus Sylvestrius'
ukbms_species_clean = ukbms_species_clean[~ # '~' (tilde) reverses the boolean output
ukbms_species_clean['SPECIES_CODE'].isin([119, 120]) 
].reset_index(drop=True)

#### Identifying Species with Insufficient Data
Some species cannot be analysed because they have insufficient data

In [14]:
# First, indices for each species are aggregated by year
annual_counts = (
    ukbms_species_clean.groupby(['SPECIES','YEAR']) # all species records per year
    .agg(SUM_SITE_INDEX=('SITE_INDEX','sum')) # summing all site indices in group
    .reset_index()
    .sort_values(['SUM_SITE_INDEX', 'SPECIES'])
)

# Filtering to leave groups where summed site indices per year per species was zero
zero_indices = (
    annual_counts[
    annual_counts['SUM_SITE_INDEX']==0 # summed index filter
    ].sort_values('SPECIES')
)

In [15]:
# Counting total 'zero years' per species
zero_index_counts = (
    annual_counts[annual_counts['SUM_SITE_INDEX']==0]
    .groupby('SPECIES') # zero counts are grouped by species
    .agg(NO_RECORD_YEARS=('SUM_SITE_INDEX','count'))
    .sort_values('NO_RECORD_YEARS', ascending=False) # descending order
    .reset_index()
)

In [16]:
# Computing the percentage of survey years missing for species with 'zero years'
zero_index_counts['%_YEARS_NO_RECORD'] = (
    zero_index_counts['NO_RECORD_YEARS']
    /ukbms_species_clean['YEAR'].nunique()*100 # number of survey years*100 for %
)

In [None]:
# Filtering to leave species where data is missing for more than 15% of survey years
insufficient_data = (
    list( # all species are added to list
        zero_index_counts[
        zero_index_counts['%_YEARS_NO_RECORD']>15 # filter
        ].iloc[:,0] # species names are found in the first column of the dataframe
    )
)

#### Identifying Migrant Species
Creating list of 'migrant species'. These species have inconsistent population sizes
due to their migratory habits. For this reason they will be excluded from the analysis.

In [None]:
migrant_species = ['Vanessa cardui', 'Colias croceus']

#### Removing Migrant Species and those with Insufficient Data

In [None]:
# Creating list of species to be removed from dataset
species_to_be_removed = list(insufficient_data + migrant_species)

In [18]:
# Creating new dataframe (ukbms_site_indices_clean)
ukbms_site_indices_clean = ukbms_species_clean.copy()

# Removing species listed in 'species_to_be_removed' list
ukbms_site_indices_clean = (
    ukbms_site_indices_clean[
    ~ukbms_site_indices_clean['SPECIES'].isin(species_to_be_removed)
    ]
)

#### Quality check

Two migrant species were removed: 'Vanessa cardui', 'Colias croceus'.

1 species was removed due to insufficient data: 'Erebia epiphron'.

A combined record was used instead of two individual species: 'Thymelicus Lineola' 
and 'Thymelicus Sylvestrius'

In total 5 different species were removed from the dataset. The cleaned dataframe 
should 5 fewer unique species. 

The number of unique values for 'SPECIES_CODE', 'COMMON_NAME', 'SPECIES' should be
equal.

In [19]:
# First the original dataframe is checked.
# Finding the number of unique butterfly species 
print(
    'number of unique species: ' 
    + str(ukbms_site_indices_filter['SPECIES'].nunique())
)

# Checking the number of unique 'common_name' records matches unique 'species' 
print(
    'number of unique common_name: ' 
    + str(ukbms_site_indices_filter['COMMON_NAME'].nunique())
)

# Checking the number of unique 'species_code' records matches unique 'species' 
print(
    'number of unique species_code: ' 
    + str(ukbms_site_indices_filter['SPECIES_CODE'].nunique())
)

number of unique species: 60
number of unique common_name: 60
number of unique species_code: 60


In [20]:
# The cleaned dataframe should have 5 fewer species

# Finding the number of unique butterfly species 
print(
    'number of unique species: ' 
    + str(ukbms_site_indices_clean['SPECIES'].nunique())
)

# Checking the number of unique 'common_name' records matches unique 'species' 
print(
    'number of unique common_name: ' 
    + str(ukbms_site_indices_clean['COMMON_NAME'].nunique())
)

# Checking the number of unique 'species_code' records matches unique 'species' 
print(
    'number of unique species_code: ' 
    + str(ukbms_site_indices_clean['SPECIES_CODE'].nunique())
)

number of unique species: 55
number of unique common_name: 55
number of unique species_code: 55


In [24]:
# Saving to csv file
ukbms_site_indices_clean.to_csv(project_root/'Data'/'UKBMS'/'ukbms_site_indices'/'ukbms_site_indices_cleaned_v1.csv')