# ESIF (ERDF)

In [None]:
import json
import re

import pandas as pd

## Load and Clean

In [None]:
raw_esif = pd.read_excel('input/ESIF_2014-2020__List_of_beneficiaries_July_2018_Final.xlsx', skiprows=9)
raw_esif.shape

In [None]:
raw_esif.head()

In [None]:
esif = raw_esif.rename(columns={
    'Bénéficiaire': 'beneficiary',
    'Nom du projet': 'project',
    'Fonds': 'funds',
    'Unnamed: 3': 'priority_axis',
    'Résumé du projet': 'summary',
    'Date de commencement': 'start_date',
    'Date de fin': 'end_date',
    'Investissement FEDER/FSE £m': 'eu_investment',
    'Coût total du projet £m': 'project_cost',
    '% du projet cofinancé par l’UE': 'prop_eu_financed',
    'Localisation (code postal)': 'raw_postcode',
    'Zone de partenariat économique local': 'economic_zone',
    'Pays': 'country',
    'Type et axe du soutien (catégorie d’intervention)': 'category'
}).copy()
esif.head()

### Country

It is all England, so we can drop the column.

In [None]:
esif.country.unique()

In [None]:
esif.drop('country', axis=1, inplace=True)

### Beneficiary

In [None]:
esif.beneficiary[esif.beneficiary.str.strip() != esif.beneficiary]

In [None]:
esif.beneficiary = esif.beneficiary.str.strip()

### Project

In the absence of any IDs, do we have any duplicates? Sometimes the same project gets both ERDF and ESF funding. There do appear to be a couple of duplicates.

TODO: Probably worth writing up [Northern Powerhouse Investment Fund](https://www.npif.co.uk/) --- it's a big one with no description.

In [None]:
esif.project[esif.project.str.contains('\n')]

In [None]:
esif.project = esif.project.str.replace('\n', ' ')

In [None]:
esif.project[esif.project.str.strip() != esif.project] # lots
esif.project = esif.project.str.strip()

In [None]:
esif.project.unique().shape

In [None]:
esif[esif.duplicated(['beneficiary', 'project', 'funds'], keep=False)].sort_values('project')

In [None]:
duplicate_project = esif.duplicated([
    'beneficiary', 'project', 'funds', 'eu_investment', 'project_cost'
])
esif[duplicate_project].sort_values('project')

In [None]:
esif = esif[~duplicate_project].copy()

### Summary

In [None]:
esif.summary.isna().sum()

In [None]:
(esif.summary != esif.summary.str.strip()).sum() # lots
esif.summary = esif.summary.str.strip()

### Funds

In [None]:
esif.funds.isna().sum()

In [None]:
esif.funds.unique()

In [None]:
esif.funds = esif.funds.str.strip().str.replace('ESF.+', 'ESF')
esif.funds.unique()

### Priority Axis

The ESF and ERDF priority axes are different.

- [ESF](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/461596/ESF_Operational_Programme_2014_-_2020_V.01.pdf) (p. 6) --- three of them
- [ERDF](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/706955/ESIF_Online_Publication_2018_FINAL_150518.pdf) (p. 3) --- nine of them

We could clean these up, if they're useful.

In [None]:
esif.priority_axis.isna().sum()

In [None]:
esif.priority_axis.unique()

### Project Cost

Unfortunately, there are some junk values, but they look salvageable.

In [None]:
esif.project_cost.isna().sum()

In [None]:
esif.project_cost = esif.project_cost.map(str).str.strip()
project_cost_bad = esif.project_cost.str.match(re.compile(r'.*[^0-9.].*'))
esif.project_cost[project_cost_bad]

In [None]:
project_cost_fixed = esif.project_cost[project_cost_bad].\
    str.replace(r'\.00$', '').str.replace('[^0-9]', '')
project_cost_fixed

In [None]:
esif.loc[project_cost_bad, 'project_cost'] = project_cost_fixed
esif.project_cost = esif.project_cost.astype('float64')
esif.project_cost.describe()

### EU Investment

Ditto.

In [None]:
esif.eu_investment.isna().sum()

In [None]:
esif.eu_investment = esif.eu_investment.map(str).str.strip()
eu_investment_bad = esif.eu_investment.str.match(re.compile(r'.*[^0-9.].*'))
esif.eu_investment[eu_investment_bad]

In [None]:
eu_investment_fixed = esif.eu_investment[eu_investment_bad].\
    str.replace(r'\.00$', '').str.replace('[^0-9]', '')
eu_investment_fixed

In [None]:
esif.loc[eu_investment_bad, 'eu_investment'] = eu_investment_fixed
esif.eu_investment = esif.eu_investment.astype('float64')
esif.eu_investment.describe()

### Overfunding

This is generally pretty good, but there is one project that apparently is overfunded. It looks like it's the total cost that is wrong, based on http://www.worksbetter.co.uk/funding . So, increase the total cost.

In [None]:
overfunded = esif.eu_investment > esif.project_cost
esif[overfunded]

In [None]:
esif.loc[overfunded, 'project_cost'] = esif.eu_investment[overfunded] / esif.prop_eu_financed[overfunded]
esif.loc[esif.eu_investment > esif.project_cost].shape

### Prop EU Financed

This provides a useful check. The [ESF guidance for 2014-2020](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/710305/ESF_Guidance_for_2014_2020_v2.pdf) says that contributions over 50% are unlikely in the UK.

#### North of Tyne Community Led Local delivery

Looks like an extra 0 in the project cost. [
North of Tyne Community Led Local Development](https://www.newcastle.gov.uk/business/business-support-and-advice/north-tyne-community-led-local-development):

> The Strategy asks for £2.5m of European funds, split between £0.9m ERDF and £1.6m ESF. About 15% of the funds (+ the match-funding provided by the Accountable Body and partners) will be used to support project development, promotion, management and administration of the CLLD programme. In total, when combining European funds and match-funding, the programme amounts to nearly £4.7m.


#### Solent Community Grants Programme

It looks like they have about £2.5, but it may be split between several programmes. [Solent LEP - European Social Fund Calls for Proposals](https://solentlep.org.uk/what-we-do/news/european-social-fund-calls-for-proposals/):

> £1,000,000 of ESF funding is available to develop and deliver a Solent Apprenticeship Hub. ...
> £640,000 of ESF funding is being made available under the 'Solent Jobs Programme' ...
> The Solent Community Grants Programme makes grants available to grass-roots and community-led organisations. A further £880,000 of ESF funding is being made available to continue activity which address exclusion, by engaging local people in improving their own lives and that of their local communities.

#### AEGIS in Communities

Can't find anything about this one. Going to reduce EU investment by half.

In [None]:
esif.prop_eu_financed.isna().sum()

In [None]:
esif.prop_eu_financed.describe()

In [None]:
esif['actual_prop'] = esif.eu_investment / esif.project_cost
esif.actual_prop.describe()

In [None]:
esif[(esif.actual_prop - esif.prop_eu_financed).abs() > 0.05]

In [None]:
esif.loc[(esif.index == 678) & (esif.project_cost == 32000000.0), 'project_cost'] = 32000000.0 / 10
esif.loc[(esif.index == 774) & (esif.eu_investment == 5000000.0), 'eu_investment'] = 5000000.0 / 2
esif.loc[(esif.index == 874) & (esif.eu_investment == 1900000.0), 'eu_investment'] = 1900000.0 / 2
esif['actual_prop'] = esif.eu_investment / esif.project_cost

In [None]:
esif[(esif.actual_prop - esif.prop_eu_financed).abs() > 0.05]

In [None]:
esif.drop('actual_prop', axis=1, inplace=True)

### Postcode

- Only one missing; drop it.
- Mostly good. A few retired. A few typos (e.g. NN11D should be NN1 1DF).

In [None]:
[esif.shape, esif.raw_postcode.isna().sum()]

In [None]:
esif[esif.raw_postcode.isna()]

In [None]:
esif = esif[~esif.raw_postcode.isna()].copy()

In [None]:
ukpostcodes = pd.read_csv('../postcodes/input/ukpostcodes.csv.gz')
ukpostcodes.shape

In [None]:
esif.raw_postcode.isin(ukpostcodes.postcode).sum()

In [None]:
esif['postcode'] = esif.raw_postcode.\
    str.upper().\
    str.strip().\
    str.replace(r'[^A-Z0-9]', '').\
    str.replace(r'^(\S+)([0-9][A-Z]{2})$', r'\1 \2')

In [None]:
esif.postcode.isin(ukpostcodes.postcode).sum()

In [None]:
esif.postcode[~esif.postcode.isin(ukpostcodes.postcode)].unique()

In [None]:
esif[~esif.postcode.isin(ukpostcodes.postcode)]

In [None]:
esif = esif[esif.postcode.isin(ukpostcodes.postcode)].copy()

### Start and End Dates

Generally good. Just one anomaly: The [Marches Growth Hub appears to have got started in October 2015](https://www.marchesgrowthhub.co.uk/assets/marchesgrowthhubreviewevaluationreport.pdf), but for now I will just drop it.

In [None]:
[esif.start_date.isna().sum(), esif.start_date.dtype]

In [None]:
[esif.end_date.isna().sum(), esif.end_date.dtype]

In [None]:
esif[esif.start_date >= esif.end_date]

In [None]:
esif = esif[esif.start_date < esif.end_date].copy()
esif.shape

In [None]:
esif.start_date.describe()

In [None]:
esif.end_date.describe()

### Category

Needs some cleaning up, but we could get most of these out by number from [this table](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32014R0215).

In [None]:
esif.category.isna().sum()

In [None]:
esif.category.unique()

## Save Data

In [None]:
clean_esif = esif.drop([
    'priority_axis', 'prop_eu_financed', 'raw_postcode', 'economic_zone', 'category'
], axis=1)
clean_esif.head()

In [None]:
clean_esif['my_eu_id'] = clean_esif.funds.str.lower() + '_england_' + clean_esif.index.map(str)
clean_esif.my_eu_id.head()

In [None]:
clean_esif.to_pickle('output/esif_england_2014_2020.pkl.gz')

## Save Map Data

In [None]:
clean_esif_locations = pd.merge(clean_esif, ukpostcodes, validate='m:1')
clean_esif_locations.head()

In [None]:
def make_esif_data_geo_json(data):
    def make_feature(row):
        properties = {
            property: row[property]
            for property in ['beneficiary', 'project', 'project_cost', 'eu_investment']
        }
        return {
            'type': 'Feature',
            'geometry': {
                "type": "Point",
                "coordinates": [row['longitude'], row['latitude']]
            },
            'properties': properties
        }
    features = list(data.apply(make_feature, axis=1))
    return { 'type': 'FeatureCollection', 'features': features }
with open('output/beneficiaries.geo.json', 'w') as file:
    json.dump(make_esif_data_geo_json(clean_esif_locations), file, sort_keys=True)