# Get a list of organisations to check for data

In [1]:
import pandas as pd

Get content from 3 files

In [22]:
df1 = pd.read_excel('orgs-to-check/All apps with charity and company reg nos_20.11.18.xlsx')
df2 = pd.read_excel('orgs-to-check/Copy of Power to Change grants data 2017_FINAL 13.09.18.xlsx')
df3 = pd.read_excel('orgs-to-check/Power to Change grants data to December 2016.xlsx')

Extract registration numbers from the first file

In [75]:
df1_orgs = df1[['Company/Organisation Registration', 'Charity Number']].dropna(how='all')
df1_orgs.columns = ['Recipient Org:Company Number', 'Recipient Org:Charity Number']

Basic company number cleaning

In [76]:
df1_orgs.loc[:, 'Recipient Org:Company Number'] = df1_orgs['Recipient Org:Company Number'].str.pad(8, fillchar='0')

Clean charity and company numbers (remove any charity numbers with non-numeric characters or company numbers with spaces in them)

In [77]:
df1_orgs.loc[
    df1_orgs['Recipient Org:Charity Number'].str.contains('[A-Za-z]').fillna(False),
    "Recipient Org:Charity Number"
] = None
df1_orgs.loc[
    df1_orgs['Recipient Org:Company Number'].str.contains(' ').fillna(False),
    "Recipient Org:Company Number"
] = None

The other two files are in 360Giving format, so extract the reg number fields and combine into one file.

In [99]:
orgs = pd.concat([
    df2[['Recipient Org:Charity Number', 'Recipient Org:Company Number']], 
    df3[['Recipient Org:Charity Number', 'Recipient Org:Company Number']]
])

The charity number field is auto-detected as a number, so transform to a string.

In [100]:
orgs.loc[
    orgs['Recipient Org:Charity Number'].notnull(), 
    "Recipient Org:Charity Number"
] = orgs.loc[orgs['Recipient Org:Charity Number'].notnull(), 'Recipient Org:Charity Number'].apply("{:.0f}".format)

Merge the reg numbers from the three files, and drop any where both fields aren't present

In [101]:
orgs = pd.concat([
    df1_orgs[['Recipient Org:Charity Number', 'Recipient Org:Company Number']],
    orgs
]).dropna(how='all')

Create an identifier in the org-id style. Prioritise charity number over company number (this is opposite to the recommended, but it's important because more data is available on charities).

In [102]:
orgs.loc[:, 'Recipient Org:Identifier'] = None
orgs.loc[
    orgs['Recipient Org:Charity Number'].notnull(),
    "Recipient Org:Identifier"
] = 'GB-CHC-' + orgs.loc[orgs['Recipient Org:Charity Number'].notnull(), 'Recipient Org:Charity Number']
orgs.loc[
    orgs['Recipient Org:Identifier'].isnull(),
    "Recipient Org:Identifier"
] = 'GB-COH-' + orgs.loc[orgs['Recipient Org:Identifier'].isnull(), 'Recipient Org:Company Number']

Get a list of the unique values

In [110]:
org_ids = orgs['Recipient Org:Identifier'].drop_duplicates().sort_values()

Show the different types of IDs we have

In [115]:
org_ids.apply(lambda x: "-".join(x.split("-")[0:2])).value_counts()

GB-COH    565
GB-CHC    452
Name: Recipient Org:Identifier, dtype: int64

Save the resulting IDs to a CSV file

In [109]:
org_ids.to_csv('orgs-to-check/ptc.csv', index=False)