In [1]:
# !pip install xlrd

In [2]:
import pandas as pd
import numpy as np

Here we're going to use the read excel functionality of Pandas. The sheet_name = None means that we want to include all the sheets in the document.

In [3]:
tkm_data = pd.read_excel('./KA_TKM_2019-08-10.xlsx',sheet_name=None)

If you take the type of the variable, you find something different than usual, but it's not completely foreign to us. Look at the last part:

In [4]:
type(tkm_data)

collections.OrderedDict

read_excel returned a type of a dictionary. We know what dictionaries have: keys and values. Let's look at the keys:

In [5]:
tkm_data.keys()

odict_keys(['WORKSPACE_PLAN', 'Key Activity Master', 'KA-Geography', 'KA-Tags', 'KA-Marker', 'KA-CBF-BREAKUP', 'KA-CBF-BY-GEOGRAPHY', 'KA-CBF-TOTAL'])

Since these keys are attached to values, I wonder what `type()` we see if we use the `.get(key)` method:

In [6]:
type(tkm_data.get('KA-Tags'))

pandas.core.frame.DataFrame

It appears that this new object is a _dictionary_ of _dataframes_ ! Since all of the values are dataframes, we can do dataframe stuff with them:

In [7]:
tkm_data.get('KA-Tags').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1601 entries, 0 to 1600
Data columns (total 8 columns):
ID               1601 non-null int64
Key Activity     1601 non-null object
TAG_ID           1601 non-null int64
Tag Dimension    1601 non-null object
TAG_ITEM_ID      1601 non-null int64
Tag Item         1594 non-null object
SDG              337 non-null object
SDG_GID          337 non-null object
dtypes: int64(3), object(5)
memory usage: 100.1+ KB


For ease of typing, let's just create a new variable and point it at the dataframe in the dictionary:

In [8]:
ka_tags = tkm_data.get('KA-Tags')

Let's start looking at what the database contains:

In [9]:
ka_tags.head()

Unnamed: 0,ID,Key Activity,TAG_ID,Tag Dimension,TAG_ITEM_ID,Tag Item,SDG,SDG_GID
0,7897,1.1.1.1 - Seminar on automatization of collect...,3,Agency,753,UNFPA,,
1,7897,1.1.1.1 - Seminar on automatization of collect...,18,OECD DAC Sector,318,Statistical capacity building,,
2,7897,1.1.1.1 - Seminar on automatization of collect...,16,SDG Targets,220,"Target 17.18 By 2020, enhance capacity-buildin...",SDG 17,SDG_17
3,7897,1.1.1.1 - Seminar on automatization of collect...,5,Implementing Partners,4256,State Statistics Committee,,
4,7897,1.1.1.1 - Seminar on automatization of collect...,4,Source of funds,1414,Core funds,,


The 'Key Activity' field has some codes paired with descriptions. I wonder if those pairs are always consistent. First, let's look at how many unique values are in that field. We're going to use the `.unique()` to return a numpy array of values and then look at the size.

In [10]:
ka_tags['Key Activity'].unique().size

248

Just to introduce some other ways to think about numpy arrays, they can themselves be passed as arguments to a function. This is a numpy function that counts the non-zero values:

In [11]:
np.count_nonzero(ka_tags['Key Activity'].unique())

248

Now, to investigate the codes and the descriptions separate from each other. Let's see if we can see what an example might be: 

In [12]:
ka_tags.loc[225,'Key Activity']

'2.2.1.3 - Support in advancement of legislative and regulatory framework for provision of social services for most vulnerable families and children with disabilities, including introduction of social work.'

Looking at this, it seems that the code at the front is separated from the description at the end with a space-dash-space composite delimiter. I wonder if we could _split_ those two:

In [13]:
ka_tags['Key Activity'].str.split(' - ').head(10)

0    [1.1.1.1, Seminar on automatization of collect...
1    [1.1.1.1, Seminar on automatization of collect...
2    [1.1.1.1, Seminar on automatization of collect...
3    [1.1.1.1, Seminar on automatization of collect...
4    [1.1.1.1, Seminar on automatization of collect...
5    [1.1.1.1, Seminar on automatization of collect...
6    [1.1.1.2, Preparatory work for the 2019 MICS, ...
7    [1.1.1.2, Preparatory work for the 2019 MICS, ...
8    [1.1.1.2, Preparatory work for the 2019 MICS, ...
9    [1.1.1.2, Preparatory work for the 2019 MICS, ...
Name: Key Activity, dtype: object

Split on multiple consecutive characters seems to work without an error. We've now created a bunch of lists. Let's convert those to series and be sure:

In [14]:
ka_tags['Key Activity'].str.split(' - ').apply(pd.Series).head()

Unnamed: 0,0,1,2
0,1.1.1.1,Seminar on automatization of collecting and pr...,
1,1.1.1.1,Seminar on automatization of collecting and pr...,
2,1.1.1.1,Seminar on automatization of collecting and pr...,
3,1.1.1.1,Seminar on automatization of collecting and pr...,
4,1.1.1.1,Seminar on automatization of collecting and pr...,


That third column is a little weird. It probably means that there is an additional space-hyphen-space in the text on a few of these. No problem! Let's just split on the first occurrence, by passing an additional argument to the `.split()` method:

In [15]:
ka_tags['Key Activity'].str.split(' - ',1).apply(pd.Series).head()

Unnamed: 0,0,1
0,1.1.1.1,Seminar on automatization of collecting and pr...
1,1.1.1.1,Seminar on automatization of collecting and pr...
2,1.1.1.1,Seminar on automatization of collecting and pr...
3,1.1.1.1,Seminar on automatization of collecting and pr...
4,1.1.1.1,Seminar on automatization of collecting and pr...


This looks better. Codes in column 0, descriptions in column 1 (hopefully). Now let's count the unique codes:

In [16]:
ka_tags['Key Activity'].str.split(' - ',1).apply(pd.Series)[0].unique().size

174

That's a different number from above. I wonder how many unique descriptions there are:

In [17]:
ka_tags['Key Activity'].str.split(' - ',1).apply(pd.Series)[1].unique().size

238

Maybe there are supposed to be codes that are different from the associated descriptions. In case there aren't, though, let's `.groupby()` the two columns and count them, ordered by code:

In [18]:
ka_tags['Key Activity'].str.split(' - ',1).apply(pd.Series).groupby([0,1]).size().reset_index(name='Count')

Unnamed: 0,0,1,Count
0,1.1.1.1,MICS data collection,10
1,1.1.1.1,Seminar on automatization of collecting and pr...,6
2,1.1.1.10,Situation analysis on the State of the Childre...,6
3,1.1.1.11,Organisation of pilot survey for estimation of...,8
4,1.1.1.12,STEPS Survey,8
5,1.1.1.13,Baseline Assessment of ICPD related SDG indica...,5
6,1.1.1.14,Analysis of planning and monitoring documents ...,10
7,1.1.1.16,Activities releated to raising awareness of ge...,9
8,1.1.1.17,Strengthening institutional and technical capa...,6
9,1.1.1.2,"Preparatory work for the 2019 MICS, including ...",7


We can already see pairs that don't go with their codes. What if we reverse the order, and sort by description?

In [26]:
ka_tags['Key Activity'].str.split(' - ',1).apply(pd.Series).groupby([1,0]).size().reset_index(name='Count')

Unnamed: 0,1,0,Count
0,Activities related to the anti-tobacco law,2.3.1.31,4
1,Activities releated to raising awareness of ge...,1.1.1.16,9
2,Adaptation mainstreamed in agricultural and wa...,3.2.2.3,12
3,Adaptation of curricular materials and tools o...,3.2.4.2,6
4,"Adaptation of the ""Climate box"" concept",3.2.4.3,7
5,"Adaptation of the ""Climate box"" concept",3.2.4.7,6
6,Adaptation of the guidelines and instruments f...,2.3.1.15,8
7,Adherence support to M/XDR-TB patients on trea...,2.3.2.3,7
8,Analysis of planning and monitoring documents ...,1.1.1.14,10
9,Annual Green Light Committee operations fee,2.3.1.8,5


Ordering the data in this way shows a couple of situations where missing words or descriptions are paired with different codes.

In [28]:
ka_tags['Key Activity'].str.split(' - ',1).apply(pd.Series).groupby([1,0]).size().reset_index(name='Count').iloc[222,0]

'To support of TB Grant Implementation Unit  and local staff of TB facilities on implementing, monitoring and evaluation.'

In [29]:
ka_tags['Key Activity'].str.split(' - ',1).apply(pd.Series).groupby([1,0]).size().reset_index(name='Count').iloc[223,0]

'To support of TB Grant Implementation Unit and NTP staff on implementing, monitoring and evaluation'