# Code logic of Rule 8896

We wouldn't want you to simply fill in the gaps and replace variable names when you do these rules. It is important to us that you learn the concepts behind the rules too, as you go. 

This file demonstrates the code logic of Rule 8896, a type 3 rule. Type 3 rules are those that define scenarios that should not occur within a group. So these values might be correct on their own but if something else happens in the group, then that same value could flag the error. 

In this case, the rule says that "Within one CINdetails group, there must not be more than one Assessments group that has no AssessmentAuthorisationDate (N00160) recorded". 

So if an AssessmentAuthorisationDate is missing (equal to pd.NA), this rule won't flag it just because of that. What this rule will do is check if there is another AssessmentAuthorisationDate that is missing in the group. If there is, then all the positions where AssessmentAuthorisationDate is missing will be flagged.

In [1]:
import pandas as pd

The CIN validator tool contains a number of background checks and guidelines on how things should be done. As helpful as this is, it might not allow you to experiment much or run your code on the fly. So sometimes I get the data out into a clean file and write my python logic against it.

Create some data and take note of the positions that you expect to be flagged.

You would notice that you now need to put the column names in strings as standard Python practice requires.

In [13]:
sample_assessments = pd.DataFrame(
    [   #child1
        {   # fail
            "LAchildID": "child1",
            "CINdetailsID": "cinID1",
            "AssessmentAuthorisationDate": pd.NA, # 0 first nan date in group
        },
        {   # fail
            "LAchildID": "child1",
            "CINdetailsID": "cinID1",
            "AssessmentAuthorisationDate": pd.NA, # 1 second nan date in group
        },
        {   # won't be flagged because there is not more than one nan authorisation date in this group.
            "LAchildID": "child1",
            "CINdetailsID": "cinID2",
            "AssessmentAuthorisationDate": pd.NA, # 2
        }, 
        # child2 
        {
            "LAchildID": "child2",
            "CINdetailsID": "cinID1", 
            "AssessmentAuthorisationDate": "26/05/2021", # 3 ignored. not nan
        },
        {   # fail
            "LAchildID": "child2",
            "CINdetailsID": "cinID2",
            "AssessmentAuthorisationDate": pd.NA, # 4 first nan date in group
        },  
        {   # fail
            "LAchildID": "child2",
            "CINdetailsID": "cinID2",
            "AssessmentAuthorisationDate": pd.NA, # 5 second nan date in group
        },
    ]
)

# if rule requires columns containing date values, convert those columns to datetime objects first. Do it here in the test_validate function, not above.
sample_assessments["AssessmentAuthorisationDate"] = pd.to_datetime(
    sample_assessments["AssessmentAuthorisationDate"], format="%d/%m/%Y", errors="coerce"
)


# See what your data looks like
print(sample_assessments)


  LAchildID CINdetailsID AssessmentAuthorisationDate
0    child1       cinID1                         NaT
1    child1       cinID1                         NaT
2    child1       cinID2                         NaT
3    child2       cinID1                  2021-05-26
4    child2       cinID2                         NaT
5    child2       cinID2                         NaT


When you start, it is wise to ask yourself the question "How do I get all the conditions that fail into one place?".
In this case, our first step is to get together all the conditions that could fail. That is, the locations where AssessmentAuthorisationDate is missing. Later on, we can proceed to check how many exist per group.

In [8]:
df = sample_assessments
df.index.name = "ROW_ID"
df.reset_index(inplace=True)
df2 = df[df["AssessmentAuthorisationDate"].isna()]

# See what the data looks like now
print(df2)

   ROW_ID LAchildID CINdetailsID AssessmentAuthorisationDate
0       0    child1       cinID1                         NaT
1       1    child1       cinID1                         NaT
2       2    child1       cinID2                         NaT
4       4    child2       cinID2                         NaT
5       5    child2       cinID2                         NaT


Generally, we would groupby the columns that define our group and count the number of items in each column. However, the count method ignores NaNs and in this case that it what we want to count. So since we have filtered only the NaN values, we can replace then with a value that can be counted. I chose to use the integer 1.

In [16]:
df2.loc[:,"AssessmentAuthorisationDate"].fillna(1, inplace=True)

# This line was added to take care of a warning being raised here. You can ignore it in other rules.
pd.options.mode.chained_assignment = None

Now we can count.
We need to groupby LAchildID and CINdetailsID because eventhough we are concerned with CINdetails groups, CINdetails groups are subgroups of each child. For example, two unique children can each contain a CINdetails group whose ID is "abc" within them. If we do not group by child first, our code will interpret the groups from these separate children as if they were the same because LAchildID has not been included to distinguish them.

In [12]:
# count how many occurences of missing "AssessmentAuthorisationDate" per CINdetails group in each child.
group_result = df2.groupby(["LAchildID", "CINdetailsID"])["AssessmentAuthorisationDate"].count()

print(group_result)

LAchildID  CINdetailsID
child1     cinID1          2
           cinID2          1
child2     cinID2          2
Name: AssessmentAuthorisationDate, dtype: int64


The line below does the same thing as the line above. However, it shows you why we need to reset_index after doing the groupby. This is because the columns that we groupedby become the index and we have to push them back into column form. Also, you would notice that our count result is now assigned to the column we put in the square bracket of the groupby statement.

In [17]:
df2 = df2.groupby(["LAchildID", "CINdetailsID"])["AssessmentAuthorisationDate"].count().reset_index()

# See what the data looks like now
print(df2)

  LAchildID CINdetailsID  AssessmentAuthorisationDate
0    child1       cinID1                            2
1    child1       cinID2                            1
2    child2       cinID2                            2


We get the count result by selecting the column we put in the square bracket of the groupby statement. The failing positions are only those in which there was more than one occurence of AssessmentAuthorisationDate as a missing value within the group.

In [18]:
# filter out the instances where "AssessmentAuthorisationDate" is missing more than once in a CINdetails group.
df2 = df2[df2["AssessmentAuthorisationDate"]>1]

# See what the data looks like now
print(df2)

  LAchildID CINdetailsID  AssessmentAuthorisationDate
0    child1       cinID1                            2
2    child2       cinID2                            2


In type3 rules, a group is our unit of testing. So the column relationship that defines the group also defines the error IDs. That is, in the end the locations that fail per group should be linked to each other.

We start by generating the IDs of all the failing positions which we have identified. 

In [19]:
issue_ids = tuple(
    zip(df2["LAchildID"], df2["CINdetailsID"],)
)
df["ERROR_ID"] = tuple(
    zip(df["LAchildID"], df["CINdetailsID"],)
)

# See what the data looks like now, including ERROR_ID that has been created.
print(df)

   ROW_ID LAchildID CINdetailsID AssessmentAuthorisationDate          ERROR_ID
0       0    child1       cinID1                         NaT  (child1, cinID1)
1       1    child1       cinID1                         NaT  (child1, cinID1)
2       2    child1       cinID2                         NaT  (child1, cinID2)
3       3    child2       cinID1                  2021-05-26  (child2, cinID1)
4       4    child2       cinID2                         NaT  (child2, cinID2)
5       5    child2       cinID2                         NaT  (child2, cinID2)


Then we go to the initial dataset, generate an ID column using that same column combination and select the rows where the ID values appear among the IDs of the failing locations.

In [20]:
df_issues = df[df.ERROR_ID.isin(issue_ids)]

# See what the data looks like now
print(df_issues)

   ROW_ID LAchildID CINdetailsID AssessmentAuthorisationDate          ERROR_ID
0       0    child1       cinID1                         NaT  (child1, cinID1)
1       1    child1       cinID1                         NaT  (child1, cinID1)
4       4    child2       cinID2                         NaT  (child2, cinID2)
5       5    child2       cinID2                         NaT  (child2, cinID2)


In real life, this df_issues will contain all the other columns that came with the table. That data is heavy to move around so we only select the columns which we need.

We would like to know all the rows that failed because of the same reason. That is, lets group together all the ROW_IDs that have the same ERROR_ID.

In [21]:
group_result = df_issues.groupby("ERROR_ID")["ROW_ID"].apply(list)

# See what the data looks like now
print(group_result)

ERROR_ID
(child1, cinID1)    [0, 1]
(child2, cinID2)    [4, 5]
Name: ROW_ID, dtype: object


The line below does the same thing as the code above, then it shows you why we need to reset_index

In [22]:
df_issues = df_issues.groupby("ERROR_ID")["ROW_ID"].apply(list).reset_index()

# See what the data looks like now
print(df_issues)

           ERROR_ID  ROW_ID
0  (child1, cinID1)  [0, 1]
1  (child2, cinID2)  [4, 5]


Now, we can push this to the issue location accumulator that prepares the data which will be sent to the frontend.