# Common analysis and tests in audit - an approach in Python


The use of Computer Assisted Auditing Techniques (CAATs) is gaining popularity with audit departments as well as the clients they serve. Although some may argue that the cost and time involved in acquiring these techniques is not justifiable, it still remains true that these provide added value to customers because they present a complete picture of an organization's system and/or transactions. As a result, a data analysis project must prove the benefits of CAATs to both the audit team and clients (Internal or External).

This paper aims to portray the one demonstrated by author Marcos F Silva on the page <https://sites.google.com/site/marcosfs2006/> where he uses the R language to demonstrate the application of the most common audit techniques using R language.

In my case, I will use the Python language, which like R, is an Open Source language.

Python is a high -level programming language, interpreted, script, imperative, object -oriented, functional, dynamic and strong typing.It was launched by Guido Van Rossum in 1991. He currently has a community development model, opened and managed by the non -profit organization Python Software Foundation.

Source: https://en.wikipedia.org/wiki/python

# Basic data analysis techniques

I will present below, as implementing in Python some of the most common computer assisted audit techniques.These are basic techniques that do not use more sophisticated statistical methods that are extremely useful in practice.
This work is not intended to be a Python tutorial, for this, there is a vast material available on the internet.I will explain the function used only for the cases you deem necessary for a better understanding of what was presented.

## Let's import all the libraries we will use in this work.

In [1]:
import pandas as pd
import numpy as np

## 1. Calculation of new fields

The creation of new fields from fields already existing in the data set is a very simple task.Let's illustrate this procedure using what is perhaps the most used package of Python language, *pandas *.Pandas is a software library written for the Python programming language for data handling and analysis.In particular, it offers data structures and operations to manipulate numerical tables and time series.We will use the "Invoice.csv" data set to illustrate the creation of new fields.

In [2]:
# Importing the dataset
invoices = pd.read_csv('/workspaces/Auditoria/Análises_e_Testes_Comuns_da_auditoria_utilizando_Python/Data/Invoices.csv', sep =";")
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
0,09/07/2003,20000,10220,8,8,920,41,37720
1,21/08/2003,20001,10491,4,48,1400,30,42000
2,27/08/2003,20002,10704,3,43,1500,25,37500
3,28/05/2003,20003,10430,5,54,2400,22,52800
4,06/12/2003,20004,10841,17,11,1500,21,31500


In [3]:
invoices.dtypes # Shows the type of each column of the data set

Date           object
InvoiceNo       int64
CustomerNo      int64
SalesPerson     int64
ProductNo       int64
UnitPrice      object
Quantity        int64
Amount         object
dtype: object

Here we see that the 'unitprice' and 'amount' columns are of the pandas object type, which means that he is understanding the fields as 'string', text, so it will not be possible to do mathematical operations that are part of this demonstration, for that,We will convert these two columns.

In [4]:
invoices['UnitPrice'] = invoices['UnitPrice'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
invoices['Amount'] = invoices['Amount'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
invoices.dtypes

Date            object
InvoiceNo        int64
CustomerNo       int64
SalesPerson      int64
ProductNo        int64
UnitPrice      float64
Quantity         int64
Amount         float64
dtype: object

As we see now, the two fields have been converted.
Let's create a new field in the Invoice Database called Vlraudit which will be the product of the Quantity and UnitPrice fields.

In [5]:
invoices['VlrAudit'] = invoices['Quantity'] * invoices['UnitPrice']
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,VlrAudit
0,09/07/2003,20000,10220,8,8,920.0,41,37720.0,37720.0
1,21/08/2003,20001,10491,4,48,1400.0,30,42000.0,42000.0
2,27/08/2003,20002,10704,3,43,1500.0,25,37500.0,37500.0
3,28/05/2003,20003,10430,5,54,2400.0,22,52800.0,52800.0
4,06/12/2003,20004,10841,17,11,1500.0,21,31500.0,31500.0


Now let's create another field we'll call 'Dif' and that will be the difference between the Amount and Vlraudit fields.

In [6]:
invoices['Dif'] = invoices['Amount'] - invoices['VlrAudit']
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,VlrAudit,Dif
0,09/07/2003,20000,10220,8,8,920.0,41,37720.0,37720.0,0.0
1,21/08/2003,20001,10491,4,48,1400.0,30,42000.0,42000.0,0.0
2,27/08/2003,20002,10704,3,43,1500.0,25,37500.0,37500.0,0.0
3,28/05/2003,20003,10430,5,54,2400.0,22,52800.0,52800.0,0.0
4,06/12/2003,20004,10841,17,11,1500.0,21,31500.0,31500.0,0.0


Now let's create a new field called 'VlrAcum' containing the accumulated value of the Amount field.

In [7]:
invoices['VlrAcum'] = invoices['Amount'].cumsum()
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,VlrAudit,Dif,VlrAcum
0,09/07/2003,20000,10220,8,8,920.0,41,37720.0,37720.0,0.0,37720.0
1,21/08/2003,20001,10491,4,48,1400.0,30,42000.0,42000.0,0.0,79720.0
2,27/08/2003,20002,10704,3,43,1500.0,25,37500.0,37500.0,0.0,117220.0
3,28/05/2003,20003,10430,5,54,2400.0,22,52800.0,52800.0,0.0,170020.0
4,06/12/2003,20004,10841,17,11,1500.0,21,31500.0,31500.0,0.0,201520.0


If we want to delete the columns that were added and return to the original data set, we can do the following:

In [8]:
invoices = invoices.drop(['VlrAudit','Dif','VlrAcum'],axis = 1) # The axis = 1 parameter indicates that we want the whole column
#be eliminated.

In [9]:
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
0,09/07/2003,20000,10220,8,8,920.0,41,37720.0
1,21/08/2003,20001,10491,4,48,1400.0,30,42000.0
2,27/08/2003,20002,10704,3,43,1500.0,25,37500.0
3,28/05/2003,20003,10430,5,54,2400.0,22,52800.0
4,06/12/2003,20004,10841,17,11,1500.0,21,31500.0


## 2. Adding new records

Sometimes we need to add new records (new lines) to a data set.This situation comes, for example, when we have monthly files containing a company's billing data, or even files containing the transactions performed on a given accounting account and want to gather these files to analyze the entire financial year.This example will be illustrated with trans_abril.xls and trans_maio.xls data sets available in the repository.


In [10]:
trans_abril = pd.read_excel('/workspaces/Auditoria/Análises_e_Testes_Comuns_da_auditoria_utilizando_Python/Data/Trans_Abril.xls')
trans_abril.head()

Unnamed: 0,Númcartão,Valor,Data_Trans,Códigos,Númclien,Descrição
0,8590120032047834,270.63,2003-04-02,1731,1000,Contratos de eletricidade
1,8590120092563655,899.76,2003-04-02,1731,2000,Contratos de eletricidade
2,8590120233319873,730.46,2003-04-04,1750,250402,Contratos de carpintaria
3,8590120534914664,106.01,2003-04-08,1750,3000,Contratos de carpintaria
4,8590120674263418,309.37,2003-04-08,2741,1000,Publicações e impressões diversas


In [11]:
# Trans_maio has two tabs, so we have to specify in the sheet_name parameter the name of the tab to be imported.
trans_maio1 = pd.read_excel('/workspaces/Auditoria/Análises_e_Testes_Comuns_da_auditoria_utilizando_Python/Data/Trans_Maio.xls',sheet_name ='Trans1_Maio') 
trans_maio1.head()

Unnamed: 0,Númcartão,Códigos,Data_Trans,Númclien,Descrição,Valor
0,8590 1252 7244 7003,4131,2003-05-27,925007,"Linhas de ônibus, incluindo charters, ônibus d...",$108.01
1,8590128346463420,4214,2003-05-28,51593,Serviços de entrega - Local,$71.57
2,8590128263176714,4784,2003-05-29,503458,Tarifas telefônicas e pedágios,$5.83
3,8590128006917664,5992,2003-05-30,925007,Floristas,$152.97
4,8590 1294 0066 5510,4131,2003-03-31,51593,"Linhas de ônibus, incluindo charters, ônibus d...",$390.33


In [12]:
trans_Maio2 = pd.read_excel('/workspaces/Auditoria/Análises_e_Testes_Comuns_da_auditoria_utilizando_Python/Data/Trans_Maio.xls',sheet_name ='Trans2_Maio')
trans_Maio2.drop(columns = 'Unnamed: 5', axis = 1, inplace = True) # Spreadsheet contains a column without information
trans_Maio2.head()


Unnamed: 0,Númcartão,Códigos,Data_Trans,Númclien,Descrição,Valor
0,8590-1224-9766-3807,2741,2003-05-04,962353,Publicações e impressões diversas,$510.43
1,8590122281964011,5021,2003-05-01,812465,Móveis de escritório e comerciais,$178.96
2,8590120784984566,3066,2003-05-02,51593,Southwest,$270.25
3,8590-1242-5362-1744,7922,2003-05-03,250402,"Produtores teatrais (exceto filmes), agências ...",$182.62
4,8590125999743363,3007,2003-05-05,778088,Air France,$1031.70


In [13]:
trans_abril.shape # Shape demonstrates how many lines and columns there are in the data set.In the case there are 281 lines and 6 columns.

(281, 6)

In [14]:
trans_maio1.shape

(86, 6)

In [15]:
trans_Maio2.shape

(114, 6)

Examining column names in the three databases, it appears that the April base has the columns in different positions of the positions in the May1 and May2 bases.But this is not a problem for the Pandas Append method.To make it easier to view, first we will gather the 2 files for May, later, we will make a consolidated between April and May.

In [16]:
trans_maio_total = pd.concat([trans_maio1,trans_Maio2])
trans_maio_total.shape

(200, 6)

In [17]:
trans_maio_total.head()

Unnamed: 0,Númcartão,Códigos,Data_Trans,Númclien,Descrição,Valor
0,8590 1252 7244 7003,4131,2003-05-27,925007,"Linhas de ônibus, incluindo charters, ônibus d...",$108.01
1,8590128346463420,4214,2003-05-28,51593,Serviços de entrega - Local,$71.57
2,8590128263176714,4784,2003-05-29,503458,Tarifas telefônicas e pedágios,$5.83
3,8590128006917664,5992,2003-05-30,925007,Floristas,$152.97
4,8590 1294 0066 5510,4131,2003-03-31,51593,"Linhas de ônibus, incluindo charters, ônibus d...",$390.33


In [18]:
trans_maio_total = trans_maio_total.reset_index(drop=True)

In [19]:
consolidado = pd.concat([trans_maio_total,trans_abril],ignore_index=True, sort=False)
consolidado.head()

Unnamed: 0,Númcartão,Códigos,Data_Trans,Númclien,Descrição,Valor
0,8590 1252 7244 7003,4131,2003-05-27,925007,"Linhas de ônibus, incluindo charters, ônibus d...",$108.01
1,8590128346463420,4214,2003-05-28,51593,Serviços de entrega - Local,$71.57
2,8590128263176714,4784,2003-05-29,503458,Tarifas telefônicas e pedágios,$5.83
3,8590128006917664,5992,2003-05-30,925007,Floristas,$152.97
4,8590 1294 0066 5510,4131,2003-03-31,51593,"Linhas de ônibus, incluindo charters, ônibus d...",$390.33


In [20]:
consolidado.shape

(481, 6)

In [21]:
len(trans_abril) + len(trans_maio1) + len(trans_Maio2)

481

As we can see, the sum of the number of lines of the 3 tables matchs the consolidated.

## 3. Adding new fields

To perform this task Python has the *NUMPY*. Library.To illustrate this procedure will be created the data to be used. Three vectors will be created that will then be combined to form a single data set.

In [22]:
column1 = np.repeat(np.array(['Marcos','Maria','Carla']),[6]) # 18 elements
column1

array(['Marcos', 'Marcos', 'Marcos', 'Marcos', 'Marcos', 'Marcos',
       'Maria', 'Maria', 'Maria', 'Maria', 'Maria', 'Maria', 'Carla',
       'Carla', 'Carla', 'Carla', 'Carla', 'Carla'], dtype='<U6')

In [23]:
column2 = np.random.randn(18) # 18 random numbers from a normal distribution
column2

array([-0.26045664,  1.12305944,  0.12348862, -0.74011846,  0.41566135,
        1.18349993, -0.08698198, -0.74100071, -0.12017011, -0.8835765 ,
        0.93735188,  0.28634122, -0.5917236 , -0.48932849, -1.15150779,
       -0.63376311,  1.26790071, -0.08709002])

In [24]:
column3 = np.random.choice(["Jan", "Fev", "Mar", "Abr", "Mai", "Jun", "Jul", "Ago", "Set", "Out", "Nov", "Dez"], replace= True, 
                           size= 18)
column3                           

array(['Ago', 'Set', 'Mai', 'Abr', 'Jun', 'Set', 'Fev', 'Set', 'Abr',
       'Jul', 'Mar', 'Jan', 'Dez', 'Set', 'Out', 'Out', 'Jan', 'Mai'],
      dtype='<U3')

In [25]:
# Putting everything together in a DataFrame
new_df = pd.DataFrame({'Name': column1,'Value':column2,'Month':column3})
new_df.head(10)

Unnamed: 0,Name,Value,Month
0,Marcos,-0.260457,Ago
1,Marcos,1.123059,Set
2,Marcos,0.123489,Mai
3,Marcos,-0.740118,Abr
4,Marcos,0.415661,Jun
5,Marcos,1.1835,Set
6,Maria,-0.086982,Fev
7,Maria,-0.741001,Set
8,Maria,-0.12017,Abr
9,Maria,-0.883576,Jul


## 4. Cross tab

The cross tabulation usually consists of determining the frequency distribution of one, two or more categorical variables.Python has CrossStab () and Groupby () functions that allow us to implement this technique.

In [26]:
# importing file
rh_data = pd.read_excel('/workspaces/Auditoria/Análises_e_Testes_Comuns_da_auditoria_utilizando_Python/Data/RH.xlsx')
rh_data.head()

Unnamed: 0,Sexo,Estado Civil,Anos de estudo,Formação,Tempo de empresa,Unidade,Departamento,Cargo,Salário,Bônus
0,Masculino,Casado,14,Sócio-econômicas,19,Curitiba,Produção,Assistente,16.67,28.02
1,Masculino,Viúvo,19,Sócio-econômicas,31,São Paulo,Vendas,Assistente,29.13,41.24
2,Feminino,Casado,18,Sócio-econômicas,28,Rio de Janeiro,Financeiro,Assi,21.8,16.88
3,Feminino,Casado,16,Sócio-econômicas,20,Rio de Janeiro,Vendas,Assistente,22.61,13.5
4,Masculino,Solteiro,15,Sócio-econômicas,15,Curitiba,Vendas,Auxiliar,16.67,8.44


In [27]:
rh_data['Salário'] = pd.to_numeric(rh_data['Salário'],errors='coerce')

In [28]:
rh_data.dtypes

Sexo                 object
Estado Civil         object
Anos de estudo       object
Formação             object
Tempo de empresa     object
Unidade              object
Departamento         object
Cargo                object
Salário             float64
Bônus                object
dtype: object

In [29]:
rh_data.Sexo.value_counts()

Sexo
Masculino    2940
Feminino     2060
Name: count, dtype: int64

With the commands above, the frequency distribution of a single variable was obtained: gender.If the joint distribution of gender and marital status variables is desired, also known as contingency table, we can be proceeded as follows:

In [30]:
pd.crosstab(index= rh_data['Sexo'],columns= rh_data['Estado Civil'], margins= True,margins_name= 'Total')

Estado Civil,Casado,Divorciado,Solteiro,Viúvo,Total
Sexo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Feminino,1369,181,375,135,2060
Masculino,2080,238,424,198,2940
Total,3449,419,799,333,5000


If our interest was to total salaries for gender, we can proceed with following way:

In [31]:
print(round(rh_data.groupby('Sexo', as_index = False)['Salário'].sum(),2))

        Sexo   Salário
0   Feminino  17871.30
1  Masculino  46206.43


## 5. Duplicities

The identification of duplicate values in a database is a task routinely performed by auditors to identify improper values repetitions.For this, the Pandas package has a function, the **duplicated**.

In [32]:
invoices = pd.read_csv('/workspaces/Auditoria/Análises_e_Testes_Comuns_da_auditoria_utilizando_Python/Data/Invoices.csv',sep =";")
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
0,09/07/2003,20000,10220,8,8,920,41,37720
1,21/08/2003,20001,10491,4,48,1400,30,42000
2,27/08/2003,20002,10704,3,43,1500,25,37500
3,28/05/2003,20003,10430,5,54,2400,22,52800
4,06/12/2003,20004,10841,17,11,1500,21,31500


In [33]:
# As a first argument, one or more columns can be used to be used as the basis of comparison to search for duplicate values.
# In our example I used the "InvoiceNo" column (invoice number).
invoices[invoices.duplicated('InvoiceNo', keep= False)]                                                                                                            

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
10,14/01/2003,20010,10439,19,38,745,28,20860
11,14/01/2003,20010,10439,99,38,745,28,20860


As we can see, lines 10 and 11 contains duplicate values.

 ## 6. Filters

 When examining a data set, it is often the situation in which the auditor is interested in selecting only one subset of it, i.e selecting the records that meet certain criteria defined by him.

Thus the auditor may want to select for analysis, only records where the amount of invoices, for example, are higher than a certain value, or invoices issued in a given period or records related to employees of a particular department of a company.For this we will use the HR data set again.

In [34]:

rh_data.head()

Unnamed: 0,Sexo,Estado Civil,Anos de estudo,Formação,Tempo de empresa,Unidade,Departamento,Cargo,Salário,Bônus
0,Masculino,Casado,14,Sócio-econômicas,19,Curitiba,Produção,Assistente,16.67,28.02
1,Masculino,Viúvo,19,Sócio-econômicas,31,São Paulo,Vendas,Assistente,29.13,41.24
2,Feminino,Casado,18,Sócio-econômicas,28,Rio de Janeiro,Financeiro,Assi,21.8,16.88
3,Feminino,Casado,16,Sócio-econômicas,20,Rio de Janeiro,Vendas,Assistente,22.61,13.5
4,Masculino,Solteiro,15,Sócio-econômicas,15,Curitiba,Vendas,Auxiliar,16.67,8.44


Assuming you want to select only the records related to female employees, you can do as follows:

In [35]:
rh_feminino = rh_data[rh_data['Sexo'] == 'Feminino' ]
rh_data['Salário'] = pd.to_numeric(rh_data['Salário'],errors='coerce')
rh_feminino.head()

Unnamed: 0,Sexo,Estado Civil,Anos de estudo,Formação,Tempo de empresa,Unidade,Departamento,Cargo,Salário,Bônus
2,Feminino,Casado,18,Sócio-econômicas,28,Rio de Janeiro,Financeiro,Assi,21.8,16.88
3,Feminino,Casado,16,Sócio-econômicas,20,Rio de Janeiro,Vendas,Assistente,22.61,13.5
9,Feminino,Casado,12,Humanas,16,Rio de Janeiro,Produção,Assistente,5.78,16.68
10,Feminino,Solteiro,13,Humanas,20,Florianópolis,Produção,Auxiliar,1.08,11.72
12,Feminino,Casado,13,Exatas,23,Rio de Janeiro,Pessoal,Assistente,5.19,16.48


Well, what if we want to filter the records related to male employees, from the Rio de Janeiro unit, the Sales Department and have a salary of over $ 70.00?

In [36]:
rh_masculino = rh_data[(rh_data.Sexo  == 'Masculino') 
                       & (rh_data.Unidade  == 'Rio de Janeiro') 
                       & (rh_data.Departamento == 'Vendas')
                       & (rh_data.Salário >= 70.00)]
rh_masculino

Unnamed: 0,Sexo,Estado Civil,Anos de estudo,Formação,Tempo de empresa,Unidade,Departamento,Cargo,Salário,Bônus
367,Masculino,Viúvo,22,Sócio-econômicas,50,Rio de Janeiro,Vendas,Vice-Presidente,77.47,49.48
387,Masculino,Casado,23,Sócio-econômicas,42,Rio de Janeiro,Vendas,Gerente,88.93,148.6
397,Masculino,Casado,19,Sócio-econômicas,45,Rio de Janeiro,Vendas,Vice-Presidente,88.93,110.26
408,Masculino,Casado,28,Sócio-econômicas,37,Rio de Janeiro,Vendas,Vice-Presidente,141.42,341.18
607,Masculino,Casado,22,Sócio-econômicas,36,Rio de Janeiro,Vendas,Gerente,92.97,127.32
760,Masculino,Casado,19,Sócio-econômicas,41,Rio de Janeiro,Vendas,Gerente,75.2,84.36
798,Masculino,Casado,21,Sócio-econômicas,39,Rio de Janeiro,Vendas,Gerente,81.65,103.02
1316,Masculino,Casado,23,Sócio-econômicas,39,Rio de Janeiro,Vendas,Vice-Presidente,79.31,192.14
1435,Masculino,Casado,22,Sócio-econômicas,37,Rio de Janeiro,Vendas,Vice-Presidente,99.77,176.88
1567,Masculino,Casado,22,Sócio-econômicas,35,Rio de Janeiro,Vendas,Gerente,75.2,90.74


## 7. Sorting

Sorting a data set based on the values of a particular field is a common task when doing data analysis.Usually the auditor seeks, with this procedure, identifying the highest values launched, the highest expenses, etc.

We will use the Invoices.CSV database to exemplify its use in the database set order based on the values of the invoices issued, which will allow the most value to identify the highest value.

In [37]:
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
0,09/07/2003,20000,10220,8,8,920,41,37720
1,21/08/2003,20001,10491,4,48,1400,30,42000
2,27/08/2003,20002,10704,3,43,1500,25,37500
3,28/05/2003,20003,10430,5,54,2400,22,52800
4,06/12/2003,20004,10841,17,11,1500,21,31500


The sorting of this database based on the amounts of the invoices (Amount field) can be done increasingly or decreasing as illustrated below:

In [38]:
invoices.sort_values('Amount').head(10) # By default, the Sort_Values function orders in ascending order of values.

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
1342,13/05/2003,21343,10224,10,34,250,4,1000
3617,16/09/2003,23618,10265,9,39,1000,10,10000
4823,14/10/2003,24824,10454,1,68,1250,8,10000
2638,21/09/2003,22639,10998,12,5,1250,8,10000
551,11/12/2003,20552,10042,12,5,1250,8,10000
4436,25/09/2003,24437,10526,12,5,1250,8,10000
4565,29/10/2003,24566,10586,14,5,1250,8,10000
975,27/01/2003,20976,10712,22,50,1000,10,10000
4244,10/08/2003,24245,10058,17,72,775,13,10075
363,28/10/2003,20364,10954,7,72,775,13,10075


If the intention is to order in a ascending order, we must proceed as follows:

In [39]:
invoices.sort_values('Amount', ascending= False).head(10)

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
4298,16/09/2003,24299,10937,2,24,2325,43,99975
1239,30/12/2003,21240,10989,21,24,2325,43,99975
2170,13/02/2003,22171,10771,7,15,12379,8,99032
1624,19/12/2003,21625,10153,15,15,12379,8,99032
2431,14/05/2003,22432,10830,22,15,12379,8,99032
3737,10/03/2003,23738,10894,11,15,12379,8,99032
806,02/12/2003,20807,10408,13,56,3000,33,99000
3450,29/04/2003,23451,10640,13,59,5500,18,99000
244,08/09/2003,20245,10607,15,56,3000,33,99000
643,17/04/2003,20644,10802,3,19,3000,33,99000


**Sorting based on two columns**

If our interest were to sort the database based on the Salesperson column (ascending order) and the Amount column (decreasing order), we can do as follows:

In [40]:
invoices.sort_values(['SalesPerson','Amount'],axis =0,
                     ascending= [True,False], inplace = True)
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
3988,27/10/2003,23989,10559,1,73,4930,20,98600
676,26/01/2003,20677,10866,1,32,1200,8,9600
782,04/04/2003,20783,10335,1,76,950,10,9500
4250,19/07/2003,24251,10580,1,70,1900,5,9500
2566,07/01/2003,22567,10371,1,61,3100,3,9300


As we can see, the Salesperson column is in ascending order, while the Amount column is in descending order.

## 8. *Gaps* 

Several documents are issued following a sequential numbering;which allows greater control over them.Examples are the sequential numbering of invoices issued by a company, the checks issued, the notes of commitment, the contracts celebrated, etc.

A very useful and usually available feature in general -use audit software is the possibility of identifying missing items in a data set, i.e checking if there is any gap in sequential numbering.

Let's illustrate how to do this using the Invoices.CSV database.

The approach used will be described below:

In [41]:
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
3988,27/10/2003,23989,10559,1,73,4930,20,98600
676,26/01/2003,20677,10866,1,32,1200,8,9600
782,04/04/2003,20783,10335,1,76,950,10,9500
4250,19/07/2003,24251,10580,1,70,1900,5,9500
2566,07/01/2003,22567,10371,1,61,3100,3,9300


The data set will be ordered based on the sequential numbering of the invoices and each element will be subtracted from its predecessor.If there is no failure in the sequential numbering, it is expected to obtain as a result a vector containing only 1. where the result is different from 1 there will be a failure.Let's see how to do this in practice.

1. **Ordination of the data set based on the number of invoices**

In [42]:
faturas = invoices.sort_values('InvoiceNo')
faturas.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
0,09/07/2003,20000,10220,8,8,920,41,37720
1,21/08/2003,20001,10491,4,48,1400,30,42000
2,27/08/2003,20002,10704,3,43,1500,25,37500
3,28/05/2003,20003,10430,5,54,2400,22,52800
4,06/12/2003,20004,10841,17,11,1500,21,31500


2. **Calculation of differences**

In [43]:
faturas['differ'] = faturas.InvoiceNo.diff()
faturas.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,differ
0,09/07/2003,20000,10220,8,8,920,41,37720,
1,21/08/2003,20001,10491,4,48,1400,30,42000,1.0
2,27/08/2003,20002,10704,3,43,1500,25,37500,1.0
3,28/05/2003,20003,10430,5,54,2400,22,52800,1.0
4,06/12/2003,20004,10841,17,11,1500,21,31500,1.0


3. **Identifying records with gaps.**

In [44]:
differences = faturas[faturas['differ']!= 1.0].iloc[1:] #Iloc here is to eliminate the first line, because its result is Nan,
# Since it is not possible to subtract the first line by the predecessor line.
differences

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,differ
11,14/01/2003,20010,10439,99,38,745,28,20860,0.0
12,29/08/2003,20012,10919,11,31,2105,28,58940,2.0
23,17/01/2003,20024,10459,24,61,3100,42,130200,2.0


In [45]:
faturas.iloc[[10,11,12,22,23],:]

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,differ
10,14/01/2003,20010,10439,19,38,745,28,20860,1.0
11,14/01/2003,20010,10439,99,38,745,28,20860,0.0
12,29/08/2003,20012,10919,11,31,2105,28,58940,2.0
22,29/11/2003,20022,10355,15,12,4600,12,55200,1.0
23,17/01/2003,20024,10459,24,61,3100,42,130200,2.0


As we can see, invoices No. 20010 are duplicated, the invoice No. 20011 and 20023 is missing.

## 9. Stratification

Stratification consists of classifying the records of a data set into mutually exclusive strata.In audit, this classification is usually done based on monetary values.The procedure is to establish values ranges and indicate, for each record, to which range of value it belongs.

To illustrate the execution of this procedure, the Invoices.CSV data set will be used.

The first step in defining strata is to establish the values ranges for the variable to be used in the stratification in which the records are desired.For this, it is necessary to know the maximum and minimum values of the data set, which can be done easily with the min () and max () function.

In [46]:
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount
3988,27/10/2003,23989,10559,1,73,4930,20,98600
676,26/01/2003,20677,10866,1,32,1200,8,9600
782,04/04/2003,20783,10335,1,76,950,10,9500
4250,19/07/2003,24251,10580,1,70,1900,5,9500
2566,07/01/2003,22567,10371,1,61,3100,3,9300


In [47]:
# Transforming "Amount" and "UnitPrice" fields to numeric.
invoices['UnitPrice'] = invoices['UnitPrice'].apply(lambda x: x.replace('$', '').replace(',', '.')).astype('float')
invoices['Amount'] = invoices['Amount'].apply(lambda x: x.replace('$', '').replace(',', '.')).astype('float')

In [48]:
print('The lowest total invoice value is: ', invoices.Amount.min())

The lowest total invoice value is:  5.0


In [49]:
print('The greatest total invoice value is: ', invoices.Amount.max())

The greatest total invoice value is:  13438.5


Considering that invoice values range from 5.00 to 13,438.50, we wish to establish 3 strata (low values, median values and high values) we could have the following strata:

**Strata 1 -**  5,00 a 1.000,00

**Strata 2 -**  1.000,01 a 10.000,00

**Strata 3 -**  10.000,01 a 13.438,50

Defining the amount of strata (3 in this example) and the amplitude of each track is a subjective choice of the auditor.It should be noted that the value ranges may or may not have the same amplitude.

To classify the database records in each of the defined strata uses the CUT () function, as exemplified below:

In [50]:
invoices['Stratas'] = pd.cut(invoices.Amount,[0.00, 1000.00, 10000.00, 13500.00],
                              labels = ['Low','Median','High'])

invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,Stratas
3988,27/10/2003,23989,10559,1,73,49.3,20,986.0,Low
676,26/01/2003,20677,10866,1,32,12.0,8,96.0,Low
782,04/04/2003,20783,10335,1,76,9.5,10,95.0,Low
4250,19/07/2003,24251,10580,1,70,19.0,5,95.0,Low
2566,07/01/2003,22567,10371,1,61,31.0,3,93.0,Low


## 10. Summarization

The situation in which the auditor wants to obtain a summary of the data (minimum value, maximum, average value, median, total, etc.) is relatively frequent based on the records corresponding to subsets of the data under examination.These subsets are usually defined by the values assumed by categorical variables.

Let's look at an example.

The data set Invoices.CSV contains information on the revenue of a company throughout the 2003 year. Among the fields in the data set is the Date field, which contains the date the invoice was issued, from which we can easily create a new field in the database containing the NF issuance month.

Note that this field will have twelve distinct values ("Jan", "Feb", ..., "Dec") and each month defines a subset of the data, for which we can be interested in calculating the average, total, minimum value, maximum value, median, etc.of the invoiced values (Amount column).

In the specific case of the Invoices.CSV data set, it may be of interest to the auditor to verify the average revenue in each of the twelve months of the financial year or any other summary measure for each month.

In [51]:
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,Stratas
3988,27/10/2003,23989,10559,1,73,49.3,20,986.0,Low
676,26/01/2003,20677,10866,1,32,12.0,8,96.0,Low
782,04/04/2003,20783,10335,1,76,9.5,10,95.0,Low
4250,19/07/2003,24251,10580,1,70,19.0,5,95.0,Low
2566,07/01/2003,22567,10371,1,61,31.0,3,93.0,Low


In [52]:
invoices.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4999 entries, 3988 to 11
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   Date         4999 non-null   object  
 1   InvoiceNo    4999 non-null   int64   
 2   CustomerNo   4999 non-null   int64   
 3   SalesPerson  4999 non-null   int64   
 4   ProductNo    4999 non-null   int64   
 5   UnitPrice    4999 non-null   float64 
 6   Quantity     4999 non-null   int64   
 7   Amount       4999 non-null   float64 
 8   Stratas      4999 non-null   category
dtypes: category(1), float64(2), int64(5), object(1)
memory usage: 356.5+ KB


In [53]:
invoices['Date'] = pd.to_datetime(invoices.Date) # Convert the Date field to date


  invoices['Date'] = pd.to_datetime(invoices.Date) # Convert the Date field to date


In [54]:
invoices['Month'] = pd.DatetimeIndex(invoices['Date']).month_name()


In [55]:
invoices.head()

Unnamed: 0,Date,InvoiceNo,CustomerNo,SalesPerson,ProductNo,UnitPrice,Quantity,Amount,Stratas,Month
3988,2003-10-27,23989,10559,1,73,49.3,20,986.0,Low,October
676,2003-01-26,20677,10866,1,32,12.0,8,96.0,Low,January
782,2003-04-04,20783,10335,1,76,9.5,10,95.0,Low,April
4250,2003-07-19,24251,10580,1,70,19.0,5,95.0,Low,July
2566,2003-01-07,22567,10371,1,61,31.0,3,93.0,Low,January


## Calculation of average monthly revenues using the Groupby() function

In [56]:
aggregation = invoices.groupby('Month')['Amount'].mean()

In [57]:
# Dictionary to map the names of the months to numbers
month_to_number = {
    "January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6,
    "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12
}

# Map the names of the months to numbers
aggregation.index = pd.Categorical(aggregation.index, categories=month_to_number.keys(), ordered=True)

# Sorting
aggregation = aggregation.sort_index()


In [58]:
display(aggregation)

January      771.873062
February     850.285771
March        656.205215
April        853.586765
May          895.156182
June         833.371707
July         826.808354
August       743.664201
September    724.803614
October      735.279212
November     764.670777
December     783.237300
Name: Amount, dtype: float64

## Monthly statistics

In [59]:
statistics = invoices.groupby('Month')['Amount'].describe()
statistics.index = pd.Categorical(aggregation.index, categories=month_to_number.keys(), ordered=True)
statistics

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
January,374.0,853.586765,1476.541674,15.5,211.875,456.0,945.3375,13438.5
February,438.0,743.664201,1022.554443,7.5,203.55,429.75,858.0,9222.5
March,463.0,783.2373,1322.256684,7.0,220.25,462.5,820.95,11594.0
April,376.0,850.285771,1349.590444,9.2,226.1375,487.5,887.825,13438.5
May,405.0,771.873062,1260.023759,5.0,207.12,418.0,798.0,10803.5
June,474.0,826.808354,1461.318891,12.0,246.5,464.05,879.5,12911.5
July,410.0,833.371707,1390.876187,5.0,275.125,489.75,886.575,12911.5
August,418.0,656.205215,860.635647,5.0,231.0,460.5,778.0,11330.5
September,406.0,895.156182,1695.605446,10.0,234.875,451.2,798.675,12384.5
October,412.0,764.670777,1219.161431,9.5,230.95,456.0,836.0,12121.0


## 11. Merging files

In certain situations is needed to perform a task in different files and it is necessary to reunite them in a single data set.

For example, consider the case where the auditor needs to confirm with third parties the values that make up the customer balance, for example.

Registration data, such as address, customer identification code, zip code, etc., are contained in one file and the values due by customers, in another.It turns out that the auditor needs the address and amounts due to prepare a letter of circularization.For this, it is indispensable that in both files there is a field that identifies the customer in a unique way (in this case we will use the "Account" column).For the execution of this procedure Python has the Pandas.MERGE () function, whose use it will be shown later.

These two data sets are fixed -shaped text files and thus their import depends on knowing the fileout of the files, which contained in the description document data_v3.doc files that follows the data sets used.

Based on the information contained in that document, the importation of the data can be done as follows:

In [60]:
# Import the Accounts Receivable file
account_receivable = pd.read_fwf('/workspaces/Auditoria/Análises_e_Testes_Comuns_da_auditoria_utilizando_Python/Data/Arfile.ASC', sep ="\t", widths=[11, 4, 4, 15, 8], 
                               header = None, 
                               names = ['account', 'division', 'store', 'balance', 'duedate'])

In [61]:
account_receivable.head()

Unnamed: 0,account,division,store,balance,duedate
0,S0000309077,246,20,13192.42,20010101
1,S0000041943,87,3,260.97,20010103
2,S0000143191,87,20,9541.28,20010106
3,S0000459709,9045,20,2254.19,20010110
4,S0000030187,139,4,2286.84,20010110


In [62]:
# Load the customer address file
address = pd.read_fwf('/workspaces/Auditoria/Análises_e_Testes_Comuns_da_auditoria_utilizando_Python/Data/Address.ASC', sep= '\t', widths= [11, 33, 33, 30, 25, 5],
                       header = None, 
                       names = ['account', 'name1', 'name2', 'street', 'cityst', 'zip'])

In [63]:
address.head()

Unnamed: 0,account,name1,name2,street,cityst,zip
0,S0000031637,LYNDA RANSEGNOLA,,1 CHARLOTTE LANE,"DENVILLE, AZ",72134
1,S0000249225,SOPHIE F. NATHAN,,1 COOLIDGE ST.,"NORWALK, CT",6850
2,S0000032500,MERLE DEL POLITO,,1 EAST SHAWNEE TRAIL,"WHARTON, AZ",72185
3,S0000800468,KERRI STRACCO,,1 FRANCIS TERRACE P. O. BOX 68,"STANHOPE, AZ",72174
4,S0000001037,JULIE ANN LAMPE,,1 NEWHAMPSHIRE STREET,"NEWTON, AZ",72160


In [64]:
new_set = pd.merge(account_receivable, address[['account','name1', 'name2', 'street', 'cityst','zip']],
                         on= 'account', how='inner')

In [65]:
new_set.head()

Unnamed: 0,account,division,store,balance,duedate,name1,name2,street,cityst,zip
0,S0000309077,246,20,13192.42,20010101,HYACINTHE H. NUBER,,66 CLEARMONT AVENUE,"DENVILLE, AZ",72134
1,S0000041943,87,3,260.97,20010103,ITF JENNIFER ZANIEWSKI,PAULA ZANIEWSKI,6 COBB STREET,"ROCKAWAY, AZ",72166
2,S0000143191,87,20,9541.28,20010106,BETTY KEMMERER,,30 BELLOWS LANE,"TOWACO, AZ",72082
3,S0000459709,9045,20,2254.19,20010110,PETER BAYSA,,18 CRESTWOOD RD.,"ROCKAWAY, AZ",72166
4,S0000030187,139,4,2286.84,20010110,JOSEPHINE PORFIDO,,16 LAKEWOOD DRIVE,"DENVILLE, AZ",72134


In the procedure performed above, also known as Inner Join, the resulting data set (in our new_ case) will contain only the information regarding the accounts that exist in both data set.Accounts that are in the account set account and do not be at address will not be partt of the resulting database.The same is true, that is, the accounts that are in the address set but are not in accounts_receiveble will not be presented either.Inner Join will show the intersection of both bases.