# Recognized patterns on the data (read the FINAL CONCLUSION at the end of the notebook)

The idea of this case is to find out an structure on the .xlsx file provided and to figure out what is the real data for a proper analysis of the customers. However, I found some inconsistencies on many of the fields and moreover, I discovered many **patterns on how the data was generated** to create this exercise.

## Loading the raw data

On the excel file provided there are 2 different sheets that ideally should have the same information of the customers. I am going to load the first sheet "Tony Stark" as `rawData_TonyStark` and the second sheet "João" as `rawData_Joao`.



In [None]:
%cd /content/drive/MyDrive/Colab Notebooks/Kaizen

import pandas as pd

rawData_TonyStark = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Kaizen/Teste PS de analista.xlsx', 'Tony Stark')
rawData_Joao = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Kaizen/Teste PS de analista.xlsx', 'João')

/content/drive/MyDrive/Colab Notebooks/Kaizen


All data has been loaded correctly although `rawData_TonyStark` has a column of `Null` values:

In [None]:
rawData_TonyStark.head()

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,E-mail do lead,Status,Objeção,Unnamed: 6,Valor
0,Thiago,Bruna,(88) 3395-1695,Bruna1@Hotmail,Concluído,não tem dinheiro,,1997.0
1,Romulo,Ana,(68) 2446-3056,Ana2@Gmail,Pago,falta tempo,,97.0
2,Rafael,Adriana,(63) 2992-5510,Adriana3@Yahoo,Pendente,prioridade,,597.0
3,Carlos,Dina,(95) 2781-4745,Dina4@Terra,Contatado,sem grana,,297.65
4,João,LETICIA,(49) 3293-2372,LETICIA5@Uou,concluido,curuiso,,197.0


In [None]:
rawData_Joao.head()

Unnamed: 0,lead,Telefone do lead,E-mail do lead,Status,Objeção,Valor
0,Thiago,(88) 33950102,Thiago56@hotmail,Concluído,não tem dinheiro,1997.0
1,Romulo,(68) 24460428,Romulo1@Outlook,Pago,falta tempo,97.0
2,Rafael,(63) 29920451,Rafael512@Gmail,Pendente,prioridade,597.0
3,Carlos,(95) 27810753,"Carlos645,666666666667@Terra",Contatado,sem grana,297.65
4,João,(49) 32931323,"João 873,666666666667@hotmail",concluido,curuiso,197.0


They both have the same number of entries, which is great:

In [None]:
print(rawData_TonyStark.shape[0])
print(rawData_Joao.shape[0])

44
44


## "Vendedor"

There is more than one column with possible names for the agents:

In [None]:
print(rawData_TonyStark['Vendedor'].unique())
print(rawData_TonyStark['Nome do lead'].unique())

['Thiago' 'Romulo' 'Rafael' 'Carlos' 'João ' 'Pedro' 'Manoel' 'João luis'
 'Pitter' 'Tony Stark' 'Jhon']
['Bruna' 'Ana' 'Adriana' 'Dina' 'LETICIA' 'Joelma' 'Lavinia' 'JESSICA'
 'Fernanda' 'Silvia' 'Priscila' 'Lourdes' 'Margarida' 'Renata' 'Emily'
 'Débora' 'Aline' 'Alana' 'FERNANDA' 'Jaqueline' 'Jully' 'Pollyane'
 'Taiane' 'Michelle' 'MARIA' 'Eduarda' 'Claudete' 'Andrelina' 'Vanessa'
 'Elen' 'Cristiane' 'Nailza' 'Cleide' 'Julie' 'Eliene' 'Camila'
 'Gracielle' 'Caroline' 'Tatiana' 'Denise' 'Luziana' 'EDMARA']


The first column in `rawData_TonyStark` is the correct one with a total of 11 agents. It looks like the names have been generated by concatenating 4 times the same list:

In [None]:
agents = ['Thiago', 'Romulo', 'Rafael', 'Carlos', 'João ', 'Pedro', 'Manoel', 'João luis', 'Pitter', 'Tony Stark', 'Jhon']
rawData_TonyStark['Vendedor'].values.tolist() == agents * 4

True

To start creating the organized final dataset in `data`, I am going to get rid of the blank space in `'João '` -> `'João'` and use capital letters for the second names `'João luis'` -> `'João Luis'`:

In [None]:
data = pd.DataFrame([rawData_TonyStark['Vendedor'][i].strip().title() for i in range(len(rawData_TonyStark['Vendedor']))], columns=['Vendedor'])
data.head()

Unnamed: 0,Vendedor
0,Thiago
1,Romulo
2,Rafael
3,Carlos
4,João


## "Nome do lead"

There are columns with customer names in both sheets but again `rawData_TonyStar['Nome do lead']` looks like the correct one. The "lead" column in the "João sheet" contains the agent names again:

In [None]:
(rawData_Joao['lead'] == rawData_TonyStark['Vendedor']).sum() # All entries coincide

44

These are the customers:

In [None]:
print(rawData_TonyStark['Nome do lead'].values)

['Bruna' 'Ana' 'Adriana' 'Dina' 'LETICIA' 'Joelma' 'Lavinia' 'JESSICA'
 'Fernanda' 'Silvia' 'Priscila' 'Lourdes' 'Margarida' 'Renata' 'Emily'
 'Débora' 'Aline' 'Alana' 'FERNANDA' 'Jaqueline' 'Jully' 'Pollyane'
 'Taiane' 'Michelle' 'MARIA' 'Eduarda' 'Claudete' 'Andrelina' 'Ana'
 'Vanessa' 'Elen' 'Cristiane' 'Nailza' 'Renata' 'Cleide' 'Julie' 'Eliene'
 'Camila' 'Gracielle' 'Caroline' 'Tatiana' 'Denise' 'Luziana' 'EDMARA']


I am going to change the format of the names such that they all look the same, e.g. `'LETICIA'` -> `'Leticia'`:

In [None]:
data['Nome do lead'] = [lead.title() for lead in rawData_TonyStark['Nome do lead']]
data.head()

Unnamed: 0,Vendedor,Nome do lead
0,Thiago,Bruna
1,Romulo,Ana
2,Rafael,Adriana
3,Carlos,Dina
4,João,Leticia


## "Telefone do lead"

Clearly the phone numbers in both sheets do not coincide:

In [None]:
(rawData_TonyStark['Telefone do lead'] == rawData_Joao['Telefone do lead']).sum() # All entries are different

0

However, it is easy to see that the "João sheet" has been manipulated ruining its data. Using the sintax "(ddd) ph1-ph2" we can check that `rawData_Joao['Telefone do lead']` is just a funny reordering of `rawData_TonyStark['Telefone do lead']`:

In [None]:
ddd_TonyStark = [num.split()[0].strip('()') for num in rawData_TonyStark['Telefone do lead']]
ddd_Joao = [num.split()[0].strip('()') for num in rawData_Joao['Telefone do lead']]
print(ddd_TonyStark == ddd_Joao) # Same "ddd"

ph1_TonyStark = [num.split()[1].split('-')[0] for num in rawData_TonyStark['Telefone do lead']]
ph1_Joao = [num.split()[1][:4] for num in rawData_Joao['Telefone do lead']]
print(ph1_TonyStark == ph1_Joao) # Same "ph1"

ph2_TonyStark = [num.split()[1].split('-')[1] for num in rawData_TonyStark['Telefone do lead']]
ph2_Joao = [num.split()[1][4:] for num in rawData_Joao['Telefone do lead']]
print( sorted(ph2_TonyStark) == ph2_Joao ) # "ph2" has been ordered in ascending numbers

print(ph2_Joao)

True
True
True
['0102', '0428', '0451', '0753', '1323', '1333', '1427', '1460', '1627', '1683', '1695', '2372', '2531', '2953', '3056', '3330', '3374', '3450', '3655', '4084', '4397', '4537', '4616', '4745', '5143', '5510', '5778', '5782', '6529', '6621', '6641', '6778', '7305', '7422', '7603', '7736', '7761', '7913', '8077', '8566', '8651', '8906', '9460', '9460']


Therefore again, `rawData_TonyStark['Telefone do lead']` is the correct one and we can include it in the final dataset:

In [None]:
data['Telefone do lead'] = rawData_TonyStark['Telefone do lead']
data.head()

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead
0,Thiago,Bruna,(88) 3395-1695
1,Romulo,Ana,(68) 2446-3056
2,Rafael,Adriana,(63) 2992-5510
3,Carlos,Dina,(95) 2781-4745
4,João,Leticia,(49) 3293-2372


As a curiosity, there is a duplicated phone number which, for the momento, I am not going to discard...

In [None]:
rawData_TonyStark[rawData_TonyStark['Telefone do lead'].duplicated(keep=False)]

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,E-mail do lead,Status,Objeção,Unnamed: 6,Valor
12,Romulo,Margarida,(48) 3706-9460,Margarida13@Yahoo,Concluído,não atendeu,,597.0
13,Rafael,Renata,(48) 3706-9460,Renata14@Terra,Pago,não responde,,297.0


## "Email do lead"

Here we find completely different emails on both sheets, with some misreading info and apparently crossed-generated entries. Let us explore the sheets separately...

#### Emails: "Tony Stark sheet"

I am going to explore first the "usernames" and then the "servers" to see what is going on.

The usernames look like the customer names in `rawData_TonyStark['Nome do lead']` followed by some kind of `id` number, i.e. "users"="name"+"id":

In [None]:
id = range(1,len(rawData_TonyStark)+1)
users = [ rawData_TonyStark.loc[i,'Nome do lead']+str(id[i]) for i in range(len(rawData_TonyStark)) ] # Generated usernames

usernames_TonyStark = [ x.split('@')[0] for x in rawData_TonyStark['E-mail do lead'] ] # Usernames in the dataset

users == usernames_TonyStark

True

I think this was done on purpose instead of choosing more varied usernames when preparing the dataset, so I am ok with that.

At the same time, the different "servers" have been generated concatenating the following list `['Hotmail', 'Gmail', 'Yahoo', 'Terra', 'Uou']` until completing all the entries:

In [None]:
companies = ['Hotmail', 'Gmail', 'Yahoo', 'Terra', 'Uou'] # Building block to generate the servers

servers_TonyStark = [ x.split('@')[1] for x in rawData_TonyStark['E-mail do lead'] ] # Servers in the dataset

(companies * 9)[:len(rawData_TonyStark)] == servers_TonyStark

True

Possibly it was done as well on purpose.Then, all the emails in the "Tony Stark sheet" would be valid once including a server extension, e.g. "***@server.com"

#### Emails: "João sheet"

This time we start exploring the servers, which have been generated in an analogous way:

In [None]:
companies2 = ['hotmail', 'Outlook', 'Gmail', 'Terra'] # Building block to generate the servers

servers_Joao = [ x.split('@')[1] for x in rawData_Joao['E-mail do lead'] ] # Servers in the dataset

(companies2 * 11)[:len(rawData_Joao)] == servers_Joao

True

For the usernames, this sheet shows a more complex structure. Definitely, they do not match the customer names but the agents, so we could directly discard them with almost 100% certainty. They could correspond to the agent emails but the servers should match accordingly in that case: a single server for each agent. It does not happen.

In conclusion, there is no way to know with this data, whose email is each (neither for the customers nor the agents). Just for completeness of the exercise, I am going to include the previous emails in the "Tony Stark sheet" as the correct ones for the customers:

In [None]:
servers_TonyStark = [ 'Uol' if x=='Uou' else x for x in servers_TonyStark ]

In [None]:
data['Email do lead'] = [ usernames_TonyStark[i].lower()+'@'+servers_TonyStark[i].lower()+'.com' for i in range(len(rawData_TonyStark)) ]
data.head()

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,Email do lead
0,Thiago,Bruna,(88) 3395-1695,bruna1@hotmail.com
1,Romulo,Ana,(68) 2446-3056,ana2@gmail.com
2,Rafael,Adriana,(63) 2992-5510,adriana3@yahoo.com
3,Carlos,Dina,(95) 2781-4745,dina4@terra.com
4,João,Leticia,(49) 3293-2372,leticia5@uol.com


Although we have discard them, let us unzip the usernames in the "João sheet" and save the numbers in a new column `data['aux']`. They may come from a different numeric column and we do not want to lose them.

In [None]:
data['aux'] = [ float(rawData_Joao.loc[i,'E-mail do lead'].split('@')[0].strip(rawData_Joao.loc[i,'lead']).replace(',','.')) for i in range(len(rawData_Joao)) ]
data.head()

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,Email do lead,aux
0,Thiago,Bruna,(88) 3395-1695,bruna1@hotmail.com,56.0
1,Romulo,Ana,(68) 2446-3056,ana2@gmail.com,1.0
2,Rafael,Adriana,(63) 2992-5510,adriana3@yahoo.com,512.0
3,Carlos,Dina,(95) 2781-4745,dina4@terra.com,645.666667
4,João,Leticia,(49) 3293-2372,leticia5@uol.com,873.666667


#### 'aux' numeric values

These values could be crucial to restore the original information of a different numeric column. In fact, they could be related to the last column, i.e. "Valor", of the dataset. Let us explore them:

In [None]:
data['aux'].head(10)

0      56.000000
1       1.000000
2     512.000000
3     645.666667
4     873.666667
5    1101.666667
6    1329.666667
7    1557.666667
8    1785.666667
9    2013.666667
Name: aux, dtype: float64

It is just an array of ascending values. It could be related to the ordered list of arbitrary phone numbers in `ph2_Joao` but in this case the values are far from arbitrary since they are actually equally spaced with step `228` (most of them):

In [None]:
print([data.loc[i+1,'aux'] - data.loc[i,'aux'] for i in range(len(data)-1)])

[-55.0, 511.0, 133.66666666666697, 228.0, 228.00000000000296, 228.0, 228.0, 228.0, 228.0, 228.00000000000023, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 227.99999999999955, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 228.0]


There is a curious behavior at the begining of the array `[56, 1, 512, ...]`, where the `228` shows up again as the step-size between the numbers, i.e. `56 + 2*228 = 512`.

For the moment, it is impossible to extract more information without exploring the rest of the fields in the dataset.

## "Status"

Both sheets show the same entries:

In [None]:
(rawData_TonyStark['Status'] == rawData_Joao['Status']).sum()

44

Again, they have been generated by concatenating a single list `['Concluído', 'Pago', 'Pendente', 'Contatado', 'concluido', 'pagou']` until completing the 44 entries:

In [None]:
block = ['Concluído', 'Pago', 'Pendente', 'Contatado', 'concluido', 'pagou']

( block * 8 )[:len(rawData_TonyStark)] == rawData_TonyStark['Status'].tolist()

True

Updating the final dataset:

In [None]:
data['Status'] = [ category.lower() for category in rawData_TonyStark['Status']]
data.head()

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,Email do lead,aux,Status
0,Thiago,Bruna,(88) 3395-1695,bruna1@hotmail.com,56.0,concluído
1,Romulo,Ana,(68) 2446-3056,ana2@gmail.com,1.0,pago
2,Rafael,Adriana,(63) 2992-5510,adriana3@yahoo.com,512.0,pendente
3,Carlos,Dina,(95) 2781-4745,dina4@terra.com,645.666667,contatado
4,João,Leticia,(49) 3293-2372,leticia5@uol.com,873.666667,concluido


Nevertheless, some of the values are equivalent, so we have to join them:

In [None]:
data.replace(['concluido','pago'],['concluído','pagou'], inplace=True)
data.head()

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,Email do lead,aux,Status
0,Thiago,Bruna,(88) 3395-1695,bruna1@hotmail.com,56.0,concluído
1,Romulo,Ana,(68) 2446-3056,ana2@gmail.com,1.0,pagou
2,Rafael,Adriana,(63) 2992-5510,adriana3@yahoo.com,512.0,pendente
3,Carlos,Dina,(95) 2781-4745,dina4@terra.com,645.666667,contatado
4,João,Leticia,(49) 3293-2372,leticia5@uol.com,873.666667,concluído


There is a total of 4 different categories:

In [None]:
print(data['Status'].unique())

['concluído' 'pagou' 'pendente' 'contatado']


## "Objeção"

Both sheets show the same entries:

In [None]:
(rawData_TonyStark['Objeção'] == rawData_Joao['Objeção']).sum()

44

This is the pattern that I found...

In [None]:
objections = ['não tem dinheiro', 'falta tempo', 'prioridade', 'sem grana', 'curuiso', 'não atendeu', 'não responde']

(objections * 7)[:len(rawData_TonyStark)] == rawData_TonyStark['Objeção'].tolist()

True

I include them in the final dataset...

In [None]:
data['Objeção'] = [ obj.lower() for obj in rawData_TonyStark['Objeção']]
data.head()

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,Email do lead,aux,Status,Objeção
0,Thiago,Bruna,(88) 3395-1695,bruna1@hotmail.com,56.0,concluído,não tem dinheiro
1,Romulo,Ana,(68) 2446-3056,ana2@gmail.com,1.0,pagou,falta tempo
2,Rafael,Adriana,(63) 2992-5510,adriana3@yahoo.com,512.0,pendente,prioridade
3,Carlos,Dina,(95) 2781-4745,dina4@terra.com,645.666667,contatado,sem grana
4,João,Leticia,(49) 3293-2372,leticia5@uol.com,873.666667,concluído,curuiso


Joining equivalent categories...

In [None]:
data.replace(['não tem dinheiro','sem grana','curuiso','não atendeu'],
             ['sem dinheiro','sem dinheiro','curioso','não responde'],
             inplace=True)
data.head()

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,Email do lead,aux,Status,Objeção
0,Thiago,Bruna,(88) 3395-1695,bruna1@hotmail.com,56.0,concluído,sem dinheiro
1,Romulo,Ana,(68) 2446-3056,ana2@gmail.com,1.0,pagou,falta tempo
2,Rafael,Adriana,(63) 2992-5510,adriana3@yahoo.com,512.0,pendente,prioridade
3,Carlos,Dina,(95) 2781-4745,dina4@terra.com,645.666667,contatado,sem dinheiro
4,João,Leticia,(49) 3293-2372,leticia5@uol.com,873.666667,concluído,curioso


These are the categories...

In [None]:
print(data['Objeção'].unique())

['sem dinheiro' 'falta tempo' 'prioridade' 'curioso' 'não responde']


## "Valor"

Finally, we find again the same entries on both sheets except for a single `Null` value in the "Tony Stark sheet":

In [None]:
print(rawData_TonyStark['Valor'].isnull().sum()) # There is a single Null value
print((rawData_TonyStark['Valor'] == rawData_Joao['Valor']).sum()) # The rest of the entries coincide

1
43


Taking the "Joao sheet", ideally all the values should be different but have been generated with a similar pattern as the rest of the columns, repeating the list `[1997, 97, 597, 297, 197]` and apparently introducing by hand some decimal values (inconsistently using points "." and commas "," for the decimal part):

In [None]:
import numpy as np

nums = [1997, 97, 597, 297, 197]

# Comparing just the first 15 entries
frame = {
    'generated': nums*3,
    'actual column': rawData_Joao['Valor'][:3*len(nums)].values
}
print(pd.DataFrame(frame))

print('')

# Checking that they are equal except for the decimal values introduced by hand
# I am subtracting the two array to see the decimal parts
print(np.array((nums*9)[:len(rawData_Joao)])-rawData_Joao['Valor'].values)

    generated  actual column
0        1997        1997.00
1          97          97.00
2         597         597.00
3         297         297.65
4         197         197.00
5        1997        1997.00
6          97          97.00
7         597         597.00
8         297         297.00
9         197         197.00
10       1997        1997.55
11         97          97.00
12        597         597.00
13        297         297.00
14        197         197.00

[ 0.00000e+00  0.00000e+00  0.00000e+00 -6.50000e-01  0.00000e+00
  0.00000e+00  0.00000e+00  0.00000e+00  0.00000e+00  0.00000e+00
 -5.50000e-01  0.00000e+00  0.00000e+00  0.00000e+00  0.00000e+00
  0.00000e+00 -9.90000e-01  0.00000e+00  0.00000e+00  0.00000e+00
  0.00000e+00  0.00000e+00  0.00000e+00  0.00000e+00  0.00000e+00
  0.00000e+00  0.00000e+00  0.00000e+00  0.00000e+00 -1.97357e+05
  0.00000e+00  0.00000e+00  0.00000e+00  0.00000e+00  0.00000e+00
  0.00000e+00 -6.50000e-01  0.00000e+00  0.00000e+00  0.00000e+00
  0.000

We have the following information to complete the final dataset:
- `data['aux']` does not show a clear periodic structure. It is just an array of quase-equally-spaced ascending values.
- `data['Valor']` exhibits a recognizable periodic pattern, slightly modified with decimal values.

Hence, considering also the numbers in `data['aux']`, without additional information it is impossible to say that these two columns are related in any way to each other. With all this in mind, I cannot identify the values that you (the recruiter) expect me to find as the correct ones for `data['Valor']`.

# FINAL CONCLUSION

After exploring the data, I found that the columns were generated following different *independent* patterns. For most of them, this does not cause any problem for continuing the exercise. However, for some of them ("Email do lead" or "Valor"), since they do not come from a realistic dataset, it is complicated to identify the correct values that would correspond to your original organized dataset.

Here is the final answer to be analyzed later on (for the missing column ---i.e. 'Valor'--- I am going to generate some random numbers in the next notebook "*2_Gerando_o_dataset.ipynb*"):

In [None]:
data.drop(labels = 'aux', axis = 1, inplace = True)
data

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,Email do lead,Status,Objeção
0,Thiago,Bruna,(88) 3395-1695,bruna1@hotmail.com,concluído,sem dinheiro
1,Romulo,Ana,(68) 2446-3056,ana2@gmail.com,pagou,falta tempo
2,Rafael,Adriana,(63) 2992-5510,adriana3@yahoo.com,pendente,prioridade
3,Carlos,Dina,(95) 2781-4745,dina4@terra.com,contatado,sem dinheiro
4,João,Leticia,(49) 3293-2372,leticia5@uol.com,concluído,curioso
5,Pedro,Joelma,(79) 2611-1323,joelma6@hotmail.com,pagou,não responde
6,Manoel,Lavinia,(16) 3292-5782,lavinia7@gmail.com,concluído,não responde
7,João Luis,Jessica,(63) 2412-6529,jessica8@yahoo.com,pagou,sem dinheiro
8,Pitter,Fernanda,(67) 2198-4616,fernanda9@terra.com,pendente,falta tempo
9,Tony Stark,Silvia,(67) 2783-0451,silvia10@uol.com,contatado,prioridade
