# Generating the organized dataset



In the previous notebook "*1_Analisando_os_dados.ipynb*" I discovered some patterns on the raw data that make it difficult to create an organized and trustable final dataset for all the fields. That is why I decided to include in here just the construction of each of the columns I managed to organize, plus some new columns that I consider relevant for creating the dashboard:

- `"Vendedor"`, `"Nome do lead"`, `"Telefone do lead"`, `"Email do lead"`, `"Status"`, `"Objeção"`: details in that previous notebook.
- `"DDD"`, `"Estado"`: relevant information that would fit perfectly in the dashboard.
- `"Valor"`: random numbers generated with the `NumPy` library.

The idea is to export a nice dataset and still be able to present you a complete dashboard.

## Loading the raw data

In [None]:
%cd /content/drive/MyDrive/Colab Notebooks/Kaizen

import pandas as pd

rawData_TonyStark = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Kaizen/Teste PS de analista.xlsx', 'Tony Stark')
rawData_Joao = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Kaizen/Teste PS de analista.xlsx', 'João')

/content/drive/MyDrive/Colab Notebooks/Kaizen


## "Vendedor", "Nome do lead", "Telefone do lead"

In [None]:
# "Vendedor"
data = pd.DataFrame([rawData_TonyStark['Vendedor'][i].strip().title() for i in range(len(rawData_TonyStark['Vendedor']))], columns=['Vendedor'])

In [None]:
# "Nome do lead"
data['Nome do lead'] = [lead.title() for lead in rawData_TonyStark['Nome do lead']]

In [None]:
# "Telefone do lead"
data['Telefone do lead'] = rawData_TonyStark['Telefone do lead']

## "DDD", "Estado"

Since I have the phone number prefixes, it would be nice to include also the state of each customer: 

In [None]:
# "DDD"
data['DDD'] = [int(num.split()[0].strip('()')) for num in rawData_TonyStark['Telefone do lead']]

In [None]:
ddd_dict = {
    'BR-SP': [11,12,13,14,15,16,17,18,19],
    'BR-RJ': [21,22,23],
    'BR-ES': [27,28],
    'BR-MG': [31,32,33,34,35,37,38],
    'BR-PR': [41,42,43,44,45,46],
    'BR-SC': [47,48,49],
    'BR-RS': [51,53,54,55],
    'BR-DF': [61],
    'BR-GO': [62,64],
    'BR-TO': [63],
    'BR-MT': [65,66],
    'BR-MS': [67],
    'BR-AC': [68],
    'BR-RO': [69],
    'BR-BA': [71,73,74,75,77],
    'BR-SE': [79],
    'BR-PE': [81],
    'BR-AL': [82],
    'BR-PB': [83],
    'BR-RN': [84],
    'BR-CE': [85],
    'BR-PI': [86,88,89],
    'BR-PA': [91,93,94],
    'BR-AM': [92,97],
    'BR-RR': [95],
    'BR-AP': [96],
    'BR-MA': [98,99]
}

states = []
for ddd in data['DDD']:
    for x in list(ddd_dict.items()):
        if ddd not in x[1]:
            continue
        else: 
            states.append(x[0])
            break

# "Estado"
data['Estado'] = states

## "Email do lead", "Status", "Objeção"

In [None]:
usernames_TonyStark = [ x.split('@')[0] for x in rawData_TonyStark['E-mail do lead'] ]
servers_TonyStark = [ x.split('@')[1] for x in rawData_TonyStark['E-mail do lead'] ]
servers_TonyStark = [ 'Uol' if x=='Uou' else x for x in servers_TonyStark]

# "Email do lead"
data['Email do lead'] = [ usernames_TonyStark[i].lower()+'@'+servers_TonyStark[i].lower()+'.com' for i in range(len(rawData_TonyStark)) ]

In [None]:
# "Status"
data['Status'] = [ category.lower() for category in rawData_TonyStark['Status']]
data.replace(['concluido','pago'],['concluído','pagou'], inplace=True)

In [None]:
# "Objeção"
data['Objeção'] = [ obj.lower() for obj in rawData_TonyStark['Objeção']]
data.replace(['não tem dinheiro','sem grana','curuiso','não atendeu'],
             ['sem dinheiro','sem dinheiro','curioso','não responde'],
             inplace=True)

## "Valor"

This column is the most subtle because it is complicated to know which values you expect me to find as the correct ones (I explained it in detail in the other notebook). Instead, I decided to generate some random numbers representing for example the benefit of each customer to the enterprise.

In [None]:
import numpy as np

mu = 2500
sigma = 1000
Znums = np.random.randn(len(data)) # Random numbers following a normal distribution

# "Valor"
data['Valor'] = mu + sigma * Znums

## Exporting the organized dataset

In [None]:
data.to_excel('final_dataset.xlsx')
data.to_csv('final_dataset.csv')

In [None]:
# Final dataset
data

Unnamed: 0,Vendedor,Nome do lead,Telefone do lead,DDD,Estado,Email do lead,Status,Objeção,Valor
0,Thiago,Bruna,(88) 3395-1695,88,BR-PI,bruna1@hotmail.com,concluído,sem dinheiro,2487.482333
1,Romulo,Ana,(68) 2446-3056,68,BR-AC,ana2@gmail.com,pagou,falta tempo,2803.470928
2,Rafael,Adriana,(63) 2992-5510,63,BR-TO,adriana3@yahoo.com,pendente,prioridade,2613.806511
3,Carlos,Dina,(95) 2781-4745,95,BR-RR,dina4@terra.com,contatado,sem dinheiro,1972.515901
4,João,Leticia,(49) 3293-2372,49,BR-SC,leticia5@uol.com,concluído,curioso,2466.706672
5,Pedro,Joelma,(79) 2611-1323,79,BR-SE,joelma6@hotmail.com,pagou,não responde,2773.596973
6,Manoel,Lavinia,(16) 3292-5782,16,BR-SP,lavinia7@gmail.com,concluído,não responde,3782.545196
7,João Luis,Jessica,(63) 2412-6529,63,BR-TO,jessica8@yahoo.com,pagou,sem dinheiro,3528.711104
8,Pitter,Fernanda,(67) 2198-4616,67,BR-MS,fernanda9@terra.com,pendente,falta tempo,3545.114943
9,Tony Stark,Silvia,(67) 2783-0451,67,BR-MS,silvia10@uol.com,contatado,prioridade,3316.104857
