# Data Analyst - Take-Home Assignment 

## Know Data

Firstly, we need to load the python libraries and the dataset. For this exercise, I am using the data provided by capchase.

Import Libraries

In [1]:
import pandas as pd

Import Data

In [2]:
historical_invoices_df = pd.read_excel('../data/test_invoices_20210305.xlsx', index_col=0)

In [3]:
historical_invoices_df.head()

Unnamed: 0,id,payment_type,type,account,account_expiration,time,amount,status,response,responsetext,avs_results,csc_results,batch_id,first_name,last_name,company
0,ID_13732,cc,sale,Account_4620,5,20181207000604,1.0,complete,success,Approved,Z,M,,First_Name_987,Last_Name_918,Company_1
1,ID_13732,cc,settle,Account_4620,5,20181207023427,1.0,complete,success,ACCEPTED,Z,M,1.0,First_Name_987,Last_Name_918,Company_1
2,ID_14747,cc,sale,Account_4620,5,20181207001202,2.0,complete,success,APPROVAL,Z,M,,First_Name_987,Last_Name_918,Company_1
3,ID_14747,cc,settle,Account_4620,5,20181207012743,2.0,complete,success,APPROVED,Z,M,337231373.0,First_Name_987,Last_Name_918,Company_1
4,ID_19256,cc,sale,Account_1518,74,20181208022222,0.01,complete,success,Approved,0,M,,,,


Before zooming into each field, let’s first take a bird’s eye view of the overall dataset characteristics.

In [4]:
# count of non-null values for each column and its data type.
historical_invoices_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41201 entries, 0 to 41200
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  41201 non-null  object 
 1   payment_type        41201 non-null  object 
 2   type                41201 non-null  object 
 3   account             41201 non-null  object 
 4   account_expiration  41201 non-null  int64  
 5   time                41201 non-null  int64  
 6   amount              41201 non-null  float64
 7   status              41201 non-null  object 
 8   response            41201 non-null  object 
 9   responsetext        41161 non-null  object 
 10  avs_results         39389 non-null  object 
 11  csc_results         40215 non-null  object 
 12  batch_id            19253 non-null  float64
 13  first_name          41186 non-null  object 
 14  last_name           41186 non-null  object 
 15  company             13 non-null     object 
dtypes: f

In [5]:
#basic statistics of each column. parameter “include = ‘all’”, it outputs the value count, unique count, the top-frequency value of the categorical variables and count, mean, standard deviation, min, max, and percentile of numeric variables
historical_invoices_df.describe(include = 'all')

Unnamed: 0,id,payment_type,type,account,account_expiration,time,amount,status,response,responsetext,avs_results,csc_results,batch_id,first_name,last_name,company
count,41201,41201,41201,41201,41201.0,41201.0,41201.0,41201,41201,41161,39389,40215,19253.0,41186,41186,13
unique,21898,1,4,5474,,,,4,2,27,11,3,,1843,3707,3
top,ID_12736,cc,sale,Account_675,,,,complete,success,APPROVED,Z,M,,First_Name_1631,Last_Name_3354,Company_2
freq,3,41201,21439,64,,,,38390,38494,19170,26537,39554,,471,223,8
mean,,,,,48.556661,20200940000000.0,252.330717,,,,,,433279100.0,,,
std,,,,,27.304527,4626674000.0,548.650148,,,,,,22499600.0,,,
min,,,,,0.0,20181210000000.0,-2550.99,,,,,,1.0,,,
25%,,,,,23.0,20200300000000.0,50.0,,,,,,414672400.0,,,
50%,,,,,49.0,20200710000000.0,129.99,,,,,,432644100.0,,,
75%,,,,,72.0,20201110000000.0,199.99,,,,,,451689600.0,,,


Focus on identifying the number of missing values. isnull().sum()returns the number of missing values for each column.

In [6]:
historical_invoices_df.isnull().sum()

id                        0
payment_type              0
type                      0
account                   0
account_expiration        0
time                      0
amount                    0
status                    0
response                  0
responsetext             40
avs_results            1812
csc_results             986
batch_id              21948
first_name               15
last_name                15
company               41188
dtype: int64

In [7]:
missing_count = historical_invoices_df.isnull().sum()
value_count = historical_invoices_df.isnull().count()
missing_percentage = round(missing_count / value_count * 100,2)
missing_percentage

id                     0.00
payment_type           0.00
type                   0.00
account                0.00
account_expiration     0.00
time                   0.00
amount                 0.00
status                 0.00
response               0.00
responsetext           0.10
avs_results            4.40
csc_results            2.39
batch_id              53.27
first_name             0.04
last_name              0.04
company               99.97
dtype: float64

In [8]:
missing_percentage = round(missing_count[missing_count > 0] / len(historical_invoices_df) * 100,2)
missing_percentage

responsetext     0.10
avs_results      4.40
csc_results      2.39
batch_id        53.27
first_name       0.04
last_name        0.04
company         99.97
dtype: float64

In [9]:
historical_invoices_df.columns

Index(['id', 'payment_type', 'type', 'account', 'account_expiration', 'time',
       'amount', 'status', 'response', 'responsetext', 'avs_results',
       'csc_results', 'batch_id', 'first_name', 'last_name', 'company'],
      dtype='object')

After reviewing the data, I have reached the following conclusions:
1. The company column barely has values, 99.97% of the data is null, so I have decided to delete the column
2. Para crear las suscripciones no creo que sea necesario la columna de batch_id entendiendo este como un lote de facturas. Además tiene un porcentaje de nulos de más del 50%

# 2. Tranform date column

In [10]:
historical_invoices_df['time'] = pd.to_datetime(historical_invoices_df['time'], format='%Y%m%d%H%M%S') 
historical_invoices_df['time']

0       2018-12-07 00:06:04
1       2018-12-07 02:34:27
2       2018-12-07 00:12:02
3       2018-12-07 01:27:43
4       2018-12-08 02:22:22
                ...        
41196   2021-03-05 01:23:29
41197   2021-03-04 23:41:09
41198   2021-03-05 01:23:29
41199   2021-03-05 00:21:50
41200   2021-03-05 01:23:29
Name: time, Length: 41201, dtype: datetime64[ns]