## Data Exploration: Client Lookup Data
In this notebook, we'll do some basic exploration of the client data lookup file to get a better domain understanding and also to assess if any client features can be extracted. Client in this context is the retailer / wholesaler who buys the bakery products from Grupo Bimbo.

This notebook requires that the cliente_tabla.csv file is in '../data'

In [1]:
# Imports go here
import re
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# specify a data file path
data_location = "../data/"

### Getting Started
We'll load the file, produce basic statistics, look at data

In [2]:
# let's load the client table into a dataframe and produce a basic count of items and show the first 5 lines
client_lookup = "cliente_tabla.csv"
client_columns = ['ClientId','ClientName']
df_client = pd.read_csv(data_location + client_lookup,names=client_columns,skiprows=1)

# basic count of entry...
print "there are {} clients in the lookup table".format(len(df_client.index))

# first 5 lines
print df_client.head(5)

there are 935362 clients in the lookup table
   ClientId                               ClientName
0         0                               SIN NOMBRE
1         1                         OXXO XINANTECATL
2         2                               SIN NOMBRE
3         3                                EL MORENO
4         4  SDN SER  DE ALIM  CUERPO SA CIA  DE INT


### First Impressions
We can see that there is an Id which is unique for each client. There are a LOT of clients.

There is a Name field for the client which could be used to engineer additional client features for the model to enable patterns to be found and to enable some level of aggregation / clustering.

Let's look for duplicate entries, parse the name to try to create additional features and save the output to a new file

In [3]:
# check for duplicate product ids or product names
list_of_dupes_id = df_client.duplicated('ClientId')
list_of_dupes_name = df_client.duplicated('ClientName')

# if there are any duplicate Ids, remove all but one as linking to the training data will be flawed otherwise
number_duplicate_ids = len(list_of_dupes_id[list_of_dupes_id == True].index)
number_duplicate_names = len(list_of_dupes_name[list_of_dupes_name == True].index)
print "there are {} duplicate Client Ids and {} duplicate names.".format(number_duplicate_ids,number_duplicate_names)
df_client = df_client.drop_duplicates('ClientId')

there are 4862 duplicate Client Ids and 624207 duplicate names.


In [4]:
# let's re-engineer Client
data_file = "cliente_tabla.csv"
file_in = data_location + "/" + data_file
file_out = data_location + "/" + "engineered_" + data_file

# open the input and output files
input_file = open(file_in, 'rb')
output_file = open(file_out, 'wb')
i = 1

# iterate through the input file line by line
for line in input_file:
    if (i == 1):
        # write out first line
        output_file.write("ID,ORIGINAL,CLIENT_NAME\n")
        i += 1
    else:
        # take the first text of the name before a space as the aggregated name
        client_name = line.split(',')[1].split(' ')[0]
        
        # capture the id
        client_id = line.split(',')[0]
        
        # placeholder for original client name
        
        # write out the line
        output_file.write(client_id+","+" ,"+client_name+"\n")
    
# close the files
input_file.close()
output_file.close()

# load the engineered client table
client_lookup = "engineered_cliente_tabla.csv"
client_columns = ['ClientId','Original','ClientName']
df_client = pd.read_csv(data_location + client_lookup,names=client_columns,skiprows=1)

# show the first few lines
print df_client.head()

   ClientId Original ClientName
0         0                 SIN
1         1                OXXO
2         2                 SIN
3         3                  EL
4         4                 SDN


### Aggregation
Now we have a reengineered cilent file, let's look at some aggregates for the new features. 

Most of the features extracted have a low value count (less than 2) and there is a very large value count for 1% of clients

In [5]:
# value count stats
df_client['ClientName'].value_counts().describe(percentiles=[0.25,0.5,0.75,0.8,0.9,0.95,0.975,0.99])

count     36694.000000
mean         25.490816
std        1518.212939
min           1.000000
25%           1.000000
50%           1.000000
75%           2.000000
80%           3.000000
90%          10.000000
95%          29.000000
97.5%        79.675000
99%         246.000000
max      281710.000000
Name: ClientName, dtype: float64

In [6]:
# finally let's look at some of the larger count values to see if they're useful...
print '\n Name has some good aggregates (OXXO, ESCUELA) but also quite a few bad ones (LA, NO)...'
print df_client.groupby(by='ClientName', 
                         as_index=False)['ClientId'].count().sort_values(by='ClientId',ascending=False).head(10)


 Name has some good aggregates (OXXO, ESCUELA) but also quite a few bad ones (LA, NO)...
       ClientName  ClientId
24918          NO    281710
19286          LA     48268
255     ABARROTES     31544
11147          EL     26282
21666       MARIA     16042
23408  MISCELANEA     15542
25880        OXXO      8973
31397       SUPER      8403
20506         LOS      8156
23213        MINI      6130


### Summary
The client look up has a large number of entries (>900k approx) and contains meta data within the name attribute which allows for some feature engineering