# Exploring the Walmart Triptype Dataset


Walmart has generated a list of trip types for which they want to classify all transactions. The trip types were generated
by Walmart data scientists over many years, but now Wal-mart would like a ML algorithm that will automatically classify
sales transactional data with trip types.

### Source

This data source originally came from a [Kaggle Competition](https://www.kaggle.com/c/walmart-recruiting-trip-type-classification).

### Transaction

Note that this is a transactional data.  Before running our analysis, we may want to perform a pivot or rollup of the
data to get all items fora given transation in a single row.


In [None]:
import pandas as pd
import numpy as np
import re 
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

### Load the data into Pandas

In [None]:
data = pd.read_csv("/data/walmart-triptype/walmart-triptype-train.csv.gz")
w_test_data = pd.read_csv("/data/walmart-triptype/walmart-triptype-test.csv.gz")

## Exploring the Data

Here's a brief guide to the triptypes.  Note that this isn't "official" from walmart, so it's mostly a guide that we can use to help describe the merchandise.

In [None]:
triptypes = pd.read_csv("/data/walmart-triptype/triptypes.csv")
triptypes

### Examining the data

Note that the data here is transactional.  That means that each row in the table is a single transaction rather than an entity.

The row is listed on a per-item-bought basis.  Unlike what one might expect, this isn't given on a per sale basis, so one customer trip will have a number of rows.

In [None]:
data.head()

In [None]:
len(data[data.TripType == 8].VisitNumber.unique())

### Columns

 * TripType - a categorical id representing the type of shopping trip the customer made. This is the ground truth that you are predicting. TripType_999 is an "other" category. 

 * VisitNumber - an id corresponding to a single trip by a single customer
 
 * Weekday - the weekday of the trip

 * Upc - the UPC number of the product purchased
 
 * ScanCount - the number of the given item that was purchased. A negative value indicates a product return.
 
 * DepartmentDescription - a high-level description of the item's department<br>

 * FinelineNumber - a more refined category for each of the products, created by Walmart<br>

### How many rows are there of the data?  

Are there any missing / NA values?

In [None]:
data.count()

647,054 rows of this data. We can see that the only columns with missing data are: <br>
Upc (~4,000 missing values)<br>
Fineline Number (same number of missing values as Upc)<br>
Department Description (~1,500 missing values)<br>
<br>
Preliminary thoughts: <br>
4,000 rows represents a very small portion of the test data (0.6%), so I think it will be safe to simply remove any rows with missing data from our dataframe.


### Get some info about the triptypes.

In [None]:
data.TripType.unique()

In [None]:
len(data.TripType.unique())

So, 38 unique Trip Types. We will need to understand what 999 represents (could be missing information. Would be interesting to do some preliminary visual exploration of this data

### Get some info about the visit numbers.

In [None]:
len(data.VisitNumber.unique())

In [None]:
data.VisitNumber.max()

In [None]:
data.VisitNumber.min()

The data contains 94,247 unique store trips, as each visit number is the ID for a trip, and will be repeated for every item that is purchased on that trip.

What do visit numbers represent?

### Days of the Week

In [None]:
data.Weekday.unique()

Nothing weird for days of the week, we should probably change them to numerals though. We can numerate Monday to Sunday as 1 to 7.

### UPCs

In [None]:
data.Upc.unique()

In [None]:
data.Upc.min()

In [None]:
data.Upc.max()

Good, no negative Upc numbers numbers

### Look at scan Counts

In [None]:
data.ScanCount.unique()

Not a lot of variation of scan counts.  There are a few large outliers though like 51, 71, and -12.  These outliers might skew analysis.

### Look at Department Descriptions

In [None]:
data.DepartmentDescription.unique()

In [None]:
len(data.DepartmentDescription.unique())

Department Descriptions look fairly clean.

### Look at Fineline numbers

Finelines are product categories defined by walmart.  No "key" to finelines is included in the dataset, unlike triptype there are a very large number of them.

It remains to be seen if this will prove to be a useful features for classification.

In [None]:
len(data.FinelineNumber.unique())

In [None]:
data.FinelineNumber.max()

In [None]:
data.FinelineNumber.min()

The number zero for fineline is probably an "unknown" value of some sort.

In [None]:
data[data.FinelineNumber == 0].count()

In [None]:
fineline_is_zero = data[data.FinelineNumber == 0]

In [None]:
fineline_is_zero[fineline_is_zero.ScanCount == 1].count()  

In [None]:
fineline_is_zero[fineline_is_zero.ScanCount == -1].count()  

Almost all the fineline = 0 occurences are when either 1 item was purchased or 1 item was returned. Not sure if this means anything because this could be consistent with the number of 1 or -1 occurrences anyway.

### Understanding the Fineline Numbers 

In [None]:
data_fineline_department = data[["DepartmentDescription", "FinelineNumber"]]

In [None]:
fineline_financial = data_fineline_department[data_fineline_department.DepartmentDescription == "FINANCIAL SERVICES"].FinelineNumber.value_counts()

In [None]:
fineline_financial.plot(kind="bar", rot=45, title="Type 4 Trips", color="midnightblue")

## Cleaning Data

Before loading the data for analysis, let's do a full cleanup items: get rid of NAs, change days of the week to numeric, etc.

In [None]:
# Dropping rows with missing values

data = data.dropna()

In [None]:
data.count()

In [None]:
# Enumerate days of the week

data = data.replace("Monday", 1)
data = data.replace("Tuesday", 2)
data = data.replace("Wednesday", 3)
data = data.replace("Thursday", 4)
data = data.replace("Friday", 5)
data = data.replace("Saturday", 6)
data = data.replace("Sunday", 7)


In [None]:
data.head()

## Data Analysis

Let's look at all the trip types one by one and see what kinds of insights we can get.


In [None]:
x = data.TripType.unique()
np.sort(x)

In [None]:
data_triptypes = data.drop_duplicates("VisitNumber")

In [None]:
x = data_triptypes["TripType"]
x = x.value_counts()

In [None]:
graph = x.plot(kind="bar", figsize=(10, 5), color="midnightblue")
graph.set_title("Number of Occurences by trip type")


Interesting - occurrences of different trip types are not evenly distributed. In fact, most trip types are categorized by just a handful of codes. 

Thinking about types of trips:<br>
- Types of items purchased<br>
- Weekday vs. weekend, or day of week<br>
- returns vs. purchasing

Takeaways from types of trips:<br>
    Many are focused on product category<br>
    The only confusing trips were revolving around groceries or included groceries<br>

In [None]:
type_3 = data[data.TripType == 3]
type_3_items = type_3[["TripType","DepartmentDescription"]]
type_3_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                        title="Type 3 - Financial Services", color="midnightblue")
plt.xticks(fontsize=18)
plt.ylabel('ylabel', fontsize=16)


In [None]:
type_3_fineline = type_3.FinelineNumber.value_counts()
type_3_fineline_finance = type_3[type_3.DepartmentDescription == "FINANCIAL SERVICES"].FinelineNumber.value_counts()

In [None]:
type_3_fineline_finance.plot(kind="bar", rot=45, title="Financial Servies Fineline Numbers for Type 3", color="midnightblue")

In [None]:
type_3_fineline.head(13).plot(kind="bar", rot=45, title="Financial Servies Fineline Numbers for Type 3", color="midnightblue")

It correlates as I suspected! The most frequent financial services fineline numbers - 

In [None]:
type_4 = data[data.TripType == 4]
type_4_items = type_4[["TripType","DepartmentDescription"]]
type_4_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, title="Type 4 Trips", color="midnightblue")

In [None]:
type_5 = data[data.TripType == 5]
type_5_items = type_5[["TripType","DepartmentDescription"]]
type_5_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 5 Trips", color="midnightblue")

type 5 trips = also pharmacy over the counter<br>
they must differ from type 4 trips by another metric, like day of week or number of purchases

In [None]:
type_6 = data[data.TripType == 6]
type_6_items = type_6[["TripType","DepartmentDescription"]]
type_6_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 6 Trips", color="midnightblue")

Type 6 trips are about the booze. Alochol + candy/tobacco/cookies + grocery/impulse merchandise

In [None]:
type_7 = data[data.TripType == 7]
type_7_items = type_7[["TripType","DepartmentDescription"]]
type_7_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 7 Trips", color="midnightblue")

Type 7 trips are clearly grocery runs

In [None]:
type_8 = data[data.TripType == 8]
type_8_items = type_8[["TripType","DepartmentDescription"]]
type_8_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 8 Trips", color="midnightblue")

Type 8 trips, the most frequent, seem like all purpose trips focused on grocery but also strong in personal care and impulse merchandise. They are not limited to grocery like type 7 trips and are twice as frequent. I wonder if the day, or number of items purchased, differs

In [None]:
type_9 = data[data.TripType == 9]
type_9_items = type_9[["TripType","DepartmentDescription"]]
type_9_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 9 Trips", color="midnightblue")

All over the place here! Seems like this is a man shopping - men's clothing + automotive'

In [None]:
type_12 = data[data.TripType == 12]
type_12_items = type_12[["TripType","DepartmentDescription"]]
type_12_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 12 Trips", color="midnightblue")

All over the place again 

In [None]:
type_14 = data[data.TripType == 14]
type_14_items = type_14[["TripType","DepartmentDescription"]]
type_14_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 14 Trips", color="midnightblue")

fabrics and crafts trip - very infrequent - I call this the "Michael's" trip

In [None]:
type_15 = data[data.TripType == 15]
type_15_items = type_15[["TripType","DepartmentDescription"]]
x = type_15_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 15 Trips", color="midnightblue")
#x.text(3,2000,"lalalala", size=15)

PARTY trips!

In [None]:
type_18 = data[data.TripType == 18]
type_18_items = type_18[["TripType","DepartmentDescription"]]
type_18_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 18 Trips", color="midnightblue")
plt.ylabel=('Items Purchased')

TOYS

In [None]:
type_19 = data[data.TripType == 19]
type_19_items = type_19[["TripType","DepartmentDescription"]]
x = type_19_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 19 Trips", color="midnightblue")

Electronics

In [None]:
type_20 = data[data.TripType == 20]
type_20_items = type_20[["TripType","DepartmentDescription"]]
x = type_20_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Trip Type #20", color="midnightblue")


Automotive!

In [None]:
type_21 = data[data.TripType == 21]
type_21_items = type_21[["TripType","DepartmentDescription"]]
x = type_21_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 21 Trips", color="midnightblue")

Office supplies and fabrics/crafts - probably important which fabrics/crafts are being purchased

In [None]:
type_22 = data[data.TripType == 22]
type_22_items = type_22[["TripType","DepartmentDescription"]]
x = type_22_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 22 Trips", color="midnightblue")

Electronics + media and gaming -- probably different electronics than trip type 19, but that's the closest comparable

In [None]:
type_23 = data[data.TripType == 23]
type_23_items = type_23[["TripType","DepartmentDescription"]]
x = type_23_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 23 Trips", color="midnightblue")

players and electronics + media/gaming - how are players/electronics different than just electronics?

In [None]:
type_24 = data[data.TripType == 24]
type_24_items = type_24[["TripType","DepartmentDescription"]]
x = type_24_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 24 Trips", color="midnightblue")

Cook and dine - are these kitchen items? Looks like a best buy type trip

In [None]:
type_25 = data[data.TripType == 25]
type_25_items = type_25[["TripType","DepartmentDescription"]]
x = type_25_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 25 Trips", color="midnightblue")

Clothes trip, with more men's clothing being purchased

In [None]:
type_26 = data[data.TripType == 26]
type_26_items = type_26[["TripType","DepartmentDescription"]]
x = type_26_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 26 Trips", color="midnightblue")

hardware trip - home depot type trip

In [None]:
type_27 = data[data.TripType == 27]
type_27_items = type_27[["TripType","DepartmentDescription"]]
x = type_27_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 27 Trips", color="midnightblue")

lawn and garden + horticulture - home depot lawn and garden trip

In [None]:
type_28 = data[data.TripType == 28]
type_28_items = type_28[["TripType","DepartmentDescription"]]
x = type_28_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 28 Trips", color="midnightblue")

The sporting goods trip!

In [None]:
type_29 = data[data.TripType == 29]
type_29_items = type_29[["TripType","DepartmentDescription"]]
x = type_29_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 29 Trips", color="midnightblue")

Toys and sporting goods! Probably a kids trip, maybe focused on boys or a different age group?

In [None]:
type_30 = data[data.TripType == 30]
type_30_items = type_30[["TripType","DepartmentDescription"]]
x = type_30_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 30 Trips", color="midnightblue")

Shoe and jewelry

In [None]:
type_31 = data[data.TripType == 31]
type_31_items = type_31[["TripType","DepartmentDescription"]]
x = type_31_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 31 Trips", color="midnightblue")

Wireless technology (cellphones?)

In [None]:
type_32 = data[data.TripType == 32]
type_32_items = type_32[["TripType","DepartmentDescription"]]
x = type_32_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 32 Trips", color="midnightblue")

Baby products 

In [None]:
type_33 = data[data.TripType == 33]
type_33_items = type_33[["TripType","DepartmentDescription"]]
x = type_33_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 33 Trips", color="midnightblue")

household chemical supplies + paper goods

In [None]:
type_34 = data[data.TripType == 34]
type_34_items = type_34[["TripType","DepartmentDescription"]]
x = type_34_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 34 Trips", color="midnightblue")

Pet trip!

In [None]:
type_35 = data[data.TripType == 35]
type_35_items = type_35[["TripType","DepartmentDescription"]]
x = type_35_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 35 Trips", color="midnightblue")

DSD Groceries (Direct store deliery) - focus on brands?

In [None]:
type_36 = data[data.TripType == 36]
type_36_items = type_36[["TripType","DepartmentDescription"]]
x = type_36_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 36 Trips", color="midnightblue")

personal care + beauty

In [None]:
type_37 = data[data.TripType == 37]
type_37_items = type_37[["TripType","DepartmentDescription"]]
x = type_37_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 37 Trips", color="midnightblue")

Produce trips - another type of grocery trip

In [None]:
type_38 = data[data.TripType == 38]
type_38_items = type_38[["TripType","DepartmentDescription"]]
x = type_38_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 38 Trips", color="midnightblue")

Another grocery trip, with a focus on dairy (probably milk)

In [None]:
type_39 = data[data.TripType == 39]
type_39_items = type_39[["TripType","DepartmentDescription"]]
x = type_39_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 39 Trips", color="midnightblue")

Another grocery trip 

In [None]:
type_40 = data[data.TripType == 40]
type_40_items = type_40[["TripType","DepartmentDescription"]]
x = type_40_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 40 Trips", color="midnightblue")

Another grocery trip 

In [None]:
type_41 = data[data.TripType == 41]
type_41_items = type_41[["TripType","DepartmentDescription"]]
x = type_41_items.DepartmentDescription.value_counts().head(10).plot(kind="bar", rot=45, 
                                                              title="Type 41 Trips", color="midnightblue")

A mix - could this be a return trip? Also, not very frequent

In [None]:
type_42 = data[data.TripType == 42]
type_42_items = type_42[["TripType","DepartmentDescription"]]
x = type_42_items.DepartmentDescription.value_counts().head(10).plot(kind="bar", rot=45, 
                                                              title="Type 42 Trips", color="midnightblue")

Another mix

In [None]:
type_43 = data[data.TripType == 43]
type_43_items = type_43[["TripType","DepartmentDescription"]]
x = type_43_items.DepartmentDescription.value_counts().head(10).plot(kind="bar", rot=45, 
                                                              title="Type 43 Trips", color="midnightblue")

Another mix

In [None]:
type_44 = data[data.TripType == 44]
type_44_items = type_44[["TripType","DepartmentDescription"]]
x = type_44_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 44 Trips", color="midnightblue")

mix of typical stuff

In [None]:
type_999 = data[data.TripType == 999]
type_999_items = type_999[["TripType","DepartmentDescription"]]
x = type_999_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, 
                                                              title="Type 999 Trips", color="midnightblue")

"Others" are often financial services related