# Data Wrangling

## 2.1 Introduction 

The purpose of this project is to come up with a recommendation model and suggest the most frequently bought items together. 



## 2.2 Import

Placing all the libraries that are required for Data Wrangling are going to be placed here. 

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

import os
#from library.sb_utils import save_file

#### Loading the file

In [3]:
retail_data_09_10 = pd.read_excel('../data/raw/online_retail_II.xlsx',sheet_name= 'Year 2009-2010')
retail_data_09_10.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [4]:
retail_data_09_10.shape

(525461, 8)

In [5]:
retail_data_10_11 = pd.read_excel('../data/raw/online_retail_II.xlsx',sheet_name= 'Year 2010-2011')
retail_data_10_11.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [6]:
retail_data_10_11.shape

(541910, 8)

#### Combine Data

The data is distributed across two different sheets and can be combined into a single dataframe for further analysis. 

In [7]:
retail_data = pd.concat([retail_data_09_10,retail_data_10_11]).reset_index()

In [8]:
retail_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 9 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   index        1067371 non-null  int64         
 1   Invoice      1067371 non-null  object        
 2   StockCode    1067371 non-null  object        
 3   Description  1062989 non-null  object        
 4   Quantity     1067371 non-null  int64         
 5   InvoiceDate  1067371 non-null  datetime64[ns]
 6   Price        1067371 non-null  float64       
 7   Customer ID  824364 non-null   float64       
 8   Country      1067371 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 73.3+ MB


In [9]:
retail_data.describe()

Unnamed: 0,index,Quantity,Price,Customer ID
count,1067371.0,1067371.0,1067371.0,824364.0
mean,266905.6,9.938898,4.649388,15324.638504
std,154171.5,172.7058,123.5531,1697.46445
min,0.0,-80995.0,-53594.36,12346.0
25%,133421.0,1.0,1.25,13975.0
50%,266842.0,3.0,2.1,15255.0
75%,400263.5,10.0,4.15,16797.0
max,541909.0,80995.0,38970.0,18287.0


## 2.3 Missing values by Column


In [10]:
retail_data.isna().sum()

index               0
Invoice             0
StockCode           0
Description      4382
Quantity            0
InvoiceDate         0
Price               0
Customer ID    243007
Country             0
dtype: int64

In [11]:
missing  = pd.concat([retail_data.isna().sum(),100 * retail_data.isna().mean()],axis = 1)
missing.columns = ['count','%']
missing.sort_values(by='count',ascending=False)

Unnamed: 0,count,%
Customer ID,243007,22.766873
Description,4382,0.410541
index,0,0.0
Invoice,0,0.0
StockCode,0,0.0
Quantity,0,0.0
InvoiceDate,0,0.0
Price,0,0.0
Country,0,0.0


`CustomerID` has the most missing values, at around ~23%. Along with that we also see that there are atleast 0.45% of items missing the `Description`. The description is a required column which tells us the item that was sold to the customer. 
CustomerID just uniquely identifies a particular customer. 

In [12]:
retail_data.Description.unique()

array(['15CM CHRISTMAS GLASS BALL 20 LIGHTS', 'PINK CHERRY LIGHTS',
       ' WHITE CHERRY LIGHTS', ..., 'mixed up',
       'CREAM HANGING HEART T-LIGHT HOLDER',
       'PAPER CRAFT , LITTLE BIRDIE'], dtype=object)

In [13]:
nan_values = retail_data[retail_data.Description.isna()]
nan_values.head()

Unnamed: 0,index,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
470,470,489521,21646,,-50,2009-12-01 11:44:00,0.0,,United Kingdom
3114,3114,489655,20683,,-44,2009-12-01 17:26:00,0.0,,United Kingdom
3161,3161,489659,21350,,230,2009-12-01 17:39:00,0.0,,United Kingdom
3731,3731,489781,84292,,17,2009-12-02 11:45:00,0.0,,United Kingdom
4296,4296,489806,18010,,-770,2009-12-02 12:42:00,0.0,,United Kingdom


In [14]:
retail_data.StockCode.unique()

array([85048, '79323P', '79323W', ..., 23609, 23617, 23843], dtype=object)

In [15]:
stockCode = retail_data[retail_data.StockCode == 18010]
stockCode

Unnamed: 0,index,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
976,976,489545,18010,ASSORTED FRAGRANCE BATH CONFETTI,6,2009-12-01 12:22:00,0.42,17804.0,United Kingdom
4296,4296,489806,18010,,-770,2009-12-02 12:42:00,0.0,,United Kingdom
5341,5341,489856,18010,ASSORTED FRAGRANCE BATH CONFETTI,1,2009-12-02 14:36:00,0.83,,United Kingdom
5618,5618,489857,18010,ASSORTED FRAGRANCE BATH CONFETTI,1,2009-12-02 14:43:00,0.83,,United Kingdom
8385,8385,490074,18010,ASSORTED FRAGRANCE BATH CONFETTI,6,2009-12-03 14:39:00,0.83,,United Kingdom
26509,26509,491662,18010,ASSORTED FRAGRANCE BATH CONFETTI,24,2009-12-13 11:16:00,0.42,15169.0,United Kingdom
32790,32790,492075,18010,ASSORTED FRAGRANCE BATH CONFETTI,5,2009-12-15 13:21:00,0.42,17596.0,United Kingdom
41701,41701,492811,18010,ASSORTED FRAGRANCE BATH CONFETTI,10,2009-12-20 13:01:00,0.42,17085.0,United Kingdom
42423,42423,492993,18010,ASSORTED FRAGRANCE BATH CONFETTI,2,2009-12-21 15:54:00,0.42,14819.0,United Kingdom
73431,73431,496017,18010,ASSORTED FRAGRANCE BATH CONFETTI,9,2010-01-28 13:23:00,0.42,16326.0,United Kingdom


In [16]:
stockCode = retail_data[retail_data.StockCode == 84292]
stockCode

Unnamed: 0,index,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
3731,3731,489781,84292,,17,2009-12-02 11:45:00,0.0,,United Kingdom
8231,8231,490074,84292,ROSE PINK METAL FOLDING CHAIR,1,2009-12-03 14:39:00,26.04,,United Kingdom
22707,22707,491155,84292,ROSE PINK METAL FOLDING CHAIR,1,2009-12-10 09:49:00,26.04,,United Kingdom
94320,94320,498261,84292,ROSE PINK METAL FOLDING CHAIR,2,2010-02-17 14:47:00,7.95,16938.0,United Kingdom
114259,114259,500322,84292,ROSE PINK METAL FOLDING CHAIR,2,2010-03-07 12:06:00,7.95,16798.0,United Kingdom
131030,131030,501850,84292,ROSE PINK METAL FOLDING CHAIR,4,2010-03-21 10:50:00,7.95,13214.0,United Kingdom
135130,135130,502280,84292,ROSE PINK METAL FOLDING CHAIR,4,2010-03-23 16:45:00,7.95,14156.0,EIRE
138988,138988,502739,84292,ROSE PINK METAL FOLDING CHAIR,4,2010-03-26 12:19:00,7.95,16170.0,United Kingdom
148231,148231,503515,84292,ROSE PINK METAL FOLDING CHAIR,4,2010-04-01 12:54:00,7.95,13196.0,United Kingdom
156012,156012,C504185,84292,ROSE PINK METAL FOLDING CHAIR,-1,2010-04-12 09:37:00,7.95,13196.0,United Kingdom


Since we see that only 0.41% of the data has missing description we can safely drop these rows. 

In [17]:
clean_retail_data = retail_data[~(retail_data.Description.isna())]
clean_retail_data.head()

Unnamed: 0,index,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In [24]:
##Finding the meaning behind '-' in quantity.
invoice_data = clean_retail_data[clean_retail_data.Invoice == 'C504185']
invoice_data.head()

Unnamed: 0,index,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
156012,156012,C504185,84292,ROSE PINK METAL FOLDING CHAIR,-1,2010-04-12 09:37:00,7.95,13196.0,United Kingdom
156013,156013,C504185,82607A,GLASS BONBON JAR PARFUME AUX FLEURE,-1,2010-04-12 09:37:00,1.25,13196.0,United Kingdom


In [75]:
clean_retail_data.shape

(1062989, 9)

In [76]:
clean_retail_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1062989 entries, 0 to 1067370
Data columns (total 9 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   index        1062989 non-null  int64         
 1   Invoice      1062989 non-null  object        
 2   StockCode    1062989 non-null  object        
 3   Description  1062989 non-null  object        
 4   Quantity     1062989 non-null  int64         
 5   InvoiceDate  1062989 non-null  datetime64[ns]
 6   Price        1062989 non-null  float64       
 7   Customer ID  824364 non-null   float64       
 8   Country      1062989 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 81.1+ MB


## 2.4 Explore Data

In [78]:
clean_retail_data.Country.value_counts()

United Kingdom          976948
EIRE                     17866
Germany                  17624
France                   14330
Netherlands               5140
Spain                     3811
Switzerland               3189
Belgium                   3123
Portugal                  2620
Australia                 1913
Channel Islands           1664
Italy                     1534
Norway                    1455
Sweden                    1364
Cyprus                    1176
Finland                   1049
Austria                    938
Denmark                    817
Unspecified                756
Greece                     663
Japan                      582
Poland                     535
USA                        535
United Arab Emirates       500
Israel                     371
Hong Kong                  364
Singapore                  346
Malta                      299
Iceland                    253
Canada                     228
Lithuania                  189
RSA                        169
Bahrain 

In [86]:
clean_retail_data.Description.nunique()

5698

In [34]:
#Records with 0 Price. 
zero_price = clean_retail_data[clean_retail_data.Price == 0.0]


zero_price.groupby(by='Price').size().reset_index(name='counts')

Unnamed: 0,Price,counts
0,0.0,1820


#### Summary from the observed data.

* We see that there are 5698 unique products in the dataset. And `0.41%` of the rows have missing description. 
* In the `Country` column, we see there are 43 unique rows and 756 records are associated with `Unspecified` country. 
* 1820 records with Price set 0.0 
 