#### Information
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

#### Features Descriptions
1. **InvoiceNo**: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. 
2. **StockCode**: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
3. **Description**: Product (item) name. Nominal.
4. **Quantity**: The quantities of each product (item) per transaction. Numeric.	
5. **InvoiceDate**: Invice Date and time. Numeric, the day and time when each transaction was generated.
6. **UnitPrice**: Unit price. Numeric, Product price per unit in sterling.
7. **CustomerID**: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
8. **Country**: Country name. Nominal, the name of the country where each customer resides.

## Preparing Environment

In [1]:
import pandas as pd 
import datetime as dt 
from IPython.display import display


pd.set_option('display.max_columns', None)

data_uncleaned=pd.read_excel("../Data/Online Retail.xlsx")
data=data_uncleaned.copy()

## Exploring Data

### General Exploration

In [2]:
#exploring the shape of dataset and viewing first 10 rows of dataset

print("Shape of data: ", data.shape,"\n")
display(data.head(10))

Shape of data:  (541909, 8) 



Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2010-12-01 08:26:00,7.65,17850.0,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.25,17850.0,United Kingdom
7,536366,22633,HAND WARMER UNION JACK,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
8,536366,22632,HAND WARMER RED POLKA DOT,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,2010-12-01 08:34:00,1.69,13047.0,United Kingdom


In [3]:
# Checking data type of each feature and presence of any null values
display(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


None

Columns 'Description' and 'CustomerID' contain null values

Also, column "CustomerID" is detected as 'float64' data type when it should actually be object type

In [4]:
# checking the number of null values
data.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [5]:
# Checking unique values
data.nunique()

InvoiceNo      25900
StockCode       4070
Description     4223
Quantity         722
InvoiceDate    23260
UnitPrice       1630
CustomerID      4372
Country           38
dtype: int64

According to data description, 'StockCode' is the code assigned to a unique product and 'Description' is the name of the product. So, in theory, number of unique values in 'StockCode' and 'Description' should be the same. But there seem to be more unique values in 'Description' as compared to 'StockCode' which could be the result of some kind of error or mistake that is worth investigating.

In [6]:
# Exploring numerical columns in dataset
display(data.describe())

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


**NOTE**: Ignore 'CustomerID' column because it is wrongly detected as float64 instead of object

We can observe that 'min' minimum value for 'Quantity' and 'UnitPrice' is a negative number. This could be the result of canceled or return orders but regardless it is worth investigating further.

### Exploring 'Quantity' column

In [7]:
# Exploring negative quantity
display(data[data['Quantity']<0].head(10))
display(data[data['Quantity']<0].tail(10))

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.5,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
238,C536391,21980,PACK OF 12 RED RETROSPOT TISSUES,-24,2010-12-01 10:24:00,0.29,17548.0,United Kingdom
239,C536391,21484,CHICK GREY HOT WATER BOTTLE,-12,2010-12-01 10:24:00,3.45,17548.0,United Kingdom
240,C536391,22557,PLASTERS IN TIN VINTAGE PAISLEY,-12,2010-12-01 10:24:00,1.65,17548.0,United Kingdom
241,C536391,22553,PLASTERS IN TIN SKULLS,-24,2010-12-01 10:24:00,1.65,17548.0,United Kingdom
939,C536506,22960,JAM MAKING SET WITH JARS,-6,2010-12-01 12:38:00,4.25,17897.0,United Kingdom


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
540141,C581468,21314,SMALL GLASS HEART TRINKET POT,-10,2011-12-08 19:26:00,2.1,13599.0,United Kingdom
540142,C581468,22098,BOUDOIR SQUARE TISSUE BOX,-12,2011-12-08 19:26:00,0.39,13599.0,United Kingdom
540176,C581470,23084,RABBIT NIGHT LIGHT,-4,2011-12-08 19:28:00,2.08,17924.0,United Kingdom
540422,C581484,23843,"PAPER CRAFT , LITTLE BIRDIE",-80995,2011-12-09 09:27:00,2.08,16446.0,United Kingdom
540448,C581490,22178,VICTORIAN GLASS HANGING T-LIGHT,-12,2011-12-09 09:57:00,1.95,14397.0,United Kingdom
540449,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,-11,2011-12-09 09:57:00,0.83,14397.0,United Kingdom
541541,C581499,M,Manual,-1,2011-12-09 10:28:00,224.69,15498.0,United Kingdom
541715,C581568,21258,VICTORIAN SEWING BOX LARGE,-5,2011-12-09 11:57:00,10.95,15311.0,United Kingdom
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,2011-12-09 11:58:00,1.25,17315.0,United Kingdom
541717,C581569,20979,36 PENCILS TUBE RED RETROSPOT,-5,2011-12-09 11:58:00,1.25,17315.0,United Kingdom


In [8]:
print("Number of entries of canceled orders: ",len(data[data['InvoiceNo'].str.startswith('C')==True].index))
print("Number of entries of orders with negative quantity: ",len(data[data['Quantity']<=0]))

Number of entries of canceled orders:  9288
Number of entries of orders with negative quantity:  10624


As expected, most of the negative values in 'Quantity' arise as a result of cancelled orders or discounts etc. It might be worth keeping a seperate record of all these orders to study cancelled orders or to understand total profit and loss. However, there seem to be more entries with negative quantity than cancelled orders

In [9]:
display(data[(data['Quantity']<0) & (data['InvoiceNo'].str.contains('C', na=False)==False)].head(10))

display(data[(data['Quantity']<0) & (data['InvoiceNo'].str.contains('C', na=False)==False)].tail(10))

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
2406,536589,21777,,-10,2010-12-01 16:50:00,0.0,,United Kingdom
4347,536764,84952C,,-38,2010-12-02 14:42:00,0.0,,United Kingdom
7188,536996,22712,,-20,2010-12-03 15:30:00,0.0,,United Kingdom
7189,536997,22028,,-20,2010-12-03 15:30:00,0.0,,United Kingdom
7190,536998,85067,,-6,2010-12-03 15:30:00,0.0,,United Kingdom
7192,537000,21414,,-22,2010-12-03 15:32:00,0.0,,United Kingdom
7193,537001,21653,,-6,2010-12-03 15:33:00,0.0,,United Kingdom
7195,537003,85126,,-2,2010-12-03 15:33:00,0.0,,United Kingdom
7196,537004,21814,,-30,2010-12-03 15:34:00,0.0,,United Kingdom
7197,537005,21692,,-70,2010-12-03 15:35:00,0.0,,United Kingdom


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
535327,581204,85104,????damages????,-355,2011-12-07 18:32:00,0.0,,United Kingdom
535328,581205,20893,damages,-55,2011-12-07 18:32:00,0.0,,United Kingdom
535329,581206,21693,mixed up,-87,2011-12-07 18:34:00,0.0,,United Kingdom
535330,581207,21688,mixed up,-337,2011-12-07 18:34:00,0.0,,United Kingdom
535331,581208,72801C,check,-10,2011-12-07 18:35:00,0.0,,United Kingdom
535333,581210,23395,check,-26,2011-12-07 18:36:00,0.0,,United Kingdom
535335,581212,22578,lost,-1050,2011-12-07 18:38:00,0.0,,United Kingdom
535336,581213,22576,check,-30,2011-12-07 18:38:00,0.0,,United Kingdom
536908,581226,23090,missing,-338,2011-12-08 09:56:00,0.0,,United Kingdom
538919,581422,23169,smashed,-235,2011-12-08 15:24:00,0.0,,United Kingdom


On checking the orders that were not cancelled but still had negative values for 'Quantity', we can observe that they were lost, mixed up, damaged etc. In other words, they caused a loss to the organisation/company.

It might be worth keeping a seperate record of all such orders since it might help to analyse profit and loss based on these transactions

### Exploring 'UnitPrice' Column

In [10]:
# Exploring negative unit proce
display(data[data['UnitPrice']<0].head())

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
299983,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00,-11062.06,,United Kingdom
299984,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00,-11062.06,,United Kingdom


The negative unit price seem to be debt payments that occured with stock code = 'B' 

In [11]:
display(data[data['StockCode']=='B'].head())

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
299982,A563185,B,Adjust bad debt,1,2011-08-12 14:50:00,11062.06,,United Kingdom
299983,A563186,B,Adjust bad debt,1,2011-08-12 14:51:00,-11062.06,,United Kingdom
299984,A563187,B,Adjust bad debt,1,2011-08-12 14:52:00,-11062.06,,United Kingdom


We can observe that there was another transaction with same stock code and it was the same amount but it was not negative. This could be typing error



### Exploring 'Description' Column

In [12]:
display(len(data['Description'].unique()))

display(data['Description'].unique())

4224

array(['WHITE HANGING HEART T-LIGHT HOLDER', 'WHITE METAL LANTERN',
       'CREAM CUPID HEARTS COAT HANGER', ..., 'lost',
       'CREAM HANGING HEART T-LIGHT HOLDER',
       'PAPER CRAFT , LITTLE BIRDIE'], dtype=object)

We can notice that there are 4224 unique values in Desscription column but as we noticed before, some of these values don't contain product name, but rather contain additional information like if the order went missing , was damaged , wet etc.

The true product names are all stored in upper case and other information is either stored as title case or lower case

In [13]:
#creating a new dataframe to hold the list of all different unique values in 'Description' column
desc_unique=pd.DataFrame(data['Description'].unique().astype(str), columns=['Desc_unique'])


In [14]:
#viewing all uppercase descrption (majority are product names)
display(desc_unique[desc_unique['Desc_unique'].str.isupper()==True])

#viewing all lowercase description
display(desc_unique[desc_unique['Desc_unique'].str.islower()==True])

#viewing all titlecase description
display(desc_unique[desc_unique['Desc_unique'].str.istitle()==True])

#viewing all other types of case in description
display(desc_unique[(desc_unique['Desc_unique'].str.isupper()==False)&(desc_unique['Desc_unique'].str.islower()==False)&(desc_unique['Desc_unique'].str.istitle()==False)])

Unnamed: 0,Desc_unique
0,WHITE HANGING HEART T-LIGHT HOLDER
1,WHITE METAL LANTERN
2,CREAM CUPID HEARTS COAT HANGER
3,KNITTED UNION FLAG HOT WATER BOTTLE
4,RED WOOLLY HOTTIE WHITE HEART.
...,...
4210,SET 10 CARDS SNOWY ROBIN 17099
4212,SET 10 CARDS SWIRLY XMAS TREE 17104
4216,"LETTER ""U"" BLING KEY RING"
4222,CREAM HANGING HEART T-LIGHT HOLDER


Unnamed: 0,Desc_unique
395,
1740,amazon
2155,check
2157,damages
2394,faulty
...,...
4217,wet
4218,wet boxes
4219,????damages????
4220,mixed up


Unnamed: 0,Desc_unique
111,Discount
1108,Manual
1537,Bank Charges
2373,*Boombox Ipod Classic
2722,Dotcomgiftshop Gift Voucher £40.00
2727,Found
2754,Dotcomgiftshop Gift Voucher £50.00
2781,Dotcomgiftshop Gift Voucher £30.00
2782,Dotcomgiftshop Gift Voucher £20.00
2910,Dotcom


Unnamed: 0,Desc_unique
326,BAG 500g SWIRLY MARBLES
550,POLYESTER FILLER PAD 45x45cm
1037,BAG 125g SWIRLY MARBLES
1038,BAG 250g SWIRLY MARBLES
1052,POLYESTER FILLER PAD 45x30cm
1053,POLYESTER FILLER PAD 40x40cm
1061,FRENCH BLUE METAL DOOR SIGN No
1187,Dr. Jam's Arouzer Stress Ball
1367,3 TRADITIONAl BISCUIT CUTTERS SET
1582,NUMBER TILE COTTAGE GARDEN No


In [15]:
data[data['StockCode']=='PADS']

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
157195,550193,PADS,PADS TO MATCH ALL CUSHIONS,1,2011-04-15 09:27:00,0.001,13952.0,United Kingdom
279045,561226,PADS,PADS TO MATCH ALL CUSHIONS,1,2011-07-26 10:13:00,0.001,15618.0,United Kingdom
358655,568158,PADS,PADS TO MATCH ALL CUSHIONS,1,2011-09-25 12:22:00,0.0,16133.0,United Kingdom
359871,568200,PADS,PADS TO MATCH ALL CUSHIONS,1,2011-09-25 14:58:00,0.001,16198.0,United Kingdom


We can observe that there are a lot of non-product name entries in the 'Description' column and there also seem to be some typing errors in it as well. 

### Exploring 'StockCode' Column

In [16]:
#total number of different products based on StockCode
display(len(data['StockCode'].unique()))

display(data['StockCode'].unique())

4070

array(['85123A', 71053, '84406B', ..., '90214U', '47591b', 23843],
      dtype=object)

We can observe that there are a total of 4070 unique entries in 'StockCode' column and all of them are either purely numeric or purely alpha-numeric values

In [17]:
#creating a dataframe of all unique values in 'StockCode' column
stockcode_unique=pd.DataFrame(data['StockCode'].unique().astype(str), columns=['StockCode_unique'])

In [47]:
#viewing all purely numeric stock codes
display(data[data['StockCode'].isin(stockcode_unique[stockcode_unique['StockCode_unique'].str.isnumeric()==True].values.tolist())])

#viewing all purely alpha numeric stock codes
display(stockcode_unique[(stockcode_unique['StockCode_unique'].str.isalnum()==True)&(stockcode_unique['StockCode_unique'].str.isalpha()==False)&(stockcode_unique['StockCode_unique'].str.isnumeric()==False)])

#viewing all purely alphabetical stock codes
display(stockcode_unique[stockcode_unique['StockCode_unique'].str.isalpha()==True])

#viewing all other stock codes
display(stockcode_unique[(stockcode_unique['StockCode_unique'].str.isalnum()==False)&(stockcode_unique['StockCode_unique'].str.isalpha()==False)&(stockcode_unique['StockCode_unique'].str.isnumeric()==False)])

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country


Unnamed: 0,StockCode_unique
0,85123A
2,84406B
3,84029G
4,84029E
55,82494L
...,...
4055,84971l
4056,85034b
4065,85179a
4067,90214U


Unnamed: 0,StockCode_unique
45,POST
111,D
952,DOT
1115,M
2242,S
2245,AMAZONFEE
2791,m
3078,DCGSSBOY
3079,DCGSSGIRL
3328,PADS


Unnamed: 0,StockCode_unique
1542,BANK CHARGES
2774,gift_0001_40
2815,gift_0001_50
2842,gift_0001_30
2843,gift_0001_20
3197,gift_0001_10


### Exploring 'Country' column

In [18]:
#checking 'Country' column to look for any errors
data['Country'].value_counts().sort_index()

Australia                 1259
Austria                    401
Bahrain                     19
Belgium                   2069
Brazil                      32
Canada                     151
Channel Islands            758
Cyprus                     622
Czech Republic              30
Denmark                    389
EIRE                      8196
European Community          61
Finland                    695
France                    8557
Germany                   9495
Greece                     146
Hong Kong                  288
Iceland                    182
Israel                     297
Italy                      803
Japan                      358
Lebanon                     45
Lithuania                   35
Malta                      127
Netherlands               2371
Norway                    1086
Poland                     341
Portugal                  1519
RSA                         58
Saudi Arabia                10
Singapore                  229
Spain                     2533
Sweden  

There don't seem to be any errors in the values of 'Country' column. However, there seem to be a value named 'unspecified'.This means that there are some rows have missing values for 'Country' column and instead of NULL they store the value 'Unspecified'