# <center><u>Online Retail</u></center>

# <b>Introduction</b>

LINK : https://archive.ics.uci.edu/ml/datasets/Online+Retail

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

# <b>Attribute Information</b>

<b>InvoiceNo</b>: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

<b>StockCode</b>: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

<b>Description</b>: Product (item) name. Nominal.

<b>Quantity</b>: The quantities of each product (item) per transaction. Numeric.

<b>InvoiceDate</b>: Invice Date and time. Numeric, the day and time when each transaction was generated.

<b>UnitPrice</b>: Unit price. Numeric, Product price per unit in sterling.

<b>CustomerID</b>: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

<b>Country</b>: Country name. Nominal, the name of the country where each customer resides.



# Sample Dataset

Import Pandas library in order to convert the file xlsx into csv format.

In [None]:
df.shape

In [None]:
df.head(20)

There 541909 rows with 9 columns(variables).

In [None]:
df.describe()

Max and Min of <i>Quantity</i> based on the above transaction which the transaction been cancelled(C581484).

Below is datatype for each variables.

In [None]:
df.dtypes

# Data Cleaning & Transformation

View the the variables using describe function

In [None]:
df.info()

Variable <i>Description</i> and <i>CustomerID</i> have missing values. <i>CustomerID</i> have more than 100000 missing values, and maybe consider it to be removed.

In [None]:
print (df['Country'].unique())

It's interesting since there are many country involve in this transaction. Of course it is, because this is internation transaction though. Maybe we can view the distribution of customer of each country?

In [None]:
plt.rc('figure', figsize=(10, 5))
fizsize_with_subplots = (30, 20)
fig = plt.figure(figsize=fizsize_with_subplots)
fig_dims = (3,2)
colors = 'rgbymck'

# EDUCATION
plt.subplot2grid(fig_dims, (0, 0))
df['Country'].value_counts().plot(kind='bar',title='Frequency of education', color=colors)

Variable <i>Country</i> also need consider to be removed since the number of customers not balance for each country and mostly from UK.

<b>Create new variables <i>Time, Date and TotalPrice</i></b>

In [None]:
df_uk['Time'] = df_uk.InvoiceDate.str[11:]
df_uk['Date'] = df_uk.InvoiceDate.str[:10]
df_uk['TotalPrice'] = df_uk['Quantity'] * df_uk['UnitPrice']

# http://stackoverflow.com/questions/30780742/get-substring-from-pandas-dataframe-while-filtering

In [None]:
df_uk.head()

<b>Remove <i>InvoiceDate</i>, <i>UnitPrice</i> and <i>StockCode</i> variable</b>


In [None]:
df_uk = df_uk.drop(['InvoiceDate','StockCode','UnitPrice'],axis=1)

# http://chrisalbon.com/python/pandas_dropping_column_and_rows.html

# Data Visualization

<b>Top 10 Product's Quantity Bought</b>

In [None]:
df_uk = df[df['Country'] == 'United Kingdom'].copy()
df_uk = df_uk[pd.notnull(df_uk['Description'])]
df_uk = df_uk[pd.notnull(df_uk['CustomerID'])]
df_uk = df_uk.drop(df_uk.columns[[0]],axis=1)

#http://stackoverflow.com/questions/32675861/copy-all-values-in-a-column-to-a-new-column-in-a-pandas-dataframe

In [None]:
graph_prod = df_uk.groupby(['Description']).sum()
graph_prod = graph_prod.sort_values('Quantity',ascending=False).head(10)
graph_prod['Quantity'].plot(kind='bar',color=colors)
graph_prod['Quantity']

# http://queirozf.com/entries/pandas-dataframe-by-example
# http://chrisalbon.com/python/pandas_sorting_rows_dataframe.html
#http://stackoverflow.com/questions/29219055/plot-top-10-verse-all-other-values

Above is the graph of 10 products bought by the customer from UK. Product that the most bought is <i>WORLD WAR 2 GLIDERS ASSTD DESIGNS</i> with 47982 units.

<b>Top 10 Product's Total Price</b>

In [None]:
graph_prod = df_uk.groupby(['Description']).sum()
graph_prod = graph_prod.sort_values('TotalPrice',ascending=False).head(10)
graph_prod['TotalPrice'].plot(kind='bar',color=colors)
graph_prod['TotalPrice']

# http://queirozf.com/entries/pandas-dataframe-by-example
# http://chrisalbon.com/python/pandas_sorting_rows_dataframe.html
#http://stackoverflow.com/questions/29219055/plot-top-10-verse-all-other-values

Regency cakestand 3 tier has the highest total price among other product.

<b>Top 10 customers based on total price spent</b>

In [None]:
graph_cust = df_uk.groupby(['CustomerID']).sum()
graph_cust = graph_cust.sort_values('TotalPrice', ascending=False).head(10)
graph_cust['TotalPrice'].plot(kind='bar',color=colors)
graph_cust['TotalPrice']

CustomerID 18102 has spent $256438.49 bought online product which the highest among them. Maybe we should consider in giving any reward to appreciate their money?


In [None]:
df_uk['Date'] = pd.to_datetime(df_uk['Date'])
df_uk['year'] = df_uk['Date'].dt.year
df_uk['month'] = df_uk['Date'].dt.month
time_series_bought = df_uk.groupby(['year','month']).sum()

ax = time_series_bought['TotalPrice'].plot(kind='line', linestyle='--', marker='o')
time_series_bought['TotalPrice'].plot(kind='bar',ax=ax, color=colors)

#http://stackoverflow.com/questions/23482201/plot-pandas-dataframe-as-bar-and-line-on-the-same-one-chart

In [None]:
print(time_series_bought['TotalPrice'])

It interesting to see that November 2011 have the highest sales. Sales more than 500000 is in May, September, October and November 2011. But in December 2011 the sales sudden fall maybe because it is not a full month transaction (transaction until 9 Dec 2011). We need to dig further what are top 3 products that the customer bought for each month.