#### This is my first Kaggle kernel and I did some very basic exploratory analysis. I have reviewed some of the other kernels and I don't beleive any of the conclusions here are any different than those presented by other.  

Import all relevant libraries

In [None]:
import numpy as Numpy
import pandas as Pandas
import matplotlib.pyplot as Plt
import seaborn as Sns
import os as Os

Provide generic input

In [None]:
inpDir = '../input/'
outDir = ''
fileName = 'BlackFriday.csv'
fileTitle = fileName.split('.')[0]

Read in the dataset and gather some basic info about the dataframe (column names, types, missing data etc.)

In [None]:
bf = Pandas.read_csv(Os.path.join(inpDir + fileName), index_col=False)
bf.info()

Fields 'Product_Category_2' and 'Product_Category_3' have missing values. 

The following option makes sure that you can see 'all' the columns when you apply '.describe()' to your data frame

In [None]:
Pandas.set_option('display.expand_frame_repr', False)
bf.describe()

Looks like all numeric fields are all integer values, even if some of them are stored as float.

Determine the number of levels for all columns.  

In [None]:
print('Number of unique values (levels) for each column')
for cols in list(bf):
    print(f'{bf[cols].nunique()}: {cols}')

Different analysis could be performed based on the end-goal.  In this analysis, I am going to assume that the end-goal is to determine the factors that affect 'revenue', which is the total of the 'Purchase' amounts. Here, I am going to refer to the different fields/columns as 'categories' and the different values (levels) within this categories as 'factors'.  With this goal in mind, we will take a preliminary look at the effects of the factors in each of the remaining 11 categories on the revenue.  

But, before we do that we need to fill in the missing values in 'Product_Category_2' and 'Product_Category_3'. Reviewing the list of existing values for these, I chose to use the value of 1 for the missing value.

In [None]:
for cols in ['Product_Category_2', 'Product_Category_3']:
    print(f'{cols}: {sorted(bf[cols].unique())}')
    bf[cols].fillna(value=1, inplace=True)
    bf[cols] = bf[cols].apply(lambda x: int(x))
    print(f'{cols}: {sorted(bf[cols].unique())}')

Do a pie chart of the sum of the purchases for each of the categories.  For now, I am going to ignore User_ID and Product_ID as there are too many factors in these categories.

In [None]:
fig1 = Plt.figure(1,figsize=(10,7.5))
fig1.clf()
k = 0
for cols in list(bf)[2:len(list(bf))-1]:    
    k = k + 1
    ax = fig1.add_subplot(3,3,k)
    
    bf_summ = bf[[cols,'Purchase']].groupby([cols])['Purchase'].agg(['sum','count']).reset_index().rename(columns={'sum':'totRev','count': 'totPurch'}).sort_values(['totRev'], ascending=[False])
    ax.pie(bf_summ['totRev'], labels = bf_summ[cols], autopct='%1.1f%%', startangle=90)
    ax.axis('equal')
    ax.set_title(cols)

fig1.subplots_adjust(hspace=0.5, wspace = 0.5)
fig1.savefig(Os.path.join(outDir + fileTitle + '_1.png'), dpi=300)

Similar data has been presented by other kernels for this dataset and hence, the conclusions are similar too.  Next, I take a cummulative some of the percentage contributions of each of the factors within each category and identify the (top) factors within each category that contribute most to the revenue.

In [None]:
fig2 = Plt.figure(2,figsize=(7.5,10.0))
fig2.clf()
j = 0

fig3 = Plt.figure(3,figsize=(7.5,26.0))
fig2.clf()
k = 0

totRevenue = sum(bf.Purchase)
for cols in list(bf)[0:len(list(bf))-1]:
    # The following two statements could probably be combined and simplified using lambda functions in the '.agg' method.  Will do later.
    bf_summ = bf[[cols,'Purchase']].groupby([cols])['Purchase'].agg(['sum']).reset_index().rename(columns={'sum':'percRev'})

    bf_summ['percRev'] = 100.0*bf_summ['percRev']/totRevenue
    bf_summ = bf_summ.sort_values(['percRev'], ascending=[False]).copy()

    noFactors = bf_summ.shape[0]
    bf_summ['cumPercFactor'] = 100.0*(bf_summ.reset_index().index + 1)/noFactors
    
    bf_summ['cumPercRev'] = round(Numpy.cumsum(bf_summ['percRev']),2)
    bf_summ['percRev'] = round(bf_summ['percRev'],2)
    
    if (cols in ['User_ID', 'Product_ID']):
        j = j + 1
        ax = fig2.add_subplot(2,1,j)
        ax.plot(bf_summ['cumPercFactor'], bf_summ['cumPercRev'])
        ax.set_xlabel('Cummulative percentage of factors')
        ax.set_ylabel('Cummulative percentage of revenue')
        ax.set_title(cols)
        print(f'Cummulative Percentage of top {cols} that accounted for 80% of revenue: {round(bf_summ.iloc[max(max(Numpy.where(bf_summ["cumPercRev"] < 80.0)))]["cumPercFactor"], 1)}\n')
    else:
        k = k + 1
        ax = fig3.add_subplot(9,1,k)
        ax.bar(bf_summ[cols].apply(lambda x: str(x)), bf_summ['cumPercRev'])
        if (cols=='Age'):
            ax.set_xticklabels(bf_summ[cols], rotation='vertical', horizontalalignment="right")
        #ax.set_xlabel('Factors')
        ax.set_ylabel('Cummulative\n percentage\n of revenue')
        ax.set_title(cols)
        print(bf_summ.head(10))
        print('\n')
        
fig2.subplots_adjust(hspace=0.5, wspace = 0.0)
fig2.savefig(Os.path.join(outDir + fileTitle + '_2.png'), dpi=300)

fig3.subplots_adjust(hspace=0.5, wspace = 0.0)
fig3.savefig(Os.path.join(outDir + fileTitle + '_3.png'), dpi=300)

The first 2 graphs shows that a larger percentage of revenue comes from a smaller percentage of the factors for both User_ID and Product_ID.  
For instance, 80% of the revenue is achieved by the (top) 43% of the users (User_IDs) or 26% of the products (Product_IDs).  So, I could use the data to maybe in the future focus on the purchase patterns of these specific users only and/or on the sale of these specific products only.

Similarly for some of the other categories, we can conclude
* Age groups of 18-50 account for roughly 86% of the revenue.  
* Factors 1, 5, 6, and 8 in Product_Category_1 account for roughly 80% of the revenue.
* The top ten (out of 21) occupations account for roughly 79% of the revenue

This is a preliminary analysis and I plan to make some of these plots and data more presentable and provide additional analysis and plots in later versions of this kernel.