### Graded Lab 2

Hello ! Welcome to Graded Lab of Module 2.
Here we will be working on an exciting business problem with help of interesting datasets.
Lets look at the problem statement,

*Client: ABC Retail, Incorporated, rest-of-the-world division* 

***Project name: Online retail sales analysis*** 

An online retailer, ABC, Inc., operates in nearly 100 countries worldwide, selling furniture, office supplies and technology products to customers in three segments: consumer, corporate and home office. ABC, Inc. is a US-based company, and it has two major divisions: US and rest of the world. We are working with the rest of the world division of the company. 

They have provided us with online sales transaction data from 2011 to 2014.

We are given 3 datasets:-

1. Data on each sale; 51290 records; all data in US dollars
It contains fields like
**order_id** (identifier) ,order_date ,ship_date ,ship_mode ,**customer_id**(identifier) ,product_id ,category ,sub_category ,product_name ,sales ,quantity ,discount ,profit ,shipping_cost ,order_priority ,**vendor_code** (identifier) 


2. Data on the customers; 1590 records 
It contains fields like
**customer_id** (identifier) ,customer_name ,city ,state ,country ,postal_code ,segment ,market ,region 

3. Data on vendors who supply the retailer; 65 records 
It contains fields like
vendor ,**vendor_code** (identifier) 

We need to analyze the data and need to provide answers to different questions asked by company officials.

In [None]:
# importing libraries
import pandas as pd
import numpy as np

### Reading sales data
sales = pd.read_csv('sales_data.csv')

### Reading customer data
cust = pd.read_csv(r'customers.csv',encoding='iso-8859-1')

### Reading vendor data
vend = pd.read_csv(r'vendors.csv')

sales.head()

In [None]:
pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.4f}'.format)

In [None]:
sales.shape, cust.shape, vend.shape

In [None]:
cust.head()

In [None]:
vend.head()

In order to solve the next questions , we need to combine all the 3 datasets into a single dataframe such that every details of sales dataframe are intact. So here we have written a data processing function.
There are 2 tasks which are to be performed.
1. Merge/ Join all the 3 datasets into a single dataframe such that every details of sales dataframe are intact. (Understand which should be the joining key , type of join , refer .merge() function of pandas)
2. Convert 'order_date' into a datetime column.
**Return output as a dataframe**

In [None]:
df = sales.merge(right=cust, how="inner", on="customer_id")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df2 = df.merge(right=vend, how="inner", on="vendor_code")

In [None]:
df2

In [None]:
df2.shape

In [None]:
df2["order_date"] = pd.to_datetime(df2['order_date'], format='%d/%m/%Y')

In [None]:
df2.info()

# ---------------------------------------------------------------------------------

In [None]:
#### data_merging & order_date processing , data1 will be sales , data2 will be customer dataset & data3 will be vendor dataset.

def data_process(data1,data2,data3):
    data = data1.merge(right=data2, how="inner", on="customer_id")
    data = data.merge(right=data3, how="inner", on="vendor_code")
    data["order_date"] = pd.to_datetime(data['order_date'], format='%d/%m/%Y')
    return data

In [None]:
sales= data_process(data1=sales.copy(),data2=cust.copy(),data3=vend.copy())

In [None]:
assert sales['order_date'].dtypes=='<M8[ns]' ,'Make sure that you have converted order_date into a datetime format correctly.'
assert sales.shape== (51290,26) ,'Checking size and shape of dataframe after merging is a very important check.'

In [None]:
sales.columns

In [None]:
#sales.to_csv("salescombine.csv", index=False)

### Q1. Return the top three subcategories that yield the best percentage profit? Return output as list of sub-categories, list of percentage values (rounded upto 2 decimals).

In [None]:
def top_3_sales(data):
    
    # Group by sub_category and sum the profits for each sub_category
    sub_category_profit = data.groupby("sub_category")["profit"].sum().reset_index()
    
    # Sort the subcategories by profit in descending order and select the top 3
    top3 = sub_category_profit.nlargest(3, "profit")
    
    # Calculate the percentage profit for each subcategory
    total_profit = sub_category_profit["profit"].sum()
    top3["percent"] = (top3["profit"] / total_profit * 100).round(2)
    
    # Extract the subcategories and their corresponding percentage values
    sub_categories = top3["sub_category"].tolist()
    perc_values = top3["percent"].tolist()
    
    return sub_categories,perc_values

In [None]:
assert len(top_3_sales(data=sales)[0])==3,"Please include list of top 3 sub-categories only"
assert type(top_3_sales(data=sales)[0])==list,"Output type should be list only."

In [None]:
# autograder cells , please do not alter/ delete /edit this cell,Kindly ignore this cell.

### Q2.	Which city has the highest sales ?

In [None]:
def top_city_sales(data):
    
    df1 = data.groupby("city", as_index=False)["sales"].sum().sort_values(by="sales", ascending=False)
    
    return df1['city'].values[0]

In [None]:
assert type(top_city_sales(data=sales))==str,"Please make sure that output is in string format"

In [None]:
# autograder cells , please do not alter/ delete /edit this cell,Kindly ignore this cell.

### Q.3. In year 2013 , which country has reported lowest profit ? 
**(Calculate order year from order_date column.)**

In [None]:
def lowest_profit_country(data):
    
    # Filter the data to include only records from the year 2013
    year2013_data = data[data["order_date"].dt.year == 2013]
    
    # Find the row with the lowest profit in the year 2013
    lowest_profit_row = year2013_data[year2013_data["profit"] == year2013_data["profit"].min()]
    
    # Extract the country from the row with the lowest profit
    country_with_lowest_profit = lowest_profit_row["country"].values[0]
    
    return country_with_lowest_profit

In [None]:
assert type(lowest_profit_country(data=sales))==str,"Please make sure that output is in string format"

In [None]:
# autograder cells , please do not alter/ delete /edit this cell,Kindly ignore this cell.

### Q.4. For which market segment we could observe 2nd highest discount ?

In [None]:
sales.discount.value_counts()

In [None]:
market_segment_discount = sales[sales.discount == 0.80]
market_segment_discount

In [None]:
market_segment_discount.market.value_counts()

In [None]:
def second_highest_discount_market(data):
    
    market_segment_discount = data[data["discount"] == 0.80]
    sorted_segments = market_segment_discount.sort_values(by="discount", ascending=False)
    second_highest_discount_segment = sorted_segments.iloc[1]["market"]
    
    return second_highest_discount_segment

In [None]:
def second_highest_discount_market(data):
    
    market_segment_discount = data.groupby("market")["discount"].mean().reset_index()
    sorted_segments = market_segment_discount.sort_values(by="discount", ascending=False)
    second_highest_discount_segment = sorted_segments.iloc[1]["market"]
    
    return second_highest_discount_segment

In [None]:
assert type(second_highest_discount_market(data=sales))==str,"Please make sure that output is in string format"

In [None]:
# autograder cells , please do not alter/ delete /edit this cell,Kindly ignore this cell.

In [None]:
second_highest_discount_market(data=sales)

### Q5.	Which product was sold the most (in terms of quantity) within the subcategory ‘Copiers’ and how many units were sold? Return output as a tuple (product name, quantities sold)

In [None]:
copier = sales[sales["sub_category"] == "Copiers"]
copier

In [None]:
# Group by product_name and sum the quantity
total_quantity_by_product = combined_data.groupby('product_name')['quantity'].sum()

In [None]:
def copier_sales(data):
    # your code here

In [None]:
assert type(copier_sales(data=sales)[0])==str,"Please check the data type of answer , product name should be string."
assert type(copier_sales(data=sales)[1])==np.int64,"Please check the data type of answer , quantity sold should be an integer."

In [None]:
# autograder cells , please do not alter/ delete /edit this cell,Kindly ignore this cell.

### Q6.	In 2014, which customer (identify by name) contributed to the highest total profit and how much was it (Round it to 4 digits) ? Return output as a tuple (customer name, profit)

In [None]:
def cust_prof(data):      
    ### Extract order year from order date. You can use dt.year functions
    # your code here
    
    data['order_year']=None 
    
    ### Filter data for yr 2014
    sales_yr= None
    
    
    #### Aggregate the profits wrto customer name , you can make use of .groupby() function in pandas
    cust_profit=None
     
    ### Round the profit column to 4 decimal numbers  ,You can make use of round() function for rounding off 
    cust_profit['profit']=None
    
    ## sort the dataframe with decreasing order of profits.
    None
    
    ### Store customer_name with highest profit in variables below. 
    customer_name=None
    customer_profit=None
    
    
    return customer_name,customer_profit

In [None]:
assert type(cust_prof(data=sales)[0])==str,"Please check the data type of answer , customer name should be string."
assert type(cust_prof(data=sales)[1])==np.float64,"Please check the data type of answer , profit recorded should be float."

In [None]:
# autograder cells , please do not alter/ delete /edit this cell,Kindly ignore this cell.

### Q7.	How much do the different categories of items contribute to total sales? 
**Return output dataframe consisting 2 columns 'category' (product category),'sales_perc' (sales % values).**

**Make sure to round off sales percentage values to 2 decimals.**

In [None]:
def cat_sales_contri(data):
    # your code here

In [None]:
assert type(cat_sales_contri(data=sales))==pd.DataFrame,"Please check the data type of answer , it should be a dataframe."
assert cat_sales_contri(data=sales).shape==(3,2),"Please check the data shape, total row count should be equal to number of unique categories."

In [None]:
# autograder cells , please do not alter/ delete /edit this cell,Kindly ignore this cell.

### Q8. Can we identify the customers who have not made any purchases in the last 12 months, so that we can send them some promotional material to encourage them to come back and shop with us? 

**Return the output as a data frame with three columns: 'customer_name','customer_id' and 'Total_sales' (total sales amount that customer has accrued). Make sure that you are returning the mentioned column names in same sequence & as same spelled.**

**Hint:-**

You might think of working on this logic

1. Calculate latest order date for every customer
2. Calculate yr back order date by offsetting latest date by 365 days (you can make use of pd.DateOffset(days=) function)
3. Check if order date is in between yr_back_order date and latest_order_date, create binary flag.
4. For every customer,you will get multiple order dates which are before yr old order date. Aggregate the flag at user level.
5. Check for a customer if all the flags indicating that order date are before 1 yr then mark that customer as one who hasnt made any transations in last yr.
6. Take out its name by joining with customer data and aggregate it for finding total sales. 

In [None]:
def cust_purc(data):
    # your code here
    
    ### Groupby customer_id & consider latest order_date, name this dataframe as purchase_date. 
    purchase_date=None
    purchase_date.rename(columns={'order_date':'latest_purchase_date'},inplace=True)
    
    #### Calculate 'yr_back_order_date' by offsetting the latest_purchase_date by 365 days , make use of pd.DateOffset() function.
    purchase_date['yr_back_order_date']=None
    
    ### Offset the latest_purchase_date by 1 more day in order to avoid intersection with latest_order_date.
    purchase_date['latest_purchase_date']=None
    
    ### We have purchase-date data ready with us so now its time to merge it with original sales data
    data=data.merge(purchase_date,on=None,how=None)
    
    #### Check if order_date falls in between 'yr_back_order_date','latest_purchase_date' , if 'yes' flag it as 1 else 0. 
    data['purchase_flag']=None

    #### Calculate sum of all flag per customer_id by using groupby() function, call this flag total as '# purchases_in_last_yr'
    purchase_in_lst_yr=None
    purchase_in_lst_yr.rename(columns={'purchase_flag':'# purchases_in_last_yr'},inplace=True)
    
    
    #### Merge the purchase_in_lst_yr with sales data.
    data=data.merge(purchase_in_lst_yr,on=None,how=None)
    
    #### Select the customers who have 0 in '# purchases_in_last_yr' as this will be customers who havent purchased anything in last yr.
    sales_df=None
    
    ### Its time to do final aggregation for selected customers cosnider their name , id & aggregate it on sales column for getting total sales.
    sales_df_1=None
    
    return sales_df_1[['customer_name','customer_id','Total_sales']] ### You can replace sales_df_1 by your dataframe name.


In [None]:
assert type(cust_purc(data=sales))==pd.DataFrame,"Please check the data type of answer , it should be dataframe."
assert cust_purc(data=sales).shape==(109,3),"Please check the data shape."

In [None]:
# autograder cells , please do not alter/ delete /edit this cell,Kindly ignore this cell.