# Understanding Data and gathering requirements

- Dataset:
    - Say we have a dataset which has "columns" as below:
    - [1] items 
    - [2] group_cat 
    - [3] order_count
    1. items:-        Actual product that is being bought by customer.
    2. group_cat:-    This is key for grouping similar items as a group.
        eg: red apple and green apple belong to same group. These items cannot be part of anyother group
    3. order_count:-  The number of orders a particular item has.
- Requirement:
    - Ask us to segregate above "dataset" into "n batches" with below requirements.
    1. Orders belonging to a group_cat (group of items) can only belong to one batch and any two batches cannot have item overlaps.
    2. All batches should have approximately equal count of order_count
    

### Let us write a "function" for above requirement

In [2]:
def fn_name(arguments):
    # process the arguments
    # meet the requirements and return
    return 

## Step-1: Create dataset which looks similar to input

In [4]:
group_cat = ['AB', 'BC', 'CD', 'DE', 'EF', 'FG', 'GH', 'HI', 'IJ']
print(len(group_cat))

9


In [10]:
# O/P - 9 is called as "keys".
# Under each key (AB) there are "set of items" available.
# I want to "randomize" these "set of items". For that what i will do is that
# I will import a module called "import random".

# random.randint(0,10), means (starting no., ending no.)
# when we execute "random.randint(0,10)" multiple times we will get random o/p's in between the "Starting numbers & Ending numbers"

import random
random.randint(0, 10)

9

In [9]:
random.randint(0, 10)

8

In [22]:
# Now i want to create a data in such a way which ever random numbers that are created for 'AB' should not repeated in 'BC','CD','DE' etc.
# So for that i will try to change these ranges in such a way that every time i pick up a new group categery the "start range" and "End range" changes

# [1] AB as a "first group"
# I had changed range from (0,10) to (1000,1500)
# I want a random number between 1000 & 1500

random.randint(1000, 1500) 

 # This i want to do it for "3 times"
 # So i had given random.randint(1, 4)
 # i,e Through me atleast 1 number or max of 3 numbers

[ random.randint(1000, 1500) for x in range(random.randint(1, 4))]

[1001, 1220, 1188, 1059]

In [23]:
# [2] BC as a "second group"
# Already 1500 completed, so starts from 1501 onwards

[ random.randint(1501, 3000) for x in range(random.randint(1, 4))]

[1816, 2892, 2092, 1684]

In [24]:
import random
items = []
for i in range(len(group_cat)):
    items.append([random.randint((i+1)*1000, (i+1)*1000+999) for x in range(random.randint(1, 4))])

In [25]:
group_cat[0]

'AB'

In [26]:
# Below 3 items are grouped as a group i,e (AB)
items[0]

[1176, 1273, 1072, 1872]

#### Creating a dataframe

In [27]:
import pandas as pd
data = pd.DataFrame({'items': items,
                    'group_cat':group_cat})

In [29]:
data.head()

Unnamed: 0,items,group_cat
0,"[1176, 1273, 1072, 1872]",AB
1,[2292],BC
2,"[3263, 3924]",CD
3,"[4107, 4055, 4758, 4604]",DE
4,"[5495, 5917, 5672]",EF


In [30]:
# We donot want group of items as like observed in above o/p.
# i,e i donot want "4 items [1176,1273,1072,1872] in AB & "2 items [3263,3924] in CD and so on.
# we have to divide them. For that we have to do "explode" them as shown below
data = data.explode('items')

In [31]:
data.head()

Unnamed: 0,items,group_cat
0,1176,AB
0,1273,AB
0,1072,AB
0,1872,AB
1,2292,BC


#### We have to known how many "order_count" are there in each "group_cat"

In [32]:
data['order_count'] = [random.randint(1, 20) for x in range(data.shape[0])]

In [33]:
data.head()

Unnamed: 0,items,group_cat,order_count
0,1176,AB,15
0,1273,AB,10
0,1072,AB,7
0,1872,AB,20
1,2292,BC,8


#### Finally "Dataset" was created.

## Step-2: 
- Now the requirement is to make n-number of batches.
- Steps to achieve batching logic.

In [34]:
data.shape

(22, 3)

In [36]:
# Now we have "17 records".
# It is meaning less to convert this "17 records" to "5 batches".
# So i will try to convert it in to "2 batches"

nbatches = 2
batches = [[] for x in range(nbatches)]

In [37]:
batches

[[], []]

In [40]:
# Now we created empty batches
# And keep the "keys" in to these "empty lists" so that next person can take from here and continue the process.

In [42]:
# Requirement-2: All batches should have approximately equal count of order_count
# We have "22 order counts"
# So i initiate "batch_count"

batch_count = [0 for x in range(nbatches)]

In [43]:
batch_count

[0, 0]

In [44]:
data_grp = data.groupby(['group_cat']).agg({'order_count':'sum', 'items':list}).reset_index()
data_grp

Unnamed: 0,group_cat,order_count,items
0,AB,52,"[1176, 1273, 1072, 1872]"
1,BC,8,[2292]
2,CD,9,"[3263, 3924]"
3,DE,20,"[4107, 4055, 4758, 4604]"
4,EF,36,"[5495, 5917, 5672]"
5,FG,17,[6157]
6,GH,20,[7704]
7,HI,26,"[8215, 8997]"
8,IJ,46,"[9026, 9084, 9474, 9260]"


In [45]:
# Now i want to convert these in to "2 batches"
# argmin: it is going to written index which is having lowest value of that list
import numpy as np
np.argmin([1, 10, -5, -5, 0, 500])

2

In [46]:
# It will show how many number of records are present
data_grp.shape[0]

9

In [47]:
data_grp['group_cat'].iloc[0]

'AB'

In [48]:
batches[0]

[]

In [49]:
for i in range(data_grp.shape[0]):
    idx = np.argmin(batch_count)
    batches[idx].append(data_grp['group_cat'].iloc[i])
    batch_count[idx] += data_grp['order_count'].iloc[i]

In [50]:
batches

[['AB', 'FG', 'GH', 'IJ'], ['BC', 'CD', 'DE', 'EF', 'HI']]

In [51]:
batch_count

[135, 99]

### We met the above all requirements
### Based on above all steps, Now i want to write a function

# Write the function after you have logic.

In [52]:
def batching(df:pd.DataFrame, n:int):
    df_grp = df.groupby(['group_cat']).agg({'order_count':'sum', 'items':list}).reset_index()
    batches = [[] for x in range(n)]  # N btaches replaced with "n", because 'n' is my input.
    batch_count = [0 for x in range(n)]
    for i in range(df_grp.shape[0]):
        idx = np.argmin(batch_count)
        batches[idx].append(df_grp['group_cat'].iloc[i])
        batch_count[idx] += df_grp['order_count'].iloc[i]
    return batches, batch_count

In [53]:
new_batches, new_batch_count = batching(data, 2)

In [54]:
batches

[['AB', 'FG', 'GH', 'IJ'], ['BC', 'CD', 'DE', 'EF', 'HI']]

In [58]:
# Above code one
new_batches

[['AB', 'FG', 'GH', 'IJ'], ['BC', 'CD', 'DE', 'EF', 'HI']]

In [56]:
batch_count

[135, 99]

In [59]:
# Above code one
new_batch_count

[135, 99]