# Weekly Challenge 11

*Original URL* https://community.alteryx.com/t5/Weekly-Challenge/Challenge-11-Identify-Logical-Groups/td-p/36739 and [**My Alteryx Approach**](https://github.com/dsmdavid/Alteryx-Weekly-Challenge/tree/master/submitted/sub_Challenge%2311)

## Brief

### Basic Text Mining:

For this exercise let’s look at some simple text mining that can be performed with Alteryx. There are several ways to do this challenge, I will provide one solution that uses a batch macro and one that does it without. It is a great example to see how batch macros can simplify a workflow.

The use case:

A manufacturing company receives customer complaint data on a daily basis from their call centers about the medical parts they distribute to their customers. The company monitors these comments to understand which parts and part groups have the highest complaint rate. This helps the company prioritize which parts to focus on from a development standpoint.

In this exercise, take the customer complaint data (Field_6 in the Test2 data) and identify which bucket the complaint falls within. The complaint can fall into multiple buckets and needs to be flagged as these complaints take highest priority. Create an aggregate view of which buckets or bucket pairings have the highest # of complaints.

This is only a subset of data so all records will not be assigned to buckets and can be ignored.

In [1]:
import pandas as pd

## Approach I want to follow:
1. Read the data.
1. Search each of the buckets in the text field.
1. Return a "sorted bucket". 
1. Summarize the results.

In [2]:
#Load the data
#Treate TIMESTAMP and Time_Now as dates
df = pd.read_csv("./11_files/input.csv", encoding="latin")     
df.head()

Unnamed: 0,Field_6
0,REVIEW OF MANUFACTURING RECORDS FOR BOTH THE P...
1,ADDITIONAL INFORMATION WAS RECEIVED STATING TH...
2,DEVICE DIAGNOSTIC TESTING AT OFFICE VISIT RESU...
3,H6-ANALYSIS RESULTS WERE NOT AVAILABLE AT THE ...
4,HCP REPORTED THE PUMP WAS IMPLANTED IN 2004. I...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178889 entries, 0 to 178888
Data columns (total 1 columns):
Field_6    178883 non-null object
dtypes: object(1)
memory usage: 1.4+ MB


In [4]:
# There are some nan values in the dataframe - convert them to the empty string:
df['Field_6'].fillna(value = "", inplace=True)

In [5]:
# What does data look like?
df.iloc[0]['Field_6']

'REVIEW OF MANUFACTURING RECORDS FOR BOTH THE PULSE GENERATOR AND THE BIPOLAR LEAD REVEALED NO ANOMALIES THAT WOULD ADVERSELY AFFECT DEVICE PERFORMANCE. TREATING NEUROLOGIST IS REPORTEDLY AWAITING MANUFACTURER ASSESSMENT OF X-RAYS BEFORE DETERMINING COURSE OF ACTION. X-RAYS HAVE NOT YET BEEN RECEIVED FOR REVIEW BY MANUFACTURER. H.6. THE VNS THERAPY SYSTEM IS INDICATED FOR USE AS AN ADJUNCTIVE THERAPY IN REDUCING THE FREQUENCY OF SEIZURES IN ADULTS AND ADOLESCENTS WITH PARTIAL ONSET SEIZURES THAT ARE REFRACTORY TO ANTIEPILEPTIC MEDICATIONS. IN THE UNITED STATES, THE VNS THERAPY SYSTEM IS APPROVED FOR USE IN ADULTS AND ADOLESCENTS WITH PARTIAL ONSET SEIZURES THAT ARE REFRACTORY TO ANTIEPILEPTIC MEDICATIONS.'

In [6]:
# What are the terms to search / buckets?
df_buckets = pd.read_csv("./11_files/buckets.csv", encoding="latin" )
df_buckets.head()

Unnamed: 0,Search,Bucket
0,beep,Tones
1,screen,Screen
2,trigger,Trigger


In [7]:
# Better to sort them once here:
df_bucketsOrdered = df_buckets.sort_values(by = 'Bucket')
df_bucketsOrdered.set_index('Search', inplace=True)
df_bucketsOrdered

Unnamed: 0_level_0,Bucket
Search,Unnamed: 1_level_1
screen,Screen
beep,Tones
trigger,Trigger


In [8]:
def find_buckets(field_6):
    '''
    Receives a text field and finds the "search" terms in the buckets list provided.
    Returns a string with all the buckets found displayed in alphabetical order
    '''
    matched = []
    try:
        text_to_find = field_6.lower()
    except:
        print("Some error happened here: couldn't convert to lowercase")
        print(field_6)
    for i in df_bucketsOrdered.index:
        if text_to_find.find(i) != -1:
            matched.append(df_bucketsOrdered.loc[i,'Bucket'])
        
    return ",".join(matched)
    

In [9]:
df['Bucket'] = df['Field_6'].apply(find_buckets)

In [10]:
#Values present in the bucket:
df['Bucket'].unique()

array(['', 'Tones', 'Trigger', 'Screen', 'Screen,Tones', 'Tones,Trigger',
       'Screen,Trigger', 'Screen,Tones,Trigger'], dtype=object)

In [11]:
#Summarize
df.groupby(by='Bucket').count().sort_values(by='Field_6', ascending=False)

Unnamed: 0_level_0,Field_6
Bucket,Unnamed: 1_level_1
,175251
Screen,2250
Trigger,1086
Tones,213
"Screen,Tones",64
"Screen,Trigger",14
"Tones,Trigger",9
"Screen,Tones,Trigger",2


## Condensed approach:

In [12]:
import time
t1 = time.time()
import pandas as pd

#Input data
df = pd.read_csv("./11_files/input.csv", encoding="latin")
df.fillna(value = "", inplace=True)
df_buckets = pd.read_csv("./11_files/buckets.csv", encoding="latin" )
df_buckets.sort_values(by = 'Bucket', inplace=True)
df_buckets.set_index('Search', inplace=True)

#Create function
def find_buckets(field_6):
    '''
    Receives a text field and finds the "search" terms in the buckets list provided.
    Returns a string with all the buckets found displayed in alphabetical order
    '''
    matched = []
    try:
        text_to_find = field_6.lower()
    except:
        print("Some error happened here: couldn't convert to lowercase")
        print(field_6)
    for i in df_bucketsOrdered.index:
        if text_to_find.find(i) != -1:
            matched.append(df_bucketsOrdered.loc[i,'Bucket'])
        
    return ",".join(matched)

#Assign Buckets
df['Bucket'] = df['Field_6'].apply(find_buckets)

#Summarize
print(df.groupby(by='Bucket').count().sort_values(by='Field_6', ascending=False))
t2 = time.time()
t2-t1

                      Field_6
Bucket                       
                       175251
Screen                   2250
Trigger                  1086
Tones                     213
Screen,Tones               64
Screen,Trigger             14
Tones,Trigger               9
Screen,Tones,Trigger        2


8.651309967041016