Name: **Arkaprabha Majumdar**

Assignment submitted to: Cogno AI, AllinCall

17-Nov-2020 to 19-Nov-2020

Let's unzip the given zip file:

In [1]:
!unzip "/content/MachineLearningContest.zip"

Archive:  /content/MachineLearningContest.zip
   creating: MachineLearningContest/
  inflating: MachineLearningContest/TestingData.xlsx  
  inflating: MachineLearningContest/intents.json  


In [2]:
%cd MachineLearningContest/
import pandas as pd
import json

/content/MachineLearningContest


## Looking at the intents data given as json:

In [3]:
intents_data = pd.read_json("intents.json")
intents_data.head()

Unnamed: 0,id,variations,intent
0,995,{'0': 'Is there a procedure to open AllinCall ...,Can I open AllinCall bank online?
1,996,{'0': 'Explain the features of AllinCall bank ...,What are the features of AllinCall bank account?
2,997,{'0': 'What are the features of AllinCall bank...,What are the features of AllinCall bank Video ...
3,998,{'0': 'Am i allowed to open a joint AllinCall ...,Can I open a joint AllinCall bank account?
4,999,"{'0': 'I don’t have a PAN', '1': 'how to open ...",I don’t have a PAN card


## and the test data:

In [174]:
test_data = pd.read_excel("TestingData.xlsx")
test_data.head(2)

Unnamed: 0,Test user queries
0,My money deduct my account but not credit
1,passbook delivery courier


## Note:
“Specific Intent” class is denoted by the ID of the intent that is matched to the test
user query.

“Suggestion” class is denoted by the integer “1”

“Failure message” class is denoted by the integer “2”

## Dividing the query column into individual rows:

In [9]:
id_rows = []
keys = []
values = []
intent = []
for row in range(intents_data.shape[0]):
  for key in intents_data['variations'][row].keys():
    id_rows.append(intents_data["id"][row])
    keys.append(key)
    values.append(intents_data['variations'][row][key])
    intent.append(intents_data['intent'][row])

In [152]:
df = pd.DataFrame({"id":id_rows,"query_key":keys,"query_val":values,"intent":intent})
df_indexed = df.groupby('id').first()

In [162]:
df

Unnamed: 0,id,query_key,query_val,intent
0,995,0,Is there a procedure to open AllinCall bank on...,Can I open AllinCall bank online?
1,995,1,How to open AllinCall bank online,Can I open AllinCall bank online?
2,995,2,Can I open AllinCall bank online?,Can I open AllinCall bank online?
3,996,0,Explain the features of AllinCall bank account,What are the features of AllinCall bank account?
4,996,1,Tellme about the features of AllinCall bank ac...,What are the features of AllinCall bank account?
...,...,...,...,...
1558,1002,9,minimum balance update my ac,What is the average monthly balance (AMB)?
1559,1010,0,What are the Documents required to open AllinC...,Documents for AllinCall bank
1560,1010,1,How to open AllinCall? What is needed along?,Documents for AllinCall bank
1561,1010,2,Tell me about the Documents needed to open All...,Documents for AllinCall bank


## Let's label encode the "intent" column

In [185]:
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()

In [12]:
df["intent_num"] = label_encoder.fit_transform(df['intent'])

I used a simple tfidf vectorizer to convert the query inputs to numeric.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

Tfd = TfidfVectorizer(stop_words="english",max_df=0.7)
Tfd_train=Tfd.fit_transform(df['query_val'])

print(Tfd.get_feature_names())

['aadhaar', 'aadhar', 'able', 'aboout', 'ac', 'accident', 'account', 'activate', 'activated', 'active', 'activity', 'add', 'address', 'advantages', 'allincall', 'allowed', 'allows', 'alowed', 'amazing', 'amb', 'annual', 'answer', 'app', 'application', 'applied', 'apply', 'applying', 'appointment', 'approved', 'approver', 'asking', 'atm', 'auto', 'autopay', 'autosweep', 'avail', 'available', 'average', 'away', 'awesome', 'bad', 'balance', 'bank', 'banking', 'benefits', 'billers', 'billpay', 'bills', 'birth', 'block', 'blocked', 'bond', 'bonds', 'book', 'booking', 'bot', 'bound', 'branch', 'browser', 'bye', 'byee', 'byeee', 'byeeee', 'came', 'cancel', 'card', 'carry', 'case', 'cash', 'cd', 'change', 'charge', 'charged', 'charges', 'cheat', 'check', 'checkboook', 'checker', 'checking', 'cheque', 'chequebook', 'clarify', 'close', 'closed', 'cnr', 'code', 'collateral', 'collect', 'coming', 'communication', 'complete', 'completed', 'completing', 'compulsory', 'confused', 'connection', 'conse

In [17]:
print(Tfd.transform([test_data['Test user queries'][5]]))

  (0, 237)	0.22244933447431223
  (0, 103)	0.3591019276529166
  (0, 7)	0.906400628391163


In [18]:
from sklearn.metrics.pairwise import cosine_similarity
sorted(cosine_similarity(Tfd.transform([test_data['Test user queries'][5]]),Tfd_train)[0])[-5:]

[0.3591019276529166,
 0.36922412526051407,
 0.5479593454072793,
 0.5735595046801804,
 0.8471627976586104]

# Procedure
Next, I'm ready to predict on the test queries. The procedure I followed is:
- first vectorize the test query
- apply cosine similarity
  - if similarity=0, then it's a failure. So we add "2".
  - if similarity=1, then it's a specific intent. So we simply add the id of that intent
  - if none of the above, then I sorted the similarities if the similarity>0.5 and add those IDs

In [184]:
cosine_val = []
result = []
for i,query in enumerate(test_data['Test user queries']):
  sug = str(i)+","
  sim_arr = cosine_similarity(Tfd.transform([query]),Tfd_train)[0] #similarity array
  tmp_ix = [x for x in range(len(sim_arr))]
  cosine_val.append(sorted(zip(sim_arr, tmp_ix), reverse=True)[:3])
  if cosine_val[i][0][0] == 0.0:
    sug+='2'
  elif cosine_val[i][0][0] == 1.0:
    sug+=str(cosine_val[i][0][1])
  else:
    sug+="1,"
    for tupple in cosine_val[i]:
      string_list_suggestions=[]
      if tupple[0]>.5:
        sug+=str(tupple[1])+','
    sug = sug[:-1]
  print(sug)
  result.append(sug)

0,1,1161
1,1,1495,1497,1493
2,2
3,2
4,1,1331,1358,1357
5,1,413,411,412
6,1,1460,1459,1454
7,1,1484,1373
8,1,507,95,501
9,1,1242,491,927
10,1,88,0,372
11,1,173,140,211
12,1
13,1,286,279,719
14,1,1488,1487,1485
15,1,223
16,1268
17,1,400,399,409
18,1,222,1217
19,496
20,1,1011,1004,1008
21,1
22,1,496,489,37
23,1,1176,491,552
24,1,211,177,176
25,2
26,1,1443,1442,1437
27,1,1372,1374,1373
28,1,1443,1442,1437
29,1,1481,1477,483
30,564
31,1,1490,490,547
32,1,47,42,45
33,1,1210,1208,136
34,1,220,219,217
35,1,196
36,1
37,1,551,1135,1128
38,1
39,1,1460,1459,1454
40,1
41,1,593
42,2
43,1,1287,1486,1488
44,1,895
45,1,784,777,902
46,1
47,1,1372,1323
48,1,1268,1267,1254
49,1416
50,2
51,1,661,663
52,1,256,1313,1312
53,1,167,175,254
54,1
55,1,1460,1459,1454
56,1
57,1
58,1,413,411,412
59,1,1293,1490,167
60,2
61,1,1242,491,927
62,1
63,1,604
64,1,1011,1004,1008
65,1,173,140,211
66,1,88,0,372
67,1,1373,1484
68,1,496,489,37
69,1,400,399,409
70,1,222,221
71,1,1176,491,552
72,1,211,177,176
73,1,286,279,719
74,1

However, these IDs are not from the original dataset, since we had divided the variations column.

So we need to fetch the actual ids based on these:
- keep the other data same
- if the id is "1"( i.e, suggestions ) , then we fetch the real intent IDs.

In [167]:
res_final = []
for each in result:
  if each.split(",")[1] == '1':
    tmp = each.split(",")
    temp_list = []
    an_list = []
    for suggestion in tmp[2:]:
      if df["id"][int(suggestion)] not in temp_list:
        print(df["intent"][int(suggestion)])
        temp_list.append(df["id"][int(suggestion)])
    for item in list(set(temp_list)):
      an_list.append(item)
    print(tmp[:2]+an_list)
    res_final.append(",".join(str(x) for x in tmp[:2]+an_list))
  else:
    res_final.append(each)

Why money is getting deducted from account?
['0', '1', 1150]
How to apply for passbook?
['1', '1', 1218]
My account got blocked
My debit card was stolen
['4', '1', 1180, 1182]
I want to activate my account when will it be activated?
['5', '1', 1068]
How to Apply for Debit Card?
['6', '1', 1203]
Where can I get my account statement?
Fraud
['7', '1', 1208, 1183]
what do I do after opening AllinCall?
How to open AllinCall?
['8', '1', 1009, 1082]
Open AllinCall
How can I see my account?
Can I open investment account and how can I open an investment account through my AllinCall bank bank account?
['9', '1', 1169, 1126, 1081]
How to open AllinCall?
Can I open AllinCall bank online?
How to open Video KYC account?
['10', '1', 1009, 1161, 995]
I haven't yet received a debit card with my AllinCall bank account? When will I receive a debit card?
Where will I receive my physical debit card?
I received my debit card. How do I know if it is active or not?
['11', '1', 1038, 1030, 1022]
['12', '1']
Wh

In [186]:
res_final[50:105]

['50,2',
 '51,1,1162',
 '52,1,1178,1047',
 '53,1,1028,1030,1047',
 '54,1',
 '55,1,1203',
 '56,1',
 '57,1',
 '58,1,1068',
 '59,1,1209,1028,1175',
 '60,2',
 '61,1,1169,1126,1081',
 '62,1',
 '63,1,1165',
 '64,1,1134',
 '65,1,1038,1030,1022',
 '66,1,1009,1161,995',
 '67,1,1208,1183',
 '68,1,1081,999',
 '69,1,1064,1065',
 '70,1,1039',
 '71,1,1153,1086,1081',
 '72,1,1038,1030',
 '73,1,1105,1051',
 '74,1,1180',
 '75,1,1201',
 '76,1,1209',
 '77,496',
 '78,496',
 '79,1',
 '80,2',
 '81,2',
 '82,2',
 '83,1,1172',
 '84,1,1068',
 '85,642',
 '86,1,1059',
 '87,1,1171',
 '88,1,1207',
 '89,1,1154',
 '90,1,1118',
 '91,1,1059',
 '92,1,1169,1188,996',
 '93,1,1021',
 '94,1,1018,1203,1086',
 '95,1,1139,1205',
 '96,1,1169,1126',
 '97,1,1039',
 '98,1',
 '99,1,1218',
 '100,1,1075',
 '101,1,1043',
 '102,1,1059,1157',
 '103,1,1139,1109,1175',
 '104,1,1028,1030,1047']

We can test some of these :

In [170]:
print(test_data['Test user queries'][59])
print(df["intent"][1209])
print(df["intent"][1028])
print(df["intent"][1175])

Sir mera account opne hogaya hai per main janana chati ho mera account no kya hai or mera check book or dibet cart and credit card or ATM card mujhe kab milega
Where can I See my account details?
What is the premium amount of Secure One?
How can I complete my Full KYC


In [183]:
print(test_data['Test user queries'][104])
print(df["intent"][1028])
print(df["intent"][1030])
print(df["intent"][1047])

I want ATM debit card
What is the premium amount of Secure One?
Can I open RD account in my AllinCall bank digital account?
How can I open Recurring Deposit?


In [173]:
pd.DataFrame({"submission":res_final}).to_csv("submission.csv", index=False)