<a href="https://colab.research.google.com/github/divya-gh/ConvoSense-AI-Banking-Chatbot-Analytics/blob/main/notebooks/Linguistic_%26_NLP_POS_Frequency_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Objective :

### POS Frequency Analysis
- This quantifies which linguistic structures cause failures.

### Manually Observace for:
- Was the utterance multi-intent?
- Was context missing?
- Was entity vague?

In [6]:
!pip install -U spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m95.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m‚ö† Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")


In [6]:
!git clone https://github.com/divya-gh/ConvoSense-AI-Banking-Chatbot-Analytics.git

Cloning into 'ConvoSense-AI-Banking-Chatbot-Analytics'...
remote: Enumerating objects: 49, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 49 (delta 23), reused 23 (delta 8), pack-reused 0 (from 0)[K
Receiving objects: 100% (49/49), 23.43 KiB | 1.67 MiB/s, done.
Resolving deltas: 100% (23/23), done.


In [2]:
import os
os.getcwd()


'/content'

In [3]:
import json
import pandas as pd
from pathlib import Path
from collections import Counter


In [7]:
data_path = Path("./ConvoSense-AI-Banking-Chatbot-Analytics/data/raw_conversations/banking_conversations.json")

with open(data_path, "r") as f:
    conversations = json.load(f)

df = pd.DataFrame(conversations)
df.head()

Unnamed: 0,conversation_id,timestamp,channel,user_utterance,true_intent,predicted_intent,confidence_score,entities,fallback_triggered,escalated_to_agent,resolved
0,conv_001,2025-01-05T10:15:00,chat,What is my checking account balance?,Check_Account_Balance,Check_Account_Balance,0.92,{'account_type': 'checking'},False,False,True
1,conv_002,2025-01-05T10:18:00,chat,I see a charge I don't recognize from Amazon,Dispute_Transaction,Transaction_History,0.61,{'merchant_name': 'Amazon'},False,True,False
2,conv_003,2025-01-05T10:22:00,chat,My debit card was stolen,Card_Lost_Or_Stolen,Card_Lost_Or_Stolen,0.95,{},False,False,True
3,conv_004,2025-01-05T10:30:00,voice,Can you tell me when my last five transactions...,Transaction_History,Transaction_History,0.88,{'transaction_count': 5},False,False,True
4,conv_005,2025-01-05T10:35:00,chat,I need help updating my phone number,Update_Personal_Details,Default_Fallback,0.42,{'field': 'phone_number'},True,True,False


### Step 1 ‚Äî Isolate Misclassified Utterances


In [8]:
misclassified = df[df['true_intent'] != df['predicted_intent']]
misclassified[['user_utterance', 'true_intent', 'predicted_intent']].head()



Unnamed: 0,user_utterance,true_intent,predicted_intent
1,I see a charge I don't recognize from Amazon,Dispute_Transaction,Transaction_History
4,I need help updating my phone number,Update_Personal_Details,Default_Fallback


üîπ Step 3: POS Tagging

POS (Part-of-Speech) tagging tells us how language structure impacts failure.

In [9]:
def pos_tags(text):
    doc = nlp(text)
    return [(token.text, token.pos_) for token in doc]

misclassified['pos_tags'] = misclassified['user_utterance'].apply(pos_tags)
misclassified[['user_utterance', 'pos_tags']].head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  misclassified['pos_tags'] = misclassified['user_utterance'].apply(pos_tags)


Unnamed: 0,user_utterance,pos_tags
1,I see a charge I don't recognize from Amazon,"[(I, PRON), (see, VERB), (a, DET), (charge, NO..."
4,I need help updating my phone number,"[(I, PRON), (need, VERB), (help, VERB), (updat..."


In [10]:
from collections import Counter

pos_counter = Counter()

for tags in misclassified['pos_tags']:
    for _, pos in tags:
        pos_counter[pos] += 1

pos_counter.most_common(10)


[('VERB', 5),
 ('PRON', 4),
 ('NOUN', 3),
 ('DET', 1),
 ('AUX', 1),
 ('PART', 1),
 ('ADP', 1),
 ('PROPN', 1)]

In [12]:
misclassified.sample(2)[['user_utterance', 'true_intent', 'predicted_intent']]


Unnamed: 0,user_utterance,true_intent,predicted_intent
4,I need help updating my phone number,Update_Personal_Details,Default_Fallback
1,I see a charge I don't recognize from Amazon,Dispute_Transaction,Transaction_History


### Part-of-Speech Failure Patterns

POS frequency analysis of misclassified utterances indicates a high prevalence of verbs and nouns, suggesting that compound actions and ambiguous entities contribute significantly to intent misclassification. Prepositions and auxiliary verbs further increase contextual complexity, leading to uncertainty in intent resolution.

#### Meaning: When the system analyzes misclassified queries, it finds a lot of verbs and nouns.

Many verbs ‚Üí users are describing multiple actions in one query (‚Äútransfer and cancel‚Äù, ‚Äúreverse and dispute‚Äù), which makes intent harder to detect.

Many nouns ‚Üí users mention ambiguous entities (‚Äúpayment‚Äù, ‚Äúcharge‚Äù, ‚Äútransfer‚Äù), which the model struggles to interpret precisely.

There are also many prepositions (ADP) and auxiliary verbs (AUX).

Prepositions like to, from, for, with introduce context switching, such as moving between accounts or describing relationships between actions.

Auxiliary verbs like can, should, would introduce conditional or modal phrasing, which makes the user‚Äôs intent less direct and harder for the model to resolve.


## Business Impact:
‚ÄúLinguistic ambiguity directly correlates with escalation risk and increased customer support cost.‚Äù

### Meaning:
When a customer‚Äôs message is linguistically ambiguous‚Äîunclear wording, vague phrasing, multiple possible interpretations‚Äîthe system is more likely to:

- fail to understand the intent,
- escalate the query to a human agent, and
- increase the cost of handling the request.

----------------------------------------

### Additional Details:

### In part‚Äëof‚Äëspeech (POS) Tagging‚Äîespecially in spaCy‚ÄîADP and AUX are two specific grammatical categories.
They‚Äôre short labels, but they carry very different meanings.

## ADP
### ADP = Adposition  
This category includes prepositions and postpositions‚Äîwords that show relationships of place, time, direction, or method.

#### Examples of ADP:
- in
- on
- at
- for
- with
- to
- from
- about

Why it matters in your insight table:
ADPs often signal context switching in user queries.
### Example:
- ‚ÄúTransfer money from savings to checking‚Äù
The ADPs ‚Äúfrom‚Äù and ‚Äúto‚Äù introduce multiple contextual relationships.

## AUX
### AUX = Auxiliary verb  
These are helping verbs that modify the main verb to express tense, mood, or modality.

#### Examples of AUX:
- can
- should
- would
- is (when used as a helper: ‚Äúis going‚Äù)
- have (in ‚Äúhave done‚Äù)
- will

Why it matters in your insight table:
AUX verbs often introduce conditional or modal phrasing, which increases ambiguity in intent.
### Example:
- ‚ÄúCan I reverse this payment‚Äù
- ‚ÄúShould I dispute this transaction‚Äù