# **AutoTagger: Intelligent Question Tagging with Difficulty & Intent Detection**

### Project Overview  
**AutoTagger is an AI-powered NLP system that enhances how technical questions are classified on platforms like StackOverflow or Quora. It automatically predicts the relevant topic tags, estimates the difficulty level, and detects 
the user's intent behind the question.** 

### Key Features: 
- Tag Prediction: Multi-label classification of questions (e.g., ["Python", "NLP"]) 
- Difficulty Estimation: Predicts whether a question is Easy, Medium, or Hard 
- Intent Detection: Classifies the question’s intent (e.g., How-to, Debugging, Concept Explanation) 
- Similar Questions Retrieval: Uses embeddings to show similar previously answered questions 
- Streamlit Web App: Simple UI with real-time prediction 
- Confidence Scores: Shows how certain the model is 
- Feedback Integration: Allows users to correct predictions to improve learning

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import ast
from bs4 import BeautifulSoup
import spacy
import re
import spacy


import warnings
warnings.filterwarnings('ignore')

In [130]:
# questions = pd.read_csv('././Dataset/Questions.csv', encoding='ISO-8859-1')
# answers = pd.read_csv('././Dataset/Answers.csv', encoding='ISO-8859-1')
# tags = pd.read_csv('././Dataset/Tags.csv', encoding='ISO-8859-1')

In [131]:
# answer_counts = answers.groupby("ParentId").size().reset_index(name="AnswerCount")

# # Group tags per question
# tag_groups = tags.groupby("Id")["Tag"].apply(list).reset_index(name="Tags")

In [132]:
# # Merge answer count and tags into questions
# questions_merged = questions.merge(answer_counts, left_on="Id", right_on="ParentId", how="left")
# questions_merged = questions_merged.merge(tag_groups, on="Id", how="left")

In [133]:
# # Fill missing answer counts with 0
# questions_merged["AnswerCount"] = questions_merged["AnswerCount"].fillna(0).astype(int)

# # Drop unnecessary columns like 'ParentId'
# questions_merged = questions_merged.drop(columns=['ParentId','OwnerUserId','ClosedDate'], errors="ignore")

In [134]:
# # Now sample 10k clean records
# df = questions_merged.sample(n=10000, random_state=42).reset_index(drop=True)

In [135]:
# df.to_csv('stack_overflow_dataset.csv', index=False, encoding='utf-8')

## Load Dataset

In [136]:
df = pd.read_csv('./stack_overflow_dataset.csv')
df

Unnamed: 0,Id,CreationDate,Score,Title,Body,AnswerCount,Tags
0,17016800,2013-06-10T04:15:05Z,0,Handling the EditText send keyboard event for ...,<pre><code>import com.example.methanegaszonege...,1,"['android', 'events', 'android-edittext', 'send']"
1,7685280,2011-10-07T09:20:41Z,7,EditText: how to enable/disable input?,<p>I have a 7x6 grid of EditText views. I want...,7,['android']
2,24178500,2014-06-12T07:13:00Z,1,Mobile web - Displaying a fixed div below a re...,<p>I want to have a relative div at the top of...,0,"['jquery', 'html', 'css', 'iphone', 'mobile']"
3,38820760,2016-08-08T03:10:28Z,0,How to create tabbed view in HTML?,<p>I'm trying to create a tabbed view in HTML ...,4,"['html', 'google-sites']"
4,3674120,2010-09-09T05:53:46Z,0,Problems decrypting HTTP Live Stream,<p>I have a single key encrypted HTTP Live Str...,2,"['http', 'stream', 'openssl', 'live', 'encrypt..."
...,...,...,...,...,...,...,...
9995,3063960,2010-06-17T17:20:01Z,1,Cocoa memory management,<p>At various points during my application's w...,2,"['iphone', 'cocoa', 'cocoa-touch', 'memory-man..."
9996,3528990,2010-08-20T07:36:03Z,0,WP theme not displaying properly in FF,<p>I've been playing around with themes on a s...,1,"['wordpress', 'wordpress-theming']"
9997,28974670,2015-03-10T21:44:13Z,0,Get records where join table ids matches or is...,"<p>I have a table called <em>variants</em>, wh...",1,"['mysql', 'sql', 'ruby-on-rails', 'ruby-on-rai..."
9998,2952060,2010-06-01T17:59:05Z,2,Silverlight File download from client,<p>I have a Silverlight application that imple...,1,"['c#', 'silverlight', 'silverlight-4.0', 'down..."


In [137]:
df.info()
df.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            10000 non-null  int64 
 1   CreationDate  10000 non-null  object
 2   Score         10000 non-null  int64 
 3   Title         10000 non-null  object
 4   Body          10000 non-null  object
 5   AnswerCount   10000 non-null  int64 
 6   Tags          10000 non-null  object
dtypes: int64(3), object(4)
memory usage: 547.0+ KB


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,10000.0,21336560.0,11511030.0,2120.0,11362385.0,21784640.0,31571720.0,40141970.0
Score,10000.0,1.6096,7.283783,-8.0,0.0,0.0,1.0,299.0
AnswerCount,10000.0,1.5838,1.364978,0.0,1.0,1.0,2.0,23.0


## Data Preprocessing

In [138]:
df['CreationDate']=pd.to_datetime(df['CreationDate'],errors='ignore')
df['CreationDate']=df['CreationDate'].dt.date

In [139]:
df[df['Score'] < -3].head(5)

Unnamed: 0,Id,CreationDate,Score,Title,Body,AnswerCount,Tags
128,16508800,2013-05-12,-4,Conditionally display a ViewController Xcode -...,<p>I'm very new with Xcode and I would like to...,2,"['ios', 'objective-c', 'authentication']"
146,32821550,2015-09-28,-6,Google places api's in android,<p><strong>Android Google Places APIs</strong>...,1,"['android', 'google-maps', 'google-places-api']"
177,37325750,2016-05-19,-4,App rejected from app store,<p>I have submitted an app on Appstore.</p>\n\...,1,"['ios', 'app-store']"
270,33007250,2015-10-08,-8,How to add two cursor values and UPDATE the table,<p>I want to add two cursor values where <code...,1,"['mysql', 'sql', 'asp.net', 'mysql-workbench',..."
314,29147900,2015-03-19,-5,How to generate a star Triangle pattern using ...,<p>How to print a star triangle pattern using ...,1,['java']


In [140]:
df = df[df['Score'] > -3]

In [141]:
df['full_text'] = df['Title'].astype(str) + ' ' + df['Body']

In [None]:
# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Function to clean and preprocess full_text
def clean_text(text):
    # 1. Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # 2. Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    text = re.sub(r"[^a-zA-Z0-9\s?.]", "", text)
    
    # 3. (Optional) Lowercase ONLY if using uncased BERT
    # If using 'bert-base-cased', COMMENT OUT the next line
    text = text.lower()
    
    return text

# Apply to full_text column
df['clean_text'] = df['full_text'].apply(clean_text)

# View the result
df[['full_text', 'clean_text']].head()


Unnamed: 0,full_text,clean_text
0,Handling the EditText send keyboard event for ...,handling the edittext send keyboard event for ...
1,EditText: how to enable/disable input? <p>I ha...,edittext: how to enable/disable input? i have ...
2,Mobile web - Displaying a fixed div below a re...,mobile web - displaying a fixed div below a re...
3,How to create tabbed view in HTML? <p>I'm tryi...,how to create tabbed view in html? i'm trying ...
4,Problems decrypting HTTP Live Stream <p>I have...,problems decrypting http live stream i have a ...


In [144]:
df.drop(columns=['Id','CreationDate', 'Title', 'Body', 'full_text'], inplace=True)

In [145]:
def estimate_difficulty(text):
    text = str(text)
    word_count = len(text.split())
    sentence_count = max(text.count("."), 1)
    avg_word_length = sum(len(word) for word in text.split()) / (word_count + 1e-5)
    avg_sentence_length = word_count / sentence_count

    # Softened thresholds
    if word_count < 40 or (sentence_count <= 2 and avg_sentence_length < 15):
        return "Easy"
    elif word_count < 120:
        return "Medium"
    return "Hard"

# Apply labeling
df["difficulty"] = df["clean_text"].apply(estimate_difficulty)

In [147]:
nlp = spacy.load("en_core_web_sm")

def detect_intent_spacy(text):
    doc = nlp(text.lower())
    tokens = [token.lemma_ for token in doc]

    text_lemma = " ".join(tokens)

    if any(kw in text_lemma for kw in ["how do i", "how to", "how can i", "step by step", "ways to", "implement"]):
        return "how-to"
    elif any(kw in text_lemma for kw in ["error", "exception", "debug", "issue", "problem", "crash", "bug"]):
        return "debugging"
    elif any(kw in text_lemma for kw in ["difference", "compare", "versus", "vs", "comparison", "better than"]):
        return "comparison"
    elif any(kw in text_lemma for kw in ["define", "explain", "what", "why", "describe", "concept"]):
        return "concept"
    elif any(kw in text_lemma for kw in ["optimize", "improve", "reduce", "faster", "efficient", "increase"]):
        return "optimization"
    
    if doc[0].lemma_ in ["what", "how", "why", "can", "is", "should"]:
        return "concept"

    return "other"

df["intent"] = df["clean_text"].apply(detect_intent_spacy)

In [98]:
def score_bucket(score):
    if score < 0:
        return "low"
    elif score == 0:
        return "neutral"
    elif score <= 5:
        return "medium"
    else:
        return "high"

df["score_label"] = df["Score"].apply(score_bucket)

In [148]:
df['difficulty'].value_counts()

difficulty
Hard      4929
Medium    4495
Easy       452
Name: count, dtype: int64

In [149]:
df['intent'].value_counts()

intent
debugging       3141
how-to          2858
other           2003
concept         1547
comparison       251
optimization      76
Name: count, dtype: int64

In [85]:
from sklearn.preprocessing import MultiLabelBinarizer

def safe_parse_tags(x):
    # Case 1: NaN or None
    if pd.isna(x):
        return []
    
    # Case 2: Already a list
    if isinstance(x, list):
        return [str(tag).strip().lower() for tag in x if isinstance(tag, str)]
    
    # Case 3: Try parsing stringified list
    try:
        parsed = ast.literal_eval(x)
        if isinstance(parsed, list):
            return [str(tag).strip().lower() for tag in parsed if isinstance(tag, str)]
        else:
            return []
    except Exception:
        return []
    
df["Tags"] = df["Tags"].apply(safe_parse_tags)


In [87]:
# Drop rows where Tags list is empty
df = df[df["Tags"].apply(lambda tags: len(tags) > 0)].reset_index(drop=True)

In [90]:
df["intent"] = df["intent"].str.strip().str.lower()

In [151]:
df.to_csv('cleaned_dataset_with_preprocessing.csv', index=False)