# **AutoTagger: Intelligent Question Tagging with Difficulty & Intent Detection**

### Project Overview  
**AutoTagger is an AI-powered NLP system that enhances how technical questions are classified on platforms like StackOverflow or Quora. It automatically predicts the relevant topic tags, estimates the difficulty level, and detects 
the user's intent behind the question.** 

### Key Features: 
- Tag Prediction: Multi-label classification of questions (e.g., ["Python", "NLP"]) 
- Difficulty Estimation: Predicts whether a question is Easy, Medium, or Hard 
- Intent Detection: Classifies the question’s intent (e.g., How-to, Debugging, Concept Explanation) 
- Similar Questions Retrieval: Uses embeddings to show similar previously answered questions 
- Streamlit Web App: Simple UI with real-time prediction 
- Confidence Scores: Shows how certain the model is 
- Feedback Integration: Allows users to correct predictions to improve learning

## Import Libraries

In [88]:
import pandas as pd
import numpy as np
import ast

import warnings
warnings.filterwarnings('ignore')

In [48]:
# questions = pd.read_csv('././Dataset/Questions.csv', encoding='ISO-8859-1')
# answers = pd.read_csv('././Dataset/Answers.csv', encoding='ISO-8859-1')
# tags = pd.read_csv('././Dataset/Tags.csv', encoding='ISO-8859-1')

In [49]:
# answer_counts = answers.groupby("ParentId").size().reset_index(name="AnswerCount")

# # Group tags per question
# tag_groups = tags.groupby("Id")["Tag"].apply(list).reset_index(name="Tags")

In [50]:
# # Merge answer count and tags into questions
# questions_merged = questions.merge(answer_counts, left_on="Id", right_on="ParentId", how="left")
# questions_merged = questions_merged.merge(tag_groups, on="Id", how="left")

In [51]:
# # Fill missing answer counts with 0
# questions_merged["AnswerCount"] = questions_merged["AnswerCount"].fillna(0).astype(int)

# # Drop unnecessary columns like 'ParentId'
# questions_merged = questions_merged.drop(columns=['ParentId','OwnerUserId','ClosedDate'], errors="ignore")

In [52]:
# # Now sample 10k clean records
# df = questions_merged.sample(n=10000, random_state=42).reset_index(drop=True)

In [53]:
# df.to_csv('stack_overflow_dataset.csv', index=False, encoding='utf-8')

In [54]:
df = pd.read_csv('./stack_overflow_dataset.csv')
df

Unnamed: 0,Id,CreationDate,Score,Title,Body,AnswerCount,Tags
0,17016800,2013-06-10T04:15:05Z,0,Handling the EditText send keyboard event for ...,<pre><code>import com.example.methanegaszonege...,1,"['android', 'events', 'android-edittext', 'send']"
1,7685280,2011-10-07T09:20:41Z,7,EditText: how to enable/disable input?,<p>I have a 7x6 grid of EditText views. I want...,7,['android']
2,24178500,2014-06-12T07:13:00Z,1,Mobile web - Displaying a fixed div below a re...,<p>I want to have a relative div at the top of...,0,"['jquery', 'html', 'css', 'iphone', 'mobile']"
3,38820760,2016-08-08T03:10:28Z,0,How to create tabbed view in HTML?,<p>I'm trying to create a tabbed view in HTML ...,4,"['html', 'google-sites']"
4,3674120,2010-09-09T05:53:46Z,0,Problems decrypting HTTP Live Stream,<p>I have a single key encrypted HTTP Live Str...,2,"['http', 'stream', 'openssl', 'live', 'encrypt..."
5,31672540,2015-07-28T09:34:59Z,1,Google Analytics picking up traffic and code n...,<p>Not sure if this is within the realms of st...,0,['google-analytics']
6,21945610,2014-02-21T21:33:06Z,1,Inserting correct x and y values in DB,<p>I have a website with images that you can d...,1,"['javascript', 'php', 'jquery', 'mysql', 'ajax']"
7,38913550,2016-08-12T08:38:34Z,0,Cannot start Webdyn Pro Java application after...,<p>I build and deploy the application to serve...,0,"['java', 'sap', 'webdynpro']"
8,22817700,2014-04-02T16:35:47Z,0,Visual Studio: automatically build multiple so...,<p>I have a number of webapi solutions that ar...,0,['visual-studio-2013']
...,...,...,...,...,...,...,...


In [89]:
df.info()
df.describe().T

<class 'pandas.core.frame.DataFrame'>
Index: 9876 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            9876 non-null   int64 
 1   CreationDate  9876 non-null   object
 2   Score         9876 non-null   int64 
 3   Title         9876 non-null   object
 4   Body          9876 non-null   object
 5   AnswerCount   9876 non-null   int64 
 6   Tags          9876 non-null   object
 7   cleaned_text  9876 non-null   object
dtypes: int64(3), object(5)
memory usage: 694.4+ KB


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,9876.0,21275710.0,11526120.0,2120.0,11297337.5,21712175.0,31532615.0,40141970.0
Score,9876.0,1.678514,7.301772,-2.0,0.0,0.0,1.0,299.0
AnswerCount,9876.0,1.581814,1.367207,0.0,1.0,1.0,2.0,23.0


In [68]:
df['CreationDate']=pd.to_datetime(df['CreationDate'],errors='ignore')
df['CreationDate']=df['CreationDate'].dt.date

In [57]:
df[df['Score'] < -3].head(5)

Unnamed: 0,Id,CreationDate,Score,Title,Body,AnswerCount,Tags
128,16508800,2013-05-12,-4,Conditionally display a ViewController Xcode -...,<p>I'm very new with Xcode and I would like to...,2,"['ios', 'objective-c', 'authentication']"
146,32821550,2015-09-28,-6,Google places api's in android,<p><strong>Android Google Places APIs</strong>...,1,"['android', 'google-maps', 'google-places-api']"
177,37325750,2016-05-19,-4,App rejected from app store,<p>I have submitted an app on Appstore.</p>\n\...,1,"['ios', 'app-store']"
270,33007250,2015-10-08,-8,How to add two cursor values and UPDATE the table,<p>I want to add two cursor values where <code...,1,"['mysql', 'sql', 'asp.net', 'mysql-workbench',..."
314,29147900,2015-03-19,-5,How to generate a star Triangle pattern using ...,<p>How to print a star triangle pattern using ...,1,['java']


In [58]:
df = df[df['Score'] > -3]

In [59]:
pd.set_option('display.max_rows', False)

In [60]:
df['Body'].value_counts()

Body
<pre><code>import com.example.methanegaszonegeolocater.R;\nimport com.google.android.gms.maps.CameraUpdateFactory;\nimport com.google.android.gms.maps.GoogleMap;\nimport com.google.android.gms.maps.SupportMapFragment;\nimport com.google.android.gms.maps.model.LatLng;\n//import com.google.android.gms.maps.model.MarkerOptions;\nimport android.os.Bundle;\nimport android.support.v4.app.FragmentActivity;\nimport android.graphics.Color;\nimport android.view.Menu;\n\nimport android.widget.EditText;\nimport android.widget.TextView;\nimport android.view.inputmethod.EditorInfo;\nimport android.widget.TextView;\nimport android.view.KeyEvent;                       \n\n@Override\nprotected void onCreate(Bundle savedInstanceState) {\nsuper.onCreate(savedInstanceState);\nsetContentView(R.layout.activity_main);\n\nsetUpMapIfNeeded();\nEditText editText = (EditText) findViewById(R.id.editText1);\n\neditText.setOnEditorActionListener(new OnEditorActionListener() {\n\n@Override\npublic boolean onEdi

In [63]:
from bs4 import BeautifulSoup
import re

def clean_for_bert(text):
    text = str(text)
    
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Remove emails
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    
    # Remove code between backticks or <code>...</code>
    text = re.sub(r'`{1,3}.*?`{1,3}', '', text)
    
    # Remove long indented stack traces/logs
    text = re.sub(r'( {4,}.*\n?)+', '', text)
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

In [69]:
df['cleaned_text'] = df['Body'].apply(clean_for_bert)

In [None]:
def eval_and_clean(val):
    # Convert string to list if needed
    if isinstance(val, str) and val.strip().startswith('['):
        try:
            parsed = ast.literal_eval(val)
        except Exception as e:
            print(f"Error parsing: {val} → {e}")
            return []
    elif isinstance(val, list):
        parsed = val
    else:
        return []

    # Remove nan values from list
    cleaned = [str(tag).strip() for tag in parsed if pd.notna(tag) and str(tag).lower() != 'nan']
    return cleaned

# Apply to your Tags column
df['Tags'] = df['Tags'].apply(eval_and_clean)

In [93]:
df['full_text'] = df['Title'].astype(str) + ' ' + df['cleaned_text']

In [96]:
df.drop(columns=['Id','CreationDate', 'Title', 'Body', 'cleaned_text'], inplace=True)

resume AutoTagger