# Read me

# Ticket Classification using Machine Learning

This project is about building a machine learning model that can automatically classify support tickets into different categories. The goal is to help companies route and manage customer support more efficiently.

---

## 🔧 Key Design Choices

### 1. **Data Preprocessing**

* **Text Cleaning**: We removed unwanted characters, stopwords (like "and", "the", etc.), and lowercased all the text.
* **Tokenization**: The text was split into individual words.
* **Vectorization**: We used TF-IDF (Term Frequency - Inverse Document Frequency) to convert text into numerical format that a machine learning model can understand.

### 2. **Model Selection**

* We tested a few models and found **Logistic Regression** worked best for this type of text classification.
* We chose it because it's fast, easy to interpret, and performs well on text data.

### 3. **Train-Test Split**

* We split the data into training and testing sets (e.g., 80% for training, 20% for testing) to check how well the model works on unseen data.



##  Model Evaluation

We used the following metrics:

* **Accuracy**: Tells us how many tickets were classified correctly. E.g., 85% accuracy means 85 out of 100 tickets were correct.
* **Precision**: Measures how many of the predicted tickets in a category were actually right.
* **Recall**: Measures how many of the actual tickets in a category were correctly found.
* **F1-Score**: A balance between precision and recall. Useful when categories are imbalanced.

We used `classification_report` from `sklearn` to print all these metrics.



##  Limitations

* **Small Dataset**: If the dataset is too small, the model might not learn well.
* **Imbalanced Categories**: If some ticket types appear more than others, the model may become biased.
* **Generalization**: The model may not work well on data that is very different from the training data.

##  Folder Structure (if applicable)

```
project/
├── ticket_classifier.py
├── model.pkl
├── vectorizer.pkl
├── test_data.csv
└── README.md
```

---

##  Future Improvements

* Try deep learning models like BERT for better performance.
* Collect more data from real-world tickets.
* Add more preprocessing steps like spell correction or named entity recognition.



## Step: 1 Loading the data 

In [5]:
import pandas as pd

file_path = r"C:\Users\Dikshant\Downloads\ai_dev_assignment_tickets_complex_1000.xls"
df = pd.read_excel(file_path)

print("Shape of dataset:", df.shape)
df.head()



Shape of dataset: (1000, 5)


Unnamed: 0,ticket_id,ticket_text,issue_type,urgency_level,product
0,1,Payment issue for my SmartWatch V2. I was unde...,Billing Problem,Medium,SmartWatch V2
1,2,Can you tell me more about the UltraClean Vacu...,General Inquiry,,UltraClean Vacuum
2,3,I ordered SoundWave 300 but got EcoBreeze AC i...,Wrong Item,Medium,SoundWave 300
3,4,Facing installation issue with PhotoSnap Cam. ...,Installation Issue,Low,PhotoSnap Cam
4,5,Order #30903 for Vision LED TV is 13 days late...,Late Delivery,,Vision LED TV


##  Step 2: Clean and Preprocess the Data

In [6]:
import numpy as np
import string
import re


print("Missing values: \n", df.isnull().sum())


df["urgency_level"].fillna("Medium", inplace=True)

def clean_text(text):
    if isinstance(text, str): 
        text = text.lower()
        text = re.sub(r'\d+', '', text)
        text = text.translate(str.maketrans("", "", string.punctuation))
        text = text.strip()
        return text
    else:
        return ""  

df["clean_text"] = df["ticket_text"].apply(clean_text)


df[["ticket_text", "clean_text"]].head()


Missing values:
 ticket_id         0
ticket_text      55
issue_type       76
urgency_level    52
product           0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["urgency_level"].fillna("Medium", inplace=True)


Unnamed: 0,ticket_text,clean_text
0,Payment issue for my SmartWatch V2. I was unde...,payment issue for my smartwatch v i was underb...
1,Can you tell me more about the UltraClean Vacu...,can you tell me more about the ultraclean vacu...
2,I ordered SoundWave 300 but got EcoBreeze AC i...,i ordered soundwave but got ecobreeze ac inst...
3,Facing installation issue with PhotoSnap Cam. ...,facing installation issue with photosnap cam s...
4,Order #30903 for Vision LED TV is 13 days late...,order for vision led tv is days late ordered...


##  Step 3: Tokenization

In [7]:
import spacy

nlp = spacy.load("en_core_web_sm")


def spacy_preprocess(text):
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    return " ".join(tokens)


df["processed_text"] = df["clean_text"].apply(spacy_preprocess)


df[["clean_text", "processed_text"]].head()


Unnamed: 0,clean_text,processed_text
0,payment issue for my smartwatch v i was underb...,payment issue smartwatch v underbilled order
1,can you tell me more about the ultraclean vacu...,tell ultraclean vacuum warranty available white
2,i ordered soundwave but got ecobreeze ac inst...,order soundwave get ecobreeze ac instead order...
3,facing installation issue with photosnap cam s...,face installation issue photosnap cam setup fa...
4,order for vision led tv is days late ordered...,order vision lead tv day late order march cont...


## Step 4: Feature Engineering


In [8]:
import sys
!{sys.executable} -m pip install scikit-learn




In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(max_features=300)  
tfidf_features = tfidf_vectorizer.fit_transform(df["processed_text"])


import pandas as pd
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


In [11]:
# Length of original text (in words)
df["ticket_length"] = df["ticket_text"].apply(lambda x: len(str(x).split()))


In [12]:
import sys
!{sys.executable} -m pip install textblob




In [13]:
from textblob import TextBlob


df["processed_text"] = df["processed_text"].astype(str)


df["polarity"] = df["processed_text"].apply(lambda x: TextBlob(x).sentiment.polarity)
df["subjectivity"] = df["processed_text"].apply(lambda x: TextBlob(x).sentiment.subjectivity)


In [14]:
print(df.columns)


Index(['ticket_id', 'ticket_text', 'issue_type', 'urgency_level', 'product',
       'clean_text', 'processed_text', 'ticket_length', 'polarity',
       'subjectivity'],
      dtype='object')


In [15]:
df["sentiment"] = df["processed_text"].astype(str).apply(lambda x: TextBlob(x).sentiment.polarity)


In [16]:
final_features = pd.concat([tfidf_df, df[["ticket_length", "sentiment"]].reset_index(drop=True)], axis=1)


In [17]:

final_features = pd.concat([tfidf_df, df[["ticket_length", "sentiment"]].reset_index(drop=True)], axis=1)


issue_type = df["issue_type"]
urgency_level = df["urgency_level"]


##  Step 5: Machine Learning Mode

we will build two models:

issue_type 

urgency_level

In [18]:
from sklearn.model_selection import train_test_split


X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(final_features, issue_type, test_size=0.2, random_state=42)


X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(final_features, urgency_level, test_size=0.2, random_state=42)


In [19]:
print(df["issue_type"].isna().sum())


76


In [20]:
print(df["urgency_level"].isna().sum())


0


In [21]:
df = df.dropna(subset=["issue_type", "urgency_level"])


In [22]:

df_cleaned = df.dropna(subset=["issue_type", "urgency_level"])


In [23]:

tfidf_features = tfidf_vectorizer.fit_transform(df_cleaned["processed_text"])
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

final_features = pd.concat(
    [tfidf_df, df_cleaned[["ticket_length", "sentiment"]].reset_index(drop=True)],
    axis=1
)


In [24]:
y_issue = df_cleaned["issue_type"].reset_index(drop=True)
y_urgency = df_cleaned["urgency_level"].reset_index(drop=True)


In [25]:
from sklearn.model_selection import train_test_split

X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(final_features, y_issue, test_size=0.2, random_state=42)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(final_features, y_urgency, test_size=0.2, random_state=42)


In [26]:
from sklearn.linear_model import LogisticRegression

# issue type
model_issue = LogisticRegression(max_iter=1000)
model_issue.fit(X_train_1, y_train_1)

# urgency_level
model_urgency = LogisticRegression(max_iter=1000)
model_urgency.fit(X_train_2, y_train_2)


In [27]:
from sklearn.metrics import classification_report, accuracy_score


In [28]:
# Predict issue_type
y_pred_issue = model_issue.predict(X_test_1)

# Predict urgency_level
y_pred_urgency = model_urgency.predict(X_test_2)


In [29]:
print("=== Issue Type Classification Report ===")
print(classification_report(y_test_1, y_pred_issue))
print("Accuracy:", accuracy_score(y_test_1, y_pred_issue))

print("\n=== Urgency Level Classification Report ===")
print(classification_report(y_test_2, y_pred_urgency))
print("Accuracy:", accuracy_score(y_test_2, y_pred_urgency))


=== Issue Type Classification Report ===
                    precision    recall  f1-score   support

    Account Access       1.00      0.92      0.96        24
   Billing Problem       1.00      0.92      0.96        26
   General Inquiry       0.67      1.00      0.81        29
Installation Issue       1.00      0.94      0.97        31
     Late Delivery       1.00      0.95      0.98        22
    Product Defect       1.00      0.90      0.95        31
        Wrong Item       1.00      0.82      0.90        22

          accuracy                           0.92       185
         macro avg       0.95      0.92      0.93       185
      weighted avg       0.95      0.92      0.93       185

Accuracy: 0.9243243243243243

=== Urgency Level Classification Report ===
              precision    recall  f1-score   support

        High       0.37      0.27      0.31        59
         Low       0.32      0.25      0.28        51
      Medium       0.44      0.59      0.50        75

    

In [30]:
print(df.columns)


Index(['ticket_id', 'ticket_text', 'issue_type', 'urgency_level', 'product',
       'clean_text', 'processed_text', 'ticket_length', 'polarity',
       'subjectivity', 'sentiment'],
      dtype='object')


In [31]:
y_urgency = df["urgency_level"]


In [33]:
print(y_urgency.value_counts())

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score


model_urgency = RandomForestClassifier(n_estimators=100, random_state=42)
model_urgency.fit(X_train_2, y_train_2)

# Predicting on test set
y_pred_urgency = model_urgency.predict(X_test_2)

# Checking how the model performs now
print("=== Urgency Level Classification Report ===")
print(classification_report(y_test_2, y_pred_urgency))
print("Accuracy:", accuracy_score(y_test_2, y_pred_urgency))


urgency_level
Medium    345
High      303
Low       276
Name: count, dtype: int64
=== Urgency Level Classification Report ===
              precision    recall  f1-score   support

        High       0.29      0.20      0.24        59
         Low       0.33      0.27      0.30        51
      Medium       0.44      0.59      0.50        75

    accuracy                           0.38       185
   macro avg       0.35      0.35      0.35       185
weighted avg       0.36      0.38      0.36       185

Accuracy: 0.3783783783783784


## Final Summary & Conclusion

In this project, we developed two classification models to predict:

**Issue Type** (Multi-class classification)
**Urgency Level** (High, Medium, Low)

The model for predicting **issue type** performed well with **>92% accuracy**, showing that user complaints are easier to distinguish based on text features and sentiment.

However, the **urgency level** prediction model performed poorly (~38% accuracy). This is likely due to:

Class imbalance and less explicit in text

###  Improvements (Future Work)
 Use **SMOTE or class weights** to handle imbalance
 Try **XGBoost or SVM** for better generalization
