<a href="https://colab.research.google.com/github/dhanu902/FoodieChat-Bot/blob/main/BOT_MODEL_IntentClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Load

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

df_train = pd.read_csv('/content/drive/MyDrive/ChatBot/Preprocessed/intent_train.csv')
df_test = pd.read_csv('/content/drive/MyDrive/ChatBot/Preprocessed/intent_test.csv')
df_val = pd.read_csv('/content/drive/MyDrive/ChatBot/Preprocessed/intent_val.csv')

## Text Vectorization

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

# ---- check missing value count and safe drop ----
print("NaN in train:", df_train["text"].isna().sum())
print("NaN in test:", df_test["text"].isna().sum())
print("NaN in val:", df_val["text"].isna().sum())

print(df_train[df_train['text'].isna()])
print(df_test[df_test['text'].isna()])
print(df_val[df_val['text'].isna()])

df_train.dropna(subset=['text'], inplace=True)
df_test.dropna(subset=['text'], inplace=True)
df_val.dropna(subset=['text'], inplace=True)

NaN in train: 63
NaN in test: 15
NaN in val: 7
      text  label
625    NaN      8
1854   NaN      3
3077   NaN      1
3516   NaN      9
4855   NaN      8
...    ...    ...
37263  NaN      1
37389  NaN      1
37455  NaN      1
37995  NaN      3
38166  NaN      9

[63 rows x 2 columns]
     text  label
639   NaN      1
2608  NaN      6
2884  NaN      1
2919  NaN      1
3180  NaN      1
3197  NaN      8
3373  NaN      1
4041  NaN      1
4773  NaN      8
4827  NaN      1
5287  NaN      1
5768  NaN      1
6259  NaN      6
6573  NaN      1
6979  NaN      1
     text  label
584   NaN      1
1260  NaN      1
3083  NaN      9
3781  NaN      9
5596  NaN      1
5938  NaN     10
8218  NaN      0


In [4]:
X_train = vectorizer.fit_transform(df_train['text'])
X_test = vectorizer.transform(df_test['text'])
X_val = vectorizer.transform(df_val['text'])

---

#### ➡️ Why Vectorize Text Data?

We **convert text data into numeric feature vectors** so that machine learning models can understand and learn from them.

---

#### 🧠 Example: Raw vs. Vectorized

| **Stage**                                            | **Representation**                                 |
| ---------------------------------------------------- | -------------------------------------------------- |
| Original Sentence                                    | `"I love this chatbot"`                            |
| Before Vectorization (Raw)                           | `"I love this chatbot"` *(String/Text)*            |
| After Vectorization (e.g. CountVectorizer or TF-IDF) | `[0, 1, 0, 2, 1, 0, ...]` *(Sparse numeric array)* |

* Each number in the vector:

  * For **CountVectorizer**: Represents how often a word appears in the sentence.
  * For **TF-IDF**: Represents how **important** a word is based on frequency and rarity across documents.

---


---

#### 🔄 Difference Between `fit_transform()` and `transform()`

##### ✅ Summary Table

| Method            | Description                                                             |
| ----------------- | ----------------------------------------------------------------------- |
| `fit_transform()` | Learns from the data (e.g., vocabulary, IDF) **and then transforms** it |
| `transform()`     | **Only transforms** the data using the information learned earlier      |

---

##### 📘 Example: TF-IDF Vectorization

```c
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

// Training data
train_texts = ["I love chatbots", "Chatbots are helpful"]

// Fit and transform on training data
X_train = vectorizer.fit_transform(train_texts)
```

* `fit_transform()`:

  * Learns the vocabulary (e.g., `["i", "love", "chatbots", "are", "helpful"]`)
  * Computes the IDF values
  * Converts the text to a vector representation

```c
// Test data
test_texts = ["chatbots are awesome"]

// Transform using the existing vocabulary
X_test = vectorizer.transform(test_texts)
```

* `transform()`:

  * **Does not re-learn** anything
  * Uses the **existing vocabulary** from training
  * Ignores new words not seen during fitting (e.g., "awesome")

---

##### ❗ Why Not Use `fit_transform()` on Test Data?

Using `fit_transform()` on test data will:

* Learn a **new vocabulary**
* Break consistency between train and test data
* Cause **data leakage** and incorrect evaluation

---

##### 📌 Best Practice

| Dataset Type   | Use `fit_transform()`? | Use `transform()`? |
| -------------- | ---------------------- | ------------------ |
| **Training**   | ✅ Yes                  | ✅ Yes (if reusing) |
| **Validation** | ❌ No                   | ✅ Yes              |
| **Test**       | ❌ No                   | ✅ Yes              |

---


## Encode Label

In [5]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(df_train['label'])
y_test = label_encoder.transform(df_test['label'])
y_val = label_encoder.transform(df_val['label'])

---

#### 🎯 Why Use `LabelEncoder` Instead of a Vectorizer for Labels?

##### ✅ Short Answer:

* **`LabelEncoder`** is used for **target labels** (like `"positive"`, `"negative"`).
* **`Vectorizer`** is used for **text features** (like the review content).
* We use `LabelEncoder` to convert **categorical labels into numbers**, which models need.

---

#### 🔄 What Does `LabelEncoder` Do?

Given labels like:

```c
df_train['label'] = ['positive', 'negative', 'positive', 'neutral']
```

Running:

```c
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(df_train['label'])
```

You get:

```
['positive', 'negative', 'positive', 'neutral'] → [2, 0, 2, 1]
```

It assigns a unique integer to each category:

* `'negative'` → 0
* `'neutral'` → 1
* `'positive'` → 2

---

#### 🧠 Why Not Use `Vectorizer` for Labels?

* `Vectorizer` like `CountVectorizer` or `TfidfVectorizer` is designed for **text input features**, not labels.
* These vectorizers create a **sparse matrix** of word counts or weights—not suitable for classification targets.

---

#### ✅ Correct Workflow:

| Component     | Tool/Method Used                      | Purpose                                 |
| ------------- | ------------------------------------- | --------------------------------------- |
| **Text Data** | `CountVectorizer` / `TfidfVectorizer` | Convert text → numeric features         |
| **Labels**    | `LabelEncoder`                        | Convert category → numeric class labels |

---


## Model parameters

In [65]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from itertools import product
from sklearn.exceptions import ConvergenceWarning
import warnings

# Ignore convergence warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# === Define hyperparameter grid ===
param_grid = {
    'C': [0.01, 0.1, 1.0, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs', 'liblinear'],
    'class_weight': [None, 'balanced'],
    'max_iter': [500, 1000],
    'multi_class': ['ovr', 'multinomial']
}

# === Generate combinations of parameters ===
grid = list(product(
    param_grid['C'],
    param_grid['penalty'],
    param_grid['solver'],
    param_grid['class_weight'],
    param_grid['max_iter'],
    param_grid['multi_class']
))

# === Output file ===
results_file = "logistic_trials_summary.csv"
results_all = []



Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.




Skipping config due to error: Solver liblinear does not support a multinomial backend.
Skipping config due to error: Solver liblinear does not support a multinomial backend.
Saved all results to logistic_trials_summary.csv




## Train Model, Validate & Test

In [None]:
# === Main loop ===
for C, penalty, solver, class_weight, max_iter, multi_class in grid:
    # Check for valid solver-penalty combinations
    if penalty == 'l1' and solver not in ['liblinear', 'saga']:
        continue
    if penalty == 'elasticnet' and solver != 'saga':
        continue
    if penalty == 'none' and solver not in ['lbfgs', 'saga', 'sag', 'newton-cg']:
        continue

    try:
        model = LogisticRegression(
            C=C,
            penalty=penalty,
            solver=solver,
            class_weight=class_weight,
            max_iter=max_iter,
            multi_class=multi_class,
            random_state=42
        )

        model.fit(X_train, y_train)

        # === Validation ===
        y_val_pred = model.predict(X_val)
        val_acc = accuracy_score(y_val, y_val_pred)
        val_precision, val_recall, val_f1, _ = precision_recall_fscore_support(
            y_val, y_val_pred, average='weighted', zero_division=0
        )

        # === Test ===
        y_test_pred = model.predict(X_test)
        test_acc = accuracy_score(y_test, y_test_pred)

        # === Log results ===
        result_row = {
            'C': C,
            'penalty': penalty,
            'solver': solver,
            'class_weight': class_weight,
            'max_iter': max_iter,
            'multi_class': multi_class,
            'val_accuracy': round(val_acc, 4),
            'val_precision': round(val_precision, 4),
            'val_recall': round(val_recall, 4),
            'val_f1_score': round(val_f1, 4),
            'test_accuracy': round(test_acc, 4)
        }

        results_all.append(result_row)

    except Exception as e:
        print(f"Skipping config due to error: {e}")
        continue

---

#### 🔍 Parameters of `LogisticRegression` Used Here

##### ✅ Parameters Explained

| Parameter      | Value        | Description                                                                                                                        |
| -------------- | ------------ | ---------------------------------------------------------------------------------------------------------------------------------- |
| `max_iter`     | `1000`       | Maximum number of iterations for the solver to converge. Useful for larger datasets or sparse input.                               |
| `class_weight` | `'balanced'` | Automatically adjusts weights **inversely proportional to class frequencies** in the input data. Helps handle **class imbalance**. |

---

##### 📌 Optional Common Parameters You Can Also Use

| **Parameter**       | **Possible Values**                                        | **Description**                                                                |
| ------------------- | ---------------------------------------------------------- | ------------------------------------------------------------------------------ |
| `penalty`           | `'l1'`, `'l2'`, `'elasticnet'`, `'none'`                   | Regularization type. `'l2'` is default. Use `'elasticnet'` with `saga` solver. |
| `dual`              | `True`, `False`                                            | Use dual formulation (only for `'l2'` penalty with `liblinear` solver).        |
| `tol`               | `float` (e.g., `1e-4`)                                     | Tolerance for stopping criteria. Smaller = more precise.                       |
| `C`                 | `float` (e.g., `1.0`, `0.1`, `10`)                         | Inverse of regularization strength. Smaller = stronger regularization.         |
| `fit_intercept`     | `True`, `False`                                            | Whether to add an intercept (bias) to the model.                               |
| `intercept_scaling` | `float` (e.g., `1`)                                        | Only used when `solver='liblinear'` and `fit_intercept=True`.                  |
| `class_weight`      | `None`, `'balanced'`, or `dict`                            | Handles class imbalance. `'balanced'` adjusts weights automatically.           |
| `random_state`      | `int`, `None`                                              | Seed for reproducibility.                                                      |
| `solver`            | `'liblinear'`, `'lbfgs'`, `'newton-cg'`, `'sag'`, `'saga'` | Optimization algorithm. Must match `penalty` type.                             |
| `max_iter`          | `int` (e.g., `100`, `1000`)                                | Maximum number of iterations for solver to converge.                           |
| `multi_class`       | `'auto'`, `'ovr'`, `'multinomial'`                         | Strategy for multi-class classification. `'auto'` picks best based on solver.  |
| `verbose`           | `int` (e.g., `0`, `1`, `2`)                                | For printing progress during training. Mostly for debugging.                   |
| `warm_start`        | `True`, `False`                                            | Reuse previous fit's solution to speed up convergence.                         |
| `n_jobs`            | `int` (e.g., `-1` for all cores)                           | Parallelize across CPUs (only for `saga`, `liblinear`).                        |
| `l1_ratio`          | `float` (0 to 1)                                           | ElasticNet mixing (only used if `penalty='elasticnet'`).                       |
                                |

---

##### 🧠 Example for Imbalanced Sentiment Data

If you have many `"positive"` but few `"negative"` samples, setting:

```python
class_weight='balanced'
```

automatically calculates weights like:

```python
weight = total_samples / (n_classes * count_of_class)
```

This helps the model not be biased toward the majority class.

---

##### 🔎 View Model Parameters After Training

To inspect the learned weights:

```python
model.coef_     # weights for each feature
model.intercept_  # bias term
```

---

## Save to File

In [None]:
# === Save results to CSV ===
try:
    df_existing = pd.read_csv(results_file)
    df_results = pd.concat([df_existing, pd.DataFrame(results_all)], ignore_index=True)
except FileNotFoundError:
    df_results = pd.DataFrame(results_all)

# Remove duplicates based on key params
df_results.drop_duplicates(subset=['C', 'penalty', 'solver', 'class_weight', 'max_iter', 'multi_class'], inplace=True)

df_results.to_csv(results_file, index=False)
print("Saved all results to", results_file)

In [66]:
df_results = pd.read_csv(results_file)
df_results

Unnamed: 0,C,penalty,solver,class_weight,max_iter,val_accuracy,val_precision,val_recall,val_f1_score,test_accuracy,multi_class
0,1.0,l2,lbfgs,balanced,900,0.6991,0.7285,0.6991,0.7081,0.6933,
1,1.0,l2,lbfgs,balanced,1000,0.6991,0.7285,0.6991,0.7081,0.6933,
2,1.0,l2,lbfgs,balanced,800,0.6991,0.7285,0.6991,0.7081,0.6933,
3,1.0,l2,lbfgs,balanced,700,0.6991,0.7285,0.6991,0.7081,0.6933,
4,0.1,l2,lbfgs,balanced,1000,0.6975,0.7296,0.6975,0.7071,0.6889,
5,0.2,l2,lbfgs,balanced,1000,0.7024,0.7335,0.7024,0.7118,0.6921,
6,0.3,l2,lbfgs,balanced,1000,0.7045,0.7339,0.7045,0.7134,0.6941,
7,0.4,l2,lbfgs,balanced,1000,0.7039,0.7337,0.7039,0.713,0.6948,
8,0.5,l2,lbfgs,balanced,1000,0.7031,0.7327,0.7031,0.7121,0.6954,
9,0.01,l2,lbfgs,,500,0.6358,0.6427,0.6358,0.6147,0.6408,ovr
