## Introduction
This notebook demonstrates the process of building a text classification model using various machine learning algorithms. The goal is to predict 'Cluster' based on 'Response' texts from a dataset. The approach includes preprocessing, model selection, and optimization using grid search.

##2. Import Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from joblib import dump

## Load and Prepare Data
Load the data from a JSON file, drop any missing values, and prepare it for model training by splitting into features and labels.

In [2]:
data = pd.read_json('/content/updated_training_dataset.json')
data.dropna(subset=['Response', 'Cluster'], inplace=True)
data['Response'] = data['Response'].astype(str)

X = data['Response']
y = data['Cluster']

## Split the Dataset
Divide the dataset into training and testing sets to prepare for model training and evaluation.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Setup Pipeline and GridSearchCV
Configure a pipeline with a vectorizer and a placeholder for the classifier. Define a parameter grid for different classifiers and set up GridSearchCV to find the best model and parameters.

In [4]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression())  # Placeholder, will be replaced in grid search
])

param_grid = [
    {'clf': [LogisticRegression()], 'clf__penalty': ['l2'], 'clf__C': [1, 10, 100],
     'vect__max_features': [100, 500], 'vect__ngram_range': [(1, 1), (1, 2)]},
    {'clf': [SVC()], 'clf__kernel': ['linear', 'rbf'], 'clf__C': [0.1, 1, 10],
     'vect__max_features': [100, 500], 'vect__ngram_range': [(1, 1), (1, 2)]},
    {'clf': [MultinomialNB()], 'clf__alpha': [0.1, 1.0, 10.0],
     'vect__max_features': [100, 500], 'vect__ngram_range': [(1, 1), (1, 2)]},
    {'clf': [DecisionTreeClassifier()], 'clf__max_depth': [None, 10, 20], 'clf__min_samples_split': [2, 5],
     'vect__max_features': [100, 500], 'vect__ngram_range': [(1, 1), (1, 2)]}
]

grid_search = GridSearchCV(pipeline, param_grid, cv=5, verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


## Model Evaluation
Evaluate the best model found by GridSearchCV on the test set and display the best parameters and test set accuracy.

In [5]:
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-validation Score: {:.3f}".format(grid_search.best_score_))
test_accuracy = grid_search.best_estimator_.score(X_test, y_test)
print("Test Set Accuracy: {:.3f}".format(test_accuracy))

Best Parameters: {'clf': SVC(C=1), 'clf__C': 1, 'clf__kernel': 'rbf', 'vect__max_features': 500, 'vect__ngram_range': (1, 1)}
Best Cross-validation Score: 0.986
Test Set Accuracy: 0.977


## Save the Model
Serialize the best-performing model to a joblib file for later use in predictions or further analysis.

In [6]:
dump(grid_search.best_estimator_, 'best_text_classifier.joblib')
print("Model saved as 'best_text_classifier.joblib'")

Model saved as 'best_text_classifier.joblib'
