In [1]:
from modules.constants import SEED, TEST_DIR, TRAIN_DIR
from modules.data_loader import DataLoader
from modules.preprocessor import Preprocessor
from modules.vectorizer import Vectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report
from sklearn.model_selection import ParameterGrid
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from typing import Callable, List, Tuple

import numpy as np

## Machine Learning Techniques

In the following notebook, various machine learning techniques applied to the context of text classification are analyzed. Specifically, the IMDB dataset is used, which includes a split into training data (25k) and test data (25k). The exploratory notebook contains exploratory analysis of the dataset. Below are the techniques applied:
- **Linear SVM**: Support Vector Machine with a `linear kernel`, where tuning of the hyperparameter `C` and the parameter `ngram_range` will be performed
- **Naive Bayes**: Naive Bayes with tuning of the hyperparameter `alpha` and the parameter `ngram_range`
- **Logistic Regression**: ogistic Regression with tuning of the hyperparameter `C` and the parameter `ngram_range`

These techniques will be applied to the balanced training dataset and preprocessed using the `perform_strong_preprocessing` function of the `Preprocessor` class. This preprocessing technique involves the removal of `HTML tags`, `punctuation`, and `English stop words`. Subsequently, an analysis will be conducted on the use of preprocessing techniques as they do not always yield the best results.

In conclusion, the obtained results will be presented.

In [2]:
data_loader = DataLoader(seed = SEED)
preprocessor = Preprocessor(stopwords_language = 'english')
vectorizer = Vectorizer()

In [3]:
X_train, y_train = data_loader.load_test_data(TRAIN_DIR)
X_test, y_test = data_loader.load_test_data(TEST_DIR)

X_train, y_train = [review.decode() for review in X_train], y_train
X_test, y_test = [review.decode() for review in X_test], y_test

In [4]:
def create_model(
		X_train: List[str],
		y_train: List[int],
		X_test: List[str],
		y_test: List[int],
		model_name: str = 'LinearSVC',
		output_mode: str = 'tf-idf',
		max_features: int = None,
		preprocessor_fnc: Callable = None,
		ngram_range: Tuple[int, int] = (1, 1),
		**kwargs
	):

	assert model_name in ['LinearSVC', 'LogisticRegression', 'MultinomialNB'], 'Invalid model name'

	vectorize, train_features = vectorizer.vectorize_data(
		X_train,
		output_mode,
		max_features,
		ngram_range,
		preprocessor_fnc,
	)
	test_features = vectorize.transform(X_test)

	if model_name == 'LinearSVC':
		model = LinearSVC(C = kwargs['C'], dual = True, max_iter = 5000)
	elif model_name == 'MultinomialNB':
		model = MultinomialNB(alpha = kwargs['alpha'])
	else:
		model = LogisticRegression(penalty = kwargs['penalty'], C = kwargs['C'])

	model.fit(train_features, y_train)
	predictions = model.predict(test_features)

	return predictions

### Support Vector Machine

In [5]:
params = {
	'ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)],
	'C': [0.1, 1, 10]
}

grid = list(ParameterGrid(params))

accuracies = []

for param in grid:
	predictions = create_model(
		X_train,
		y_train,
		X_test,
		y_test,
		model_name = 'LinearSVC',
		preprocessor_fnc = preprocessor.perform_strong_preprocessing,
		**param
	)
	accuracy = accuracy_score(y_test, predictions)
	print(f'Fine tuning model with hyper/parameters: {param} - Accuracy: {accuracy}')
	accuracies.append(accuracy)

index = np.argmax(np.array(accuracies))
best_params = grid[index]
print(f'Best parameters: {best_params} with accuracy: {accuracies[index]}')

Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (1, 1)} - Accuracy: 0.8838
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (1, 2)} - Accuracy: 0.88316
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (1, 3)} - Accuracy: 0.87464
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (2, 2)} - Accuracy: 0.84048
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (2, 3)} - Accuracy: 0.83676
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (3, 3)} - Accuracy: 0.7402
Fine tuning model with hyper/parameters: {'C': 1, 'ngram_range': (1, 1)} - Accuracy: 0.8702
Fine tuning model with hyper/parameters: {'C': 1, 'ngram_range': (1, 2)} - Accuracy: 0.892
Fine tuning model with hyper/parameters: {'C': 1, 'ngram_range': (1, 3)} - Accuracy: 0.89096
Fine tuning model with hyper/parameters: {'C': 1, 'ngram_range': (2, 2)} - Accuracy: 0.85072
Fine tuning model with hyper/parameters: {'C': 1, 'ngram_range'

The model achieving the best accuracy of `0.8926` utilizes the parameters: `C = 10`, `ngram_range = (1, 3)`. In general, from the results, we can infer that the use of n-grams with ranges `(1, 2)` and `(1, 3)` enables the training of models with higher accuracy (as hypothesized during the exploration phase).

### Naive Bayes Classifier

In [6]:
params = {
	'ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)],
	'alpha': [0.1, 1, 10]
}

grid = list(ParameterGrid(params))

accuracies = []

for param in grid:
	predictions = create_model(
		X_train,
		y_train,
		X_test,
		y_test,
		model_name = 'MultinomialNB',
		preprocessor_fnc = preprocessor.perform_strong_preprocessing,
		**param
	)
	accuracy = accuracy_score(y_test, predictions)
	print(f'Fine tuning model with hyper/parameters: {param} - Accuracy: {accuracy}')
	accuracies.append(accuracy)

index = np.argmax(np.array(accuracies))
best_params = grid[index]
print(f'Best parameters: {best_params} with accuracy: {accuracies[index]}')

Fine tuning model with hyper/parameters: {'alpha': 0.1, 'ngram_range': (1, 1)} - Accuracy: 0.81352
Fine tuning model with hyper/parameters: {'alpha': 0.1, 'ngram_range': (1, 2)} - Accuracy: 0.858
Fine tuning model with hyper/parameters: {'alpha': 0.1, 'ngram_range': (1, 3)} - Accuracy: 0.86188
Fine tuning model with hyper/parameters: {'alpha': 0.1, 'ngram_range': (2, 2)} - Accuracy: 0.84932
Fine tuning model with hyper/parameters: {'alpha': 0.1, 'ngram_range': (2, 3)} - Accuracy: 0.8558
Fine tuning model with hyper/parameters: {'alpha': 0.1, 'ngram_range': (3, 3)} - Accuracy: 0.7422
Fine tuning model with hyper/parameters: {'alpha': 1, 'ngram_range': (1, 1)} - Accuracy: 0.83536
Fine tuning model with hyper/parameters: {'alpha': 1, 'ngram_range': (1, 2)} - Accuracy: 0.86412
Fine tuning model with hyper/parameters: {'alpha': 1, 'ngram_range': (1, 3)} - Accuracy: 0.86588
Fine tuning model with hyper/parameters: {'alpha': 1, 'ngram_range': (2, 2)} - Accuracy: 0.85636
Fine tuning model with

The model achieving the best accuracy of `0.86588` utilizes the parameters: `alpha = 1`, `ngram_range = (1, 3)`.

### Logistic Regression

In [7]:
params = {
	'ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)],
	'penalty': ['l2'],
	'C': [0.1, 1, 10]
}

grid = list(ParameterGrid(params))

accuracies = []

for param in grid:
	predictions = create_model(
		X_train,
		y_train,
		X_test,
		y_test,
		model_name = 'LogisticRegression',
		preprocessor_fnc = preprocessor.perform_strong_preprocessing,
		**param
	)
	accuracy = accuracy_score(y_test, predictions)
	print(f'Fine tuning model with hyper/parameters: {param} - Accuracy: {accuracy}')
	accuracies.append(accuracy)

index = np.argmax(np.array(accuracies))
best_params = grid[index]
print(f'Best parameters: {best_params} with accuracy: {accuracies[index]}')

Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (1, 1), 'penalty': 'l2'} - Accuracy: 0.86188
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (1, 2), 'penalty': 'l2'} - Accuracy: 0.85232
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (1, 3), 'penalty': 'l2'} - Accuracy: 0.84644
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (2, 2), 'penalty': 'l2'} - Accuracy: 0.82248
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (2, 3), 'penalty': 'l2'} - Accuracy: 0.82504
Fine tuning model with hyper/parameters: {'C': 0.1, 'ngram_range': (3, 3), 'penalty': 'l2'} - Accuracy: 0.74024
Fine tuning model with hyper/parameters: {'C': 1, 'ngram_range': (1, 1), 'penalty': 'l2'} - Accuracy: 0.88304
Fine tuning model with hyper/parameters: {'C': 1, 'ngram_range': (1, 2), 'penalty': 'l2'} - Accuracy: 0.88164
Fine tuning model with hyper/parameters: {'C': 1, 'ngram_range': (1, 3), 'penalty': 'l2'} - Accuracy: 0.8744

The model achieving the best accuracy of `0.8908` utilizes the parameters: `C = 10`, `ngram_range = (1, 2)`.

### Does preprocessing have a positive impact?

Below, we analyze the influence of preprocessing on the best models obtained previously:
- **Linear SVC**: C = 10, ngram_range = (1, 3)
- **Naive Bayes**: alpha = 1, ngram_range = (1, 3)
- **Logistic Regression**: C = 10, ngram_range = (1, 2)

In text classification problems, preprocessing techniques are often used to remove data that is not useful for training the model. There is no certainty that removing these steps results in more accurate models. Therefore, we analyze three scenarios on the models with the highest accuracy obtained previously:
- **perform_strong_preprocessing**: html tags + punctuation + stop words (English)
- **perform_soft_preprocessing**: html tags + punctuation
- **without any preprocessing step**: -

##### Support Vector Machine

In [18]:
params = {
	'preprocessor_fnc': [preprocessor.perform_strong_preprocessing, preprocessor.perform_soft_preprocessing, None],
	'ngram_range': [(1, 3)],
	'C': [10]
}

grid = list(ParameterGrid(params))

accuracies = []
predictions_list = []

for param in grid:
	predictions = create_model(
		X_train,
		y_train,
		X_test,
		y_test,
		model_name = 'LinearSVC',
		**param
	)
	accuracy = accuracy_score(y_test, predictions)
	print(f'Fine tuning model with hyper/parameters: {param} - Accuracy: {accuracy}')
	accuracies.append(accuracy)
	predictions_list.append(predictions)

index = np.argmax(np.array(accuracies))
best_params = grid[index]
print(f'Best parameters: {best_params} with accuracy: {accuracies[index]}')
print(f'Accuracy: {accuracy_score(y_test, predictions_list[index])}, F1: {f1_score(y_test, predictions_list[index])}, Precision: {precision_score(y_test, predictions_list[index])}, Recall: {recall_score(y_test, predictions_list[index])}')
print(classification_report(y_test, predictions_list[index]))

Fine tuning model with hyper/parameters: {'C': 10, 'ngram_range': (1, 3), 'preprocessor_fnc': <bound method Preprocessor.perform_strong_preprocessing of <modules.preprocessor.Preprocessor object at 0x291f80b50>>} - Accuracy: 0.8926
Fine tuning model with hyper/parameters: {'C': 10, 'ngram_range': (1, 3), 'preprocessor_fnc': <bound method Preprocessor.perform_soft_preprocessing of <modules.preprocessor.Preprocessor object at 0x291f80b50>>} - Accuracy: 0.90428
Fine tuning model with hyper/parameters: {'C': 10, 'ngram_range': (1, 3), 'preprocessor_fnc': None} - Accuracy: 0.90492
Best parameters: {'C': 10, 'ngram_range': (1, 3), 'preprocessor_fnc': None} with accuracy: 0.90492
Accuracy: 0.90492, F1: 0.9053629016204164, Precision: 0.9011650947134818, Recall: 0.9096
              precision    recall  f1-score   support

           0       0.91      0.90      0.90     12500
           1       0.90      0.91      0.91     12500

    accuracy                           0.90     25000
   macro av

From the obtained results, it can be observed that the accuracy reaches `0.9049` without any preprocessing technique.

##### Naive Bayes

In [19]:
params = {
	'preprocessor_fnc': [preprocessor.perform_strong_preprocessing, preprocessor.perform_soft_preprocessing, None],
	'ngram_range': [(1, 3)],
	'alpha': [1]
}

grid = list(ParameterGrid(params))

accuracies = []
predictions_list = []

for param in grid:
	predictions = create_model(
		X_train,
		y_train,
		X_test,
		y_test,
		model_name = 'MultinomialNB',
		**param
	)
	accuracy = accuracy_score(y_test, predictions)
	print(f'Fine tuning model with hyper/parameters: {param} - Accuracy: {accuracy}')
	accuracies.append(accuracy)
	predictions_list.append(predictions)

index = np.argmax(np.array(accuracies))
best_params = grid[index]
print(f'Best parameters: {best_params} with accuracy: {accuracies[index]}')
print(f'Accuracy: {accuracy_score(y_test, predictions_list[index])}, F1: {f1_score(y_test, predictions_list[index])}, Precision: {precision_score(y_test, predictions_list[index])}, Recall: {recall_score(y_test, predictions_list[index])}')
print(classification_report(y_test, predictions_list[index]))

Fine tuning model with hyper/parameters: {'alpha': 1, 'ngram_range': (1, 3), 'preprocessor_fnc': <bound method Preprocessor.perform_strong_preprocessing of <modules.preprocessor.Preprocessor object at 0x291f80b50>>} - Accuracy: 0.86588
Fine tuning model with hyper/parameters: {'alpha': 1, 'ngram_range': (1, 3), 'preprocessor_fnc': <bound method Preprocessor.perform_soft_preprocessing of <modules.preprocessor.Preprocessor object at 0x291f80b50>>} - Accuracy: 0.87664
Fine tuning model with hyper/parameters: {'alpha': 1, 'ngram_range': (1, 3), 'preprocessor_fnc': None} - Accuracy: 0.87684
Best parameters: {'alpha': 1, 'ngram_range': (1, 3), 'preprocessor_fnc': None} with accuracy: 0.87684
Accuracy: 0.87684, F1: 0.871980375036381, Precision: 0.9078001904597005, Recall: 0.83888
              precision    recall  f1-score   support

           0       0.85      0.91      0.88     12500
           1       0.91      0.84      0.87     12500

    accuracy                           0.88     2500

From the obtained results, it can be observed that the accuracy reaches `0.8768` without any preprocessing technique.

##### Logistic Regression

In [20]:
params = {
	'preprocessor_fnc': [preprocessor.perform_strong_preprocessing, preprocessor.perform_soft_preprocessing, None],
	'ngram_range': [(1, 2)],
	'penalty': ['l2'],
	'C': [10]
}

grid = list(ParameterGrid(params))

accuracies = []
predictions_list = []

for param in grid:
	predictions = create_model(
		X_train,
		y_train,
		X_test,
		y_test,
		model_name = 'LogisticRegression',
		**param
	)
	accuracy = accuracy_score(y_test, predictions)
	print(f'Fine tuning model with hyper/parameters: {param} - Accuracy: {accuracy}')
	accuracies.append(accuracy)
	predictions_list.append(predictions)

index = np.argmax(np.array(accuracies))
best_params = grid[index]
print(f'Best parameters: {best_params} with accuracy: {accuracies[index]}')
print(f'Accuracy: {accuracy_score(y_test, predictions_list[index])}, F1: {f1_score(y_test, predictions_list[index])}, Precision: {precision_score(y_test, predictions_list[index])}, Recall: {recall_score(y_test, predictions_list[index])}')
print(classification_report(y_test, predictions_list[index]))

Fine tuning model with hyper/parameters: {'C': 10, 'ngram_range': (1, 2), 'penalty': 'l2', 'preprocessor_fnc': <bound method Preprocessor.perform_strong_preprocessing of <modules.preprocessor.Preprocessor object at 0x291f80b50>>} - Accuracy: 0.8908
Fine tuning model with hyper/parameters: {'C': 10, 'ngram_range': (1, 2), 'penalty': 'l2', 'preprocessor_fnc': <bound method Preprocessor.perform_soft_preprocessing of <modules.preprocessor.Preprocessor object at 0x291f80b50>>} - Accuracy: 0.90112
Fine tuning model with hyper/parameters: {'C': 10, 'ngram_range': (1, 2), 'penalty': 'l2', 'preprocessor_fnc': None} - Accuracy: 0.90088
Best parameters: {'C': 10, 'ngram_range': (1, 2), 'penalty': 'l2', 'preprocessor_fnc': <bound method Preprocessor.perform_soft_preprocessing of <modules.preprocessor.Preprocessor object at 0x291f80b50>>} with accuracy: 0.90112
Accuracy: 0.90112, F1: 0.9014825442372071, Precision: 0.8981893265565438, Recall: 0.9048
              precision    recall  f1-score   supp

From the obtained results, it can be observed that the accuracy reaches `0.9011` without any preprocessing technique.

## Conclusion

Below are the tables containing the metrics of the models with the highest accuracy obtained from the various methodologies.

| Model  | Accuracy | F1-Score | Recall | Precision |
| ------------- | ------------- | -------------  | -------------  | -------------  |
| Support Vector Classifier  | <u>**0.9049**</u> | <u>**0.9053**</u> | 0.9011 | <u>**0.9096**</u> |
| Naive Bayes  | 0.8768 | 0.8719 | <u>**0.9078**</u> | 0.8388 |
| Logistic Regression  | 9011 | 0.9014 | 0.8981 | 0.9048 |

In addition, the table containing the different results obtained using different preprocessing techniques is reported.

| Model  | Strong Preprocessing | Soft Preprocessing | No Preprocessing |
| ------------- | ------------- | -------------  | -------------  |
| Support Vector Classifier  | 0.8926 | 0.9042 | <u>**0.9049**</u> |
| Naive Bayes  | 0.8658 | 0.8766 | <u>**0.8768**</u> |
| Logistic Regression  | 0.8908 | <u>**0.9011**</u> | 0.9008 |

As can be seen in general, a higher accuracy of `0.01` is obtained by performing a `soft preprocessing` or not performing any preprocessing at all. The highest accuracy is achieved by the model trained using the `Support Vector Classifier` with an accuracy of `0.9049`.