Only the second model approach was implemented - the one that considers every document as an instance - each doc is represented based on all its sentences.

Load the data.

In [None]:
import warnings
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

from sklearn.datasets import make_multilabel_classification
from sklearn.preprocessing import MaxAbsScaler
from sklearn.svm import SVC

#load data
from google.colab import files
uploaded = files.upload()
import io
warnings.filterwarnings("ignore")
X_test = pd.read_csv(io.BytesIO(uploaded['test-data.dat']))
X_train =pd.read_csv(io.BytesIO(uploaded['train-data.dat']))
Y_test = pd.read_csv(io.BytesIO(uploaded['test-label.dat']))
Y_train = pd.read_csv(io.BytesIO(uploaded['train-label.dat']))

Y_test_new = Y_test.values
Y_train_new= Y_train.values


Saving test-data.dat to test-data (2).dat
Saving test-label.dat to test-label (2).dat
Saving train-data.dat to train-data (2).dat
Saving train-label.dat to train-label (2).dat


The next few steps are similar to the preprocessing in the previous notebook.

In [None]:

array_list_test = []
for index, row in Y_test.iterrows():
    split_values = (row[0])  # Split the numbers into a list of strings
    filtered_list = [item for item in split_values if item.strip() in ('0', '1')]
    int_array = np.array([int(num) for num in filtered_list])
    array_list_test.append(int_array)
merged_array_test = np.array(array_list_test)

array_list_train = []
for index, row in Y_train.iterrows():
    split_values = (row[0])  # Split the numbers into a list of strings
    filtered_list = [item for item in split_values if item.strip() in ('0', '1')]
    int_array = np.array([int(num) for num in filtered_list])
    array_list_train.append(int_array)
merged_array_train = np.array(array_list_train)

Y_test_new = merged_array_test
Y_train_new= merged_array_train

In [None]:

#remove the <> tags, they are not needed
X_train = X_train.replace(r'<.*?>', '', regex=True)
X_test = X_test.replace(r'<.*?>', '', regex=True)


Focus only on the most popular label:

In [None]:

popular_label = np.argmax(np.sum(Y_train_new, axis=0))

#keep only the column with the highest sum
y_train = Y_train_new[:, popular_label]
y_test = Y_test_new[:, popular_label]

Vectorize data. The min-df parameter here essentially means "ignore terms that appear in less than 10% of the docs" (you can play with this parameter to include more/fewer words)


In [None]:

vectorizer = TfidfVectorizer(min_df=0.1)
X_train_array = X_train.values
X_test_array = X_test.values

#vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train_array.ravel())
X_test = vectorizer.transform(X_test_array.ravel())

print(X_train.shape)
print(X_test.shape) #test shape to see how many words the min_df parameter cut off

(8250, 146)
(3982, 146)


Scale data.

In [None]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler() #or StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Run the classifier models.

In [None]:
from sklearn.naive_bayes import GaussianNB

classifiers = [

    {
        'name': 'Random Forest',
        'classifier': RandomForestClassifier(random_state=0),
        'parameters': {'n_estimators': [100, 200, 400]}
    },
    {
        'name': 'Logistic Regression',
        'classifier': LogisticRegression(random_state=0),
        'parameters': {'C': [0.01, 0.1, 1.0], 'solver': ['sag', 'saga', 'lbfgs']}
    }
]

for clf_info in classifiers:
    name = clf_info['name']
    clf = clf_info['classifier']
    parameters = clf_info['parameters']

    print('Classifier:', name)

    grid = GridSearchCV(clf, param_grid=parameters, cv=5, scoring='f1')
    model_grid = grid.fit(X_train, y_train)
    y_pred = model_grid.predict(X_test)

    print('Classification Report:')
    print(classification_report(y_test, y_pred))


Classifier: Random Forest
Classification Report:
              precision    recall  f1-score   support

           0       0.65      0.88      0.75      2425
           1       0.60      0.27      0.37      1557

    accuracy                           0.64      3982
   macro avg       0.63      0.58      0.56      3982
weighted avg       0.63      0.64      0.60      3982

Classifier: Logistic Regression
Classification Report:
              precision    recall  f1-score   support

           0       0.66      0.85      0.74      2425
           1       0.58      0.33      0.42      1557

    accuracy                           0.64      3982
   macro avg       0.62      0.59      0.58      3982
weighted avg       0.63      0.64      0.62      3982



The two models have very similar performance - interestingly it wasn't significantly impacted for different min_df parameter values. Their performance is overall satisfactory - I would estimate that the other model would not significantly outperform this one, despite it inevitably being sub-optimal in terms of time complexity (it has to run multiple iterations of a clustering algorithm, for example).