# Lab 9: Document Analysis

In this assignment, we will learn how to do document classification and clustering



## 1. Example

In this example, we use [20newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) dataset. Each sample is a document and there are totally 20 classes. 

### 1.1 Load data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

print("Train data target labels: {}".format(data_train.target))
print("Train data target names: {}".format(data_train.target_names))

print('#training samples: {}'.format(len(data_train.data)))
print('#testing samples: {}'.format(len(data_test.data)))


Train data target labels: [7 4 4 ... 3 1 8]
Train data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
#training samples: 11314
#testing samples: 7532


### 1.2 Represent documents with TF-IDF represention

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler


#TF-IDF representation for each document
vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(data_train.data)
data_test_vectors = vectorizer.transform(data_test.data) 

print(data_train_vectors.shape, data_test_vectors.shape)


(11314, 101631) (7532, 101631)


### 1.3 Use KNN to do document classification

Here, we use the cross-validation method to select $K$.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


X_train = data_train_vectors
y_train = data_train.target

X_test = data_test_vectors
y_test = data_test.target

k_range = range(1, 5)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print(grid.best_score_)
print(grid.best_params_)

### 1.3 Use Logistic Regression to do document classification
Here, we also use the cross-validation method to select the regularization coefficient. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(X_train, y_train)

y_pred = clf_lr.predict(X_test)

acc = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

## 2. Task: Document Classification and Clustering

In this task, we are going to use [BBCNews](BBC_News_Train.csv) dataset. There are 1490 articles from 5 topics, including tech, business, sport, entertainment, politics. 

* Task 1: Please use KNN and logistic regression to do classification, and compare their performance.

* Task 2: Please use K-means to partition this dataset into 5 clusters and find the representative words in each cluster. 

### 2.1 Load data and represent it with TF-IDF representation

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans

# Reading the dataset
df = pd.read_csv('BBC_News_Train.csv')

# Splitting the dataset into training and testing sets
df_train, df_test = train_test_split(df, test_size=0.15)

# Printing the unique target names in the training data
print("Train data target names: {}".format(df_train['Category'].unique()))

# Printing the number of samples in the training and testing sets
print('Training samples: {}'.format(len(df_train)))
print('Testing samples: {}'.format(len(df_test)))

# Creating a TfidfVectorizer object to convert text data to numerical vectors
tfidf = TfidfVectorizer(stop_words='english')

# Fitting the TfidfVectorizer object on the training data and transforming the training and testing data
train_vectors = tfidf.fit_transform(df_train['Text'])
test_vectors = tfidf.transform(df_test['Text']) 

# Printing the shape of the training and testing vectors
print(train_vectors.shape, test_vectors.shape)

Train data target names: ['tech' 'entertainment' 'politics' 'business' 'sport']
Training samples: 1266
Testing samples: 224
(1266, 22800) (224, 22800)


### 2.2 Use KNN to do document classification

In [3]:
# Assign the training vectors and their corresponding labels to X_train and y_train
X_train = train_vectors
y_train = df_train['Category']

# Assign the test vectors and their corresponding labels to X_test and y_test
X_test = test_vectors
y_test = df_test['Category']

# Define a range of k values to test
k_range = range(1, 5)

# Define the parameter grid to search over
param_grid = dict(n_neighbors=k_range)

# Create a KNeighborsClassifier object
clf_knn = KNeighborsClassifier()

# Create a GridSearchCV object to find the best hyperparameters
grid = GridSearchCV(clf_knn, param_grid, scoring='accuracy')

# Fit the GridSearchCV object to the training data
grid.fit(X_train, y_train)

# Print the best score and hyperparameters
print(grid.best_score_)
print(grid.best_params_)

0.9257446080109553
{'n_neighbors': 4}


### 2.3 Use Logistic Regression to do document classification

In [4]:
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')
    
grid = GridSearchCV(clf_lr, param_grid, scoring='accuracy')
grid.fit(X_train, y_train)

print(grid.best_params_)


clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(X_train, y_train)

y_pred = clf_lr.predict(X_test)

acc = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

{'C': 7}
0.9821428571428571 0.9809102258540461 0.9821428571428571


### 2.4 Use K-means to do document clustering and find the 10 most representative words in each cluster. 

In [6]:
# Define the number of clusters to use for KMeans clustering
clf_kmeans = KMeans(n_clusters=5)

# Fit the KMeans model to the training data
y = clf_kmeans.fit(X_train)

# Get the indices that would sort the cluster centers in descending order for each cluster
order_centroids = clf_kmeans.cluster_centers_.argsort()[:, ::-1]

# Get the list of feature names (terms) used in the TfidfVectorizer
terms = tfidf.get_feature_names_out()

# For each cluster, print the top 10 terms with the highest tf-idf scores
for i in range(5):
     print("Cluster %d:" % i)
     for ind in order_centroids[i, :10]:
         print(' %s' % terms[ind])

Cluster 0:
 mr
 labour
 election
 blair
 said
 party
 government
 brown
 howard
 minister
Cluster 1:
 game
 england
 said
 cup
 win
 players
 team
 club
 chelsea
 match
Cluster 2:
 film
 best
 awards
 award
 actor
 oscar
 won
 films
 director
 comedy
Cluster 3:
 mobile
 people
 said
 software
 microsoft
 users
 technology
 digital
 phone
 music
Cluster 4:
 said
 year
 sales
 growth
 mr
 company
 economy
 bank
 market
 oil
