# Lab 9: Document Analysis

In this assignment, we will learn how to do document classification and clustering



## 1. Example

In this example, we use [20newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) dataset. Each sample is a document and there are totally 20 classes. 

### 1.1 Load data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

print("Train data target labels: {}".format(data_train.target))
print("Train data target names: {}".format(data_train.target_names))

print('#training samples: {}'.format(len(data_train.data)))
print('#testing samples: {}'.format(len(data_test.data)))


Train data target labels: [7 4 4 ... 3 1 8]
Train data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
#training samples: 11314
#testing samples: 7532


### 1.2 Represent documents with TF-IDF represention

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler


#TF-IDF representation for each document
vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(data_train.data)
data_test_vectors = vectorizer.transform(data_test.data) 

print(data_train_vectors.shape, data_test_vectors.shape)


(11314, 101631) (7532, 101631)


### 1.3 Use KNN to do document classification

Here, we use the cross-validation method to select $K$.

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


Xtr = data_train_vectors
Ytr = data_train.target

Xte = data_test_vectors
Yte = data_test.target

k_range = range(1, 2)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=2, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_score_)
print(grid.best_params_)

0.13620293441753578
{'n_neighbors': 1}


### 1.3 Use Logistic Regression to do document classification
Here, we also use the cross-validation method to select the regularization coefficient. 

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

#=====training with cross validation======
coeff = range(1, 2)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=2, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(Xtr, Ytr)

y_pred = clf_lr.predict(Xte)

acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average='macro')
micro_f1 = f1_score(Yte, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

{'C': 1}
0.6736590546999469 0.6585894332744863 0.6736590546999469


## 2. Task: Document Classification and Clustering

In this task, we are going to use [BBCNews](BBC_News_Train.csv) dataset. There are 1490 articles from 5 topics, including tech, business, sport, entertainment, politics. 

* Task 1: Please use KNN and logistic regression to do classification, and compare their performance.

* Task 2: Please use K-means to partition this dataset into 5 clusters and find the representative words in each cluster. 

### 2.1 Load data and represent it with TF-IDF representation

In [7]:
import pandas as pd

bbcnews_df = pd.read_csv('BBC_News_Train.csv')

bbcnews_text = bbcnews_df['Text']
bbcnews_target = bbcnews_df['Category']

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(bbcnews_text,
                                                    bbcnews_target.values,
                                                    test_size=0.2,
                                                    random_state=42)


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english')

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

### 2.2 Use KNN to do document classification

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

k_range = range(1, 5)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train_vec, y_train)

print(grid.best_score_)
print(grid.best_params_)

0.9270173341303047
{'n_neighbors': 4}


### 2.3 Use Logistic Regression to do document classification

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=2, scoring='accuracy')
grid.fit(X_train_vec, y_train)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(X_train_vec, y_train)

y_pred = clf_lr.predict(X_test_vec)

acc = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

{'C': 8}
0.9697986577181208 0.9708405690370728 0.9697986577181208


### 2.4 Use K-means to do document clustering and find the 10 most representative words in each cluster. 

In [14]:
set(bbcnews_target.values)

{'business', 'entertainment', 'politics', 'sport', 'tech'}

In [15]:
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=5, random_state=0).fit(X_train_vec)



In [16]:
term_representation_df = pd.DataFrame(X_train_vec.todense()).groupby(cluster.labels_).mean()
term_representation_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22292,22293,22294,22295,22296,22297,22298,22299,22300,22301
0,0.000302,0.010411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000292,...,0.0002,0.000324,0.000603,0.000583,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000379,0.01369,0.0,0.0,0.0,0.0,0.000801,0.0004,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.009528,0.000261,0.0,0.0,0.000845,0.0,0.0,0.000469,0.0,...,0.0,0.0,0.0,0.0,0.000257,0.00056,0.0,0.0,0.000486,0.0
3,0.0,0.011289,0.0,0.000322,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000736,0.0,0.0,0.0,0.0,0.0,0.000262,0.0,0.0
4,0.0,0.003176,0.0,0.0,0.004879,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000135,0.00143,0.0,0.000964


In [24]:
term_representation_df = pd.DataFrame(X_train_vec.todense()).groupby(cluster.labels_).mean()
terms = vectorizer.get_feature_names_out()
for i, r in term_representation_df.iterrows():
    sorted_row = np.argsort(r)
    top_10_sorted_row = sorted_row[-10:]
    top_10 = [terms[t] for t in top_10_sorted_row]
    print(f'Cluster {i}: {top_10}')

Cluster 0: ['microsoft', 'broadband', 'technology', 'net', 'software', 'users', 'phone', 'said', 'people', 'mobile']
Cluster 1: ['howard', 'minister', 'government', 'brown', 'party', 'said', 'blair', 'election', 'labour', 'mr']
Cluster 2: ['festival', 'star', 'award', 'actor', 'band', 'music', 'said', 'awards', 'best', 'film']
Cluster 3: ['oil', 'market', 'firm', 'government', 'bank', 'economy', 'year', 'mr', 'growth', 'said']
Cluster 4: ['players', 'chelsea', 'season', 'match', 'world', 'cup', 'win', 'said', 'game', 'england']
