# Lab 9: Document Analysis

In this assignment, we will learn how to do document classification and clustering



## 1. Example

In this example, we use [20newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) dataset. Each sample is a document and there are totally 20 classes. 

### 1.1 Load data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

print("Train data target labels: {}".format(data_train.target))
print("Train data target names: {}".format(data_train.target_names))

print('#training samples: {}'.format(len(data_train.data)))
print('#testing samples: {}'.format(len(data_test.data)))


Train data target labels: [7 4 4 ... 3 1 8]
Train data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
#training samples: 11314
#testing samples: 7532


### 1.2 Represent documents with TF-IDF represention

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler


#TF-IDF representation for each document
vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(data_train.data)
data_test_vectors = vectorizer.transform(data_test.data) 

print(data_train_vectors.shape, data_test_vectors.shape)


(11314, 101631) (7532, 101631)


### 1.3 Use KNN to do document classification

Here, we use the cross-validation method to select $K$.

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


Xtr = data_train_vectors
Ytr = data_train.target

Xte = data_test_vectors
Yte = data_test.target

k_range = range(1, 5)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_score_)
print(grid.best_params_)

0.16855203045338205
{'n_neighbors': 1}


### 1.3 Use Logistic Regression to do document classification
Here, we also use the cross-validation method to select the regularization coefficient. 

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(Xtr, Ytr)

y_pred = clf_lr.predict(Xte)

acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average='macro')
micro_f1 = f1_score(Yte, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

KeyboardInterrupt: 

## 2. Task: Document Classification and Clustering

In this task, we are going to use [BBCNews](BBC_News_Train.csv) dataset. There are 1490 articles from 5 topics, including tech, business, sport, entertainment, politics. 

* Task 1: Please use KNN and logistic regression to do classification, and compare their performance.

* Task 2: Please use K-means to partition this dataset into 5 clusters and find the representative words in each cluster. 

### 2.1 Load data and represent it with TF-IDF representation

In [39]:
#2.1: Load the data in and represent it with TF_IDF Representation:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

#Loading the Data:
df = pd.read_csv("BBC_News_Train.csv")

df.drop(columns = ['ArticleId'], inplace = True)

#Splitting the dataset into x and y:
x = df['Text'].values
y = df['Category'].values

#Preprocessing the y labels into numerical targets:
labenc = LabelEncoder()
y = labenc.fit_transform(y)


#Xtrain, Xtest, Ytrain, Ytest = train_test_split(x, y, test_size = 0.2, random_state = 42)

#Splitting the dataset:
train_data, test_data, ytrain, ytest = train_test_split(x,y, test_size = 0.2, random_state = 42)

print("Train data target labels: {}".format(ytrain))
#print("Train data target names: {}".format(train_data.target_names))

print('#training samples: {}'.format(len(train_data)))
print('#testing samples: {}'.format(len(test_data)))

#print("Length of Xtrain is: ", len(Xtrain))
#print("Length of Xtest is: ", len(Xtest))
#print("Ytrain is: ", Ytrain)



Train data target labels: [3 0 2 ... 2 1 3]
#training samples: 1192
#testing samples: 298


In [40]:
#Representing with TF_IDF Representation:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler


#TF-IDF representation for each document
vectorizer = TfidfVectorizer(stop_words = 'english')
train_data_vectors = vectorizer.fit_transform(train_data)
test_data_vectors = vectorizer.fit_transform(test_data) 

print(train_data_vectors.shape, test_data_vectors.shape)

(1192, 22302) (298, 11403)


### 2.2 Use KNN to do document classification

In [51]:
#Using KNN to do Document Classification
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


Xtra = train_data_vectors
Ytra = ytrain #train_data['Category']

Xtes = test_data_vectors
Ytes = ytest #test_data['Category']

k_range = range(1, 5)
param_grid = dict(n_neighbors = k_range)

clif_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clif_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtra, Ytra)

print(grid.best_score_)
print(grid.best_params_)

0.9270173341303047
{'n_neighbors': 4}


### 2.3 Use Logistic Regression to do document classification

In [54]:
#Using Logistic Regression to do Document Classification:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clif_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clif_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtra, Ytra)

print(grid.best_params_)

#=====testing======
clif_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clif_lr.fit(Xtra, Ytra)

y_pred = clif_lr.predict(Xtes)

acc = accuracy_score(Ytes, y_pred)
macro_f1 = f1_score(Ytes, y_pred, average='macro')
micro_f1 = f1_score(Ytes, y_pred, average='micro')

print(acc, macro_f1, micro_f1)

{'C': 7}


ValueError: X has 11403 features per sample; expecting 22302

### 2.4 Use K-means to do document clustering and find the 10 most representative words in each cluster. 

In [50]:
# your code
#Check slides 17
cluster = KMeans(n_clusters = 5, random_state = 0).fit(train_data)

centroids = cluster.clusters_centers_

centroid_1 = centroids[1,:]
idx = (-abs(centroid_1)).argsort()

for it in range(10):
    print(df[idx[it]])

ValueError: could not convert string to float: 'fuming robinson blasts officials england coach andy robinson said he was  livid  after his side were denied two tries in sunday s 19-13 six nations loss to ireland in dublin.  mark cueto s first-half effort was ruled out for offside before the referee spurned tv replays when england crashed over in the dying minutes.  [i m] absolutely spitting. i m livid. there s two tries we ve been cost   robinson told bbc sport.  we ve got to go back to technology. i don t know why we didn t.  south african referee jonathan kaplan ruled that cueto was ahead of charlie hodgson when the fly-half hoisted his cross-field kick for the sale wing to gather.  kaplan then declined the chance to consult the fourth official when josh lewsey took the ball over the irish line under a pile of bodies for what could have been the game-winning try.  i think mark cueto scored a perfectly legal try and i think he should have gone to the video referee on josh lewsey   said robinson.  it is how we use the technology. it is there  and it should be used.   i am still trying to work out the cueto try. i have looked at both  and they both looked tries.  we are very disappointed  and this will hurt  there is no doubt about that.  we are upset now  but the referee is in charge and he has called it his way and we have got to be able to cope with that.   we did everything we could have done to win the game. i am very proud of my players and  with a couple of decisions  this could have been a very famous victory.  i thought we dominated. matt stevens had an awesome game at tighthead prop  while the likes of charlie hodgson  martin corry and lewis moody all came through well.  josh lewsey was awesome  and every one of the forwards stood up out there. given the pressure we were under  credit must go to all the players.  we have done everything but win a game of rugby  but ireland are a good side. they defended magnificently and they ve got every chance of winning this six nations.  england have lost their first three matches in this year s six nations and four out of their six games since robinson took over from sir clive woodward in september.'