# Assignment A2: Topic Modeling and Text ML

Covering material from Notebooks 5 and 6

In [6]:
#Import the AG news dataset (same as hw01)
#Download them from here 
#!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df.head()


import spacy
dfs = df.sample(200)
nlp = spacy.load('en_core_web_md')

# A. Dimension Reduction

## PCA

In [9]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3,svd_solver='randomized')

##TODO reduce the vectorized data using PCA
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
hv = CountVectorizer()
X = hv.fit_transform(dfs["text"]) 
X = np.array(X.todense())  # type: ignore
X_pca = pca.fit_transform(X)
##TODO compute again cosine similarity with the reduced version for the first 200 snippets
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(X_pca, X_pca[0,:].reshape(1,-1))
##TODO for the first snippet, show again its three most similar snippets
# ind = np.argpartition(similarities.reshape(1, -1), -3)[-3:]
ind = (-similarities.reshape(1, -1)).argsort()

# only keep top 3
ind = ind[0][:4]

# similarity scores of top three
print(similarities[ind])

# print orignal sentence and the three most similar snippets 
for i in ind:
    print(str(dfs.iloc[i,3]) + "\n")

[[1.        ]
 [0.99110823]
 [0.98403761]
 [0.95963776]]
Former star bowler Lillee ends coaching role at Australian academy Fast bowling great Dennis Lillee has cut ties with Australia #39;s Cricket Academy after failing to come to terms on a new coaching contract.

Israel Turns Up Heat on Palestinian Hunger Strike  JERUSALEM (Reuters) - Israel declared psychological war on  hunger-striking Palestinian prisoners on Monday, saying it  would barbecue meat outside their cells to try to break their  will.

EU fast-tracks action to end institutional limbo Incoming European Commission head Jose Manuel Barroso acted Friday to end an unprecedented EU stalemate, by arranging fast-track confirmation hearings for his re-arranged Brussels executive team.

Steroid shock waves ASSOCIATED PRESS ASSOCIATED PRESS ASSOCIATED &lt;b&gt;...&lt;/b&gt; San Francisco Giants left fielder Barry Bonds, third on Major League Baseball #39;s career home run list, testified to a grand jury that he used a clear subst

Compare the cosine similarity between docs before and after PCA reduction. Did the results change? 

## Topic Modeling with LDA

For this part you will need to use LDA Mallet. If you cannot have Mallet run, you can use the simple LDA algorithm 

In [18]:
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel

##TODO create a dictionary with the pre-processed tokenized text and filter it according to frequencies and keeping 1000 vocabularies
import spacy
nlp = spacy.load('en_core_web_md')

docs = list(nlp.pipe(dfs["text"]))
dfs["text_clean"] = [[chunk.text.lower() for chunk in doc if not 
                      (chunk.is_punct or chunk.is_stop)] for doc in docs]
dfs["text_clean"] = dfs["text_clean"].str.join(" ")
dict = Dictionary(dfs["text_clean"], prune_at=1000)

##TODO create the doc_term_matrix
dict = dict.doc2bow(dfs["text_clean"])


TypeError: doc2bow expects an array of unicode tokens on input, not a single string

In [None]:
##TODO train a LDA Mallet model with 5, 10 and 15 topics
##TODO compute the coherence score for each of these model and print the topics from the model with highest coherence score

In [None]:
import pyLDAvis.gensim
##TODO using LDAvis visualize the topics using the optimal number of topics

# B. Supervised Learning

## Load and Pre-process Text
We do sentiment analysis on the [Movie Review Data](https://www.cs.cornell.edu/people/pabo/movie-review-data/). If you would like to know more about the data, have a look at [the paper](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf) (but no need to do so).

In [19]:
# In this tutorial, we do sentiment analysis
# download the data
#!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
#!tar xf aclImdb_v1.tar.gz

!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
 
!tar xf scale_data.tar.gz 
!tar xf scale_whole_review.tar.gz

--2022-11-13 13:20:40--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4029756 (3.8M) [application/x-gzip]
Saving to: 'scale_data.tar.gz'


2022-11-13 13:20:43 (3.21 MB/s) - 'scale_data.tar.gz' saved [4029756/4029756]

--2022-11-13 13:20:43--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8853204 (8.4M) [application/x-gzip]
Saving to: 'scale_whole_review.tar.gz'


2022-11-13 13:20:49 (1.54 MB/s) - 'scale_whole_review.tar.gz' saved [8853204/8853204]



First, we have to load the data for which we provide the function below. Note how we also preprocess the text using gensim's simple_preprocess() function and how we already split the data into a train and test split.

In [108]:
import os
from gensim.utils import simple_preprocess
def load_data():
    examples, labels = [], []
    authors = os.listdir("scale_whole_review")
    for author in authors:
        path = os.listdir(os.path.join("scale_whole_review", author, "txt.parag"))
        fn_ids = os.path.join("scaledata", author, "id." + author)
        fn_ratings = os.path.join("scaledata", author, "rating." + author)
        with open(fn_ids) as ids, open(fn_ratings) as ratings:
            for idx, rating in zip(ids, ratings):
                labels.append(float(rating.strip()))
                filename_text = os.path.join("scale_whole_review", author, "txt.parag", idx.strip() + ".txt")
                with open(filename_text, encoding='latin-1') as f:
                    examples.append(" ".join(simple_preprocess(f.read())))
    return examples, labels
                  
X,y  = load_data()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("text:", X_train[0], "\nlabel:", y_train[0])

text: bloody child the director writer cinematographer nina menkes screenwriter tinka menkes editors nina and tina menkes cast tinka menkes captain sherry sibley murdered wife robert mueller murderer russ little sergeant jack hara enlisted man runtime mirage reviewed by dennis schwartz an amazingly strange film confusing and not thoroughly enjoyable but film found more interesting than thought possible at first viewing this experimental film in minimalist story telling film consisting of disturbing visualizations and almost no dialogue had concept that was greater than how the film turned out it felt at times like was watching paint dry on the wall but the reward for sitting through those excruciatingly redundant scenes was in seeing something different something that cast spell of sorcery over terrible incident as believe the film in its unique and sometimes shrill voice does justice in commenting on the violence in american society especially against women the film uses its impressio

## Vectorize the data

In [110]:
# train a TF_IDF Vectorizer on X_train and vectorize X_train and X_test
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        ngram_range=(1,2))

##TODO train vectorizer
vec.fit(X_train)

##TODO transform X_train to TF-IDF values
X_train_tfidf = vec.transform(X_train)
feature_names = vec.get_feature_names()
dense = X_train_tfidf.todense()
lst1 = dense.tolist()
X_train_tfidf = pd.DataFrame(lst1, columns=feature_names)
print(X_train_tfidf.shape)

##TODO transform X_test to TF-IDF values
X_test_tfidf = vec.transform(X_test)
feature_names = vec.get_feature_names()
dense = X_test_tfidf.todense()
lst1 = dense.tolist()
X_test_tfidf = pd.DataFrame(lst1, columns=feature_names)
print(X_test_tfidf.info)



(3354, 5593)




<bound method DataFrame.info of       aaron  abandon  abandoned  abilities  ability  able  ably  aboard  \
0       0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
1       0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
2       0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
3       0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
4       0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
...     ...      ...        ...        ...      ...   ...   ...     ...   
1647    0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
1648    0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
1649    0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
1650    0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   
1651    0.0      0.0        0.0        0.0      0.0   0.0   0.0     0.0   

      absence  absent  ...  young girl  young man  young son  young

In [113]:
##TODO scale both training and test data with the standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
X_train_tfidf_scaled = scaler.fit_transform(X_train_tfidf)
X_test_tfidf_scaled = scaler.transform(X_test_tfidf)

## ElasticNet

In [119]:
##TODO train an elastic net on the transformed output of the scaler
from sklearn.linear_model import ElasticNet

en = ElasticNet(alpha=0.01)

##TODO train the ElasticNet
en.fit(X_train_tfidf_scaled, y_train)
##TODO predict the testset
preds = en.predict(X_test_tfidf_scaled)

from sklearn.metrics import r2_score, accuracy_score, mean_squared_error, balanced_accuracy_score
##TODO print mean squared error and r2 score on the test set
r2 = r2_score(y_test, preds)
# accur_score = accuracy_score(y_test, preds) doesn't work, we are not classifying
mse = mean_squared_error(y_test, preds)
# balanc_accur_score = balanced_accuracy_score(y_test, preds)
print(r2, mse)

0.49772890489643706 0.016535782264671527


## Logistic Regression

Next, we train an OLS model doing binary prediction on these movie reviews. Two get two bins, we transform the continuous ratings into two classes, where one class contains all the negative ratings (value < 0.5), the other class all the positive ratings (value > 0.5)

In [120]:
y_train = [1 if i >= 0.5 else 0 for i in y_train]
y_test = [1 if i >= 0.5 else 0 for i in y_test]


In [121]:
##TODO train logistic regression on X_train
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()

##TODO train a logistic regression
logistic_regression.fit(X_train_tfidf_scaled, y_train)

##TODO predict the testset 
preds = logistic_regression.predict(X_test_tfidf_scaled)

##since we have continuous output, we need to post-process our labels into two classes. We choose a threshold of 0.5 
def map_predictions(predicted):
    predicted = [1 if i > 0.5 else 0 for i in predicted]
    return predicted

##TODO print the accuracy of our classifier on the testset
print(accuracy_score(y_test, map_predictions(preds)))

## TODO print the 10 most informative words of the regression (the 10 words having the highest coefficients)
idx = (-logistic_regression.coef_).argsort()
idx = idx[0][:10]
print(X_train_tfidf.columns[idx])  # type: ignore)

0.8075060532687651
Index(['great', 'effective', 'surprisingly', 'punches', 'success',
       'fascinating', 'influenced', 'investigating', 'brilliant', 'best'],
      dtype='object')


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## XGBoost

Lastly, we train an XGBoost classifier to do topic prediction on the AG news dataset, which is a multi-class prediction problem (4 classes). We again have to vectorize the data, train the classifier, predict the testset and output an evaluation metric (we go for accuracy).

In [122]:

# vectorize the data
from sklearn.feature_extraction.text import TfidfVectorizer

# only consider 10% of the data
dfs = df.sample(frac=0.1)

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(dfs["text"], dfs["label"], test_size=0.33, random_state=42)

vec = TfidfVectorizer(min_df=5, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        max_features=2000,
                        ngram_range=(1,2))

# transform into TF-IDF values
X_train_tfidf = vec.fit_transform(X_train).todense()
X_test_tfidf = vec.transform(X_test).todense()


XGBoost provides an interface to SKLearn classifiers, e.g. they implement the same train and predict methods as an SKLearn classifier would. If you are interested in a more detailed overview, have a look at the [official documentation](https://xgboost.readthedocs.io/en/latest/python/index.html).

In [123]:
param_dist = {'objective':'multi:softmax', 'num_class': 5, 'n_estimators':25}
# note how we only have 4 labels, but we need to pass "num_class": 5
# if we pass "num_class": 4, we get the error "label must be in [0, num_class)."
import xgboost as xgb
from sklearn import preprocessing

clf = xgb.XGBModel(**param_dist)

##TODO train the XGBModel 
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(y_train)
clf.fit(X_train_tfidf, y_train)

##TODO predict the testset 
preds = clf.predict(X_test_tfidf)

##TODO evaluate the predictions using accuracy as a metric
le = preprocessing.LabelEncoder()
y_test = le.fit_transform(y_test)
print(accuracy_score(y_test, preds))

0.806060606060606
