# HW04: ML and DL

Remember that these homework work as a completion grade. **You can skip one section without losing credit.**

## Load and Pre-process Text
We do sentiment analysis on the [Movie Review Data](https://www.cs.cornell.edu/people/pabo/movie-review-data/). If you would like to know more about the data, have a look at [the paper](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf) (but no need to do so).

In [1]:
# In this tutorial, we do sentiment analysis
# download the data
#!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
#!tar xf aclImdb_v1.tar.gz

!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
 
!tar xf scale_data.tar.gz 
!tar xf scale_whole_review.tar.gz

--2023-03-23 10:50:26--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4029756 (3.8M) [application/x-gzip]
Saving to: 'scale_data.tar.gz'


2023-03-23 10:50:28 (3.90 MB/s) - 'scale_data.tar.gz' saved [4029756/4029756]

--2023-03-23 10:50:28--  https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_whole_review.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8853204 (8.4M) [application/x-gzip]
Saving to: 'scale_whole_review.tar.gz'


2023-03-23 10:50:32 (2.35 MB/s) - 'scale_whole_review.tar.gz' saved [8853204/8853204]



First, we have to load the data for which we provide the function below. Note how we also preprocess the text using gensim's simple_preprocess() function and how we already split the data into a train and test split.

In [2]:
import os
from gensim.utils import simple_preprocess
def load_data():
    examples, labels = [], []
    authors = os.listdir("scale_whole_review")
    for author in authors:
        path = os.listdir(os.path.join("scale_whole_review", author, "txt.parag"))
        fn_ids = os.path.join("scaledata", author, "id." + author)
        fn_ratings = os.path.join("scaledata", author, "rating." + author)
        with open(fn_ids) as ids, open(fn_ratings) as ratings:
            for idx, rating in zip(ids, ratings):
                labels.append(float(rating.strip()))
                filename_text = os.path.join("scale_whole_review", author, "txt.parag", idx.strip() + ".txt")
                with open(filename_text, encoding='latin-1') as f:
                    examples.append(" ".join(simple_preprocess(f.read())))
    return examples, labels
                  
X,y  = load_data()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("text:", X_train[0], "\nlabel:", y_train[0])

text: bloody child the director writer cinematographer nina menkes screenwriter tinka menkes editors nina and tina menkes cast tinka menkes captain sherry sibley murdered wife robert mueller murderer russ little sergeant jack hara enlisted man runtime mirage reviewed by dennis schwartz an amazingly strange film confusing and not thoroughly enjoyable but film found more interesting than thought possible at first viewing this experimental film in minimalist story telling film consisting of disturbing visualizations and almost no dialogue had concept that was greater than how the film turned out it felt at times like was watching paint dry on the wall but the reward for sitting through those excruciatingly redundant scenes was in seeing something different something that cast spell of sorcery over terrible incident as believe the film in its unique and sometimes shrill voice does justice in commenting on the violence in american society especially against women the film uses its impressio

In [3]:
import numpy as np
np.mean(y_train)

0.5836076326774001

## Vectorize the data

In [4]:
# train a TF_IDF Vectorizer on X_train and vectorize X_train and X_test
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(min_df=0.01, # at min 1% of docs
                        max_df=.5,  
                        stop_words='english',
                        ngram_range=(1,2))

##TODO train vectorizer

##TODO transform X_train to TF-IDF values
X_train_tfidf = vec.fit_transform(X_train)
X_train_feature_names = vec.get_feature_names_out()
##TODO transform X_test to TF-IDF values
X_test_tfidf = vec.fit_transform(X_test)
X_test_feature_names = vec.get_feature_names_out()

In [5]:
# get indixes for the train and test data 
X_train_index = np.nonzero(np.in1d(X_train_feature_names, X_test_feature_names))[0]
X_test_index = np.nonzero(np.in1d(X_test_feature_names, X_train_feature_names))[0]

print(len(X_train_index) == len(X_test_index))

True


In [6]:
##TODO scale both training and test data with the standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
X_train_tfidf_scaled = scaler.fit_transform(X_train_tfidf)
X_test_tfidf_scaled = scaler.fit_transform(X_test_tfidf)

In [7]:
print(X_train_tfidf_scaled.shape, X_test_tfidf_scaled.shape)
print(X_train_feature_names)
print(X_test_feature_names)
X_train_tfidf_scaled.toarray()
#matrix of all names

(3354, 5593) (1652, 5687)
['aaron' 'abandon' 'abandoned' ... 'youngster' 'youngsters' 'youth']
['aaron' 'abandon' 'abandoned' ... 'youthful' 'zero' 'zone']


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## ElasticNet

In [8]:
##TODO train an elastic net on the transformed output of the scaler
from sklearn.linear_model import ElasticNet

en = ElasticNet(alpha=0.01)

#get right dimensions
X_train_d = X_train_tfidf_scaled[:, X_train_index]
X_test_d = X_test_tfidf_scaled[:, X_test_index]

##TODO train the ElasticNet
en.fit(X_train_d ,y_train)

##TODO predict the testset
y_pred = en.predict(X_test_d)
from sklearn.metrics import r2_score, accuracy_score, mean_squared_error, balanced_accuracy_score


In [9]:
##TODO print mean squared error and r2 score on the test set
print('R^2: ', r2_score(y_test,y_pred))
#print('Accuracy Score: ',accuracy_score(y_test,y_pred))
print('Mean Squared error: ', mean_squared_error(y_test,y_pred))
#print('Balanced Accuracy score: ', balanced_accuracy_score(y_test,y_pred))
      

R^2:  0.4934699420124996
Mean Squared error:  0.016675995953272474


## Logistic Regression

Next, we train an OLS model doing binary prediction on these movie reviews. Two get two bins, we transform the continuous ratings into two classes, where one class contains all the negative ratings (value < 0.5), the other class all the positive ratings (value > 0.5)

In [10]:
y_train = [1 if i >= 0.5 else 0 for i in y_train]
y_test = [1 if i >= 0.5 else 0 for i in y_test]


In [11]:
##TODO train logistic regression on X_train
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()

##TODO train a logistic regression
clf = logistic_regression.fit(X_train_d, y_train)
##TODO predict the testset 
y_pred_2 = clf.predict(X_test_d)
##since we have continuous output, we need to post-process our labels into two classes. We choose a threshold of 0.5 
def map_predictions(predicted):
    predicted = [1 if i > 0.5 else 0 for i in predicted]
    return predicted

##TODO print the accuracy of our classifier on the testset
print('Accuracy Score: ',accuracy_score(y_test,map_predictions(y_pred_2)))
print('Balanced Accuracy score: ', balanced_accuracy_score(y_test, map_predictions(y_pred_2)))



Accuracy Score:  0.799636803874092
Balanced Accuracy score:  0.716512901290129


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
## TODO print the 10 most informative words of the regression (the 10 words having the highest coefficients)
coeff1 = clf.coef_[0]
coeff = clf.coef_[0]
coeff.sort()
imp_coeff = coeff[-9:]
word_index = np.nonzero(np.in1d(coeff1, imp_coeff))[0]
X_train_feature_names[word_index]

#weird result but coef_ are not supposed to be ordered...

array(['thrillers', 'thrilling', 'thrills', 'throw', 'throwing', 'thrown',
       'throws', 'thumbs', 'ticket'], dtype=object)

# Deep Learning

## MLP

In [13]:
#Import the AG news dataset (same as hw01)
#Download them from here 
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

--2023-03-23 10:51:05--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29470338 (28M) [text/plain]
Saving to: 'train.csv'


2023-03-23 10:51:13 (3.96 MB/s) - 'train.csv' saved [29470338/29470338]



Unnamed: 0,label,title,lead,text
17379,world,Australian high court limits unions' right to ...,AFP - Australian Prime Minister John Howard ha...,Australian high court limits unions' right to ...
25976,sport,Echoes Across Forty Years,On a day when the Browns last championship tea...,Echoes Across Forty Years On a day when the Br...
71699,business,Mittal Family Forges \$17.8 Bln Steel Deal,"AMSTERDAM (Reuters) - Lakshmi Mittal, one of ...",Mittal Family Forges \$17.8 Bln Steel Deal AM...
48920,sci/tech,PalmOne unveils 256MB Flash drive T5 PDA,"PalmOne launched the Tungsten T5, its first PD...",PalmOne unveils 256MB Flash drive T5 PDA PalmO...
51450,business,China to 'phase out' measures aimed at cooling...,AFP - China will phase out measures aimed at r...,China to 'phase out' measures aimed at cooling...


In [14]:
# create a new variable "business" that takes value 1 if the label is business and 0 otherwise
df['business'] = df['label'].apply(lambda x: int(x=='business'))
y = df['business'].values
df['business'].head()

17379    0
25976    0
71699    1
48920    0
51450    1
Name: business, dtype: int64

In [15]:
import spacy
nlp = spacy.load('en_core_web_sm')
from sklearn.feature_extraction.text import CountVectorizer

# pre-process text as you did in HW02
def tokenize(x):
    return [w.lemma_.lower() for w in nlp(x) if not w.is_stop and not w.is_punct and not w.is_digit]
df["tokens"] = df["text"].apply(lambda x: tokenize(x))
df["preprocessed"] = df['tokens'].apply(lambda x: ' '.join(x))
df["preprocessed_text"] = df["preprocessed"].apply(lambda x: " ".join(x))

df

Unnamed: 0,label,title,lead,text,business,tokens,preprocessed,preprocessed_text
17379,world,Australian high court limits unions' right to ...,AFP - Australian Prime Minister John Howard ha...,Australian high court limits unions' right to ...,0,"[australian, high, court, limit, union, right,...",australian high court limit union right strike...,a u s t r a l i a n h i g h c o u r t l ...
25976,sport,Echoes Across Forty Years,On a day when the Browns last championship tea...,Echoes Across Forty Years On a day when the Br...,0,"[echo, year, day, browns, championship, team, ...",echo year day browns championship team salute ...,e c h o y e a r d a y b r o w n s c h ...
71699,business,Mittal Family Forges \$17.8 Bln Steel Deal,"AMSTERDAM (Reuters) - Lakshmi Mittal, one of ...",Mittal Family Forges \$17.8 Bln Steel Deal AM...,1,"[mittal, family, forges, \$17.8, bln, steel, d...",mittal family forges \$17.8 bln steel deal a...,m i t t a l f a m i l y f o r g e s \ $ ...
48920,sci/tech,PalmOne unveils 256MB Flash drive T5 PDA,"PalmOne launched the Tungsten T5, its first PD...",PalmOne unveils 256MB Flash drive T5 PDA PalmO...,0,"[palmone, unveil, mb, flash, drive, t5, pda, p...",palmone unveil mb flash drive t5 pda palmone l...,p a l m o n e u n v e i l m b f l a s h ...
51450,business,China to 'phase out' measures aimed at cooling...,AFP - China will phase out measures aimed at r...,China to 'phase out' measures aimed at cooling...,1,"[china, phase, measure, aim, cool, economy, af...",china phase measure aim cool economy afp afp c...,c h i n a p h a s e m e a s u r e a i m ...
...,...,...,...,...,...,...,...,...
36338,sci/tech,Immersion Wins Patent Case Against Sony (AP),"AP - Immersion Corp., a small firm that develo...",Immersion Wins Patent Case Against Sony (AP) A...,0,"[immersion, win, patent, case, sony, ap, ap, i...",immersion win patent case sony ap ap immersion...,i m m e r s i o n w i n p a t e n t c a ...
26396,sport,Play-off earns Singh seventh victory of year,"NOW that he is No1 in the world, Vijay Singh c...",Play-off earns Singh seventh victory of year N...,0,"[play, earn, singh, seventh, victory, year, no...",play earn singh seventh victory year no1 world...,p l a y e a r n s i n g h s e v e n t h ...
6454,sport,Hamm Goes for More Gold Amid Controversy,Despite the controversy surrounding his gold m...,Hamm Goes for More Gold Amid Controversy Despi...,0,"[hamm, go, gold, amid, controversy, despite, c...",hamm go gold amid controversy despite controve...,h a m m g o g o l d a m i d c o n t r ...
30223,business,Symantec Buys Security Consulting Pioneer stake,Symantec Corp. on Thursday announced that is a...,Symantec Buys Security Consulting Pioneer stak...,1,"[symantec, buy, security, consulting, pioneer,...",symantec buy security consulting pioneer stake...,s y m a n t e c b u y s e c u r i t y c ...


In [16]:
##TODO vectorize the pre-processed text using CountVectorizer
corpus = df['preprocessed_text'].values
vectorizer = CountVectorizer(analyzer='char')
X_CV = vectorizer.fit_transform(corpus)
X_CV = X_CV.toarray()
#perhpas should do toarray() here...
vectorizer.get_feature_names_out()
X_CV.shape

(10000, 55)

In [17]:
# get classes by understanding labels
# predicting only business so perhaps binary classifaction
y_lables = df['business'].values
#y_lables = y_lables.reshape(10000,1, 1)
y_lables

array([0, 0, 1, ..., 0, 1, 0])

In [19]:
#reshaping
#X_CV_train = X_CV.reshape(10000, 1, 54, 1)
X_CV_train = np.expand_dims(X_CV, -1)
#x_test = np.expand_dims(x_test, -1)

In [20]:
#not sura ab out this..
num_classes = 1
input_shape = (55)

In [21]:
print(y_lables.shape)
print(X_CV_train.shape)

(10000,)
(10000, 55, 1)


Your goal here is to use features from the Vectorized text to predict whether the snippet is from a business article.

In [22]:
from keras.models import Sequential, Input
from keras.layers import Dense
from keras.layers import Dropout
from keras.callbacks import EarlyStopping

## TODO build a MLP model with at least 2 hidden layers with ReLU activation, followed 
#by dropout and an output layer with sigmoid activation
model = Sequential([
    Input(shape=input_shape),
    Dense(32, activation='relu'),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(num_classes, activation='sigmoid'),
])
## TODO compile the model
print(model.summary())

## TODO fit the model using early stopping to predict the business label
callback = EarlyStopping(monitor='loss', patience=3)
batch_size = 100
epochs = 10

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(X_CV_train, y_lables, batch_size=batch_size, epochs=epochs, validation_split=0.1,
         callbacks=[callback])

2023-03-23 10:54:06.945321: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 32)                1792      
                                                                 
 dense_1 (Dense)             (None, 64)                2112      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 3,969
Trainable params: 3,969
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<keras.callbacks.History at 0x7f97f6100400>

In [23]:
#not doing any predictions on other parts of the dataset but didnt seem to be part of the task
#accuracy is relatvily low but could be imporved by changing the network
#also, skipping last part since this took a little while and we're allowed to skip one part

## Autoencoders

In [None]:
from keras import backend as K

def r2(y_true, y_pred):
    SS_res =  K.sum(K.square( y_true-y_pred )) 
    SS_tot = K.sum(K.square( y_true - K.mean(y_true) ) ) 
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

##TODO build a simple autoencoder with two compression layers and two reconstruction layers using ReLu
##TODO compile and fit the model minimizing "mean_squared_error"
##report r_squared during training (the function r2 defined above)


In [None]:
import keras

##TODO compress the vectorized text (X.todense())