In this assignment, you will train and test linear models using sklearn for text classification. You will use the dataset given here: http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz. (Links to an external site.)

The README file is available here: http://www.cs.cornell.edu/people/pabo/movie-review-data/scaledata.README.1.0.txt

In [1]:
!wget -nv "http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz"

2020-11-06 06:14:25 URL:http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz [4029756/4029756] -> "scale_data.tar.gz" [1]


In [2]:
!tar -xvf scale_data.tar.gz

scaledata.README.1.0.txt
scaledata/Dennis+Schwartz/
scaledata/Dennis+Schwartz/subj.Dennis+Schwartz
scaledata/Dennis+Schwartz/id.Dennis+Schwartz
scaledata/Dennis+Schwartz/rating.Dennis+Schwartz
scaledata/Dennis+Schwartz/label.3class.Dennis+Schwartz
scaledata/Dennis+Schwartz/label.4class.Dennis+Schwartz
scaledata/James+Berardinelli/
scaledata/James+Berardinelli/subj.James+Berardinelli
scaledata/James+Berardinelli/id.James+Berardinelli
scaledata/James+Berardinelli/rating.James+Berardinelli
scaledata/James+Berardinelli/label.3class.James+Berardinelli
scaledata/James+Berardinelli/label.4class.James+Berardinelli
scaledata/Scott+Renshaw/
scaledata/Scott+Renshaw/subj.Scott+Renshaw
scaledata/Scott+Renshaw/id.Scott+Renshaw
scaledata/Scott+Renshaw/rating.Scott+Renshaw
scaledata/Scott+Renshaw/label.3class.Scott+Renshaw
scaledata/Scott+Renshaw/label.4class.Scott+Renshaw
scaledata/Steve+Rhodes/
scaledata/Steve+Rhodes/subj.Steve+Rhodes
scaledata/Steve+Rhodes/id.Steve+Rhodes
scaledata/Steve+Rhodes/rat

As in the other assignments, create a single dataset for classification based on the IDs available. An example ID name is "id.Dennis+Schwartz". The data file is "subj.Dennis+Schwartz". The ratings file is "rating.Dennis+Schwartz". The ratings are converted to 3 classes as described in the README file.

In [3]:
import pandas as pd
import numpy as np
import os

In [4]:
def read_data_for_author(author_name):

  au_labels = au_subj = np.array([])
  
  with open('/content/scaledata/'+author_name+'/label.3class.'+author_name,'r') as fp:
    for line in fp:
      au_labels = np.append(au_labels,int(line.strip()))
  
  with open('/content/scaledata/'+author_name+'/subj.'+author_name,'r') as fp:
    for line in fp:
      au_subj = np.append(au_subj,line.strip())

  return au_labels,au_subj

In [5]:
authors = np.array(os.listdir('/content/scaledata'))

dataset = pd.DataFrame(columns=np.array(['id','label_3','subj']))

for author in authors:
  author_data = read_data_for_author(author)

  auth_map_dict = {
    'id': author,
    'label_3': author_data[0],
    'subj': author_data[1]
  }

  dataset = dataset.append(pd.DataFrame(data = auth_map_dict),ignore_index=True)


In [6]:
type(dataset['subj'])

pandas.core.series.Series

Split the data into training and test datasets (80/20 ratio). Are the labels balanced between training and testing datasets?

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train,X_test,y_train,y_test = train_test_split(dataset['subj'], dataset['label_3'], test_size=0.20)
print(type(X_train))

X_train, X_test = list(X_train), list(X_test)
print(X_train[:5])

<class 'pandas.core.series.Series'>


In [10]:
y_train

4574    2.0
1558    1.0
2715    1.0
4971    2.0
4831    2.0
       ... 
2261    2.0
1264    1.0
1122    0.0
988     2.0
1233    1.0
Name: label_3, Length: 4004, dtype: float64

You can vectorize the documents using a vectorizer of your choice. CountVectorizer is the simplest one that can be used. Set the feature values to binary so that only presence or absence matters.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
cv = CountVectorizer(analyzer=str.split,stop_words='english',max_features=10000, binary=True)

In [14]:
#X_train
X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)
#cv.fit_transform(dataset['subj'])
X_test

<1002x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 206265 stored elements in Compressed Sparse Row format>

In [16]:
y_train = [int(x) for x in y_train]
y_train

[2,
 1,
 1,
 2,
 2,
 1,
 2,
 0,
 0,
 2,
 1,
 1,
 1,
 1,
 0,
 0,
 2,
 0,
 2,
 2,
 0,
 1,
 0,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 0,
 2,
 0,
 0,
 0,
 1,
 0,
 0,
 2,
 0,
 0,
 2,
 1,
 0,
 1,
 2,
 2,
 1,
 1,
 1,
 0,
 0,
 2,
 1,
 0,
 2,
 1,
 2,
 1,
 1,
 1,
 2,
 1,
 2,
 0,
 2,
 1,
 0,
 0,
 2,
 2,
 1,
 2,
 1,
 2,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 2,
 1,
 2,
 0,
 1,
 2,
 0,
 1,
 1,
 2,
 2,
 2,
 0,
 1,
 0,
 0,
 2,
 1,
 1,
 1,
 0,
 2,
 2,
 0,
 1,
 1,
 1,
 0,
 2,
 2,
 1,
 1,
 0,
 0,
 0,
 2,
 0,
 0,
 1,
 1,
 2,
 0,
 2,
 1,
 0,
 0,
 2,
 1,
 1,
 1,
 2,
 0,
 2,
 1,
 2,
 1,
 2,
 1,
 1,
 1,
 2,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 1,
 1,
 2,
 2,
 1,
 2,
 2,
 0,
 0,
 2,
 0,
 2,
 0,
 1,
 1,
 1,
 1,
 2,
 2,
 0,
 2,
 1,
 1,
 0,
 1,
 2,
 0,
 1,
 2,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 0,
 1,
 2,
 1,
 0,
 0,
 0,
 1,
 2,
 0,
 2,
 1,
 2,
 2,
 1,
 1,
 2,
 0,
 0,
 0,
 1,
 1,
 0,
 2,
 0,
 1,
 0,
 2,
 2,
 2,
 2,
 0,
 2,
 1,
 1,
 2,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 2,
 0,


In [None]:
#vocab = cv.get_feature_names()

In [None]:
# def get_feature_list(doc):
#   '''
#   Returns a list of features with 0/1 for a document if the word present in the vocab
#   '''
#   feature_list = []
#   for word in vocab:
#     if word in doc:
#       feature_list.append(1)
#     else:
#       feature_list.append(0)

#   return feature_list

In [None]:
#X_train_features = X_train.apply(get_feature_list)

In [None]:
#X_test_features = X_test.apply(get_feature_list)

In [None]:
#X_train_features

4341    [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
4399    [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
335     [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...
4538    [0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, ...
1392    [0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, ...
                              ...                        
4861    [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
4968    [1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, ...
753     [0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, ...
4362    [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, ...
3950    [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...
Name: subj, Length: 4004, dtype: object

In [None]:
#y_train

4341    2.0
4399    2.0
335     0.0
4538    2.0
1392    1.0
       ... 
4861    2.0
4968    2.0
753     1.0
4362    2.0
3950    1.0
Name: label_3, Length: 4004, dtype: float64

What is the performance of the trained classifier on the test dataset?
Which regularization is performing the best on the test dataset: L1 or L2? Use the default settings.

In [17]:
from sklearn.linear_model import SGDClassifier

In [35]:
classifier_1 = SGDClassifier(penalty='l1')
classifier_1.fit(X_train,y_train)

#classifier_1.fit(np.array(X_train_features),np.array(y_train))

0.9992507492507493

In [19]:
y_pred_1 = classifier_1.predict(X_test)

In [20]:
from sklearn.metrics import classification_report

In [22]:
print(classification_report(y_test,y_pred_1))

              precision    recall  f1-score   support

         0.0       0.64      0.59      0.61       254
         1.0       0.55      0.58      0.56       356
         2.0       0.75      0.76      0.75       392

    accuracy                           0.65      1002
   macro avg       0.65      0.64      0.64      1002
weighted avg       0.65      0.65      0.65      1002



In [23]:
classifier_2 = SGDClassifier(penalty='l2')
classifier_2.fit(X_train,y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [24]:
y_pred_2 = classifier_2.predict(X_test)

In [25]:
print(classification_report(y_test,y_pred_2))

              precision    recall  f1-score   support

         0.0       0.66      0.52      0.58       254
         1.0       0.57      0.67      0.62       356
         2.0       0.76      0.74      0.75       392

    accuracy                           0.66      1002
   macro avg       0.66      0.64      0.65      1002
weighted avg       0.67      0.66      0.66      1002



We will use cross-validation to get the best value of the α parameter for regularization. What is the value of the α? Is regularization important? You can use 5-fold cross validation. Note that the cross-validation should be used ONLY on the training dataset. You can read more here: https://scikit-learn.org/stable/modules/grid_search.html#

In [38]:
classifier_1.get_params

<bound method BaseEstimator.get_params of SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l1',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)>

In [26]:
from sklearn.model_selection import cross_val_score

In [36]:
cross_val_score(classifier_1,X_train,y_train,cv=5)

array([0.62047441, 0.64419476, 0.62047441, 0.64419476, 0.64      ])

In [37]:
cross_val_score(classifier_2,X_train,y_train,cv=5)

array([0.65168539, 0.65168539, 0.62546816, 0.61922597, 0.63875   ])

Choose a different loss function. The default option is "hinge". Now train and test the different classifier to compare against the default one. Is it better or worse?

In [29]:
classifier_1 = SGDClassifier(loss = 'log',penalty='l1')
classifier_1.fit(X_train,y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l1', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [30]:
y_pred_1 = classifier_1.predict(X_test)

In [31]:
print(classification_report(y_test,y_pred_1))

              precision    recall  f1-score   support

         0.0       0.62      0.59      0.61       254
         1.0       0.57      0.57      0.57       356
         2.0       0.74      0.77      0.75       392

    accuracy                           0.65      1002
   macro avg       0.64      0.64      0.64      1002
weighted avg       0.65      0.65      0.65      1002



In [32]:
classifier_2 = SGDClassifier(loss = 'log',penalty='l2')
classifier_2.fit(X_train,y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [33]:
y_pred_2 = classifier_2.predict(X_test)

In [34]:
print(classification_report(y_test,y_pred_2))

              precision    recall  f1-score   support

         0.0       0.63      0.54      0.58       254
         1.0       0.54      0.59      0.57       356
         2.0       0.74      0.75      0.75       392

    accuracy                           0.64      1002
   macro avg       0.64      0.63      0.63      1002
weighted avg       0.64      0.64      0.64      1002

