# **Project Description**

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

# **Dataset**

**Blog Authorship Corpus**

Over 600,000 posts from more than 19 thousand bloggers.

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)


All bloggers included in the corpus fall into one of three age groups: **8240 "10s" blogs (ages 13-17), 8086 "20s" blogs(ages 23-27) 2994 "30s" blogs (ages 33-47)**

For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

https://www.kaggle.com/rtatman/blog-authorship-corpus

In [1]:
# Imports
import numpy as np 
import pandas as pd
import os
import re

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords

#Import sklearn Library
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,f1_score, accuracy_score, recall_score, precision_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# 1. Load the dataset

In [2]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [3]:
corpus_df = pd.read_csv("/gdrive/My Drive/Colab Notebooks/R8/Lab/Project_Statistical_NLP/blog-authorship-corpus/blogtext.csv",nrows=10000)
corpus_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [4]:
corpus_df.shape

(10000, 7)

# 2. Preprocess rows of the “text” column

a. Remove unwanted characters

b. Convert text to lowercase

c. Remove unwanted spaces

d. Remove stopwords

In [5]:
corpus_df['text'][0]

'           Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html.         '

**Removing unwanted character**

In [6]:
#corpus_df['text'] = corpus_df['text'].str.replace('[^a-zA-Z0-9]', ' ')
corpus_df['text'] = corpus_df['text'].str.replace('[^a-zA-Z]', ' ')
corpus_df['text'][0]

'           Info has been found          pages  and     MB of  pdf files  Now i have to wait untill our team leader has processed it and learns html          '

**Coverting to lower case**

In [7]:
corpus_df['text'] = corpus_df['text'].str.lower()
corpus_df["text"].loc[0]

'           info has been found          pages  and     mb of  pdf files  now i have to wait untill our team leader has processed it and learns html          '

**Remove unwanted space character**

In [8]:
corpus_df['text'] = corpus_df['text'].str.strip()
corpus_df['text'] = corpus_df['text'].str.replace('\s+', ' ')
corpus_df['text'][0]

'info has been found pages and mb of pdf files now i have to wait untill our team leader has processed it and learns html'

**Remove stop words**

In [0]:
#Split 'text' column for removing stopword
corpus_df["text"] = corpus_df["text"].str.split()

In [10]:
print(corpus_df['text'][0])

['info', 'has', 'been', 'found', 'pages', 'and', 'mb', 'of', 'pdf', 'files', 'now', 'i', 'have', 'to', 'wait', 'untill', 'our', 'team', 'leader', 'has', 'processed', 'it', 'and', 'learns', 'html']


In [0]:
stopword = stopwords.words('english')

def removestopwords(sentence):
  stopwordremoved = [word for word in sentence if word not in stopword]
  return(" ".join(stopwordremoved))

In [0]:
# Loop over each text
corpus_text_len = len(corpus_df['text'])
cleaner_corpus_df = []

for i in range( 0, corpus_text_len):
    cleaner_corpus_df.append(removestopwords(corpus_df["text"][i]))

In [13]:
print(cleaner_corpus_df[10])

ah korean language looks difficult first figure read hanguel korea surprisingly easy learn alphabet characters seems easy vocabulary starts oh backwards us sentence structure yikes luckily many options us slow witted foreigners take language course could list urllink joongang article says lot resources urllink well guy motivation jeon ji hyun latest something actually star movies cfs hear means commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly titles make sense like website korean english looks quite good actually urllink movie shown theatres subtitles special times info urllink list many theatres seoul click urllink urllink great reason learn korean already married went foreigners well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea no

In [0]:
corpus_df['text'] = cleaner_corpus_df

**Lemmatization**

In [0]:
wordTokenizer = nltk.tokenize.WhitespaceTokenizer()
wordLemmatizer = nltk.stem.WordNetLemmatizer()

def text_lemmatizer(text):
  lemm = [wordLemmatizer.lemmatize(word) for word in wordTokenizer.tokenize(text)]
  return(" ".join(lemm))

corpus_df['text'] = corpus_df['text'].apply(text_lemmatizer)

# 3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence.

a. Label columns to merge: “gender”, “age”, “topic”, “sign”

b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

In [16]:
corpus_df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

In [17]:
corpus_df['age'] = corpus_df['age'].astype(str)
corpus_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
id        10000 non-null int64
gender    10000 non-null object
age       10000 non-null object
topic     10000 non-null object
sign      10000 non-null object
date      10000 non-null object
text      10000 non-null object
dtypes: int64(1), object(6)
memory usage: 547.0+ KB


In [18]:
corpus_df['labels'] = corpus_df[['gender','age','topic','sign']].apply(lambda x: ', '.join(x), axis = 1)
corpus_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,labels
0,2059027,male,15,Student,Leo,"14,May,2004",info found page mb pdf file wait untill team l...,"male, 15, Student, Leo"
1,2059027,male,15,Student,Leo,"13,May,2004",team member drewes van der laag urllink mail r...,"male, 15, Student, Leo"
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...,"male, 15, Student, Leo"
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing,"male, 15, Student, Leo"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture url popups mean s...,"male, 33, InvestmentBanking, Aquarius"


In [19]:
corpus_df.drop(['gender','age','topic','sign', 'id', 'date'], axis=1,inplace=True)
corpus_df.head()

Unnamed: 0,text,labels
0,info found page mb pdf file wait untill team l...,"male, 15, Student, Leo"
1,team member drewes van der laag urllink mail r...,"male, 15, Student, Leo"
2,het kader van kernfusie op aarde maak je eigen...,"male, 15, Student, Leo"
3,testing testing,"male, 15, Student, Leo"
4,thanks yahoo toolbar capture url popups mean s...,"male, 33, InvestmentBanking, Aquarius"


In [20]:
print("Shape after merging columns together: ",corpus_df.shape)

Shape after merging columns together:  (10000, 2)


# 4. Separate features and labels, and split the data into training and testing

In [0]:
Features = corpus_df['text']
Labels = corpus_df['labels']

X_train, X_test, Y_train, Y_test = train_test_split(Features, Labels, test_size = 0.30, random_state = 42)

In [22]:
print("X_train shape: ", X_train.shape, "Y_train shape: ", Y_train.shape)
print("X_test shape: ", X_test.shape, "Y_test shape: ", Y_test.shape)

X_train shape:  (7000,) Y_train shape:  (7000,)
X_test shape:  (3000,) Y_test shape:  (3000,)


# 5. Vectorize the features

**a. Create a Bag of Words using count vectorizer**

> i. Use ngram_range=(1, 2)


> ii. Vectorize training and testing features


In [0]:
Vectorizer = CountVectorizer(min_df=2, ngram_range=(1,2), stop_words="english", lowercase=True)
X_train = Vectorizer.fit_transform(X_train)
X_test = Vectorizer.transform(X_test)

**b. Print the term-document matrix**

In [24]:
X_train

<7000x60158 sparse matrix of type '<class 'numpy.int64'>'
	with 501472 stored elements in Compressed Sparse Row format>

In [25]:
X_test

<3000x60158 sparse matrix of type '<class 'numpy.int64'>'
	with 186682 stored elements in Compressed Sparse Row format>

# 6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label.

In [26]:
Vectorizer.vocabulary_

{'finished': 16161,
 'buddha': 5461,
 'little': 29027,
 'finger': 16127,
 'viking': 55854,
 'jan': 24877,
 'want': 56364,
 'know': 26028,
 'men': 32033,
 'white': 57962,
 'hat': 21356,
 'writing': 59250,
 'book': 4686,
 'american': 1260,
 'fiction': 15863,
 'january': 24890,
 'man': 31085,
 'novel': 35482,
 'psycho': 40408,
 'taken': 50085,
 'yesterday': 59918,
 'came': 5976,
 'rusty': 43715,
 'winter': 58197,
 'colored': 8117,
 'pouring': 39537,
 'floor': 16383,
 'counted': 9356,
 'spot': 48006,
 'set': 45541,
 'plant': 38730,
 'leaf': 27106,
 'match': 31433,
 'green': 20043,
 'dream': 12613,
 'villager': 55858,
 'spoke': 47964,
 'water': 56920,
 'face': 14833,
 'scratched': 44915,
 'demon': 11326,
 'handed': 20926,
 'crow': 9835,
 'beak': 3194,
 'outside': 36595,
 'wind': 58135,
 'fierce': 15885,
 'snow': 47068,
 'dusted': 13016,
 'street': 48954,
 'fairy': 14994,
 'dust': 13012,
 'blocking': 4303,
 'view': 55828,
 'town': 53667,
 'hall': 20824,
 'palace': 36880,
 'night': 35044,
 'l

In [27]:
class_labels = []
for key in Vectorizer.vocabulary_.keys():
  class_labels.append(key)

print("Number of Classes: ",len(class_labels))

Number of Classes:  60158


# 7. Transform the labels
As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

**a. Convert your train and test labels using MultiLabelBinarizer**

In [28]:
Labels

0                      male, 15, Student, Leo
1                      male, 15, Student, Leo
2                      male, 15, Student, Leo
3                      male, 15, Student, Leo
4       male, 33, InvestmentBanking, Aquarius
                        ...                  
9995               female, 25, indUnk, Pisces
9996               female, 25, indUnk, Pisces
9997               female, 25, indUnk, Pisces
9998               female, 25, indUnk, Pisces
9999               female, 25, indUnk, Pisces
Name: labels, Length: 10000, dtype: object

In [0]:
Labels = [["".join(re.findall('\w',word )) for word in lst] for lst in [label.split(',') for label in Labels]]

In [30]:
Labels[:5]

[['male', '15', 'Student', 'Leo'],
 ['male', '15', 'Student', 'Leo'],
 ['male', '15', 'Student', 'Leo'],
 ['male', '15', 'Student', 'Leo'],
 ['male', '33', 'InvestmentBanking', 'Aquarius']]

In [31]:
multiLabelBinarizer = MultiLabelBinarizer(classes=class_labels )
Label_trans = multiLabelBinarizer.fit(Labels)
Label_trans

MultiLabelBinarizer(classes=['finished', 'buddha', 'little', 'finger', 'viking',
                             'jan', 'want', 'know', 'men', 'white', 'hat',
                             'writing', 'book', 'american', 'fiction',
                             'january', 'man', 'novel', 'psycho', 'taken',
                             'yesterday', 'came', 'rusty', 'winter', 'colored',
                             'pouring', 'floor', 'counted', 'spot', 'set', ...],
                    sparse_output=False)

In [32]:
# Transform Y_train
Y_train = [["".join(re.findall('\w',word )) for word in lst] for lst in [label.split(',') for label in Y_train]]
Y_train_trans = multiLabelBinarizer.transform(Y_train)

  .format(sorted(unknown, key=str)))


In [33]:
print(Y_train[10])
print(Y_train_trans[10])

['female', '27', 'indUnk', 'Taurus']
[0 0 0 ... 0 0 0]


In [34]:
Y_train_trans.shape

(7000, 60158)

In [35]:
#Transform Y_test
Y_test = [["".join(re.findall('\w',word )) for word in lst] for lst in [label.split(',') for label in Y_test]]
Y_test_trans = multiLabelBinarizer.transform(Y_test)

  .format(sorted(unknown, key=str)))


In [36]:
print(Y_test[30])
print(Y_test_trans[30])

['female', '24', 'indUnk', 'Scorpio']
[0 0 0 ... 0 0 0]


In [37]:
multiLabelBinarizer.classes_

array(['finished', 'buddha', 'little', ..., 'david ortiz', 'like prince',
       'strumming'], dtype=object)

In [38]:
print("Total number of classes: ", len(multiLabelBinarizer.classes_))

Total number of classes:  60158


# 8. Choose a classifier
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

**a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label**

In [0]:
classifier = LogisticRegression(solver = 'lbfgs',max_iter = 1000)
classifier = OneVsRestClassifier(classifier)

# 9. Fit the classifier, make predictions and get the accuracy

In [40]:
#Fit
classifier.fit(X_train,Y_train_trans)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  st

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=1000,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

**a. Print the following**


> i. Accuracy score


> ii. F1 score


> iii. Average precision score


> iv. Average recall score

In [41]:
print("Train Accuracy:",classifier.score(X_train,Y_train_trans))

Train Accuracy: 0.9691428571428572


In [0]:
#Prediction
Y_pred = classifier.predict(X_test)

In [43]:
print("Test Accuracy:" + str(accuracy_score(Y_test_trans, Y_pred)))

Test Accuracy:0.7433333333333333


In [44]:
print("F1 Score: " + str(f1_score(Y_test_trans, Y_pred, average='micro')))
print("F1_macro Score: " + str(f1_score(Y_test_trans, Y_pred, average='macro')))

F1 Score: 0.7871514295799505
F1_macro Score: 2.5716170299975882e-05


  average, "true nor predicted", 'F-score is', len(true_sum)


In [45]:
print("Precision: " + str(precision_score(Y_test_trans, Y_pred, average='micro')))
print("Precision_macro: " + str(precision_score(Y_test_trans, Y_pred, average='macro')))

Precision: 0.8364591147786947
Precision_macro: 2.771152937855678e-05


  _warn_prf(average, modifier, msg_start, len(result))


In [46]:
print("Recall: " + str(recall_score(Y_test_trans, Y_pred, average='micro')))
print("Recall_macro: " + str(recall_score(Y_test_trans, Y_pred, average='macro')))

Recall: 0.7433333333333333
Recall_macro: 2.4142613967945156e-05


  _warn_prf(average, modifier, msg_start, len(result))


# 10. Print true label and predicted label for any five examples

In [0]:
# Apply inverse_transform to get Predicted label
Y_pred_inv = multiLabelBinarizer.inverse_transform(Y_pred)

In [0]:
#Apply inverse_transform to get Actual label
Y_test_trans_inv = multiLabelBinarizer.inverse_transform(Y_test_trans)

In [49]:
for i in range(10):
  print("Example: ",i)
  print("Predicted Label:",Y_pred_inv[i])
  print("Actual Label:",Y_test_trans_inv[i])
  print("Actual label before applying multiLabelBinarizer transformation :",Y_test[i])
  print("-----------------------------------------------------------------------------\n")

Example:  0
Predicted Label: ('male',)
Actual Label: ('male',)
Actual label before applying multiLabelBinarizer transformation : ['male', '23', 'Consulting', 'Taurus']
-----------------------------------------------------------------------------

Example:  1
Predicted Label: ('male',)
Actual Label: ('male',)
Actual label before applying multiLabelBinarizer transformation : ['male', '17', 'indUnk', 'Aquarius']
-----------------------------------------------------------------------------

Example:  2
Predicted Label: ('male',)
Actual Label: ('male',)
Actual label before applying multiLabelBinarizer transformation : ['male', '35', 'Technology', 'Aries']
-----------------------------------------------------------------------------

Example:  3
Predicted Label: ('female',)
Actual Label: ('female',)
Actual label before applying multiLabelBinarizer transformation : ['female', '23', 'Automotive', 'Aquarius']
-----------------------------------------------------------------------------

Example

# Conclusion

1. I have used only 10000 data points for model building due to computational limitation. Also tried consider 60k, 50k, 30k and 20k, but everytime google colab is crashing.

2. I have also used Lemmatization, but that is not creating any impact in model generalisation.

3. While preprocessing(step2) I have removed all numbers. Removing numbers helping in model generalisation. If numbers are not removed then training and test accuracy is pretty low. **Training accuracy was around 88% and test accuracy is around 34%**

4. If numbers are removed then signification increase in training and test accuracy is seen. **Training accuracy was around 96% and test accuracy is around 74%. Model generalisation is better.**