# DOMAIN: Digital content management

# CONTEXT: Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

# DATA DESCRIPTION: Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
• 8240 "10s" blogs (ages 13-17),
• 8086 "20s" blogs(ages 23-27) and
• 2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label url link. Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus

# PROJECT OBJECTIVE: The need is to build a NLP classifier which can use input text parameters to determine the label/s of of the blog.

# Steps and tasks: [ Total Score: 40 points]
1. Import and analyse the data set.
2. Perform data pre-processing on the data:
• Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase.
• Target/label merger and transformation
• Train and test split
• Vectorisation, etc.
3. Design, train, tune and test the best text classifier.
4. Display and explain detail the classification report
5. Print the true vs predicted labels for any 5 entries from the dataset.
Hint: The aim here Is to import the text, process it such a way that it can be taken as an inout to the ML/NN classifiers. Be analytical and experimental here in trying new approaches to design the best model.

# Import and Analyse the dataset

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
import pandas as pd
import numpy as np
import warnings
import re
warnings.filterwarnings('ignore')

In [None]:
blog_df = pd.read_csv('/gdrive/MyDrive/Colab Notebooks/Project-NLP/Dataset - blogtext.csv')

In [None]:
blog_df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [None]:
blog_df.tail(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
681279,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, I could write some really ..."
681280,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, 'I have the second yeast i..."
681281,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, Your 'boyfriend' is fuckin..."
681282,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan: Just to clarify, I am as..."
681283,1713845,male,23,Student,Taurus,"01,July,2004","Hey everybody...and Susan, You might a..."


In [None]:
blog_df.shape

(681284, 7)

In [None]:
# drop id and date columns
blog_df.drop(['id','date'], axis =1, inplace = True)

# Perform data pre-processing on the data: • Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase. • Target/label merger and transformation • Train and test split • Vectorisation, etc.

In [None]:
df_new = blog_df[:100000]

In [None]:
df_new.shape

(340000, 5)

In [None]:
df_new['Clean_text'] = df_new['text'].apply(lambda x: re.sub("[^\w ]","",x))

In [None]:
df_new.head()

Unnamed: 0,gender,age,topic,sign,text,Clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",Info has been found 100 pages and ...
1,male,15,Student,Leo,These are the team members: Drewe...,These are the team members Drewes...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,In het kader van kernfusie op aarde...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,Thanks to Yahoos Toolbar I can no...


In [None]:
df_new['Clean_text'] = df_new['Clean_text'].apply(lambda x: x.lower())

In [None]:
df_new.head()

Unnamed: 0,gender,age,topic,sign,text,Clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info has been found 100 pages and ...
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team members drewes...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoos toolbar i can no...


In [None]:
df_new['Clean_text'] = df_new['Clean_text'].apply(lambda x: x.strip())

In [None]:
df_new.head()

Unnamed: 0,gender,age,topic,sign,text,Clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info has been found 100 pages and 45 mb of pd...
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team members drewes van der la...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde maak je e...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoos toolbar i can now capture the...


In [None]:
print("Actual data=======> {}".format(df_new['text'][1]))



In [None]:
print("Cleaned data=======> {}".format(df_new['Clean_text'][1]))



In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
df_new['Clean_text'] = df_new['Clean_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df_new.iloc[2][['text', 'Clean_text']].to_dict()

{'Clean_text': 'het kader van kernfusie op aarde maak je eigen waterstofbom build hbomb ascotttartarusuwaeduau andrew scott newsgroups rechumor subject build hbomb humorous date 7 feb 1994 074114 gmt organization university western australia original file dated 12th november 1990 seemed transcript seven days article poorly formatted corrupted added text examine microscope malleable like gold missing anyone full text please distribute responsible accuracy information converted html dionisioinfinetcom 111398 little spellchecking minor edits stolen urllink httpmyohiovoyagernetdionisiofunmownhbombhtml reformatted html validates xhtml 10 strict build hbomb making owning hbomb kind challenge real americans seek wants passive victim nuclear war little effort active participant bomb shelters losers wants huddle together underground eating canned spam winners want push button making hbomb big step nuclear assertiveness training called taking charge sure youll enjoy risks heady thrill playing nu

In [None]:
df_new.head()

Unnamed: 0,gender,age,topic,sign,text,Clean_text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found 100 pages 45 mb pdf files wait unti...
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoos toolbar capture urls popupswhich...


In [None]:
df_new['labels'] = df_new.apply(lambda col : [col['gender'],col['age'],col['topic'],col['sign']], axis=1)

In [None]:
df_new.head()

Unnamed: 0,gender,age,topic,sign,text,Clean_text,labels
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found 100 pages 45 mb pdf files wait unti...,"[male, 15, Student, Leo]"
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,male,15,Student,Leo,testing!!! testing!!!,testing testing,"[male, 15, Student, Leo]"
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoos toolbar capture urls popupswhich...,"[male, 33, InvestmentBanking, Aquarius]"


In [None]:
#drop  gender,age,topic & sign as they are already merged to labels column
df_model = df_new.drop(columns=['gender','age','topic','sign','text'], axis=1)

In [None]:
df_model.head(5)

Unnamed: 0,Clean_text,labels
0,info found 100 pages 45 mb pdf files wait unti...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoos toolbar capture urls popupswhich...,"[male, 33, InvestmentBanking, Aquarius]"


In [None]:
X= df_model['Clean_text']
y = df_model['labels']

In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=2,test_size = 0.2)

In [None]:
print(X_train.shape)
print(y_train.shape)

(272000,)
(272000,)


In [None]:
print(X_test.shape)
print(y_test.shape)

(68000,)
(68000,)


In [None]:
X_test

31026     worked details scene one today yesterday im ho...
186714    valley doom dear blog riding home tyler today ...
1584      youre depressed already urllink httpwwwacluorg...
339991                                 hahayou dum bass lol
198704    wanted remind guys im going gone thursday dont...
                                ...                        
313781    well im feeling much better today went home la...
267530    oh yeah think informed tell u sign guestbook w...
131897    congratulations kim getting tight ass hired ur...
17387     urllink td1 well 18 months ago favourite photo...
115318    everyone read everything carefully anything of...
Name: Clean_text, Length: 68000, dtype: object

In [None]:
# import Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
Cvect = CountVectorizer(ngram_range=(1,2))

In [None]:
Cvect.fit(X_train)

#Check the vocablury size
len(Cvect.vocabulary_)

12124762

In [None]:
Cvect.get_feature_names()[:100]

['00',
 '00 00',
 '00 01',
 '00 04',
 '00 0b',
 '00 100',
 '00 11',
 '00 120',
 '00 2822',
 '00 2b10',
 '00 46',
 '00 48',
 '00 69',
 '00 77',
 '00 7f',
 '00 a0',
 '00 became',
 '00 bloodalcohol',
 '00 c0',
 '00 chance',
 '00 commonality',
 '00 couple',
 '00 create',
 '00 credits',
 '00 crowd',
 '00 d0',
 '00 dad',
 '00 damarcus',
 '00 doesnt',
 '00 dont',
 '00 draw',
 '00 duke',
 '00 fleming',
 '00 floor',
 '00 flowto_serverestablishedclasstypeattemptedadmin',
 '00 following',
 '00 game',
 '00 games',
 '00 good',
 '00 hyped',
 '00 iraqis',
 '00 know',
 '00 laughed',
 '00 lil',
 '00 lon9',
 '00 mark',
 '00 match',
 '00 michael',
 '00 morning',
 '00 new',
 '00 number',
 '00 oh',
 '00 one',
 '00 pm',
 '00 rating',
 '00 records',
 '00 refused',
 '00 sad',
 '00 say',
 '00 seconds',
 '00 silver',
 '00 sixth',
 '00 sometimes',
 '00 soooooo',
 '00 sq',
 '00 steph',
 '00 tcr1',
 '00 tenth',
 '00 texashtgametable',
 '00 theres',
 '00 tie',
 '00 times',
 '00 tired',
 '00 trip',
 '00 type',
 '00 

In [None]:
X_train_ct = Cvect.transform(X_train)

In [None]:
X_test_ct = Cvect.transform(X_test)

In [None]:
label_counts = dict()
for labels in df_model.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[str(label)]+=1
        else:
            label_counts[str(label)]=1

In [None]:
label_counts

{'13': 1,
 '14': 1,
 '15': 1,
 '16': 1,
 '17': 1,
 '23': 1,
 '24': 1,
 '25': 1,
 '26': 1,
 '27': 1,
 '33': 1,
 '34': 1,
 '35': 1,
 '36': 1,
 '37': 1,
 '38': 1,
 '39': 1,
 '40': 1,
 '41': 1,
 '42': 1,
 '43': 1,
 '44': 1,
 '45': 1,
 '46': 1,
 '47': 1,
 '48': 1,
 'Accounting': 2784,
 'Advertising': 2730,
 'Agriculture': 880,
 'Aquarius': 24411,
 'Architecture': 746,
 'Aries': 31244,
 'Arts': 15650,
 'Automotive': 450,
 'Banking': 1124,
 'Biotech': 1703,
 'BusinessServices': 2122,
 'Cancer': 31368,
 'Capricorn': 25284,
 'Chemicals': 1051,
 'Communications-Media': 10423,
 'Construction': 641,
 'Consulting': 3327,
 'Education': 17599,
 'Engineering': 7354,
 'Environment': 451,
 'Fashion': 3391,
 'Gemini': 28280,
 'Government': 3904,
 'HumanResources': 1187,
 'Internet': 7898,
 'InvestmentBanking': 745,
 'Law': 4877,
 'LawEnforcement-Security': 914,
 'Leo': 30173,
 'Libra': 27660,
 'Manufacturing': 1284,
 'Maritime': 186,
 'Marketing': 3037,
 'Military': 1675,
 'Museums-Libraries': 1369,
 'No

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
binarizer=MultiLabelBinarizer(classes=sorted(label_counts.keys()))

In [None]:
y_train =binarizer.fit_transform(y_train)

In [None]:
y_test =binarizer.fit_transform(y_test)

In [None]:
y_test

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0]])

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
model=LogisticRegression(solver='lbfgs', max_iter=100)

In [None]:
model=OneVsRestClassifier(model)

In [None]:
model.fit(X_train_ct,y_train)

In [None]:
y_pred=model.predict(X_test_ct)

In [None]:
y_pred_inversed = binarizer.inverse_transform(y_pred)
y_test_inversed = binarizer.inverse_transform(y_test)

In [None]:
for i in range(5):
    print('Text:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(y_pred_inversed[i])
    ))

In [None]:
#Size of Document Term Matrix
X_train_ct.shape

In [None]:
X_train_ct[0]

In [None]:
#Let's check the first record
X_train_ct[0]

In [None]:
X_test_ct = cvect.transform(X_test)

In [None]:
X_test_ct.shape

In [None]:
from sklearn.svm import SVC

In [None]:
#Train an SVM with default parameters
svc = SVC()
svc.fit(X_train_ct, y_train)

In [None]:
#Calculate accuracy on Test Dataset
from sklearn.metrics import accuracy_score
accuracy_score(y_test, svc.predict(X_test_ct))

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
y_pred = svc.predict(X_test_ct)
print("Test Accuracy:" + str(accuracy_score(y_test,y_pred)))
print("F1: " + str(f1_score(y_test,y_pred, average='macro')))
print("F1_macro: " + str(f1_score(y_test,y_pred, average='macro')))
print("Precision: " + str(precision_score(y_test,y_pred, average='macro')))

In [None]:
#Start building a Keras Sequential Model
import tensorflow as tf
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [None]:
#Add hidden layers
model.add(tf.keras.layers.Dense(100, activation='relu', input_shape=(len(cvect.vocabulary_),)))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(50, activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))

In [None]:
#Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
X_train_ct.todense()[0]

In [None]:
print(X_train_ct[0])

In [None]:
X_test_ct.todense()[0]

In [None]:
print(X_test_ct[0])

In [None]:
model.fit(X_train_ct.todense(), y_train, validation_data=(X_test_ct.todense(), y_test), epochs=10, batch_size=32)

In [None]:
import matplotlib.pyplot as plt # visualization
from wordcloud import WordCloud

In [None]:
# Define wordcloud function from wordcloud library.
wc = WordCloud()
wc.generate(str(df['clean_text']))
# declare our figure 
plt.figure(figsize=(20,10), facecolor='k')
# add title to the graph
plt.title("Most frequent words in blog dataset", fontsize=40, color='white')
plt.imshow(wc)
plt.show()