### Project_Summary

**Domain: Digital content management**


Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign

The purpose is to build an NLP classifier which can use input text parameters to determine the multiple features of the blog author

Tasks performed:

1. Importing and cleaning the data
2. text preprocessing
4. Design, train, tune and test the best NLP text classifier.

**Loading the libraries and dataset**

In [1]:
#Loading the libraries
import numpy as np 
import pandas as pd 
import re 
import nltk
import os
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
import opendatasets as od

In [3]:
od.download("https://www.kaggle.com/rtatman/blog-authorship-corpus")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: vemulaganesh
Your Kaggle Key: ········
Downloading blog-authorship-corpus.zip to .\blog-authorship-corpus


100%|███████████████████████████████████████████████████████████████████████████████| 290M/290M [03:32<00:00, 1.43MB/s]





**Loading the dataset file**

In [6]:
data_set=pd.read_csv(r".\blog-authorship-corpus\blogtext.csv")

## Analysing the data set

In [7]:
#Checking the first 10 rows
data_set.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


In [8]:
#Checking any null values present
data_set.isna().any()

id        False
gender    False
age       False
topic     False
sign      False
date      False
text      False
dtype: bool

There are no null values in the dataset

In [9]:
data_set.shape

(681284, 7)

There are 681284 rows (blogs) in this dataset, which is very large number. First we will do on a sample of 10000

In [10]:
data=data_set.head(10000)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      10000 non-null  int64 
 1   gender  10000 non-null  object
 2   age     10000 non-null  int64 
 3   topic   10000 non-null  object
 4   sign    10000 non-null  object
 5   date    10000 non-null  object
 6   text    10000 non-null  object
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


## Preprocessing the data

**Dropping the columns which are not needed**

Outputs are gender, age, topic and sign.
There is no importance for id and date, so dropping those columns

In [12]:
data.drop(['id','date'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Columns like ID and date are removed from the dateset as they do not provide much value

In [13]:
data.head()

Unnamed: 0,gender,age,topic,sign,text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,..."
1,male,15,Student,Leo,These are the team members: Drewe...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...
3,male,15,Student,Leo,testing!!! testing!!!
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...


**Text Cleaning**

In [14]:
data['text_L']=data['text'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Removing stop words before removing the punctuation marks because some stop words will have punctuation marks

In [15]:
stop_words = set(stopwords.words('english'))
def drop_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])

In [16]:
data["text_no_stop"] = data["text_L"].apply(lambda text : drop_stopwords(text))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Removing any special characters other than alphabet letters

In [17]:
data['only_text']=data['text_no_stop'].apply(lambda x: re.sub(r'[^a-z]+',' ',x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


We are removing stop words again, why because after substituting special letters with ' ', some letters will be left out. For ex: i'm=>i m (m and i are stop words)

In [18]:
data["text"] = data["only_text"].apply(lambda text : drop_stopwords(text))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Removing extra spaces

In [19]:
data["text"]=data['text'].apply(lambda x: x.strip())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


This is needed as age is a number, multilabelbinalizer cannot consider integer as discrete values, so changing type to 'str'

In [20]:
data['age']=data['age'].astype('str')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   gender        10000 non-null  object
 1   age           10000 non-null  object
 2   topic         10000 non-null  object
 3   sign          10000 non-null  object
 4   text          10000 non-null  object
 5   text_L        10000 non-null  object
 6   text_no_stop  10000 non-null  object
 7   only_text     10000 non-null  object
dtypes: object(8)
memory usage: 625.1+ KB


**Creating multiple label column**

In [22]:
#Output labels column is created by adding gender, age, topic and sign (These are to be predicted based on the text)
data["labels"] = data.apply(lambda row : [row["gender"],row["age"],row["topic"],row["sign"]],axis =1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [23]:
#Dropping gender, age, topic ,sign columns and also additionally created columns

data.drop(columns=["gender","age","sign","topic","text_no_stop","only_text","text_L"],axis =1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [24]:
data.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


Converted all the columns to object data-type

## Splitting the data into X and Y

In [25]:
X=data['text']

**Using count vectorizer to get the count vectors of the text**

In [26]:
vectorizer=CountVectorizer(binary=True, ngram_range=(1,2))

In [27]:
X=vectorizer.fit_transform(X)

In [28]:
X[1]

<1x643302 sparse matrix of type '<class 'numpy.int64'>'
	with 25 stored elements in Compressed Sparse Row format>

#### Checking some feature names

In [29]:
vectorizer.get_feature_names()[:5]

['aa', 'aa amazing', 'aa anger', 'aa compared', 'aa keeps']

**Getting the different unique label values present in the labels column**

In [30]:
label_list=[]

for labels in data.labels.values:
    for label in labels:
        if label in label_list:
            pass
        else:
            label_list.append(label)

In [31]:
sorted(label_list)

['13',
 '14',
 '15',
 '16',
 '17',
 '23',
 '24',
 '25',
 '26',
 '27',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 'Accounting',
 'Aquarius',
 'Aries',
 'Arts',
 'Automotive',
 'Banking',
 'BusinessServices',
 'Cancer',
 'Capricorn',
 'Communications-Media',
 'Consulting',
 'Education',
 'Engineering',
 'Fashion',
 'Gemini',
 'HumanResources',
 'Internet',
 'InvestmentBanking',
 'Law',
 'LawEnforcement-Security',
 'Leo',
 'Libra',
 'Marketing',
 'Museums-Libraries',
 'Non-Profit',
 'Pisces',
 'Publishing',
 'Religion',
 'Sagittarius',
 'Science',
 'Scorpio',
 'Sports-Recreation',
 'Student',
 'Taurus',
 'Technology',
 'Telecommunications',
 'Virgo',
 'female',
 'indUnk',
 'male']

**Pre-processing the labels**

In [32]:
from sklearn.preprocessing import MultiLabelBinarizer
binarizer=MultiLabelBinarizer(classes=sorted(label_list))

In [33]:
binarizer

MultiLabelBinarizer(classes=['13', '14', '15', '16', '17', '23', '24', '25',
                             '26', '27', '33', '34', '35', '36', '37', '38',
                             '39', '40', '41', '42', '43', '44', '45', '46',
                             'Accounting', 'Aquarius', 'Aries', 'Arts',
                             'Automotive', 'Banking', ...])

In [34]:
Y=binarizer.fit_transform(data.labels)

In [35]:
Y[:5]

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

**Splitting the data into 70% Train set :30% Test set**

In [36]:
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3)

## Fit the model

In [37]:
model=OneVsRestClassifier(LogisticRegression(solver='lbfgs'))

In [38]:
model.fit(X_train,y_train)

  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression())

In [39]:
y_pred=model.predict(X_test)

In [40]:
y_pred_inversed = binarizer.inverse_transform(y_pred)
y_test_inversed = binarizer.inverse_transform(y_test)

**Checking predicted values against the actual values**

In [41]:
for i in range(5):
    print('Text:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(y_pred_inversed[i])
    ))

Text:	  (0, 450189)	1
  (0, 597026)	1
  (0, 229515)	1
  (0, 218079)	1
  (0, 222683)	1
  (0, 194964)	1
  (0, 528086)	1
  (0, 326192)	1
  (0, 215891)	1
  (0, 170072)	1
  (0, 442891)	1
  (0, 256867)	1
  (0, 937)	1
  (0, 558082)	1
  (0, 535731)	1
  (0, 419632)	1
  (0, 115987)	1
  (0, 172531)	1
  (0, 169199)	1
  (0, 613430)	1
  (0, 337957)	1
  (0, 616967)	1
  (0, 347567)	1
  (0, 212267)	1
  (0, 235189)	1
  :	:
  (0, 201371)	1
  (0, 294836)	1
  (0, 123994)	1
  (0, 625771)	1
  (0, 126559)	1
  (0, 285769)	1
  (0, 447290)	1
  (0, 511158)	1
  (0, 589047)	1
  (0, 347711)	1
  (0, 588064)	1
  (0, 201376)	1
  (0, 73080)	1
  (0, 169242)	1
  (0, 126556)	1
  (0, 250233)	1
  (0, 126551)	1
  (0, 138383)	1
  (0, 326275)	1
  (0, 427282)	1
  (0, 642358)	1
  (0, 339316)	1
  (0, 498419)	1
  (0, 428798)	1
  (0, 368286)	1
True labels:	15,Student,Virgo,male
Predicted labels:	15,Student,Virgo,male


Text:	  (0, 13357)	1
  (0, 229515)	1
  (0, 246950)	1
  (0, 556303)	1
  (0, 559031)	1
  (0, 426777)	1
  (0, 143432)	

## Evaluation metrics

In [42]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def print_evaluation_scores(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: ', f1_score(Ytest, Ypred, average='micro'))
    print('Average precision score: ', average_precision_score(Ytest, Ypred, average='micro'))
    print('Average recall score: ', recall_score(Ytest, Ypred, average='micro'))

In [43]:
print_evaluation_scores(y_test, y_pred)

Accuracy score:  0.306
F1 score:  0.6392772422147075
Average precision score:  0.45729506311596063
Average recall score:  0.52775
