# Author Features Prediction

## Description

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the
information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

## Dataset
### Blog Authorship Corpus

Over 600,000 posts from more than 19 thousand bloggers. The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

- 8240 "10s" blogs (ages 13-17)
- 8086 "20s" blogs(ages 23-27)
- 2994 "30s" blogs (ages 33-47)
For each age group there are an equal number of male and female bloggers.

For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.

Link to dataset: [Blog Authorship Corpus](https://www.kaggle.com/rtatman/blog-authorship-corpus)

### Acknowledgements

The corpus may be freely used for non-commercial research purposes. Any resulting publications should cite the following:

J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. URL: http://www.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import pandas as pd
import numpy as np

In [9]:
#df = pd.read_csv('/Users/debajyotidas/Google Drive/PGP-AIML/Statistical NLP/Project/blogtext.csv.zip')
df = pd.read_csv('/content/drive/My Drive/PGP-AIML/Statistical NLP/Project/blogtext.csv.zip')

In [10]:
df.sample(n=5)

Unnamed: 0,id,gender,age,topic,sign,date,text
255030,2004217,female,17,Education,Leo,"23,October,2003",Wahey! Have found a way to immedi...
650534,1103171,female,25,indUnk,Aries,"05,June,2004",I saw Troy last night with Cynt...
444143,1901727,female,41,Arts,Scorpio,"01,July,2004",My friend Sara and I ...
308089,3634769,female,25,Education,Taurus,"01,July,2004","Hey, I think we should try to get Khwaj..."
315666,4127142,female,25,Student,Scorpio,"03,August,2004",Hey! One of My Poems Took On a Life...


In [11]:
df.shape

(681284, 7)

In [12]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [13]:
#Lets consider only 100000 records for demo here
small_df = df.sample(n=100000)
small_df.reset_index(inplace=True, drop=True)
#small_df=df.copy()

In [14]:
small_df.shape

(100000, 7)

In [15]:
small_df.sample(n=5)

Unnamed: 0,id,gender,age,topic,sign,date,text
3117,3494091,male,33,Manufacturing,Libra,"04,August,2004",that was said by my good friend roger o...
84443,3698513,female,16,indUnk,Virgo,"12,July,2004",woot woot! i'm so excited. so i'm n...
3277,1761174,female,43,indUnk,Taurus,"10,August,2003",I got some bad news today at work...
51099,3479692,male,25,Advertising,Aries,"13,July,2004",OMG - check out the photos urlLink...
61562,3581495,male,17,indUnk,Aries,"09,June,2004",urlLink Sometimes I real...


In [16]:
import nltk
import re
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

In [17]:
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [18]:
def pre_processor(doc):
    text = re.sub(r'[^a-zA-Z0-9\s]', '', doc) #removing all special characters
    text = text.lower() #converting all text to lowercase
    text = text.strip() #removing unwanted spaces
    word_tokens = word_tokenize(text)
    wordList = [w for w in word_tokens if w not in stopwords.words('english')] #removing ENGLISH stopwords
    return wordList

In [19]:
small_df['text'] = small_df['text'].apply(lambda s: ' '.join(pre_processor(s)))

In [20]:
small_df.sample(n=5)

Unnamed: 0,id,gender,age,topic,sign,date,text
6795,1152012,female,16,indUnk,Leo,"07,April,2004",know dont wear bright colors makes look sick f...
13433,1115374,male,17,indUnk,Scorpio,"13,August,2004",missing someone theyre gone natural thing love...
99609,3962236,female,23,Military,Scorpio,"27,July,2004",today lunch luke went chow hall galley still t...
73352,2802245,male,15,Student,Virgo,"11,May,2004",meh havent posted ages sue meuh dont actually ...
9941,3603600,male,25,Automotive,Cancer,"26,June,2004",friday night much cambridge went walmart weekl...


### 3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
#### a. Label columns to merge: “gender”, “age”, “topic”, “sign”
#### b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels”

In [21]:
small_df['labels']=small_df[['gender','age','topic','sign']].apply(lambda x: [str(y) for y in x],axis=1)

In [22]:
small_df.sample(n=5)

Unnamed: 0,id,gender,age,topic,sign,date,text,labels
31055,2892926,male,26,Communications-Media,Virgo,"20,July,2004",urllink remind certain redhead know nbsp maybe...,"[male, 26, Communications-Media, Virgo]"
23785,2384693,male,23,indUnk,Cancer,"08,July,2004",world stage one loony performer set unbroken r...,"[male, 23, indUnk, Cancer]"
20872,894945,male,27,Technology,Cancer,"09,August,2004",video fun saturday night urllink shots warm b ...,"[male, 27, Technology, Cancer]"
97172,3359058,female,35,indUnk,Scorpio,"20,June,2004",kind bad weekend food well friday good anyway ...,"[female, 35, indUnk, Scorpio]"
66218,3208138,male,13,indUnk,Pisces,"20,May,2004",travel philadelphia pa urllink association ent...,"[male, 13, indUnk, Pisces]"


### 4. Separate features and labels, and split the data into training and testing (5 points)

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
x_train,x_test,y_train,y_test=train_test_split(small_df.text,small_df.labels,test_size=0.3,random_state=42)

In [25]:
x_train.sample(n=5)

79624    form 0 phoenix eternal phoenixs cycle reached ...
5906     urllink making american president mayor bloomb...
89221    congratulations ken jennings buzzer going sepa...
96082    okay well past 2 days awesome monday pool jake...
69925        kerrang radio station live 1052fm secks radio
Name: text, dtype: object

In [26]:
y_train.sample(n=5)

42187      [male, 25, Internet, Scorpio]
85224         [male, 16, indUnk, Pisces]
26909    [female, 24, Education, Pisces]
95058      [female, 15, Arts, Capricorn]
925          [female, 24, indUnk, Aries]
Name: labels, dtype: object

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words='english',ngram_range=(1, 2),max_df=0.2,min_df=2,max_features=50000)

In [28]:
X_train_ct=cvect.fit_transform(x_train)

In [29]:
X_test_ct=cvect.transform(x_test)

In [30]:
print(X_train_ct)

  (0, 458)	1
  (0, 2292)	1
  (0, 382)	1
  (0, 23067)	2
  (0, 23031)	2
  (0, 25174)	1
  (0, 37533)	1
  (0, 6707)	1
  (0, 22137)	1
  (0, 19852)	1
  (0, 5140)	1
  (0, 1536)	1
  (0, 15639)	1
  (0, 4364)	1
  (0, 2298)	1
  (0, 15643)	1
  (1, 5140)	1
  (1, 21424)	3
  (1, 12172)	1
  (1, 12224)	1
  (1, 16903)	1
  (1, 41531)	1
  (1, 40407)	2
  (1, 24986)	1
  (1, 38896)	2
  :	:
  (69998, 38002)	1
  (69998, 35426)	1
  (69998, 37926)	1
  (69998, 20119)	1
  (69998, 18234)	1
  (69998, 34847)	1
  (69998, 33080)	1
  (69998, 13480)	1
  (69998, 9699)	1
  (69998, 37907)	1
  (69998, 43823)	1
  (69998, 22960)	1
  (69998, 38245)	1
  (69998, 36476)	1
  (69998, 15576)	1
  (69998, 37833)	1
  (69998, 9934)	1
  (69998, 37610)	1
  (69998, 34616)	1
  (69998, 22964)	1
  (69998, 43092)	1
  (69998, 48926)	1
  (69998, 12478)	1
  (69998, 24378)	1
  (69999, 32260)	1


#### As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

In [33]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

In [34]:
ytrain=mlb.fit_transform(y_train)

In [35]:
ytest=mlb.transform(y_test)

#### In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, we use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

In [36]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs',multi_class='multinomial',max_iter=1000)
clf = OneVsRestClassifier(clf)

In [37]:
clf.fit(X_train_ct,ytrain)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=1000,
                                                 multi_class='multinomial',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [39]:
print('Testing F1-Score of Model with Weighted averaging is:',metrics.f1_score(ytest, clf.predict(X_test_ct),average='weighted'))
print('Testing Average Precision Score of model with Weighted averaging is:',metrics.average_precision_score(ytest, clf.predict(X_test_ct),average='weighted'))
print('Testing Average Recall Score of model with Weighted averaging is:',metrics.recall_score(ytest, clf.predict(X_test_ct),average='weighted'))

Testing F1-Score of Model with Weighted averaging is: 0.2967905471427612
Testing Average Precision Score of model with Weighted averaging is: 0.2499368574341174
Testing Average Recall Score of model with Weighted averaging is: 0.2734666666666667


### 10. Print true label and predicted label for any five examples (7.5 points)

In [40]:
mlb.inverse_transform(ytest[:5])

[('25', 'Virgo', 'female', 'indUnk'),
 ('26', 'Consulting', 'Virgo', 'female'),
 ('25', 'Aries', 'female', 'indUnk'),
 ('24', 'Aquarius', 'female', 'indUnk'),
 ('24', 'Capricorn', 'Technology', 'male')]

In [41]:
mlb.inverse_transform(clf.predict(X_test_ct[:5]))

[('24', '33', 'indUnk', 'male'),
 ('16', 'Cancer', 'Student', 'male'),
 ('Capricorn', 'female', 'indUnk'),
 ('Taurus', 'female'),
 ('Technology', 'male')]