# NLP Project

# Author Features Prediction

## Description
Classification is probably the most popular task that you would deal with in real life.

Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information 
about the writer without knowing about him/her. 

We are going to create a classifier that predicts multiple features of the author of a given text.

We have designed it as a Multilabel classification problem

## Dataset

Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in 
August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 
35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id and the blogger’s self - provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry 
and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

8240 "10s" blogs (ages 13-17)

8086 "20s" blogs(ages 23-27)

2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been 
stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following 
post and links within a post are denoted by the label urllink.


## 1. Load the dataset 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from warnings import filterwarnings
filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

In [2]:
df = pd.read_csv('blogtext.csv')
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [3]:
df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

#### Checking Dimension

In [4]:
df.shape

(681284, 7)

#### Checking for null values

In [5]:
Total = df.isnull().sum().sort_values(ascending=False)  
Percent = (df.isnull().sum()*100/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([Total, Percent], axis = 1, keys = ['Total', 'Percentage of Missing Values'])    
missing_data

Unnamed: 0,Total,Percentage of Missing Values
id,0,0.0
gender,0,0.0
age,0,0.0
topic,0,0.0
sign,0,0.0
date,0,0.0
text,0,0.0


There are no missing values in the data.

##### As mentioned in the problem statement: "As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly" I have decided to take first 7000 rows for the further analysis. 

In [6]:
df = df.head(7000)

## 2. Preprocess rows of the “text” column

- Remove unwanted characters
- Convert text to lowercase
- Remove unwanted spaces
- Remove stopwords

In [7]:
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))
df.text = df.text.apply(lambda x: x.lower())
df.text = df.text.apply(lambda x: x.strip())

In [8]:
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Pooja\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [9]:
df.text[7]

'anything korea country extremes everything seems fad based think may come korea history invaded reported times years time got independence imagine move quickly get next level next war occupation lately well really lately japanese occupation ended korean war occurred turmoil park chung hee took dictator president elections everyone encouraged vote still dictator assassination next leaders basically ilk president park amazing things time however took incredibly backward country set road industrialization japan stripped korea resources people even language culture many buildings palaces razed japanese official language president park determined change orchestrated han river miracle han river hangang main river seoul korea korea made terrific strides expense civil liberties fastforward present point see korea world wired nation canada finland way beyond u craze pc pc bangs rooms everywhere country well instead playstation like games players go computer one two people korean gamers always 

# 3. Merge the label columns

#### a. Label columns to merge: “gender”, “age”, “topic”, “sign”

In [10]:
df['labels'] = df.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

#### b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

In [11]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,labels
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing,"[male, 15, Student, Leo]"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


## 4. Separate features and labels, and split the data into training and testing 

In [12]:
df = df[['text','labels']]

In [13]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


### Train_test_split

In [14]:
X = df.text.values
y = df.labels.values

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [16]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

X_train shape: (5600,)
X_test shape: (1400,)
y_train shape: (5600,)
y_test shape: (1400,)


## 5. Vectorize the features

### a. Create Bag of Words
- Use CountVectorizer
- Transform the training and testing data

In [17]:
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_cv = vectorizer.fit_transform(X_train)
X_test_cv = vectorizer.transform(X_test)

In [18]:
X_train_cv

<5600x345819 sparse matrix of type '<class 'numpy.int64'>'
	with 702412 stored elements in Compressed Sparse Row format>

In [19]:
vectorizer.get_feature_names()[:5]

['aa', 'aa anger', 'aa compared', 'aa nice', 'aaa']

#### b. Print the term-document matrix

In [20]:
X_train_cv.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## 6. Create a dictionary to get the count of every label 

In [21]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

In [22]:
label_counts

{'male': 4021,
 '15': 354,
 'Student': 577,
 'Leo': 208,
 '33': 101,
 'InvestmentBanking': 70,
 'Aquarius': 351,
 'female': 2979,
 '14': 170,
 'indUnk': 2616,
 'Aries': 3222,
 '25': 268,
 'Capricorn': 88,
 '17': 914,
 'Gemini': 88,
 '23': 142,
 'Non-Profit': 47,
 'Cancer': 238,
 'Banking': 16,
 '37': 19,
 'Sagittarius': 709,
 '26': 112,
 '24': 378,
 'Scorpio': 854,
 '27': 691,
 'Education': 121,
 '45': 14,
 'Engineering': 119,
 'Libra': 425,
 'Science': 33,
 '34': 540,
 '41': 14,
 'Communications-Media': 61,
 'BusinessServices': 87,
 'Sports-Recreation': 77,
 'Virgo': 41,
 'Taurus': 709,
 'Arts': 31,
 'Pisces': 67,
 '44': 3,
 '16': 73,
 'Internet': 47,
 'Museums-Libraries': 2,
 'Accounting': 2,
 '39': 79,
 '35': 2311,
 'Technology': 2350,
 '36': 787,
 'Law': 3,
 '46': 7,
 'Consulting': 18,
 'Automotive': 14,
 '42': 14,
 'Religion': 9,
 '13': 9,
 'Fashion': 700}

## 7. Convert your train and test labels using MultiLabelBinarizer

In [23]:
mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

In [24]:
y_test

array([[0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1]])

## 8. Choose a classifier 

Use a linear classifier, wrap it up in OneVsRestClassifier to train it on every label.

In [25]:
clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

### 9. Fit the classifier, make predictions and get the accuracy

In [26]:
clf.fit(X_train_cv, y_train)

OneVsRestClassifier(estimator=LogisticRegression())

### Predictions
- Get predicted labels and scores

In [27]:
predicted_labels = clf.predict(X_test_cv)
predicted_scores = clf.decision_function(X_test_cv)

In [28]:
predicted_labels

array([[0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1]])

In [29]:
predicted_scores

array([[-9.32050338, -7.60584363, -6.1873672 , ...,  5.06615342,
         3.28580851, -5.06615342],
       [-8.80162597, -6.04990957, -5.14632347, ..., -0.13964018,
        -0.21544985,  0.13964018],
       [-9.8266968 , -5.82470983, -6.03096067, ..., -4.5584783 ,
        -4.71848738,  4.5584783 ],
       ...,
       [-8.81617819, -6.17224765, -5.209461  , ...,  0.51464154,
         0.32527844, -0.51464154],
       [-9.15751292, -6.27864861, -5.61974567, ...,  0.03687363,
         0.05972972, -0.03687363],
       [-8.61459193, -4.8877882 , -5.21195358, ..., -1.43377336,
        -1.50334669,  1.43377336]])

#### Get inverse transform for predicted labels and test labels

In [30]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

In [31]:
(pred_inversed[:5])

[('34', 'Sagittarius', 'female', 'indUnk'),
 ('Aries', 'male'),
 ('Aries', 'male'),
 ('male',),
 ('17', 'Scorpio', 'female', 'indUnk')]

In [32]:
y_test_inversed[:5]

[('34', 'Sagittarius', 'female', 'indUnk'),
 ('27', 'Taurus', 'female', 'indUnk'),
 ('25', 'Cancer', 'Non-Profit', 'male'),
 ('33', 'Aquarius', 'InvestmentBanking', 'male'),
 ('17', 'Scorpio', 'female', 'indUnk')]

### a. Print the following
- i. Accuracy score
- ii. F1 score
- iii. Average precision score
- iv. Average recall score

In [33]:
def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))

In [34]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)

Bag-of-words
Accuracy score:  0.3892857142857143
F1 score:  0.6608695652173913
Average precision score:  0.47885221674876843
Average recall score:  0.57


### 10. Print true label and predicted label for any five examples 

In [35]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	gotta love south beach everyone knows knows work south beach used live south beach law live willing piss life away night club taking large amounts alcohol ecstasy soooo leave allowed work hotel receives glamorous shiny magazine called lincoln road businesses restaurants clubs people around lincoln road glorified strip mall beach full pretty people fabulous things dancing drinking eating loving life month one hot articles listed cover touted follows club lust urge merge sex patrons highlighted quote article walked dj booths dance floors seen girls control panels making dj happy manny hernandez photographer gotta love south beach either love leave say grin diva
True labels:	34,Sagittarius,female,indUnk
Predicted labels:	34,Sagittarius,female,indUnk


Title:	urllink fire vegas
True labels:	27,Taurus,female,indUnk
Predicted labels:	Aries,male


True labels:	25,Cancer,Non-Profit,male
Predicted labels:	Aries,male


Title:	surf english news sites lot looking tidbits korea foreigners li