## Deep Learning model to predict zodiac sign and gender based on blog posts.

In [0]:
import pandas as pd
import numpy as np

In [0]:
import requests 
import xml.etree.ElementTree as et
import os
import csv

Creating an empty pandas dataframe to append data to from xml files

In [0]:
df = pd.DataFrame(columns=['ID', 'Gender', 'Age','Occ','Star','fileType'])

In [0]:
import xml.dom.minidom

Directory of files

In [0]:
#os.listdir('blogs_train')

Extracting data from file names and adding it to the aforementioned dataframe

In [0]:
i=0
for filename in os.listdir(r'Drive/ML/blogs_train'):
    if filename.endswith('.xml'):
        a=filename.split('.')
        df.loc[i]=a
        i=i+1

In [7]:
df.head()

Unnamed: 0,ID,Gender,Age,Occ,Star,fileType
0,3342873,male,15,Student,Aquarius,xml
1,3220764,female,24,Government,Taurus,xml
2,3114279,male,24,Student,Taurus,xml
3,3214684,male,24,Engineering,Cancer,xml
4,3227750,male,23,Technology,Scorpio,xml


Creating a new column by copying another

In [0]:
df['PostCol']=df['fileType']

In [9]:
df.head()

Unnamed: 0,ID,Gender,Age,Occ,Star,fileType,PostCol
0,3342873,male,15,Student,Aquarius,xml,xml
1,3220764,female,24,Government,Taurus,xml,xml
2,3114279,male,24,Student,Taurus,xml,xml
3,3214684,male,24,Engineering,Cancer,xml,xml
4,3227750,male,23,Technology,Scorpio,xml,xml


Parsing the actual xml files to get the blog posts text corresponding to each user, then storing them in the new column.
Date is ignored as unrequired in the actual algorithm.

In [0]:
i=0
from lxml import etree
from lxml import html
for filename in os.listdir(r'Drive/ML/blogs_train'):
    fullname = os.path.join(r'Drive/ML/blogs_train', filename)
    tree=html.parse(fullname)
    root=tree.getroot()
    
    for post in root.iter('post'):
        df['PostCol'][i]=post.text
    i=i+1

In [12]:
df.head()

Unnamed: 0,ID,Gender,Age,Occ,Star,fileType,PostCol
0,3342873,male,15,Student,Aquarius,xml,\r\n\r\n\t \r\n The Eagle Has Landed ...
1,3220764,female,24,Government,Taurus,xml,\r\n\r\n\t \r\n It is hot and it is raini...
2,3114279,male,24,Student,Taurus,xml,\r\n\r\n\t \r\nNick : \r\n It's a littl...
3,3214684,male,24,Engineering,Cancer,xml,"\r\n\r\n \r\n Hey guys, been long tim..."
4,3227750,male,23,Technology,Scorpio,xml,\r\n\r\n\t \r\n Ugh. This is my second at...


Creating a function using BeautifulSoup and bag of words to clean the text from blog posts

In [16]:
import re
from bs4 import BeautifulSoup 
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def post_to_words( bp ):

    # Function to convert a blog post to a string of words
    # 1. Remove HTML
    p_text = BeautifulSoup(bp).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", p_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words )) 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Applying said function to the data

In [0]:
df['words']=''
for i in df.index:
    df['words'][i]=post_to_words(str(df['PostCol'][i]))

In [18]:
df.head()

Unnamed: 0,ID,Gender,Age,Occ,Star,fileType,PostCol,words
0,3342873,male,15,Student,Aquarius,xml,\r\n\r\n\t \r\n The Eagle Has Landed ...,eagle landed apollo touches lunar surface july...
1,3220764,female,24,Government,Taurus,xml,\r\n\r\n\t \r\n It is hot and it is raini...,hot raining fat summer raindrops size eyeballs...
2,3114279,male,24,Student,Taurus,xml,\r\n\r\n\t \r\nNick : \r\n It's a littl...,nick little got home work short ago pulling mi...
3,3214684,male,24,Engineering,Cancer,xml,"\r\n\r\n \r\n Hey guys, been long tim...",hey guys long time guess every one kicking goo...
4,3227750,male,23,Technology,Scorpio,xml,\r\n\r\n\t \r\n Ugh. This is my second at...,ugh second attempt third post blog last night ...


Dropping unnecessary columns as a part of pre-processing.
Creating a new dataframe to make dealing with errors easier.

In [0]:
dff=df.drop(['ID','Age','Occ','fileType','PostCol'],axis=1)

Defining stop words (for vectorizer)

In [20]:
stp = nltk.corpus.stopwords.words('english')
with open('./stopwords_eng.txt', 'w') as outfile:
    outfile.write('\n'.join(stp))
    
    
with open('./stopwords_eng.txt', 'r') as infile:
    stop_words = infile.read().splitlines()
print('stop words %s ...' %stop_words[:5])

stop words ['i', 'me', 'my', 'myself', 'we'] ...


Defining vectorizer.
Tfidf vectorizer is favored over count vectorizer.
This is needed to deal with the nature of our input data, numerically.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import string
import re

porter_stemmer = nltk.stem.porter.PorterStemmer()

def porter_tokenizer(text, stemmer=porter_stemmer):
    """
    A Porter-Stemmer-Tokenizer hybrid to splits sentences into words (tokens) 
    and applies the porter stemming algorithm to each of the obtained token. 
    Tokens that are only consisting of punctuation characters are removed as well.
    Only tokens that consist of more than one letter are being kept.
    
    Parameters
    ----------
        
    text : `str`. 
      A sentence that is to split into words.
        
    Returns
    ----------
    
    no_punct : `str`. 
      A list of tokens after stemming and removing Sentence punctuation patterns.
    
    """
    lower_txt = text.lower()
    tokens = nltk.wordpunct_tokenize(lower_txt)
    stems = [porter_stemmer.stem(t) for t in tokens]
    no_punct = [s for s in stems if re.match('^[a-zA-Z]+$', s) is not None]
    return no_punct

tfidf = TfidfVectorizer(encoding='utf-8',decode_error='replace',strip_accents='unicode',
            analyzer='word',
            binary=False,
            stop_words=stop_words,
            tokenizer=porter_tokenizer
    )

#vec = CountVectorizer(
#            encoding='utf-8',
#            decode_error='replace',
#            strip_accents='unicode',
#            analyzer='word',
#            binary=False,
#            stop_words=stop_words,
#            tokenizer=porter_tokenizer,
#            ngram_range=(2,2)
#    )

Label Encoder for outputs

In [22]:
dff.head()

Unnamed: 0,Gender,Star,words
0,male,Aquarius,eagle landed apollo touches lunar surface july...
1,female,Taurus,hot raining fat summer raindrops size eyeballs...
2,male,Taurus,nick little got home work short ago pulling mi...
3,male,Cancer,hey guys long time guess every one kicking goo...
4,male,Scorpio,ugh second attempt third post blog last night ...


In [0]:
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
dff['Gender']=le.fit_transform(dff['Gender'])
le=preprocessing.LabelEncoder()
dff['Star']=le.fit_transform(dff['Star'])


In [24]:
dff.head()

Unnamed: 0,Gender,Star,words
0,1,0,eagle landed apollo touches lunar surface july...
1,0,10,hot raining fat summer raindrops size eyeballs...
2,1,10,nick little got home work short ago pulling mi...
3,1,2,hey guys long time guess every one kicking goo...
4,1,9,ugh second attempt third post blog last night ...


I choose to create two neural networks, one for each output (i.e gender and star) .

### Model 1: predicting gender

In [0]:
#cols = dff.columns.values
X= dff['words']
y = dff['Gender']

Splitting the test train data using K-Fold cross validation

In [0]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=42, shuffle=False)
for train_index, test_index in kf.split(X):
    X_train, X_test, y_train, y_test = X.iloc[train_index], X.iloc[test_index], y.iloc[train_index], y.iloc[test_index]

Fitting the vectorizer

In [27]:
tfidf = tfidf.fit(dff['words'])

  'stop_words.' % sorted(inconsistent))


In [28]:
len(tfidf.get_feature_names())

37305

In [0]:
X1=tfidf.transform(X_train)
X1=X1.toarray()

In [0]:
X_tr=X1

In [0]:
X_ts=tfidf.transform(X_test)
X_ts=X_ts.toarray()

Now, building our deep learning model.
I use keras library.

In [32]:
X1.shape

(4599, 37305)

In [33]:
X1.shape[1]

37305

In [34]:
from keras.models import Sequential
from keras.layers import Dense

#create model
model = Sequential()
#get number of columns in training 
datan_cols = X1.shape[1]
#add model layers(2 layers and an output layer, 10 nodes)
model.add(Dense(10, activation='relu', input_dim=datan_cols))
model.add(Dense(10, activation='relu'))
model.add(Dense(1,activation='sigmoid')) #sigmoid activation because we have binary output

Using TensorFlow backend.
W0705 13:20:44.574489 140294004193152 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0705 13:20:44.594970 140294004193152 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0705 13:20:44.597851 140294004193152 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



In [35]:
#compile model using mse as a measure of model performance
#since gender is abinary class, I choose the binary cross entropy loss function
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

W0705 13:20:44.682618 140294004193152 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0705 13:20:44.709288 140294004193152 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0705 13:20:44.716001 140294004193152 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [36]:
from keras.callbacks import EarlyStopping
#set early stopping monitor so the model stops training when it won't improve anymore
early_stopping_monitor = EarlyStopping(monitor='acc',patience=3)
#train model
model.fit(X_tr, y_train, epochs=100, batch_size=5,verbose=0,callbacks=[early_stopping_monitor])

W0705 13:20:46.852483 140294004193152 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



<keras.callbacks.History at 0x7f9876ec25f8>

In [37]:
# evaluate the model
train_acc = model.evaluate(X_tr, y_train, verbose=0)
test_acc = model.evaluate(X_ts, y_test, verbose=0)
print('Train:',train_acc,'Test:', test_acc)

Train: [0.020193101918239725, 0.9884757555990432] Test: [1.4721486092546914, 0.6360078267388615]


In [0]:
#predictions 
y_pred=model.predict_classes(X_ts)

In [39]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.58      0.56      0.57       220
           1       0.68      0.69      0.68       291

    accuracy                           0.64       511
   macro avg       0.63      0.63      0.63       511
weighted avg       0.63      0.64      0.64       511



### Part 2: predicting star

In [0]:
X= dff['words']
y1 = dff['Star']

In [0]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=42, shuffle=False)
for train_index, test_index in kf.split(X):
    X_trn, X_tst, y_tr, y_ts = X.iloc[train_index], X.iloc[test_index], y1.iloc[train_index], y1.iloc[test_index]

In [0]:
tr_X=tfidf.transform(X_trn)
tr_X=tr_X.toarray()

In [0]:
ts_X=tfidf.transform(X_tst)
ts_X=ts_X.toarray()

In [0]:
#create model
model_s = Sequential()
#get number of columns in training 
datan_cols = tr_X.shape[1]
#add model layers
model_s.add(Dense(50, activation='relu', input_dim=datan_cols))
model_s.add(Dense(12,activation='softmax')) #‘softmax‘ activation in order to predict the probability for each class
#12 nodes to account for the 12 classes i.e stars

In [0]:
#compile model using mse as a measure of model performance
from keras.optimizers import SGD
opt = SGD(lr=0.01, momentum=0.9)
model_s.compile(optimizer=opt, loss='sparse_categorical_crossentropy',metrics=['accuracy']) #Multi-Class Sparse Cross-Entropy Loss function

In [0]:
from keras.callbacks import EarlyStopping
#set early stopping monitor so the model stops training when it won't improve anymore
es_monitor = EarlyStopping(monitor='acc',patience=3)
#train model
history = model_s.fit(tr_X, y_tr, epochs=100, batch_size=5,verbose=0,callbacks=[es_monitor])

In [140]:
# evaluate the model
train_acc1 = model_s.evaluate(tr_X, y_tr, verbose=0)
test_acc1 = model_s.evaluate(ts_X, y_ts, verbose=0)
print('Train:',train_acc1,'Test:',test_acc1)

Train: [0.12224617259664804, 0.9710806697108066] Test: [4.335233559580465, 0.078277886438743]


In [0]:
#predictions
y_prd=model_s.predict_classes(ts_X)

In [143]:
#evaluating the model
from sklearn.metrics import classification_report
print(classification_report(y_ts, y_prd))

              precision    recall  f1-score   support

           0       0.05      0.02      0.03        46
           1       0.08      0.04      0.05        50
           2       0.10      0.02      0.04        43
           3       0.10      0.11      0.10        38
           4       0.02      0.02      0.02        46
           5       0.22      0.09      0.13        44
           6       0.09      0.06      0.07        47
           7       0.12      0.26      0.17        47
           8       0.00      0.00      0.00        36
           9       0.06      0.24      0.10        34
          10       0.06      0.12      0.08        33
          11       0.00      0.00      0.00        47

    accuracy                           0.08       511
   macro avg       0.08      0.08      0.07       511
weighted avg       0.08      0.08      0.07       511



In conclusion, the metrics for astrological sign make sense since there isn't an obvious correlation of blog posts and zodiac sign so there is a 1/12 chance of predicting the correct star i.e an 8% of a correct Astrological star prediction based on blog posts. I attribute these scores to the lack of correlation in the data and not the model itself.

The model for gender however showed SOME correlation as there was a score of around 60% instead of the expeted 50% (1/2 chance).