# Blog Authorship Corpus

Classification is probably the most popular task that you would deal with in real life.  Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the  information about the writer without knowing about him/her.     We are going to create a classifier that predicts multiple features of the author of a given text.  We have designed it as a Multilabel classification problem. 

Over 600,000 posts from more than 19 thousand bloggers    

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from  blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million  words - or approximately 35 posts and 7250 words per person.    

Each blog is presented as a separate file, the name of which indicates a blogger id# and the  blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and  age but for many, industry and/or sign is marked as unknown.)    

All bloggers included in the corpus fall into one of three age groups:  
8240 "10s" blogs (ages 13-17)

8086 "20s" blogs (ages 23-27)  

2994 "30s" blogs (ages 33-47) 

For each age group, there is an equal number of male and female bloggers.  

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting  has been stripped with two exceptions. Individual posts within a single blogger are separated by the  date of the following post and links within a post are denoted by the label urllink. 

Link to dataset:  https://www.kaggle.com/rtatman/blog-authorship-corpus/

1. Load the dataset (5 points)  
a. Tip: As the dataset is large, use fewer rows. Check what is working well on your  machine and decide accordingly. 

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [0]:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/PGP AIML/Project Assignment/R8 SNLP Project 1/blog-authorship-corpus/blogtext.csv") # Code to load from Google Drive
#df = pd.read_csv("blog-authorship-corpus/blogtext.csv") # Code for local system load

In [4]:
df.shape

(681284, 7)

In [5]:
train = df.iloc[0:50000,:]
train.shape

(50000, 7)

In [6]:
train.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


2. Preprocess rows of the “text” column (7.5 points)

    a. Remove unwanted characters
    
    b. Convert text to lowercase 
    
    c. Remove unwanted spaces 
    
    d. Remove stopwords  

In [0]:
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words )) 

In [0]:
# Get the number of reviews based on the dataframe column size
num_text = train["text"].size

# Initialize an empty list to hold the clean reviews
clean_train_text = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in range( 0, num_text ):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_train_text.append( review_to_words( train["text"][i] ) )

In [11]:
clean_train_text[0]

'info found pages mb pdf files wait untill team leader processed learns html'

In [12]:
clean_train_text[3682]

'declared war terrorits even noun good luck defeat im sure well take bastard ennui jon stewart'

3. As we want to make this into a multi-label classification problem, you are required to merge  all the label columns together, so that we have all the labels together for a particular sentence  (7.5 points)  

    a. Label columns to merge: “gender”, “age”, “topic”, “sign”  

    b. After completing the previous step, there should be only two columns in your data  frame i.e. “text” and “labels” as shown in the below image
    ![Point3%20image.JPG](attachment:Point3%20image.JPG)


In [13]:
train['gender'] = train['gender'].map(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [14]:
train['label'] = train['gender']+","+train['age'].astype(str)+","+train['topic']+","+train['sign']
train['label']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0                       male,15,Student,Leo
1                       male,15,Student,Leo
2                       male,15,Student,Leo
3                       male,15,Student,Leo
4        male,33,InvestmentBanking,Aquarius
                        ...                
49995            male,23,Advertising,Taurus
49996            male,23,Advertising,Taurus
49997            male,23,Advertising,Taurus
49998            male,23,Advertising,Taurus
49999            male,23,Advertising,Taurus
Name: label, Length: 50000, dtype: object

In [15]:
train['clean_text'] = clean_train_text
train['clean_text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0        info found pages mb pdf files wait untill team...
1        team members drewes van der laag urllink mail ...
2        het kader van kernfusie op aarde maak je eigen...
3                                          testing testing
4        thanks yahoo toolbar capture urls popups means...
                               ...                        
49995      aug th thur bought mua chee vcds send home work
49996    aug th wed st day work back dw sent work sent ...
49997    aug th mon zing bd went place cooked dinner to...
49998    aug rd sun went place b goin get zing bd prese...
49999    aug st fri met go shoppin wisma b meetin luver...
Name: clean_text, Length: 50000, dtype: object

In [16]:
mod_train = pd.DataFrame()
mod_train['text'] = train['clean_text']
mod_train['label'] = train['label']
mod_train.head()

Unnamed: 0,text,label
0,info found pages mb pdf files wait untill team...,"male,15,Student,Leo"
1,team members drewes van der laag urllink mail ...,"male,15,Student,Leo"
2,het kader van kernfusie op aarde maak je eigen...,"male,15,Student,Leo"
3,testing testing,"male,15,Student,Leo"
4,thanks yahoo toolbar capture urls popups means...,"male,33,InvestmentBanking,Aquarius"


4. Separate features and labels, and split the data into training and testing (5 points)  

In [0]:
features = mod_train['text']
labels = mod_train['label']


In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)

5. Vectorize the features (5 points)  

    a. Create a Bag of Words using count vectorizer  
        i. Use ngram_range=(1, 2) 
        ii. Vectorize training and testing features  
    
    b. Print the term-document matrix
        

In [19]:
print ("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 8000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(X_train)
test_data_features = vectorizer.fit_transform(X_test)
# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()
test_data_features = test_data_features.toarray()
print (test_data_features)

Creating the bag of words...

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [20]:
print (train_data_features.shape)
print (test_data_features.shape)

(35000, 8000)
(15000, 8000)


In [21]:
print(train_data_features)
from sklearn.preprocessing import MaxAbsScaler
scale = MaxAbsScaler()
train_data_features_scale = scale.fit_transform(train_data_features)
print(train_data_features_scale)
test_data_features_scale = scale.fit_transform(test_data_features)
print(test_data_features_scale.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(15000, 8000)


In [22]:
# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print (vocab)



In [23]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print (tag, count)

aa 188
aaron 62
abandoned 48
abby 79
abc 184
aber 89
abilities 123
ability 84
abit 427
able 57
abortion 2430
abraham 136
abroad 52
absence 69
absent 143
absolute 56
absolutely 156
absorbed 693
abstract 45
absurd 94
abt 56
abuse 285
abused 47
ac 217
academic 70
academy 49
accent 269
accept 81
acceptable 163
acceptance 64
accepted 403
accepting 125
access 83
accessible 255
accident 65
accidentally 389
accompanied 262
accomplish 135
accomplished 50
accomplishment 67
accomplishments 48
according 109
account 102
accounting 53
accounts 546
accurate 506
accused 57
ace 120
ache 100
achieve 67
achieved 62
achievement 82
aching 143
acid 56
ack 108
acknowledge 55
acoustic 89
acquired 103
across 65
act 48
acted 53
acting 1090
action 868
actions 102
active 380
actively 711
activities 298
activity 263
actor 56
actors 350
actress 217
acts 173
actual 145
actually 60
ad 182
ada 462
adam 5116
adams 278
adapt 110
add 366
added 88
addict 866
addicted 511
addiction 52
addictive 138
adding 83
addison 173
ad

6. Create a dictionary to get the count of every label i.e. the key will be label name and value will  be the total count of the label. Check below image for reference (5 points)
![Point6_image.JPG](attachment:Point6_image.JPG)

In [24]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)
dictionary = dict(zip(vocab, dist))

print (dictionary)



7. Transform the labels - (7.5 points)  As we have noticed before, in this task each example can have multiple tags. To deal with  such kind of prediction, we need to transform labels in a binary form and the prediction will be  a mask of 0s and 1s. For this purpose, it is convenient to use ​MultiLabelBinarizer​ from sklearn  
    a. Convert your train and test labels using MultiLabelBinarizer  

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

In [26]:
#print(X_test)
#print(y_test)
lb = MultiLabelBinarizer()
y_train_label = lb.fit_transform(y_train)
print (y_train_label.shape)
y_test_label = lb.fit_transform(y_test)
print (y_test_label.shape)
y_test_label_inv = lb.inverse_transform(y_test_label)
print (y_test_label)

(35000, 52)
(15000, 52)
[[1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]]


8. Choose a classifier - (5 points)  
    In this task, we suggest using the One-vs-Rest approach, which is implemented in  OneVsRestClassifier​ class. In this approach k classifiers (= number of tags) are trained. As a  basic classifier, use ​LogisticRegression​. It is one of the simplest methods, but often it  performs good enough in text classification tasks. It might take some time because the  number of classifiers to train is large.  
        a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on  every label  
        b. As One-vs-Rest approach might not have been discussed in the sessions, we are  providing you the code for that
![Point8_image.JPG](attachment:Point8_image.JPG)

In [28]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver ='liblinear', max_iter=1000) #liblinear
clas = OneVsRestClassifier(clf)
model1 = clas.fit(train_data_features, y_train_label)

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


In [29]:

print (train_data_features.shape)
print (test_data_features.shape)

(35000, 8000)
(15000, 8000)


In [0]:
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score

pred_data = model1.predict(test_data_features)

pred_data_inv = lb.inverse_transform(pred_data)


In [31]:
print (pred_data)
print (pred_data_inv)

[[1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 1 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]]
[(',', '3', '4', 'a', 'd', 'e', 'i', 'l', 'm', 'n', 's', 'u'), (',', '5', 'G', 'S', 'a', 'c', 'd', 'e', 'i', 'l', 'm', 'n', 'r', 't'), (',', '-', '1', '7', 'A', 'M', 'U', 'a', 'e', 'g', 'i', 'k', 'l', 'm', 'n', 'q', 'r', 's', 'u'), (',', 'A', 'a', 'e', 'i', 'l', 'm', 'n', 'r', 's'), (',', '5', 'E', 'a', 'e', 'g', 'i', 'k', 'l', 'm', 'n'), (',', '2', '6', '7', 'a', 'd', 'e', 'f', 'g', 'i', 'l', 'm', 'n', 't'), (',', '1', '7', 'C', 'a', 'c', 'd', 'e', 'f', 'i', 'l', 'm', 'n', 'r', 's'), (',', '3', 'A', 'a', 'c', 'e', 'f', 'i', 'l', 'm', 'r', 'u'), (',', 'a', 'e', 'f', 'i', 'l', 'm', 'n', 'o', 'r'), (',', '2', '3', 'E', 'L', 'a', 'b', 'e', 'f', 'g', 'i', 'l', 'm', 'n', 'r'), (',', '2', '3', 'U', 'a', 'e', 'f', 'i', 'k', 'l', 'm', 'n', 's'), (',', '1', 'A', 'a', 'd', 'e', 'f', 'i', 'l', 'm', 'n', 'q', 'r', 's', 't', 'u'), (',', '6', 'a', 'd', 'e', 'i', 'l', 'm', 'n', 'r',

9. Fit the classifier, make predictions and get the accuracy (5 points)  
    a. Print the following  
        i. Accuracy score  
        ii. F1 score  
        iii. Average precision score  
        iv. Average recall score 
        v. Tip: Make sure you are familiar with all of them. How would you expect the  things to work for the multi-label scenario ? Read about micro/macro/weighted  averaging  

In [32]:
print("Accuracy:" + str(accuracy_score(y_test_label, pred_data)))
print("F1: " + str(f1_score(y_test_label, pred_data, average='micro')))
print("F1_macro: " + str(f1_score(y_test_label, pred_data, average='macro')))
print("Precision: " + str(precision_score(y_test_label, pred_data, average='micro')))
print("Precision_macro: " + str(precision_score(y_test_label, pred_data, average='macro')))
print("Recall: " + str(recall_score(y_test_label, pred_data, average='micro')))
print("Recall_macro: " + str(recall_score(y_test_label, pred_data, average='macro')))


Accuracy:0.0002666666666666667
F1: 0.6580389080506326
F1_macro: 0.3043841469227087
Precision: 0.7057965447544837
Precision_macro: 0.3319169337025707
Recall: 0.616334696354982
Recall_macro: 0.2883351212836844


10. Print true label and predicted label for any five examples (7.5 points) 

In [41]:
X_test.index

Int64Index([33553,  9427,   199, 12447, 39489, 42724, 10822, 49498,  4144,
            36958,
            ...
            24884, 26210, 27736, 38444,  4880, 15168, 49241, 39317, 42191,
            15109],
           dtype='int64', length=15000)

In [34]:
print ("Sample Text: \n",X_test[233])
print ("\nActual Label: \n",y_test[233])
print ("\nTransformed Actual Label: \n",y_test_label[233])
print ("\nPredicted Label: \n",pred_data[233])
print ("\nInverse Transformed Actual Label: \n",y_test_label_inv[233])
print ("\nInverse Transformed Predicted Label: \n",pred_data_inv[233])

Sample Text: 
 power love say angels birth hell devils warp heaven one knew truth time fled space space time none shelter concern love seeks blind heart lonely one none forsee path blinding love unbind self legends knights shinning armor reincarnated soul saints rebirth blessing gods shone awe splendour hidden seed love

Actual Label: 
 male,15,Student,Aquarius

Transformed Actual Label: 
 [1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0
 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1]

Predicted Label: 
 [1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0
 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0]

Inverse Transformed Actual Label: 
 (',', '1', '7', 'M', 'V', 'a', 'e', 'g', 'i', 'l', 'm', 'o', 'r', 't', 'y')

Inverse Transformed Predicted Label: 
 (',', '1', '7', 'S', 'U', 'a', 'd', 'e', 'g', 'i', 'k', 'l', 'm', 'n', 'r', 's', 't', 'u')


In [36]:
print ("Sample Text: \n",X_test[199])
print ("\nActual Label: \n",y_test[199])
print ("\nTransformed Actual Label: \n",y_test_label[199])
print ("\nPredicted Label: \n",pred_data[199])
print ("\nInverse Transformed Actual Label: \n",y_test_label_inv[199])
print ("\nInverse Transformed Predicted Label: \n",pred_data_inv[199])

Sample Text: 
 one best gothic poetry sites web opinion one urllink http www ravensrants com index html raven poetry brilliant breaking gothic poetry stereotype everyone thinks gothic poetry poems death crystal night walked one crystal night air cold black every star held bright true midnight back moon hung low trees silent ghost sat fields frigid post night gorgeous like movie scene even trees kept still keep view pristine mind began wonder thoughts place wished share someone nature purest face around soul heart make sound knew alone began look around heart felt empty like space stars knew night alone heal scars began walking back night knew would need someone help win fight must leave scene time share song without rhyme raven poetry interests must check site even goth open mind experience something new

Actual Label: 
 female,37,indUnk,Aquarius

Transformed Actual Label: 
 [1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0
 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0]

Pred

In [38]:
print ("Sample Text: \n",X_test[14168])
print ("\nActual Label: \n",y_test[14168])
print ("\nTransformed Actual Label: \n",y_test_label[14168])
print ("\nPredicted Label: \n",pred_data[14168])
print ("\nInverse Transformed Actual Label: \n",y_test_label_inv[14168])
print ("\nInverse Transformed Predicted Label: \n",pred_data_inv[14168])

Sample Text: 
 sigh happy know trying bestest look bright side things think positively moment working stopped taking herbal happy pills anything makes wonder like taking bear thinking think stress work busy kwik save road closing customers extra moany co op extortionatley expensive well could job fairly soon find new one fairly quickly unemployed able survive shut kate depressing mad dreams lately one last night one really old fashioned school late class really nasty teacher reason always get lost end later ever want talk nice ah well day tomorrow really need week suppose make one day things grateful idea story actually makes enthusiastic fact dog barking window cats garden cheese cheesy rock time cheese

Actual Label: 
 female,25,indUnk,Virgo

Transformed Actual Label: 
 [1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 0 0
 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0]

Predicted Label: 
 [1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1
 1 0 1 1 1 1 0 

In [42]:
print ("Sample Text: \n",X_test[12447])
print ("\nActual Label: \n",y_test[12447])
print ("\nTransformed Actual Label: \n",y_test_label[12447])
print ("\nPredicted Label: \n",pred_data[12447])
print ("\nInverse Transformed Actual Label: \n",y_test_label_inv[12447])
print ("\nInverse Transformed Predicted Label: \n",pred_data_inv[12447])

Sample Text: 
 taking break urllink urllink

Actual Label: 
 female,24,indUnk,Sagittarius

Transformed Actual Label: 
 [1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0
 1 1 1 1 1 0 0 1 1 1 0 1 0 0 0]

Predicted Label: 
 [1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0
 1 0 1 1 1 1 0 0 1 1 1 1 0 0 0]

Inverse Transformed Actual Label: 
 (',', '1', '7', 'A', 'U', 'a', 'd', 'e', 'f', 'i', 'k', 'l', 'm', 'n', 'q', 'r', 's', 'u')

Inverse Transformed Predicted Label: 
 (',', '2', 'E', 'a', 'c', 'd', 'e', 'i', 'l', 'm', 'n', 'o', 'r', 's', 't', 'u')


In [43]:
print ("Sample Text: \n",X_test[9427])
print ("\nActual Label: \n",y_test[9427])
print ("\nTransformed Actual Label: \n",y_test_label[9427])
print ("\nPredicted Label: \n",pred_data[9427])
print ("\nInverse Transformed Actual Label: \n",y_test_label_inv[9427])
print ("\nInverse Transformed Predicted Label: \n",pred_data_inv[9427])

Sample Text: 
 hey im great mood lately yes reasons yet desire disclose later lets say things going really really well im happy except hello dolly hate feel pointless like seriously died one would like oh cant show w e im musical next year want stage manage cuz think thatll fun least something new interesting ya thatll fun well thats ill bbl

Actual Label: 
 male,16,indUnk,Cancer

Transformed Actual Label: 
 [1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0
 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]

Predicted Label: 
 [1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0
 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0]

Inverse Transformed Actual Label: 
 (',', '1', '7', 'G', 'U', 'a', 'd', 'e', 'i', 'k', 'l', 'm', 'n')

Inverse Transformed Predicted Label: 
 (',', '1', 'C', 'U', 'a', 'c', 'e', 'i', 'k', 'l', 'm', 'n', 'r')
