<a href="https://colab.research.google.com/github/cakennedy/266-mbti-project/blob/main/notebooks/PredictPersonality_Type_1018.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Predicting Personality Type Through Machine Learning - W266 Fall '22 Final Project**

By *John Clark, Courtney Kennedy and Shrinivas Joshi*


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cakennedy/266-mbti-project/blob/main/notebooks/PredictPersonality_Type_1018.ipynb#)

# **Introduction**


##Problem Background


##Our Dataset

We have chosen to utilize two files from the TypologyCentral forum. These files were downloaded on September 25th, 2022. 

* **typology_xenforo-9-25-22.csv** - a csv file containing information on users of the Typology Central Internet forum
* **typology_xenforo-9-25-22-posts.csv** - a csv file containing posts of those users

###typology_xenforo-9-25-22.csv
*   Username - Unique key identifying the user's identity
*   Age - Age rounded to number of years
*   Posts - Number of posts made by the user
*   MBTI Type - Myers Briggs type https://www.myersbriggs.org/my-mbti-personality-type/mbti-basics/the-16-mbti-types.htm
*   Enneagram - Enneagram type - https://www.typologycentral.com/wiki/index.php/Typology_Central_Wiki_Main_Page#The_Enneagram_Types_-_Profile_Descriptions
*   Instinctual Variant - Instinctual variant associated with the enneagram type https://enneagramgift.com/enneagram-subtypes/
*   Gender - Optional user supplied gender
*   Occupation - Optional user supplied occupation

###typology_xenforo-9-25-22-posts.csv

After stripping out unnecessary fields, the following fields are in the post database:
*   post_id - Seqentially assigned number uniquely identifying each post 
*   thread_id - Sequentially assigned number that identifies the thread a post is associated with 
*   Username - Unique key identifying the user's identity
*   post_date - Date that post was made
*   message - content of the post

An inner join of these two files was performed on Username to provide the file used on this project, where MBTI Type or Enneagram are the labels. 

# **Setup, Data Load, Exploratory Data Analysis and Data Cleaning**

##Import Libraries

In [None]:
# Import libraries
from google.colab import files
from google.cloud import storage
import pandas as pd
import io
from io import BytesIO
import matplotlib.pyplot as plt
import altair as alt
import numpy as np
import textwrap
import nltk
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk import word_tokenize
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split






[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
#Set colors
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

In [None]:
#Upload Google Cloud service account key to enable authentication ( json file )
# Go to https://console.cloud.google.com/:
# Under the Navigation Menu ( upper left 3 horizontal lines) 
# 1. choose IAM & Admin>
# 2. choose Service Accounts>
# 3. Select a Service Account>
# 4. Under the Actions menu ( 3 dots to the right of the service account )>Manage Keys to create your own json credentials file

uploaded = files.upload()

Saving pacific-castle-360400-a3ca89f64de6.json to pacific-castle-360400-a3ca89f64de6 (1).json


## Data survey/analysis and cleanup

We began our efforts by loading the two datasets, conducting preliminary exploratory data analysis and cleaning the data. After taking steps to balance the dataset, we created train, dev and test data/label datasets. 

Specific steps that we took include the following: 
*   Conduct post data cleansing including removing MBTI and Enneagram type from posts and quotes from other posters
*   Print statistics on the data
*   Remove rows with null or incorrect values
*   Join the two datasets 
*   Perform further preprocessing including removal of newline, bold, italics, underline and performing lemmatization
*   Shuffle data; then create labels; create train, dev and test datasets
*   Test for class imbalance and over/undersample to correct for imbalance
*   Create train, dev and test CSV files for input into Base Case Notebook

###Download Datasets 

In this step, we download our input datasets from a Google Drive using a json authenticaqtion key. 

In [None]:

#Load Google Cloud storage client using service key
storage_client = storage.Client.from_service_account_json('pacific-castle-360400-a3ca89f64de6.json')

#Print buckets available
for bucket in storage_client.list_buckets():
    print(bucket)

#Assign bucket name being used
bucket_name = '266csffile'

#Get bucket
bucket = storage_client.get_bucket(bucket_name)

#Show list of files in bucket and list the files
filename = list(bucket.list_blobs(prefix=''))
for name in filename:
  print(name.name)




<Bucket: 266csffile>
dev_mbti_data.csv
dev_mbti_labels.csv
test_mbti_data.csv
test_mbti_labels.csv
train_mbti_data.csv
train_mbti_labels.csv
typology_merged.parquet
typology_users_clean.csv
typology_xenforo-9-25-22-posts.csv
typology_xenforo-9-25-22.csv
typology_xenforo-9-25-22_clean.parquet


In [None]:
#Increase field size to allow reading in of files
import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 as long as overflow error occurs 
    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

In [None]:
#Grab the cleaned header file and download as string
header_filename = 'typology_users_clean.csv'
blobby = bucket.blob(header_filename)
blobby_string = blobby.download_as_string()

#Convert to pandas
mbti_header_data = pd.read_csv(io.BytesIO(blobby_string) )


In [None]:
#Grab the cleaned posts file and download as string

posts = bucket.blob('typology_xenforo-9-25-22_clean.parquet')
string_data_posts = posts.download_as_string()

#Convert to pandas
post_data = pd.read_parquet(io.BytesIO(string_data_posts))

In [None]:
# what did we get?

print( "users/header df shape:", mbti_header_data.shape )
print( "users/header df shape:", mbti_header_data.columns )

print( "posts df shape:", post_data.shape )
print( "posts df shape:", post_data.columns )


users/header df shape: (15414, 13)
users/header df shape: Index(['Unnamed: 0', 'Username', 'Age', 'Posts', 'MBTI Type', 'Enneagram',
       'Instinctual Variant', 'Gender', 'Occupation', 'is_I', 'is_S', 'is_T',
       'is_J'],
      dtype='object')
posts df shape: (3328436, 5)
posts df shape: Index(['post_id', 'thread_id', 'Username', 'post_date', 'message'], dtype='object')


In [None]:
# utility for writing a csv to our shared GCloud storage

def write_csv_file( dataframe, 
                    filename,
                    index,
                    header,
                    mode ):
    '''This function writes a panda to a CSV file in the Google cloud storage 
    bucket. Input parameters include the dataframe, the name of the file to 
    write, the index (write row names), header(create column names) and 
    mode(write) '''

    #Assign bucket name being used
    bucket_name = '266csffile'

    #Get bucket
    bucket = storage_client.get_bucket(bucket_name)

    blob = bucket.blob( filename )
    blob.upload_from_string(dataframe.to_csv(index=index, header=header, mode=mode), 'text/csv')


# code to test writing to a file
# alpha_list = ['a', 'b', 'c', 'd']
# alphabet_df = pd.DataFrame(alpha_list)

#write_csv_file( alphabet_df, 'test.csv', False, True, 'w')



In [None]:
# utility for reading csv file from shared Google Cloud storage

def read_csv_file( filename,
                    encoding='utf-8',
                    separator=',' ):
    '''This function reads a file from the google cloud storage bucket. Input
    parameters include the filename, encoding and CSV file separators.'''

    #Grab blob posts file and download as string
    blobby = bucket.blob(filename)
    blobby_string = blobby.download_as_string()

    #Convert to pandas
    df = pd.read_csv(io.BytesIO(blobby_string), encoding=encoding, sep=separator, engine='python')
   
    return df


# code to test reading a file
# type_df = read_csv_file( 'typology_xenforo-9-25-22.csv')
# print( type_df.head() )



###User/Header File Preprocessing

In this step, we provide some overall statistics and information on the header database. After this, null or missing values are identified. Next, we remove header database records that contain no MBTI type. Finally, we check the values in the header database to determine that information has been cleaned correctly, and provide some statistics on the distribution of personality type in the database. 

In [None]:
#Print general statistics on dataset before cleanup
print(color.BOLD + 'Initial data survey/analysis and basic cleanup' + color.END)
print(color.BOLD + 'Pre-split pre-cleaning data survey' + color.END)

# Print column names
print(' This data set has following columns: ',list(mbti_header_data.columns))
# Print shape of imported raw data before any cleanup.
data_rows = mbti_header_data.shape[0]
print(' There are', '{:,}'.format(data_rows), 'rows in the dataset')

[1mInitial data survey/analysis and basic cleanup[0m
[1mPre-split pre-cleaning data survey[0m
 This data set has following columns:  ['Unnamed: 0', 'Username', 'Age', 'Posts', 'MBTI Type', 'Enneagram', 'Instinctual Variant', 'Gender', 'Occupation', 'is_I', 'is_S', 'is_T', 'is_J']
 There are 15,414 rows in the dataset


In [None]:

#Identify the presense of missing values
null_data = mbti_header_data.isnull().sum()

print('<b>Missing values in each column:</b>')

#Print info on missing values
print(' There are', '{:,}'.format(null_data['MBTI Type']), 'missing MBTI Types in the dataset \
which represents', "{:.1%}".format(null_data[3]/data_rows), 'of the dataset' )
print(' There are', '{:,}'.format(null_data['Enneagram']), 'missing Enneagram values in the dataset \
which represents', "{:.1%}".format(null_data[4]/data_rows), 'of the dataset' )
print(' There are', '{:,}'.format(null_data['Instinctual Variant']), 'missing Instinctual Varient values in the dataset \
which represents', "{:.1%}".format(null_data[5]/data_rows), 'of the dataset' )
print(' There are', '{:,}'.format(null_data['Gender']), 'missing Gender values in the dataset \
which represents', "{:.1%}".format(null_data[6]/data_rows), 'of the dataset' )


<b>Missing values in each column:</b>
 There are 0 missing MBTI Types in the dataset which represents 0.0% of the dataset
 There are 8,019 missing Enneagram values in the dataset which represents 0.0% of the dataset
 There are 13,625 missing Instinctual Varient values in the dataset which represents 52.0% of the dataset
 There are 9,388 missing Gender values in the dataset which represents 88.4% of the dataset


In [None]:

#Remove rows with blank MBTI type 
mbti_header_data = mbti_header_data.dropna(subset=['MBTI Type'])
mbti_header_data.shape

valid_MBTI = ['ISTJ', 
              'INTJ', 
              'ESTJ', 
              'ENTJ', 
              'ENTP', 
              'INTP', 
              'ISTP', 
              'ESTP', 
              'ISFJ', 
              'INFJ', 
              'ESFJ', 
              'ENFJ', 
              'ENFP', 
              'INFP', 
              'ISFP', 
              'ESFP']


#Remove data with invalid MBTI type in MBTI field
mbti_header_data = mbti_header_data[mbti_header_data['MBTI Type'].isin(valid_MBTI)]
mbti_header_data.shape


(15414, 13)

In [None]:
mbti_header_data.head(10)

Unnamed: 0.1,Unnamed: 0,Username,Age,Posts,MBTI Type,Enneagram,Instinctual Variant,Gender,Occupation,is_I,is_S,is_T,is_J
0,3,!sfj,74.0,0,ISFJ,,,,,True,True,False,True
1,5,"""jake""",27.0,0,ESTJ,,,,,False,True,True,True
2,10,&chet,30.0,0,ENTP,7,,,,False,False,True,False
3,11,'Veracity',36.0,0,INTJ,,,,,True,False,True,True
4,12,(FL)Cross,31.0,3,ENTP,,,,,False,False,True,False
5,15,*avariel*,37.0,21,INFJ,145,so_sp,female,,True,False,False,True
6,21,*UserName*,35.0,0,INFJ,4w5,,,,True,False,False,True
7,22,+ patch,59.0,71,INTP,5w4,sp_so,male,IT,True,False,True,False
8,23,- kim -,48.0,0,INFP,,,female,,True,False,False,False
9,24,-Andy-,32.0,0,INTP,1,,,,True,False,True,False


In [None]:
#Double check values
print(mbti_header_data.shape)
data_rows = mbti_header_data.shape[0]

#Identify the presense of missing values
null_data = mbti_header_data.isnull().sum()

print(' There are', '{:,}'.format(null_data['MBTI Type']), 'missing MBTI Types in the dataset \
which represents', "{:.1%}".format(null_data['MBTI Type']/data_rows), 'of the dataset' )
print(' There are', '{:,}'.format(null_data['Enneagram']), 'missing Enneagram values in the dataset \
which represents', "{:.1%}".format(null_data['Enneagram']/data_rows), 'of the dataset' )
print(' There are', '{:,}'.format(null_data['Instinctual Variant']), 'missing Instinctual Varient values in the dataset \
which represents', "{:.1%}".format(null_data['Instinctual Variant']/data_rows), 'of the dataset' )
print(' There are', '{:,}'.format(null_data['Gender']), 'missing Gender values in the dataset \
which represents', "{:.1%}".format(null_data['Gender']/data_rows), 'of the dataset' )
print(' There are', '{:,}'.format(null_data['Username']), 'missing Username values in the dataset \
which represents', "{:.1%}".format(null_data['Username']/data_rows), 'of the dataset' )



(15414, 13)
 There are 0 missing MBTI Types in the dataset which represents 0.0% of the dataset
 There are 8,019 missing Enneagram values in the dataset which represents 52.0% of the dataset
 There are 13,625 missing Instinctual Varient values in the dataset which represents 88.4% of the dataset
 There are 9,388 missing Gender values in the dataset which represents 60.9% of the dataset
 There are 0 missing Username values in the dataset which represents 0.0% of the dataset


In [None]:
#Disable max rows for Altair visualization
alt.data_transformers.disable_max_rows()


DataTransformerRegistry.enable('default')

In [None]:
#Print statistics on number of users for each type

bars = alt.Chart(mbti_header_data).mark_bar().encode(
    x='count(MBTI Type):Q',
    y=alt.Y('MBTI Type:N', sort='-x')
).properties(height=400)

text = bars.mark_text(
    align='left',
    baseline='middle',
    dx=3  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text='count(MBTI Type):Q'
)

(bars + text).properties(height=900)


####Post Database Preprocessing (Continued)

In these steps, we print some general statistics on the MBTI post data including a sample of the post data messages. We delete unnecessary columns and perform an inner join on the cleaned header file with the posts, and further clean up the data, removing joined header/post data with missing or incorrect MBTI type and missing post/message data. We conduct some further preprocessing of the post database using a preprocessing routine, create train, dev and test datasets and then write that data out to respective CSV files. 

In [None]:
#Print general statistics on dataset before cleanup
print(color.BOLD + 'Initial data survey/analysis and basic cleanup' + color.END)
print(color.BOLD + 'Pre-split pre-cleaning data survey' + color.END)

# Print column names
print(' This data set has following columns: \n',list(post_data.columns))
# Print shape of imported raw data before any cleanup.
data_rows = post_data.shape[0]
print(' There are', '{:,}'.format(data_rows), 'rows in the dataset')

[1mInitial data survey/analysis and basic cleanup[0m
[1mPre-split pre-cleaning data survey[0m
 This data set has following columns: 
 ['post_id', 'thread_id', 'Username', 'post_date', 'message']
 There are 3,328,436 rows in the dataset


In [None]:
#Merge the post database and header database
merged = mbti_header_data.merge(post_data, on='Username')


print("merged data shape:", merged.shape)
print( "merged data columns:", merged.columns )
print(merged.head(2))


merged data shape: (1611083, 17)
merged data columns: Index(['Unnamed: 0', 'Username', 'Age', 'Posts', 'MBTI Type', 'Enneagram',
       'Instinctual Variant', 'Gender', 'Occupation', 'is_I', 'is_S', 'is_T',
       'is_J', 'post_id', 'thread_id', 'post_date', 'message'],
      dtype='object')
   Unnamed: 0   Username   Age  Posts MBTI Type Enneagram Instinctual Variant Gender Occupation   is_I   is_S  is_T   is_J  post_id  thread_id   post_date                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

In [None]:
# write merged data out to a parquet file

# write to file
merged_filename = 'typology_merged.parquet'

print( merged_filename)
blob = bucket.blob( merged_filename )
blob.upload_from_string(merged.to_parquet(), 'application/octet-stream')



typology_merged.parquet


In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
#Print sample of messages to identify patterns for cleaning

for i in range(10):
  post = merged.loc[i].at["message"]
  my_wrap = textwrap.TextWrapper(width = 80)
  wrap_list = my_wrap.wrap(text=post)
  for line in wrap_list:
    print(line)
  print('\n')


  ooh ooh ooh, how about listing bad pick up lines, or stories of bad dates??
inquiring minds want to know ;


how can you tell if an XXXX likes you?  they just are available and around a
lot.  in romantic situations they become "aggressively available" as synarch
puts it, just conveniently being where you may bump into them, yet not appearing
to be initiating any kind of contact.  they are very receptive though.  you
really know that you are in with an XXXX if they share much of who they are with
you means they trust you not to stomp all over that and if they bring up
negative feelings or have conflict with you means you are worth the effort to
get things back to the way they should be and they feel you care enough to work
through the conflict.  if they hold you to higher standards than other people,
that is also a good sign that you are very important to them.


why do you keep letting these girls use you?  i could understand it if you were
just in it for the sex, but it sounds like 

In [None]:
#Identify the presense of missing values
null_data = merged.isnull().sum()

print(' There are', '{:,}'.format(null_data['MBTI Type']), 'missing MBTI Types in the dataset \
which represents', "{:.1%}".format(null_data[3]/data_rows), 'of the dataset' )

 There are 0 missing MBTI Types in the dataset which represents 0.0% of the dataset


##Model Pre-processing

In [None]:

#Set up stop words
stop_words = nltk.corpus.stopwords.words('english') 
stop_words.extend([ "\xa0"]) # breaking space

def preprocess(text):
    """This function takes in a text string and performs a variety of 
    pre-processing activities against that text including converting to 
    lowercase, and removing vbulletin BB codes"""
    text=str(text)
    text = text.replace('\\n', '')
    text = text.lower()
    text = text.replace('[b]', '')
    text = text.replace('[/b]', '')
    text = text.replace('[i]', '')
    text = text.replace('[/i]', '')
    text = text.replace('[u]', '')
    text = text.replace('[/u]', '')
    text = text.replace('[s]', '')
    text = text.replace('[/s]', '')
#    text = re.sub('[^a-zA-Z0-9 \.]', '', text)
    text = re.sub('[\']', '', text)
    
    #Tokenize (convert from string to list)
    text = text.split()

##COMMENTED OUT FOR TIME BEING. WE CAN RESTORE AND SEE IF IT IMPROVES ACCURACY
    #Remove Stopwords 
#    if stop_words is not None:
#        words = [word for word in text if word not in 
#                  stop_words]

    #Conduct lemmatization
#    lem = nltk.stem.wordnet.WordNetLemmatizer()
#    words = [lem.lemmatize(word) for word in text]

    #Move words in table back into one string
    new_text = " ".join(text)

    return new_text



In [None]:


#Determine average and longest text lengths
string_lengths = merged['message'].str.len()
print('Length analysis:')
print(' Average length of the text string in characters is', '{:,}'.format(round(string_lengths.mean(),0)), 'characters')
print(' The longest statement string in characters is', '{:,}'.format(round(string_lengths.max(),0)), 'characters\n')

#Plot the text lengths
MEDIUM_SIZE = 14
fig=plt.figure(figsize=(7,5))
plt.style.use('seaborn')
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
plt.xlabel('Statement String Length')
plt.ylabel('Number Of Occurances in Train')
plt.title('Histogram of String Length in Training Dataset')
plt.xlim([0, 4000])
plt.hist(string_lengths, bins = 200, edgecolor = "black")
plt.show()

#Preprocess train, dev and test data
clean_texts = list(map(preprocess, merged['message']))

#Drop old column in MBTI file
merged.drop(['message'], axis = 1, inplace = True)

#Create new revised column based on additional preprocessing
merged['message'] = clean_texts

#Drop rows with blank message (required for Bert to work)
null_data = merged.isnull().sum()
print(' There are', '{:,}'.format(null_data['message']), 'NaN message rows in the merged dataset') 
merged = merged.dropna(subset=['message'])

#Create labels 
labels = merged['MBTI Type']

#Shuffle the data
shuffle = np.random.permutation(np.arange(merged.shape[0]))
data, labels = merged.iloc[shuffle], labels.iloc[shuffle]

#Print sample of messages to confirm cleaning has worked
for i in range(50):
  post = clean_texts[i]
  my_wrap = textwrap.TextWrapper(width = 80)
  wrap_list = my_wrap.wrap(text=post)
  for line in wrap_list:
    print(line)
  print('\n')

# Set some variables to hold test, dev, and training data.
train_mbti_data, train_mbti_labels = data[:1200000], labels[:1200000]
dev_mbti_data, dev_mbti_labels = data[1200000:1400000], labels[1200000:1400000]
test_mbti_data, test_mbti_labels = data[1400000:1600000], labels[1400000:1600000]

#Determine and print number of rows in each dataset
train_mbti_rows = train_mbti_data.shape[0]
dev_mbti_rows = dev_mbti_data.shape[0]
test_mbti_rows = test_mbti_data.shape[0]
total_mbti_records = train_mbti_rows+dev_mbti_rows+test_mbti_rows
print('After Data Split')
print(' TRAIN dataset row count: ', '{:,}'.format(train_mbti_rows))
print(' DEV dataset row count  : ', '{:,}'.format(dev_mbti_rows))
print(' TEST dataset row count : ', '{:,}'.format(test_mbti_rows))
print(' Total rows             : ', '{:,}'.format(total_mbti_records))

null_data = train_mbti_data.isnull().sum()
print(' There are', '{:,}'.format(null_data['message']), 'missing message rows in the Train dataset')
null_data = dev_mbti_data.isnull().sum()
print(' There are', '{:,}'.format(null_data['message']), 'missing message rows in the Dev dataset')
null_data = test_mbti_data.isnull().sum()
print(' There are', '{:,}'.format(null_data['message']), 'missing message rows in the Test dataset')

#Set option to print dataframe in one line
pd.set_option('expand_frame_repr', False)

#Check that data looks good
print('\nExample of data after cleaning and splitting:')
print('\nTraining data')
print(train_mbti_data.head(10))
print('\nTraining labels')
print(train_mbti_labels.head(10))
print('')



In [None]:
#diagnostic code
mylist = [184720, 184721, 324223, 324224, 410614, 410615, 427590, 427591, 552552, 552553, 600776, 600777, 619523, 619524, 1029069, 1029070]


In [None]:
#diagnostic code
for i in mylist:
  post = train_mbti_data.iloc[i].at["message"]
  my_wrap = textwrap.TextWrapper(width = 80)
  wrap_list = my_wrap.wrap(text=post)
  for line in wrap_list:
    print(line)
  print('\n')

for i in mylist:
  print(train_mbti_data.iloc[i].at["post_id"])


###Write Final Preprocessed CSV Files

As a final step, we write the final preprocessed train, dev and test data and label files to the Google Storage Bucket. 

In [None]:
train_mbti_data.dtypes

Unnamed: 0               int64
Username                object
Age                    float64
Posts                    int64
MBTI Type               object
Enneagram               object
Instinctual Variant     object
Gender                  object
Occupation              object
is_I                      bool
is_S                      bool
is_T                      bool
is_J                      bool
post_id                  int64
thread_id                int64
post_date                int64
message                 object
dtype: object

In [None]:

#Write train, test and dev data to CSV
write_csv_file(train_mbti_data, 'train_mbti_data.csv', True, True, 'w')
write_csv_file(test_mbti_data, 'test_mbti_data.csv', True, True, 'w')
write_csv_file(dev_mbti_data, 'dev_mbti_data.csv', True, True, 'w')
write_csv_file(train_mbti_labels, 'train_mbti_labels.csv', True, True, 'w')
write_csv_file(test_mbti_labels, 'test_mbti_labels.csv', True, True, 'w')
write_csv_file(dev_mbti_labels, 'dev_mbti_labels.csv', True, True, 'w')



##Pull in resampled data


In [None]:
# we only need to do the under/over resampling on training data.



#INFORMATION BELOW THIS IS OLD CODE FROM A PREVIOUS PROJECT AND IS ONLY TO BE USED FOR EXAMPLE PURPOSES

##Run Logistic Regression Against Non PreProcessed and Preprocessed Data

In this step, we conduct a grid search using the logistic regression classifier searching for the best C value, while using the vectorized dataset that *does not include* custom preprocessiong. After that, we run logistic regression against the vectorized dataset that *has included* custom preprocessing. 

We note that the best value for C in the values evaluaed is 10. We attempted both lemmatisation and porter stemming separately in our preprocessor and they appeared to reduce the F1 score of the classifier by approximately .005. In general, our custom preprocessor does not perform as well as the default with the addtion of English stopwords.  

In [None]:
#Visual Confusion Matrix
LR_Count = accuracy_score(predicted_labels, dev_labels)
LR_Count_CM = classification_report(predicted_labels, dev_labels) 

ax= plt.subplot()
LR_Count_Confusion = confusion_matrix(dev_labels, predicted_labels)

sns.heatmap(LR_Count_Confusion, cmap="Blues", annot=True, fmt='d', ) # for decimal

# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Fake', 'True']); ax.yaxis.set_ticklabels(['Fake', 'True']);
plt.show()


Reviewing the Confusion Matrix, the key items are as follow:

* True Positive (TP): These are cases in which we predicted True (they have to be True)
* True Negative (TN): We predicted Fake, and they aren’t True.
* False Positive (FP): We predicted True, but they aren’t True, they are Fake. (Type I error)
* False Negatives (FN): We predicted  Fake, but they are True. (Type II error)

This means that the prediction did not work on the False Positives and False Negatives.

In [None]:
#Run MultinomialNB classifier and try different values of alpha 
Naive_Bayes = MultinomialNB()

#Search for best value of alpha
alphas = {'alpha': [1.0e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
print('\nAttempting values of alpha: ', alphas)
nb_model = GridSearchCV(Naive_Bayes, alphas, scoring = 'f1_weighted', verbose = 0)
nb_model.fit(train_cv, train_labels)

print('\nBest model estimator is:', nb_model.best_estimator_, ' with an f1 score of:', \
      "{0:.4}".format(nb_model.best_score_))
  
#Run predictions with dev data
predicted_labels = nb_model.predict(dev_cv)
  
#Print results for predictions with dev dataset
print('\033[1m\nResults for Multinomial Naive Bayes using dev data \033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))


It appears that Logistic Regression performs much more effectively with our dataset than Multinomial Naive Bayes, so we will continue with our use of logistical regression for the time being. 

# **Feature Engineering**

Next, we explore feature engineering in further detail. We conduct a further deep dive analysis of the text field. We also examine the title field. Because the number of authors = the number of rows in the dataset, we will not explore this feature further as it would be non-differentiating.  We begin with an analysis of coefficients with the highest value as well as analyizing errors, using the text field.


##Top Features

In [None]:
#Set number of top features to show
no_features = 50

def top_features(lr_model, train_features2, no_features): 
  """This function takes in a logistic regression model and set of features as \
an input and the number of top features to print. It then sorts the top nonzero \
coefficients for each classifcation output and prints a table comparing all \
of the coefficients and the relevant classes."""

  #Create array to store sorted indexes
  sorted_idx = np.zeros(lr_model.coef_.shape, dtype = int)

  #Sort array of the y values
  for i, row in enumerate(lr_model.coef_):
    sorted_idx [i] = (-row).argsort()

  #Select top five for each y value
  top_idx = sorted_idx [:,:no_features]

  #Initialize table to conatain top 20 features
  features = []
  #Intialize table to contain classes
  classes = [0,1]
  #Initialize array to contain weights
  weights = np.zeros((no_features, 2), dtype = float)
  
  #Set easier to interpret variables for subsequent loop
  rows = top_idx.shape[0]
  cols = top_idx.shape[1]

  #Set index for weights
  idx = 0

  #Pull features, classes and weights information 
  for row in range(rows):
    for col in range(cols):
      features.append(train_features[top_idx [row, col]])
      weights [idx] = lr_model.coef_ [:, top_idx [row,col]]
      idx += 1

  #Put information in dataframe to print
  df = pd.DataFrame(data=weights, index=[features], columns=[classes])

  #Print the dataframe
  print(df)

top_features(lr_model, train_features, no_features)

We can see that common words that allow us to determine if news if fake include words like "clinton",  "hillary", "2016", "fbi", "election", "october", and "emails".  These results do not seem surprising given what we know in retrospect about the 2016 election.  

##Experiment With Bigram and Trigram Features

Our next step was to conduct an experiment with the use of bigram as well as trigram features on the text field. We vectorize unigram/bigram and unigran/bigram/trigram results. Then we conduct a logisitic regression to determine if the accuracy of our classifier improves with the addition of these features. 

In [None]:
##Attempt Unigram + Bigram Features
#Number of words range start for Vectorizer 
num_words_a = 1
#Number of words range end for Vectorizer 
num_words_b = 2

#Transform data
vectorizer_t_nopreprocess = TfidfVectorizer(stop_words= stop_words , ngram_range=(num_words_a, num_words_b)) 
train_cv = vectorizer_t_nopreprocess.fit_transform(train_data["text"]) 
train_features = vectorizer_t_nopreprocess.get_feature_names_out()
vectorizer_d_nopreprocess = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b), \
                                             vocabulary = train_features)
dev_cv = vectorizer_d_nopreprocess.fit_transform(dev_data["text"]) 

c = 10
lr_model = LogisticRegression(solver = 'liblinear', penalty = 'l2', C = c)
lr_model.fit(train_cv, train_labels)

#Run predictions with dev data
predicted_labels = lr_model.predict(dev_cv)
  
#Print results for predictions with dev dataset
print('\033[1m\nLogistic Regression using unigram and bigram features\033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))

##Attempt Unigram + Bigram Features + Trigram Features
#Number of words range start for Vectorizer 
num_words_a = 1
#Number of words range end for Vectorizer 
num_words_b = 3

#Transform training data to matrices of word unigram features
vectorizer_t_nopreprocess = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b))
train_cv = vectorizer_t_nopreprocess.fit_transform(train_data["text"]) 
train_features = vectorizer_t_nopreprocess.get_feature_names_out()
vectorizer_d_nopreprocess = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b), \
                                            vocabulary = train_features)
dev_cv = vectorizer_d_nopreprocess.fit_transform(dev_data["text"]) 

c = 10
lr_model = LogisticRegression(solver = 'liblinear', penalty = 'l2', C = c)
lr_model.fit(train_cv, train_labels)

#Run predictions with dev data
predicted_labels = lr_model.predict(dev_cv)
  
#Print results for predictions with dev dataset
print('\033[1m\nLogistic Regression using unigram, bigram and trigram features\033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))

We determined that adding trigram features appears to degrade the F1 score and thus will no longer be considered. Bigram features in most cases (though not all) appears to degrade performance as well. Due to these results and with unigram features being simpler, we'll use unigram features moving forward for the text field. 




##Error Analysis

In [None]:
# Expand output panel height to eliminte scrollbar on output panel
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 5000})'''))

##Revert back to unigrams
#Number of words range start for Vectorizer 
num_words_a = 1
#Number of words range end for Vectorizer 
num_words_b = 1

#Transform training data to matrices of word unigram features
vectorizer_t = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b))
train_cv = vectorizer_t.fit_transform(train_data["text"]) 
train_features = vectorizer_t.get_feature_names_out()
vectorizer_d = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b), \
                                            vocabulary = train_features)
dev_cv = vectorizer_d.fit_transform(dev_data["text"]) 

c = 10
lr_model = LogisticRegression(solver = 'liblinear', penalty = 'l2', C = c)
lr_model.fit(train_cv, train_labels)

#Run predictions with dev data
predicted_labels = lr_model.predict(dev_cv)
  
#Print results for predictions with dev dataset
print('\033[1m\nLogistic Regression using bigram features\033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))

#Create a list to store results of prediction error analysis
#[row in datset, predicted class, correct class, r ratio]
error_analysis = []

#Move labels into array

dev_labels_a = dev_labels.to_numpy()

#Obtain probabilities for each class
predicted_probs = lr_model.predict_proba(dev_cv)

#Initialize counter that represents label index in dev dataset
count = 0

#Initialize counter for printing
i = 0
print_no = 10

#Loop through all rows in the dataset, checking for predictions that aren't correct
for row_num in range(dev_cv.shape[0]):

  #Determine index of highest value for current row
  max_predict_index = int(np.argmax(predicted_probs[row_num]))
    
  #Determine probability of index with highest value 
  max_prob = predicted_probs[row_num, max_predict_index]
    
  #Determine probability of index with the correct classification
  correct_class_prob = predicted_probs [row_num, dev_labels_a [count]]
    
  #Calculate r ratio
  r_ratio = max_prob / correct_class_prob

  #If prediction isn't correct, add relevant values to the list of errors
  # 0 - Row number
  # 1 - Predicted class
  # 2 - Correct class 
  # 3 - R Ratio 
  if max_predict_index != dev_labels_a [count]: 
    error_analysis.append([row_num, predicted_labels[count], dev_labels_a[count], r_ratio])
    
  #Increment dev dataset index counter
  count += 1 

#Sort the errors in descending order by r_ratio
sorted_errors = sorted(error_analysis, key=lambda x: x[3], reverse = True)

#Create array for fitted vectorizer
cv_d_array = sparse.lil_matrix(dev_cv).toarray()

#Loop through the errors and produce list with index, feature name, and probability
for x in range(0, len(sorted_errors)):

  #Create index of nonzero feature values for example
  row = cv_d_array[sorted_errors[x][0]]

  nz1 = np.nonzero(row)
  nz = nz1[0]

  #For every nonzero feature, create table entry with index, feature and value
  #0 - index into dev dataset
  #1 - feature name
  #2 - estimator coefficient
  f_and_v = []
  for count, value in enumerate(nz):
    f_and_v.append([sorted_errors[x][0], train_features[value], row [value] ])

  #Sort the errors in descending order by r_ratio
  sorted_f_and_v = sorted(f_and_v, key=lambda x: x[2], reverse = True)

  if i < print_no: 
    #Print r ratio, predicted label, correct label and text body
    print ('\033[1m\nR Ratio of ', sorted_errors[x][3], '\033[0m')
    print('Predicted label is ', sorted_errors [x][1])
    print('Correct label is', sorted_errors[x][2])
#    print ('\nText body:', dev_data [sorted_errors [x][0]])
    print('\nListing features with non-zero coefficients:')

    i += 1
    #Print errors in descending order
#    for row in range(len(sorted_f_and_v)):
    no_features = len(sorted_f_and_v)
    if no_features > 10:
      no_features = 10
    for row in range(no_features):
        print('\nIndex =', sorted_f_and_v[row][0], ', Feature =', sorted_f_and_v[row][1], \
          ', Coefficient =', sorted_f_and_v[row][2])
    
print('\nTotal of all r ratios > 1 =', round(sum(row[3] for row in sorted_errors), 0))

When tried to comapre the text from mislabled data points, They don't appear to have a clear pattern to those errors where R ratio is high. We may consider removing some of the words with high coefficients at a later point or consider additional features that can improve the classification. 

##Consider Additional Features

We next considered several additional features. 

*   word count
*   character count
*   sentence count
*   average sentence length
*   sentiment (using nltk TextBlob)
*   named entity recognition (NER)

We have created a new dataset called train_data_counts, which contains all numerical data to compliment the two datasets that have already been created. 

In summary, we now have three datasets:

*   train(dev)_data which contains textual information 
*   train(dev)_cv, which contains the word counts from vectorization  
*   train(dev)_data_counts, which contains word counts, character counts, sentence counts, average sentence length and sentiment scores



In [None]:
train_data_counts = pd.DataFrame(columns=['text_characters','word_count','sentence_count','avg_sentence_length', 'sentiment'])
dev_data_counts = pd.DataFrame(columns=['text_characters','word_count','sentence_count','avg_sentence_length', 'sentiment'])


train_data_counts['text_characters'] = train_data['text'].str.len()
train_data_counts['word_count'] = train_data["text"].apply(lambda x: len(str(x).split(" ")))
train_data_counts['sentence_count'] = train_data["text"].apply(lambda x: len(str(x).split(".")))
train_data_counts['avg_sentence_length'] = train_data_counts['word_count'] / train_data_counts['sentence_count']
train_data_counts['sentiment'] = train_data['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
dev_data_counts['text_characters'] = dev_data['text'].str.len()
dev_data_counts['word_count'] = dev_data["text"].apply(lambda x: len(str(x).split(" ")))
dev_data_counts['sentence_count'] = dev_data["text"].apply(lambda x: len(str(x).split(".")))
dev_data_counts['avg_sentence_length'] = dev_data_counts['word_count'] / dev_data_counts['sentence_count']
dev_data_counts['sentiment'] = dev_data['text'].apply(lambda x: TextBlob(x).sentiment.polarity)

#Run linear regression using only these new features
c = 10
lr_model = LogisticRegression(solver = 'liblinear', penalty = 'l2', C = c)
lr_model.fit(train_data_counts, train_labels)

#Run predictions with dev data
predicted_labels = lr_model.predict(dev_data_counts)
  
#Print results for predictions with dev dataset
print('\033[1m\nLogistic Regression using character, word and sentence counts + sentiment\033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))






These new features, used on their own with the logistic regression classifier result in an F1 score of ~70%. We will have to determine at a later point if this can improve the overall results. Let's look at a graph of the sentiment values to see what the distribution looks like. 

In [None]:
#Plot sentiment
fig=plt.figure(figsize=(12,5))
plt.xlabel('Sentiment Values')
plt.ylabel('Number Of Occurances in Train')
plt.title('Histogram of Sentiment Values')
#plt.xlim([0, 5000])
plt.hist(dev_data_counts['sentiment'], bins = 200, edgecolor = "black")
plt.axvline(dev_data_counts['sentiment'].mean(), color='y', linestyle='dashed', linewidth=3)
min_ylim, max_ylim = plt.ylim()
plt.text(dev_data_counts['sentiment'].mean()*1.2, max_ylim*0.9, 'Mean: {:.2f}'.format(dev_data_counts['sentiment'].mean()))
plt.show()

Sentiment values appear largely normally distributed with a mean of .07. 

## Named-Entity Recognition

Now we begin to analyze the title field and extract features from it. 

We start with Named-Entity Recognition. In this step, we extract tags representing named entities in the title field in pre-defined categories and determine counts. The named entities include names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. 

Counts for each category become a new feature in our dataset. We add these values to the train_data_counts dataset. 

In [None]:
#Ignore chained assignment warnings
pd.options.mode.chained_assignment = None

#Call model
ner = spacy.load('en_core_web_sm')

#Extract tags and insert new column in dataframe
train_data['tags'] = train_data['title'].apply(lambda x: [(tag.text, tag.label_) 
                                for tag in ner(x).ents] )
dev_data['tags'] = dev_data['title'].apply(lambda x: [(tag.text, tag.label_) 
                                for tag in ner(x).ents] )

#Create function to count the number of items in a list
def list_count(lst):
  """This function is used to count the number of elements in a list"""
  dict_counter = collections.Counter()
  for x in lst:
    dict_counter[x] += 1
  dict_counter = collections.OrderedDict( 
          sorted(dict_counter.items(), 
          key=lambda x: x[1], reverse=True))
  list_count = [ {key:value} for key,value in dict_counter.items() ]
  return list_count

#Count the tags for each row and add count to the tags column
train_data["tags"] = train_data["tags"].apply(lambda x: list_count(x))
dev_data["tags"] = dev_data["tags"].apply(lambda x: list_count(x))

print('\nPrinting example of added tags')
printmd('\n<b>Printing examples of added tags:</b>')     
print(train_data.iloc[0]['tags'])
print(train_data.iloc[1]['tags'])
print(train_data.iloc[3]['tags'])

#Create column for each tag category
def ner_features(list_dics_tuples, tag):
  """This function creates a new column for each tag category"""
  if len(list_dics_tuples) > 0:
    tag_type = []
    for dict_tuples in list_dics_tuples:
      for tuple in dict_tuples:
        type, n = tuple[1], dict_tuples[tuple]
        tag_type = tag_type + [type]*n
        dict_counter = collections.Counter()
        for x in tag_type:
            dict_counter[x] += 1
    return dict_counter[tag]
  else:
    return 0


#Extract the features for train data
tags = []
for lst in train_data["tags"].tolist():
  for dict in lst:
    for k in dict.keys():
      tags.append(k[1])

tags = list(set(tags))

for feature in tags:
  train_data_counts["tags_"+feature] = train_data["tags"].apply(lambda x: 
                    ner_features(x, feature))

#Extract the features for dev data
tags = []
for lst in dev_data["tags"].tolist():
  for dict in lst:
    for k in dict.keys():
      tags.append(k[1])

tags = list(set(tags))

for feature in tags:
  dev_data_counts["tags_"+feature] = dev_data["tags"].apply(lambda x: 
                  ner_features(x, feature))

printmd('\n<b>Printing all features in train_data_counts:</b>')     
print(train_data_counts.columns)

printmd('\n<b>Printing head of train_data_counts:</b>')     
print(train_data_counts.head(5))
     

##Calculating Logistic Regression After Adding NER

In [None]:

#Run linear regression using only these new features
c = 10
lr_model = LogisticRegression(solver = 'liblinear', penalty = 'l2', C = c)
lr_model.fit(train_data_counts, train_labels)

#Run predictions with dev data
predicted_labels = lr_model.predict(dev_data_counts)
  
#Print results for predictions with dev dataset
print('\033[1m\nLogistic Regression using character, word and sentence counts\033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))



Adding NER improves the F1 score slightly (by about 2%). 

##Apply scaling to numerical data

Next, we apply scaling the the numeric data in train_data_counts. We rerun logistic regression to determine if we are making any improvements in the performance of the model. 

In [None]:
#Apply standard scalar to numerical values
standard_scaler = StandardScaler().fit(train_data_counts)
scaled_values_t = standard_scaler.transform(train_data_counts)
pd.Series(scaled_values_t.flatten()).describe()

standard_scaler = StandardScaler().fit(dev_data_counts)
scaled_values_d = standard_scaler.transform(dev_data_counts)

#Run linear regression using only these new features
c = 10
lr_model = LogisticRegression(solver = 'liblinear', penalty = 'l2', C = c)
lr_model.fit(scaled_values_t, train_labels)

#Run predictions with dev data
predicted_labels = lr_model.predict(scaled_values_d)
  
#Print results for predictions with dev dataset
print('\033[1m\nLogistic Regression using character, word sentence counts, NER and sentiment with scaled values \033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))
results = confusion_matrix(predicted_labels, dev_labels)
print('\n', results)



Standard scaling introduces a slight improvement in the performance of the model (.003% increas in F1 score). 

# **Attempt Additional Classifiers**

Next, we try additional classifiers to determine if they appear to provide any improvement over Logistic Regression. A GridSearch was conducted on Random Forest but the code is not run in this Notebook due to long execution time. 

##Run SVC Classifier

##Run Random Forest Classifier

We next run the random forest classifier. Prior to this, we conducted a grid search to seek the best parameter for n_estimators and max_features. This code is commented out in the notebook due to its long running time. 

In [None]:
### GRIDSEARCH ON RANDOM FOREST CLASSIFIER - COMMENTED OUT DUE TO LONG RUNNING TIME
# Build a classification task using 100 informative features
#train_cv, train_labels = make_classification(n_samples=1000,
#                           n_features=131028,
#                           n_informative=100,
#                           n_redundant=0,
#                           n_repeated=0,
#                           n_classes=2,
#                           random_state=0,
#                           shuffle=False)

#rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

#param_grid = { 
#    'n_estimators': [20, 10000],
#    'max_features': ['auto', 'sqrt', 'log2']
#}

#rf_model = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
#rf_model.fit(train_cv, train_labels)
#print('Best model parameters for random forest classifier:', rf_model.best_params_)

#predicted_labels = rf_model.predict(dev_cv)
#print(accuracy_score(dev_labels,predicted_labels))

In [None]:
# Instantiate model 
rf = RandomForestClassifier(n_estimators = 200, random_state = 42, max_features = 'sqrt')
# Train the model on training data
rf.fit(train_cv, train_labels)
#rf_model.fit(train_cv, train_labels)
predicted_labels = rf.predict(dev_cv)
#print(accuracy_score(dev_labels,predicted_labels))
#Print results for predictions with dev dataset
print('\033[1m\nRandom Forest using Text word counts\033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))

Linear SVC appears to outperform Logistic Regression. Random Forest appears to perform much more poorly and thus will no longer be considered. Now we will attempt to combine the datasets to evaluate peformance with combined features using Linear SVC. 

##Run Linear SVC On Vectorized Text and Title Fields

We wish to next combine the vectorized text and title fields. Based on the meaningful labels provided by NER, our speculation is that bigrams or trigrams may offer some improvements in the title field due to the inherently different nature of that field (where words grouped together may have significant meaning). 

This experiment consists of several steps: 
*   Vectorize the title fields, varying ngram_range
*   Print statistics from the vectorization
*   Concatenate the vectorized data from the text and title fields
*   Run the Linear SVC classifier against the combined data for each ngram_range

In [None]:

def vectorize_and_test(num_words_a, num_words_b): 
  """This function takes in the ngram_range as input, vectorizes the title field,
combines vectorized text and title data, and runs the linear SVC classifier 
against the concatenated file"""
  #Transform training and dev data to matrices of word unigram features
  vectorizer_t_title = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b))
  train_cv_title = vectorizer_t_title.fit_transform(train_data["title"]) 
  train_features = vectorizer_t_title.get_feature_names_out()
  vectorizer_d_title = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b), vocabulary = train_features)
  dev_cv_title = vectorizer_d_title.fit_transform(dev_data["title"]) 

  #Print some statistics on the vectorization
  print('\n\033[1mNGram range: {', num_words_a, ',', num_words_b, '}\033[0m')
  print('\nPrinting statistics for title vectorization:')
  print('\nVocabulary size = ', "{:,}".format(train_cv_title.shape[1]))
  print('Number of non zero features = ', "{:,}".format(train_cv_title.nnz))
  print('Number of samples in training dataset = ', "{:,}".format(train_cv_title.shape[0]))
  print('Average non zero features per example = ', "{:,}".format(train_cv_title.nnz/train_cv_title.shape[0]))
  total_entries = train_cv_title.shape[0] * train_cv_title.shape[1]
  print('Fraction of non zero entries in the matrix = ', round((train_cv_title.nnz/total_entries), 6))
  feature_names = vectorizer_t_title.get_feature_names_out()
  print('First feature = ', feature_names[0], '  Last feature = ', feature_names[len(feature_names) - 1])

  concat_t_array = hstack((train_cv_title, train_cv))
  concat_d_array = hstack((dev_cv_title, dev_cv))

  #Run LinearSVC using combined features from text and title fields
  c = 1
  lr_model = LinearSVC(C = c)
  lr_model.fit(concat_t_array, train_labels)

  #Run predictions with dev data
  predicted_labels = lr_model.predict(concat_d_array)
  
  #Print results for predictions with dev dataset
  print('\033[1m\nLinear SVC on Vectorized Text and Title Fields \033[0m\n')
  print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))

  print('\n', classification_report(dev_labels, predicted_labels, target_names=['true', 'fake']))
  results = confusion_matrix(predicted_labels, dev_labels)
  print('\n', results)

  #Return vectorized concatenated arrays
  return concat_t_array, concat_d_array


#Number of words range start for Vectorizer 
num_words_a = 1
#Number of words range end for Vectorizer 
num_words_b = 1

concat_t_array, concat_d_array = vectorize_and_test(num_words_a, num_words_b)

#Number of words range start for Vectorizer 
num_words_a = 1
#Number of words range end for Vectorizer 
num_words_b = 2

concat_t_array_best, concat_d_array_best = vectorize_and_test(num_words_a, num_words_b)

#Number of words range start for Vectorizer 
num_words_a = 1
#Number of words range end for Vectorizer 
num_words_b = 3

concat_t_array, concat_d_array = vectorize_and_test(num_words_a, num_words_b)

#Number of words range start for Vectorizer 
num_words_a = 1
#Number of words range end for Vectorizer 
num_words_b = 4


concat_t_array, concat_d_array = vectorize_and_test(num_words_a, num_words_b)


It appears that an ngram_range of {1,3} for the title provides the best results. Our speculation that this might provide an improvement turned out to be correct.  

##Combine Vectorized Text, Vectorized Title and Additional Numerical Features

Next, we attempt to combine the Vectorized title and text data with the train_data_counts data that we created earlier, which includes standard scaled word counts, character counts, sentence counts, average sentence length, sentiment scores, and NER category counts. 

In [None]:
#Put counts info into array
counts_t_array = train_data_counts.to_numpy()
counts_d_array = dev_data_counts.to_numpy()
      
counts_t_matrix = sparse.csr_matrix(counts_t_array)
counts_d_matrix = sparse.csr_matrix(counts_d_array)

concat_t_array2 = hstack((concat_t_array_best, counts_t_matrix))
concat_d_array2 = hstack((concat_d_array_best, counts_d_matrix))

#Run linear regression using combined features
c = 1
lr_model = LinearSVC(C = c)
lr_model.fit(concat_t_array2, train_labels)

#Run predictions with dev data
predicted_labels = lr_model.predict(concat_d_array2)
  
#Print results for predictions with dev dataset
print('\033[1m\nLinear SVC on vectorized text and title fields with additional numerical data \033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(dev_labels,predicted_labels, average='weighted')))
results = confusion_matrix(predicted_labels, dev_labels)
print('\n', results)



Surprisingly, when we add in the data from train_data_counts, it severely degrades the effectiveness of our classifier. This is remarkable given that it consists of only 23 features being added to ~370,000 features in the concatenated dataset. 

##Deep Learning

In this section, we attempt to use an unsupervised learning algorithm to develop topics and classify title information into two different categories. We use the Latent Dirichlet Allocation method to identify the topics and associated features. 

In [None]:
#Number of words range start for Vectorizer 
num_words_a = 1
#Number of words range end for Vectorizer 
num_words_b = 1

#Transform train Title data to matrices of word unigram features
vectorizer_t = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b))
train_cv = vectorizer_t.fit_transform(train_data["title"]) 

#Define the number of topics or components
n_top_words=20

#Define model object
model=LatentDirichletAllocation(n_components=2, max_iter = 20, random_state = 20)

#Fit and transform data
lda_matrix_t = model.fit_transform(train_cv)

#Obtain components
topic_words = model.components_

#Get feature names
feature_names =  vectorizer_t.get_feature_names_out()

print("\nTopics and Topic Words")
#Print feature names for each topic 
for i, topic_dist in enumerate(topic_words):
  sorted_topic_dist = np.argsort(topic_dist)
  topic_words = np.array(feature_names)[sorted_topic_dist]
  topic_words = topic_words[:-n_top_words:-1]
  print('Title', str(i+1), topic_words)

#Determine the topic for each document
topic_assignments_t = []
doc_topic = model.transform(train_cv)
for n in range(doc_topic.shape[0]):
  topic_doc = doc_topic[n].argmax()
  topic_assignments_t.append([n+1, topic_doc, train_labels.iloc[n]])

for n in range(10):
  print('Document', topic_assignments_t[n][0], ' -- Topic:', topic_assignments_t[n][1])

df_t = pd.DataFrame(topic_assignments_t, columns =['Document', 'Topic', 'Label'], dtype = int)

temp_df = (df_t.groupby('Topic').size().sort_values(ascending=False) / df_t.groupby('Topic').size().sort_values(ascending=False).sum())*100

print('\n', temp_df)

plt.bar(temp_df.index, temp_df.values, color ='blue', tick_label = ([1, 0]))
plt.xlabel("Topic")
plt.ylabel("Percent")
plt.title("Topic Distribution")
plt.show()

print('\nAccuracy Score of Topic Assignments : ' + str(metrics.accuracy_score(train_labels,df_t['Topic'])))

We compared the assigned topics for each row in the training dataset against the label for that row. The topics Latent Dirichlet Allocation method was not effective in predicting the label. As these results are significantly below previous attempts, we will no longer consider the use of this method. Other unsupervised methods, or perhaps different parameters in this method, may perhaps be more useful.  

#**Conduct Final Test**

Finally, we invoke our test data to evaluate the performance of our refined model overall. We conduct the following steps: 
* Vectorize and concatenate the test data text and title fields
* Run the linear SVC classifier against the vectorized data using the best options noted above

In [None]:
#Number of words range start for Text Vectorizer 
num_words_a = 1
#Number of words range end for Text Vectorizer 
num_words_b = 1

#Transform train TEXT data to matrices of word unigram features
vectorizer_t = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b))
train_cv = vectorizer_t.fit_transform(train_data["text"]) 
train_features = vectorizer_t.get_feature_names_out()

#Transform test TEXT data to matrices of word unigram features
vectorizer_test = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b), vocabulary = train_features)
test_cv = vectorizer_test.fit_transform(test_data["text"]) 

#Number of words range start for Title Vectorizer 
num_words_a = 1
#Number of words range end for Title Vectorizer 
num_words_b = 3

#Transform training and test TITLE data to matrices of word unigram features
vectorizer_t_title = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b))
train_cv_title = vectorizer_t_title.fit_transform(train_data["title"]) 
train_features2 = vectorizer_t_title.get_feature_names_out()

vectorizer_test_title = TfidfVectorizer( stop_words= stop_words , ngram_range=(num_words_a, num_words_b), vocabulary = train_features2)
test_cv_title = vectorizer_test_title.fit_transform(test_data["title"]) 

#Concatenate TITLE and TEXT data
concat_t_array = hstack((train_cv_title, train_cv))
concat_test_array = hstack((test_cv_title, test_cv))

#Run LinearSVC using combined features from text and title fields
c = 1
lr_model = LinearSVC(C = c)
lr_model.fit(concat_t_array, train_labels)

#Run predictions with dev data
predicted_labels = lr_model.predict(concat_test_array)
  
#Print results for predictions with dev dataset
print('\033[1m\nLinear SVC on Vectorized Text and Title Fields \033[0m\n')
print('F1 Score : ' + str(metrics.f1_score(test_labels,predicted_labels, average='weighted')))
results = confusion_matrix(predicted_labels, test_labels)
print('\n', results)

Our classifier, using LinearSVC against vectorized text and title data performs well in our final test. 

#Conclusion

In our efforts to develop a novel classifier, we analyzed several different datasets - running exploratory data analysis and running preliminary classification models on each. One dataset suffered from significant data leakage problems. A second dataset contained very reliable labeling data but the dataset was too small. A third dataset had a more complex structure and would have required us to make many Twitter API calls for each row, making the analysis process much more difficult. We chose the fake-news dataset because it contained a sufficient amount of data, it was straightforward to analyze, and didn't suffer from data leakage problems. 

Our initial analysis consisted of several steps: 
*   Data load
*   Exploratory data analysis and data cleaning 
*   Development of a custom preprocessing routine
*   Conducting TFID vectorization on the 'Text' field, which contained the article itself; first, using the default vectorizer options with the addition of English stopwords and second, using the custom preprocessor 
*   Running sklearn classifiers against the dataset that had applied the default as well as custom preprocessor

Our dataset needed cleaning. There were approximately 2500 rows with null values or text fields with improperly placed commas. We also determined that in our clean dataset, the author field contained unique values in each row, making this column undifferentiated and thus not useful for classification. The two primary fields to center our classification efforts on were the 'Title' and 'Text' fields. Our dataset was reasonably well balanced with 56.7% of the articles deemed as 'True' and 43.3% of the articles deemed as 'False'. 

After the initial analysis, we conducted a feature engineering exercise, which included the following steps: 
*   An analysis of the top 50 features
*   An experiment with bigram and trigram features on the 'Text' field
*   Error analysis, using R-ratio
*   Development of new features including word count, character count, sentence count, average sentence length, sentiment (using nltk TextBlob), and named entity recognition (NER) of the 'Title' field. 
*   Applying standard scaling to the new features
*   Attempting various combinations of features and running classifiers after the addition of each group of new features above
*.   Using an unsupervised learning algorithm (Latent Drichlet Allocation) to develop topics and classify rows into those topics

We found that unigram features worked best for the 'Text' field and that unigram/bigram/trigram features worked best for the 'Title' field. The new features above that we created - word count, character count, NER, sentiment, etc. appeared to detract from successful classification. We were quite struck that by the addition of the 23 newly created numerical features, that our classifier performed much more poorly when there were 130,000 features in the original dataset. Standard scaling provided only a slight improvement (.003) in the F1 score. The best results were obtained by combining the vectorized 'Text' and 'Title' fields. 

Analysis of top features clearly indicated that a pattern of misinformation related to the 2016 election, with words identified like "clinton", "hillary", "2016", "fbi", "election", "october", and "emails". The top features, in large part, seemed intuitively reasonable based on what the team perceived as news stories that had proven to historically be untrustworthy.  

An analysis of R-ratio errors did not turn out to be particularly fruitful as we could not identify any clear patterns of errors that we could correct for, though through vectorization of the 'Title' field, we achieved quite good results from the classifier. Latent Drichlet Allocation similarly did not provide us with a great deal of insight on the data and subsequent classification. The algorithm seemed particularly unsuited for the 'Text' field though results seemed a bit more balanced when used with the 'Title' field. In one run with multiple components, we observed the vast majority of rows (98%) fell into one Topic. In another case, the algorithm appeared to identify non-English words as a separate topic. 

By the end of the project, we had attempted four different classifiers: logistic regression, multinomial naive bayes, random forest and linear SVC. Our efforts included gridsearch to obtain the best parameters for the classifiers. Linear SVC and logistic regression performed far better than the other two with Linear SVC consistently edging out logistic regression. Using vectorized 'Text' and 'Title' fields with linear SVC resulted in an F1 score of around 98.3% - 98.6% in our final testing step.  We believe through the experiments that we conducted, that our classifier performs reasonably well compared to other alternatives that were evaluated. 

