<a href="https://colab.research.google.com/github/Zihooo/Text-selection-codes-pub/blob/main/The_TFIDF_Approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Feature Extraction Approach for Personality Score Prediction
This colab is written in **Python** to illistrate the process of *feature extraction approach* with TF-IDF scores and a random forest classifier when predicting personality scores from texts.

### **Step 1 Text Preprocessing** 
In the text preprocessing phase, we 1. Removed the special characters. 2. Tokenized the texts. 3. Lowercased all texts. 4. Removed stop words. 

In [None]:
# Mount Google drive to get access to the data
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
# import required pacakges
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
string.punctuation

In [None]:
# import raw data
csv_file = '/content/drive/MyDrive/Files/Text Selection Paper Codes/data/All_data.csv' # path to data file
df = pd.read_csv(csv_file, encoding= 'unicode_escape')


In [None]:
#defining the function to remove special characters and punctuations
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

In [None]:
#storing the and punctuations free text
df['clean_text']= df['All_response_raw'].apply(lambda x:remove_punctuation(x))

In [None]:
# lowercase the texts
df['msg_lower']= df['clean_text'].apply(lambda x: x.lower())

In [None]:
#applying function to the column
df['msg_tokenied']= df['msg_lower'].apply(word_tokenize)

In [None]:
# Load the pre-defined stop words dictionary
stop_words = stopwords.words('english')
# Extend the stop words distionary with some high frequency words in the current data
stop_words.extend(["would","dont","could","id","X"])

In [None]:
# define remove stop words function
def remove_english_stopwords_func(text):
    # check in lowercase 
    t = [token for token in text if token.lower() not in stop_words]
    text = ' '.join(t)    
    return text

In [None]:
# remove stop words
df['No_Stop_Words'] = df['msg_tokenied'].apply(remove_english_stopwords_func)

### **Step 2 feature extraction** 
In the feature extraction phase, we generate the TF-IDF vectors. 

In [None]:
document = df.No_Stop_Words

In [None]:
# generate TF-IDF vectors with 2000 features
vectorizer = TfidfVectorizer(max_features=2000)
vectors = vectorizer.fit_transform(document)
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
tfidf = pd.DataFrame(denselist, columns=feature_names)

In [None]:
# add labels to TF-IDF vectors
tfidf['ascore'] = df['ascore']
tfidf['cscore'] = df['cscore']
tfidf['nscore'] = df['nscore']
tfidf['escore'] = df['escore']
tfidf['oscore'] = df['oscore']
tfidf['split'] = df['split_set']          # we have already split the original data into training and testing set

In [None]:
# save the TF-IDF scores
# tfidf.to_csv('/content/drive/MyDrive/personality prediction/tfidf/tfidf.csv')  # after the file has been saved, it was further splited into a training, evaluation and a testing set.

### **Step 3 Score Prediction** 
In the score prediction phase, we used a random forest model to predict personality scores based on TF-IDF vectors. We used the prediction  of Extraversion scores as an example in the current code sample. Other predictions can be achieved by changing the label column.

In [None]:
# import required pacakges
!pip install scipy
from scipy.stats import pearsonr
import numpy as np
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# load previously splited training and testing set
train_set = pd.read_csv('/content/drive/MyDrive/Text Selection Paper Codes/data/train_tfidf.csv', encoding= 'unicode_escape')
test_set = pd.read_csv('/content/drive/MyDrive/Text Selection Paper Codes/data/test_tfidf.csv', encoding= 'unicode_escape')

In [None]:
# specifiy the first 2000 columns as features, and the personality scores as labels
train_features = train_set.iloc[:,0:2000]
test_features = test_set.iloc[:,0:2000]
train_labels = train_set.escore
test_labels = test_set.escore
feature_list = list(train_features.columns)

In [None]:
# set parameter grid for grid search
param_grid = {'max_depth': [10, 50, 100],
 'n_estimators': [200, 600, 1000]}

In [None]:
# Random search of parameters, using 5 fold cross validation, 
# search across 9 different combinations, and use all available cores
# 45 fits in total
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = param_grid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(train_features, train_labels)

In [None]:
# get best estimators
best_grid = rf_random.best_estimator_

In [None]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with best parameters from random search
rf = RandomForestRegressor(n_estimators = 1000, random_state = 100)
# Train the model on training data
rf.fit(train_features, train_labels);

In [None]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# get the correlation between predicted scores and labels
pearsonr(predictions,test_labels)

In [None]:
# save the predicted scores
import pandas
dfpred = pd.DataFrame(predictions)
dfpred.to_csv('/content/drive/MyDrive/personality prediction/final-saved outputs/TFIDF/test_O.csv')