# Importing packages

In [210]:
# Importing packages * CLEAN NON USED MODULES
import pandas as pd
import numpy as np
import matplotlib
import sklearn
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold, train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, RocCurveDisplay 
import matplotlib.pyplot as plt

%matplotlib inline


# Setting up paths
First step is to setup files paths for the linguistics database and word2vec model made by Yang et al. 2020.

In [211]:
### Setting up relative file paths

# get current working directory
current_dir = Path.cwd()


# relative file path for linguistics database
data_file_rel_path = "../data/paper_data_streamlined.csv"

# make absolute path for linguistics database by combinig current working directory and relative path
data_path = (current_dir / data_file_rel_path).resolve()


# relative file path for word to vec model
model_rel_path = "../data/mag_200d_psy_eco_word2vec"

# make absolute path for word2vec model by combinig current working directory and relative path
model_path  = (current_dir / model_rel_path).resolve()

# Loading in linguistics database and word2vec model
The linguistics database is loaded in as the object "data", and the word2vec model is loaded in as "model".

In [212]:
#load in linguistics database
data = pd.read_csv(data_path, sep=',', names=['ID', 'bib', 'abstract', 'rep_score'])

# Removes some rows from the dataframe with NAs (remember that python starts counting from 0 . . .):
### CONSIDER MOVING TO R *
data.drop([0, 43, 92], axis = 0, inplace = True)


# Importing the word2vec model as a dataframe:
model = pd.read_csv(model_path, sep=' ', skiprows = 1, header=None)

# Calculate TF-IDF vectors for each paper
A TF-IDF vector is calculated for each documents in the collection. Each document is a abstract in the linguistics database, and the collection refers to the entire database of abstracts. 

This is done using the tfidfVectorizer function from scikit-learn. The vectors for each paper are saved as rows in a matrix. Each entry is the TF-IDF for a term in entire collection. Since many terms are not present in all documents, the resulting TF-IDFs will often be zero. The matrix is therefore saved as a sparse matrix, which is computationally more efficient. 


In [213]:
# Save the TF-IDF vectorizer function from sci-kit learn
tfidf_vectorizer = TfidfVectorizer()

# Make a sparse matrix containing TF-IDFs for each abstract. 
# This is done by passing the column containg Abstracts from the linguistics database to the vectorizer function. 
# Note that thext has be to converted to unicode strings. the text needs to be converted to unicode strings (see https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document)
matrix = tfidf_vectorizer.fit_transform(data['abstract'].values.astype('U')) 


# Working with the dictionary file from the TF-IDF
The TF-IDF matrix comes with .vocabulary_ method which returns an dictionary. The dictionary is structured where the keys are terms, and the values are the column index of the term in TF-IDF vector. We reverse the dictionary so that keys are index-locations in tf-idf matrix and values are the corresponding term.

In [214]:
# Get a list (techically dict_item) of tuples that are (term, index) from the vocabulary dictionary 
items = dict.items(tfidf_vectorizer.vocabulary_) 

# Turn this list of tuples into a data frame with two columns. One is term, the other is index.
dict_df = pd.DataFrame(items)

# Rename columns with sensible names
dict_df.columns = ['keys', 'values']

#Create a list with the order we want for the dataframe
column_titles = ['values', 'keys']

# Flip the order of columns in df using the list
dict_df = dict_df.reindex(columns = column_titles)


In [215]:
dict_df.shape
model.shape

model.iloc[:,0]

0              the
1               of
2              and
3               in
4               to
            ...   
275556    workover
275557    condotel
275558      kuntey
275559       houga
275560      gp-stn
Name: 0, Length: 275561, dtype: object

# Make the TF-IDF and word2vec model commensurable
In order to be able to multiply the TF-IDF vecotors with the word2vec model, we have make sure that they contain excactly the same terms. This is not the case now, which also results in the two matricises not being commnesurable. As a consequence, we have to remove all terms from the word2vec model that are not present in the TF-IDF. 

### Remove terms from word2vec model that are not present in the TF-IDF matrix
The following code iterates through the reversed dataframe of TF-IDF (term, index) pairs. For each term, the loop checks if the same term is present in the word2vec model. This results in a dataframe, where each row is a word vector for a term in the TF-IDF. If the match isn't found, the term is stored in another dataframe of missing matches. The rows of this filtered word2vec-model is also ordered as the columns of TF-IDF matrix.  * Are the matricies still aligned when TF-IDF is changed later? * Im unsure of the how the code works

In [216]:
# Trying out a for loop to extract terms from the model: 

# data 1 = dict_df 
# data 2 = model

# First we create an empty data frame to store the matched rows from the model
matched_df = pd.DataFrame(columns = model.columns)

# Then we create an empty dataframe to store the missing matches
missing_matches = pd.DataFrame(columns=['Term', 'Value'])

# We iterate through each term in the model:
for i, term in enumerate(dict_df['keys']):
    # Checking if the terms exists in the dictionary
    if term in model.iloc[:,0].values:
        # If there is a match, we extract the entire row from the model and add it to the matched_df
        row = model.loc[model.iloc[:,0] == term] # is this extracting the row?
        row.index = [i]
        matched_df = pd.concat([matched_df, row]) # how does this line work?
    else:
        # If no match is found, add the term to missing_matches
        missing_matches.loc[len(missing_matches)] = [term, dict_df.iloc[i, 0]]

matched_df = matched_df.sort_index()    


### Remove missing words from TF-IDF matrix 

In [217]:
# Converting the TF-IDF matrix from sparse to pandas dataframe
tf_idf_df = pd.DataFrame.sparse.from_spmatrix(matrix)

# make a list of the indicies of all columns in the TF-IDF matrix that didn't have a match in word2vec model.
missing_matches_list = list(missing_matches['Value'])

# Save only columns in TF-IDF matrix that are not present in the list of missing matches 
tf_idf_matched_df = tf_idf_df[tf_idf_df.columns[~tf_idf_df.columns.isin(missing_matches_list)]]





# Fixing and aligning indecies in TF-IDF matrix and in Word2vec matrix
Since we have removed rows and columns from the matricies, the indicies of the rows and columns doesn't line up with their actual placement. For example row 2000 in the word2vec model doesn't have id 1999. This is fixed in the following code.

In [218]:
#Select all but the first column in the matched df and save it to new matrix. This is done since the first row of the matrix has the actual terms.
word2vec_rindx = matched_df.iloc[:, 1:] 

#Fix row index so that no values are skipped
word2vec_rindx = word2vec_rindx.reset_index(drop = True) # Reindexing

#Fix column index so that each value corresponds to a dimension. Dimensions are 0-199
word2vec_rindx.columns = range(len(word2vec_rindx.columns)) 

# Doing the same for the tf_idf.matched dataframe: *?
tf_idf_rindx = tf_idf_matched_df.reset_index(drop = True) # Reindexing

tf_idf_rindx.reset_index(drop = True, inplace = True) # and changing the index range to 0-199

# Checking type and dimensions of our two dataframes:

print("Dimensions of the TF-IDF dataframe: ", tf_idf_rindx.shape)
print("Type of TF-IDF dataframe: ", type(tf_idf_rindx))
print(" ")
print("Dimensions of the word2vec dataframe: ", word2vec_rindx.shape)
print("Type of word2vec dataframe: ", type(word2vec_rindx))


Dimensions of the TF-IDF dataframe:  (95, 2501)
Type of TF-IDF dataframe:  <class 'pandas.core.frame.DataFrame'>
 
Dimensions of the word2vec dataframe:  (2501, 200)
Type of word2vec dataframe:  <class 'pandas.core.frame.DataFrame'>


# Multiplying the matrices: TF-IDF x Word2Vec 
We now multiply the TF-IDF matrix and the word2vec matrix. The TF-IDF is a (95,2501) matrix and the word2vec is a (2501, 200) matrix. This makes them comensurable or "multiplyable". This results in a (95,200) matrix. This resulting matrix is corresponds to a vector for each paper that has reweighted the word vectors with the papers TF-IDF values. 

In [219]:
#Turn the both matricies into a numpy arrays in order to do numpy operations
tf_idf_rindx_np = tf_idf_rindx.to_numpy() 
word2vec_rindx_np = word2vec_rindx.to_numpy()

#Multiply the the two 
tf_idf_w2v_product = np.matmul(tf_idf_rindx_np, word2vec_rindx_np)

# Append ground truths to each paper's vector
We code each paper's 200d vector with 1 if it is judged replicated or partially replicated. 0 if not.

In [220]:
#Take the column of yes/partial/no replication encodings from the linguistics database. Outputs a pandas Series
rep_column = (data['rep_score'])

#Turn it into a numpy array
rep_column = rep_column.to_numpy()

# Loop through each entry and turn yes/partial into 1 and 0 if not. 
for i, val in enumerate(rep_column):
    if val == "yes" or i == "partial":
        rep_column[i] = 1
    else:
        rep_column[i] = 0

In [221]:
# Append outcomes to tf-idf * w2v product matrix:

tf_idf_w2v_encoded = np.c_[ tf_idf_w2v_product, rep_column ]

tf_idf_w2v_encoded.shape

(95, 201)

Final matrices:

- tf_w2v_encoded

- tf_idf_w2v_encoded 

# ====== Training model =======

In [None]:
#remove last column and add to y vector
# if len(tf_idf_w2v_encoded.columns) == 201:
#     y = tf_idf_w2v_encoded.pop(200)

# #Predictors
# X = tf_idf_w2v_encoded

# # Train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)
# # Stratify = y means that the training set will be balanced between the classes
# # Random_state = 42 is a random seed, to ensure that the otherwise randomized train-test-set split is the same each time we run the kernel

# #Fit Randomforest
# clf = RandomForestClassifier(max_depth=3, random_state=0)
# #clf.fit(X_train, y_train.astype('int'))

# #cross validation with grid search. Tries a bunch of hyperparameters, whatever they are, and picks the best performing model
# space = dict()
# space['n_estimators'] = [10,50,100]
# space['max_features'] = [2,4,6]
# search = GridSearchCV(clf, space, cv = KFold(), refit=True)
# result = search.fit(X_train, y_train.astype('int'))
# best_model = result.best_estimator_

# print(best_model.get_params)

In [None]:
# Split the dataset into features and target
y = tf_idf_w2v_encoded.iloc[:, 200] # Target
y = pd.get_dummies(y)
X = tf_idf_w2v_encoded.iloc[:, :200] # Features

# Check frequencies for the different outcomes in the dataset
print(y.value_counts())

0  1
0  1    59
1  0    36
dtype: int64


In [None]:
# Splitting the data-set into a train and test set



# Strangely, just increasing the size of the test set increases accuracy. Clearly, something super random is going on.

# Stratify = y means that the training set will be balanced between the classes
# Random_state = 42 is a random seed, to ensure that the otherwise randomized train-test-set split is the same each time we run the kernel

In [None]:
# Making the model print its parameters
sklearn.set_config(print_changed_only=False)

# Instantiating the Random Forest Classifier
forest = RandomForestClassifier(max_depth=3)

#empty list of observed accuraices
accuracies = []

#make a new train-test split and test accuracy. Becomes expensive pretty fast. 100 iterations are 9secs on M1 Macbook Air
for i in range(30):

    # Make train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

    # train model for each iteration
    forest.fit(X_train, y_train)

    # generate predictions
    y_pred_test = forest.predict(X_test)

    # score predicitons
    accuracy = accuracy_score(y_test, y_pred_test)

    # save accurcy score to list
    accuracies.append(accuracy)

#turn list of accuracies into dataframe
accuracies_df = pd.DataFrame(accuracies)
#rename column
accuracies_df.columns = ['accuracy']
#save to .csv for visualization in r
accuracies_df.to_csv("../data/accuracies.csv")




While the accuracy is not very impressive, it is not the only measure of how well the model actually performs. According to [Kreiger](https://medium.com/analytics-vidhya/evaluating-a-random-forest-model-9d165595ad56), we can use a confusion matrix to get more information about our model – how well does it classify the different classes (in our case 2)?

In [None]:
# Converting the y_pred_test to dataframe
y_pred_test = pd.DataFrame(y_pred_test)

# Checking the values out as binary vectors
print(y_test.values.argmax(axis=1))
print(y_pred_test.values.argmax(axis=1))

[0 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 0 0 0 0]
[0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1]


In [None]:
# Making confusion matrix
confusion_matrix(y_test.values.argmax(axis=1), y_pred_test.values.argmax(axis=1))
# Notice how the inputs are converted to binary strings using the argmax along axis 1

array([[ 2,  9],
       [ 3, 18]])

In [None]:
# Printing the classification report
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.40      0.18      0.25        11
           1       0.67      0.86      0.75        21

   micro avg       0.62      0.62      0.62        32
   macro avg       0.53      0.52      0.50        32
weighted avg       0.57      0.62      0.58        32
 samples avg       0.62      0.62      0.62        32



It would be awesome to include the tf vectors as well into the feature set . . .
But how is this done?