# Adding new nodes
Get SSP texts that are not in the network --> convert them to records --> if the intersect between their citations and the records in the network is great than the threshold, add them to the network.

### <u>Running Order for the Cells</u>

- Import the desired Pandas and Metaknowledge packages
- Define the global variables for the Column Order of the Database, the Path of the file directories and the Database list of Record Citation Objects
- Run the cell containing all of the functions
- Run the following cells up until the matrix is created and then check the articles for SSP contents
    - If an article does not reference the SSPs, remove it from the Database and mark it in red in the matrix file
- Run the last cell to create the text records after having gone through the new entries in the database

In [2]:
import pandas as pd
import metaknowledge as mk

In [3]:
# column_order is the order in which the columns appear in the SSP database
column_order = ["Continent", "Region", "SSPs", "RCPs", "Combinations", "Title", "Authors", "Year", "Abstract", "Primary Sector", "Secondary Sectors", "Type", "URL", "DOI", "WOS"]

# the path for the file directories, edit as required
# the path will be different on different machines, this is currently for MAC OS
path = '/Users/aidanpower/Desktop/OneDrive - University of Waterloo/Research/'

# the record collection for existing files in the database
DatabaseRC = mk.RecordCollection(path + "SSP citations/")
Database = [R.createCitation() for R in DatabaseRC] 

In [57]:
# create_dict consumes a Record Collection and creates a dictiary of key and value pairs for each new entry in the database. 
# Each Key is a new Record Citation Object and each Value is a list of all of the Record Citation Objects from the Key's references.
# The Web of Science ID strings are used to make comparisons between articles that are already in the database and those that are not. 
# Records that don't have any citations in the database are removed before the dictionary is returned. 
# create_dict: RecordCollection --> dictof()
# Effects: Writes a .txt file to disk containing the WOSIDs

def create_dict(RecordCollection):
    dictionary = {}
    WOS_list = []

    previous_entries = list(df["WOS"])
    for Record in RecordCollection:
        Citations = Record.get("CR")
        threshold = []
        if not(Record.get("UT") in previous_entries):
            if type(Citations) != type(None):
                for cite in Citations:
                    if cite in Database:
                        threshold.append(cite)
                dictionary[Record.createCitation()] = threshold
            if len(threshold) > 0:
                WOS_list.append(Record.get("UT"))
  
    # filter the dictionary
    dictionary = {i:dictionary[i] for i in dictionary if len(dictionary[i])>0}
    
    # Write WOS_list to file
    try:
        file = open(path + "WOS_list.txt", 'w')
        for WOSID in WOS_list:
            file.write("{}\n".format(WOSID))
        print("File successfully saved!")
        file.close()
        return dictionary
    except: 
        print("The file WOS_list.txt could not be saved\n")
    

# create_matrix consumes the desired name of the file, the existing dictionary and the exisintg database. Using the Record Citation Objects
# from the existing Database as the column titles and the Record Citation Objects from the dictionary of new entries as the row titles to 
# determine what each new entry cites from the exisiting database, that intersect is given the value 1. All other boxes are given the value 0.
# create_matrix: Str, Dictionary, listof(Str) --> PandasDataframe
# Effects: Writes a .csv file to disk containing the SSP Matrix for the given year

def create_matrix(Name, Dictionary, Database):
    matrix = pd.DataFrame(0, index = range(len(Dictionary) + 1), columns = range(len(Database) + 1), dtype = int)
    rows, columns = matrix.shape
    
    for col in range(columns):
        if col == 0:
            continue
        else:
            matrix.loc[0, col] = Database[col - 1]
            
    for row in range(rows):
        if row == 0:
            continue
        else:
            matrix.loc[row, 0] = list(Dictionary.keys())[row - 1]
    rows, columns = matrix.shape
    
    for row in range(rows):
        for col in range(columns):
            if row == 0 or col == 0:
                continue
            else:
                article = matrix.loc[0, col]
                if article in list(Dictionary.values())[row - 1]:
                    #print("{}, {}".format(i,j))
                    matrix.loc[row, col] = 1
                    continue                     
    try:               
        matrix.to_csv("Matrices/" + Name)
        print("File successfully saved!")
        return matrix
    except: 
        print("The CSV file for {} could not be saved\n".format(Name))    
 

# new_entries creates a new dataframe by consuming the given RecordCollection and entering the new Records into the dataframe.
# The information from the Records is as follows: Full name of Authors, Abstract, DOI, Publication Year, Data Type and the WOS ID number.
# new_entries: RecordCollection --> PandasDataframe

def new_entries(RecordCollection, Dictionary):
    new_df = pd.DataFrame(columns = column_order)
    n = 0
    citation_objs = list(map(lambda x: str(x), Dictionary))
    
    for Record in RecordCollection:
        if str(Record.createCitation()) in citation_objs:
            entry = pd.DataFrame(columns = column_order)
            labels = {"Authors":"AF", "Title":"TI", "Abstract":"AB", "DOI":"DI", "Year":"PY", "Type":"DT", "WOS":"UT"}
            
            for label in labels:
                try:
                    if label == "Authors":
                        authors = Record.get(labels[label])
                        b = ""
                        for i in range(len(authors)): 
                            b = b + authors[i]+ "; "
                        b = b.strip()
                        entry["Authors"] = [b[:-2]]
                    elif label == "DOI":
                        doi = Record.get(labels[label]).lower()
                        entry["DOI"] = doi
                    else:
                        entry[label] = Record.get(labels[label])
                except: 
                    print("The {} for '{}' could not be found\n".format(label, Record.createCitation()))
            new_df = pd.concat([entry, new_df], ignore_index = True)
    return new_df
    

# writefile consumes the desired name of the file and saves the database as an Excel file. It takes the existing database and 
# adds the new entries to it before saving it as an Excel file.  
# writefile: Str --> None
# Effects: Writes a .xlxs file to disk containing the SSP Database

def writefile(Name, Dictionary):
    try:
        writer = pd.ExcelWriter(path + Name, engine="xlsxwriter")

        newdf = pd.concat([df, new_entries(newRC, Dictionary)], ignore_index=True)
        newdf = newdf[column_order]
        newdf.to_excel(writer, sheet_name= "DB 3.0", na_rep="", freeze_panes=(1,0), index=False)

        workbook = writer.book
        worksheet = writer.sheets["DB 3.0"]

        header_format = workbook.add_format({'bold':True, 'fg_color':'#FABF8F', 'border':0, 'font_size':11})
        for col_num, value in enumerate(newdf.columns.values):
            worksheet.write(0, col_num, value, header_format)

        writer.save()
        print("File successfully saved!")
        writer.close()        
    except: 
        print("The Excel file for {} could not be saved\n".format(Name))
    
    
# add_records adds a new record file to the existing folder of record collections (path/SSP Citations/) that contains all of the
# plain text records for the most recent record collection add to the database, after it has been screened for articles that were
# incorrectly included in the excel file database. This encorporates the list of Web of Science IDs that was saved previously.
# add_records: Str --> None
# Effects: Writes a .txt file to disk containing the 

def add_records(Name):
    NewRC = mk.RecordCollection()    
    previous_entries = list(df["WOS"])
    
    file = open("WOS_list.txt", 'r')
    lines = file.readlines()
    WOS_list = list(map(lambda WOSID: WOSID.strip(), lines))
    
    for WOSID in WOS_list:
        if not(WOSID in previous_entries):
            try: 
                Record = newRC.getID(WOSID)
                NewRC.add(Record)
            except:
                print("The record for {} could not be found\n".format(WOSID))
    try:
        NewRC.writeFile(path + Name)
        print("File successfully saved!")
    except: 
        print("The RecordCollection file for {} could not be saved\n".format(Name))

In [58]:
# Updating the excel database

# DO NOT RUN ME

if 1 == 0:
    path = '/Users/aidanpower/Desktop/OneDrive - University of Waterloo/Research/'

    df = pd.read_excel(path + "Database.xlsx", sheetname = "DB 3.0")
    """
    WOS_dict = {}
    AU = {}
    l = list(df["DOI"])
    for i in l:
        i = str(i).lower()
    for R in DatabaseRC:
        doi = R.get("DI").lower()
        if doi in l:
            AU[doi] = R.get("AF")
            WOS_dict[doi] = R.get("UT")

    for i in AU:
        b = ""
        for a in AU[i]:
            b = b + a + "; "
            AU[i] = b
        AU[i] = AU[i][:-2]
    
    
    for w in WOS_dict:
        row = df.loc[df["DOI"] == w].index
        df["WOS"][row] = WOS_dict[w]
        df["Authors"][row] = AU[w]
    """
    

In [59]:
# newRC is the record collection for files to be added
newRC = mk.RecordCollection(path + "Testing info/SSP 2018.txt")

# Create a pandas dataframe from the Database excel file, from before new entries are added
df = pd.read_excel(path + "Database.xlsx", sheetname = "DB 3.0")

Dictionary = create_dict(newRC)

File successfully saved!


In [60]:
writefile("Database.xlsx", Dictionary)

File successfully saved!


In [61]:
matrix = create_matrix("SSP citation matrix.csv", Dictionary, Database)

File successfully saved!


## STOP here and check the entires

Here is where you want to look through the new entries in the excel database. If an entry does not make use of the SSPs, delete it from the database by removing the row and mark the entry in red in the matrix citation file. Once that is done, you can add the records.

In [None]:
# Create a pandas dataframe from the Database excel file, from after new entries are added
df = pd.read_excel(path + "Database.xlsx", sheetname = "DB 3.0")

add_records("SSP Citations/SSP 2018.txt")

count, any that are over one compare with the 107 features of the matrix database and see what papers specifically are being cited. Create a new record collection, each fo the objects is one of the records that have at least one citation. Retrieve the new record collection and feed it to the matirx 

Current search string: TS = (Shared AND (Socioeconomic OR Socio-economic) AND Pathway*) AND PY = (2011-2018)) AND LANGUAGE: (English)