# Pattern matching


We will start in the same way as the last notebook started  - by downloading/importing the packages needed and importing the .csv file(s) needed. In this case, we only need the .csv file that has the matched abstracts as we are specifically looking at person-first and identity-first patterns that are "about" autism (or ASD, Asperger's syndrome, etc.). 

We could use the same basic approach to look at person-first and identity-first language for other conditions for which there are good noun and adjective forms of the words (diabetes? obesity? cancer? something else?). Doing that would mean using the .csv file with all of the abstracts or potentially creating and entirely new file of abstracts matched to another condition of interest. However, that lies outside the scope of this research, so I will not address it further here. 

## Get ready 

As always, we start with code that:
* loads up and nicknames some useful packages, 
* checks file locations,
* imports files, and 
* checks them. 


In [None]:
%%capture

!pip install nltk
!pip install spacy -q
!python -m spacy download en_core_web_lg -q

import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
from nltk import word_tokenize    # and some of its key functions
from nltk import sent_tokenize  
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
from nltk.corpus import wordnet                    # Finally, things we need for lemmatising!
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
nltk.download('averaged_perceptron_tagger')        # Like a POS-tagger...
nltk.download('wordnet')
nltk.download('webtext')
from nltk.corpus import webtext

import pandas as pd
pd.set_option('display.max_colwidth', 200)
import numpy as np
import statistics

import csv                        # csv is for importing and working with csv files

from collections import Counter

import statistics
import re                         # things we need for RegEx corrections
import string 
import spacy 
from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 
nlp = spacy.load('en_core_web_lg')
nlp.max_length = 1500000 #or any large value, as long as you don't run out of RAM

import math
import matplotlib.pyplot as plt
print(os.listdir("..\\output")  )    

# the '%%capture' at the top of this cell suppresses the output (which is normally quite long and annoying looking). 
# You can remove or comment it out if you prefer to see the output.

## Import

Having checked the contents of the output folder and seen the files we expected to see, we can now import the specific file of interest for this step of the analysis.

In [None]:
matched_texts = pd.read_csv('..\\output\\matched_abstracts_no_null_texts.csv')    # one for just those that match the keyword
len(matched_texts)                                                                # check the length 

## Cleaning phase

Cleaning begins by turning any instances of extra whitespaces (two or more in a row) into a single whitespace. Then, identifying any run-on sentences (where a lowercase letter, a full stop, and an uppercase letter are clustered without a whitespace) and inserting a whitespace between the full stop and the uppercase letter. Both of these steps will improve the sentence tokenisation that happens next. 

Then, we proceed to sentence tokenising the text. Like word tokens, sentence tokens become the unit for analysisis. As a trivial example, sentence tokenisation would turn a short text such as 


''' The cat named Cat is one of five cats. Honestly, I wonder why I have so many cats.
''' 

into a list of sentence tokens like

''' [[The cat named Cat is one of five cats.]

[Honestly, I wonder why I have so many cats.]]

''' 

An important difference is that the punctuation within the sentences that contributes to its structured and meaning (e.g. the comma and the full stops) are retained. This punctuation, like the capitalisation at the start of the sentences or for the poper nouns, is also retained as it helps the sentence-tokenisation process identify the words within the sentence correctly for their parts of speech (e.g. which of the words are nouns, verbs, etc. ). 



The sentence tokens are then put on individual rows, filtered to retain only those that contain one or more of the keywords of interest, and then filtered to ensure that there are no empty rows or duplicates. 

In [None]:
def remove_errors (input):
    no_extra_spaces = re.sub(r'(\s)(\s+)', r'\1', input)               # turn 2+ sequential whitespaces into 1
    no_run_ons1 = re.sub(r'([a-z].)([A-Z])', r'\1 \2', no_extra_spaces) # identifies run-ons (e.g. "word.New sentence ")
    no_run_ons2 = re.sub(r'([A-Z].)([A-Z])', r'\1. \2', no_run_ons1) # identifies run-ons (e.g. "ACRONYM.New sentence ")

    return(no_run_ons2)

In [None]:
no_run_ons = [remove_errors(abstract) for abstract in matched_texts['Text'] ] 
                                             # create abstract list without extra spaces/run-ons 
                                             # this is to improve sentence tokenisation later 
matched_texts['Sentence'] = no_run_ons       # copy the no extra space/run-on abstract list back into df as a new column

In [None]:
sentences  = [sent_tokenize(abstract) for abstract in matched_texts['Text'] ] # create tokenised list of cleaner abstracts
matched_texts['Sentence'] = sentences                                   # copy that list back into df as a new column
sentence_per_row = matched_texts.explode('Sentence')                    # explode column in new df with 1 row/sentence token
print("How many sentences in total: " + str(len(sentence_per_row)))     # check the length of new df


In [None]:
print(sentence_per_row[['Text','Sentence']])                            # have a look. The selected rows should have 
                                                                        # 'Text' the same, but 'Sentence' different 

In [None]:
matched_sentences = sentence_per_row[sentence_per_row['Sentence'].str.contains('[Aa]utis|ASD|AS|[Aa]sperger')]
                                                     # create a new data frame with only the sentences that contain keywords
print("How many matching sentences: " + str(len(matched_sentences)))            # check the length

In [None]:
matched_sentences = matched_sentences[~matched_sentences['Sentence'].isnull()]  # remove any rows with empty 'Sentence' column
matched_sentences = matched_sentences.drop_duplicates()                         # drop any duplicates
print("Now how many matching sentences: " + str(len(matched_sentences)))        # check length of remaining data frame

In working with the matching sentences, it became clear there were several common errors, variations on how things were written and other annoying minor differences in the texts that made the manual checking more time-consuming than it needed to be.

Further, the minor differences meant that the counting steps later on were counting "child with ASD" separately from "child with autism" when perhaps the more interesting distinction there is whether "child with autism/ASD" is more or less common than "patient with autism/ASD" or "proband with autism/ASD" or any other common person-nouns. 

Thus, this tidy_up_terminology function corrects several importing errors, spelling and style differences, and consolidates on terminology. 

In [None]:
def tidy_up_terminology (input):
    space1 = re.sub(r'([A-Z]).(A-Z)', r'\1. \2', input)                 # removes multiple white spaces between words
    space2 = re.sub(r'([a-z])(disorder|disability|spectrum)', r'\1 \2', space1) # adds a space in select run-ons
    space3 = re.sub(r'([a-z])(disorder|spectrum)', r'\1 \2', space2)    # a second go at adding a space in select run-ons      
    space4 = re.sub(r'(spec) (trum)', r'\1\2', space3)                  # removes a space between 'spec' and 'trum'
    no_apost = re.sub(r'([Aa]sperger[\S*?]s)', r'asperger', space4)     # lowercases, removes ' and S from '[Aa]sperger's' 
    lower1 = re.sub(r'Autis', r'autis', no_apost)                       # lowercases 'Autism' and 'Autistic'
    lower2 = re.sub(r'[Aa]spergers|[Aa]sperger', r'asperger', lower1)   # lowercases/removes S from '[Aa]spergers' & '[Aa]sperger'
    lower3 = re.sub(r'[Ss]pectrums|[Ss]pectra', r'spectrum', lower2)    # lowercases and removes various plurals for spectrum
    lower4 = re.sub(r'[Ss]yndromes|[Ss]yndrome', r'syndrome', lower3)   # lowercases and removes plurals for syndrome
    lower5 = re.sub(r'[Dd]isorders|Disorder', r'disorder', lower4)      # lowercases and removes plurals for disorder
    lower6 = re.sub(r'[Dd]iseases|Disease', r'disease', lower5)         # lowercases and removes plurals for disease
    plur = re.sub(r'ASDs', r'ASD', lower6)                              # removes plural from instances of more than one ASD
    stan0 = re.sub(r'(autism|autistic|asperger) syndrome', r'autism spectrum', plur ) # turns select 'syndrome' to 'spectrum'
    stan1 = re.sub(r'spectrum disease', r'spectrum disorder', stan0 )   # turns select 'disease' to 'disorder'
    stan2 = re.sub(r'(autism|autistic|asperger) spectrum disorder \(ASD\)', r'ASD', stan1) # abbreviates various ASD definitions
    stan3 = re.sub(r'(autism|autistic|asperger) spectrum disorder', r'ASD', stan2) # abbreviates various options to ASD
    stan4 = re.sub(r'(autism|autistic|asperger) spectrum \(AS\)', r'ASD', stan3)  # standardises more options to ASD
    stan5 = re.sub(r'(autism|autistic|asperger) spectrum', r'ASD', stan4)         # standardises more options to ASD
    stan6 = re.sub(r'AS ', r'ASD ', stan5)                              # standardises 'AS ' to 'ASD ' - note trailing space
    stan7 = re.sub(r'(autism|autistic|asperger) disorder', r'ASD', stan6) # abbreviates various ASD definitions
    aut0 = re.sub(r'asperger autism', r'autism', stan7)                  # standardises 'asperger autism' to 'autism'
    ID1 = re.sub(r'[Ii]ntellectual [Dd]isability \(ID\)', r'ID', aut0)
    ID2 = re.sub(r'[Ii]ntellectual [Dd]isability', r'ID', ID1)

    return(ID2)

In [None]:
 # Optional cell code block to test or understand what the tidy_up_terminology function does
    
tidy_test = "Autism spectrum intellectual disability and autism ID, ASD \
            Autisticspectrum autisticspectrumdisorder ASD \
            Asperger's syndrome asperger's syndrome \
            intellectual disability Intellectual Disability (ID)\
            aspergers syndrome autism spectrum  ASDs ASD ID, and autism "

tidy_up_terminology(tidy_test)

In [None]:
tidy_text = [tidy_up_terminology(sentence) for sentence in matched_sentences['Sentence'] ] 
                                             # create abstract list without extra spaces/run-ons 
                                             # this is to improve sentence tokenisation later 
matched_sentences['Sentence'] = tidy_text    # copy the no extra space/run-on abstract list back into df as a new column

In [None]:
backup = matched_sentences                    # A backup is useful at this step because the next may not go the way you expect

In [None]:
matched_sentences = backup                    # If you need the backup, re-run this step. 

## Extraction

Following the cleaning phase, we move on to the extraction phase. This has two parts, first for the person-first extraction and then for the identity-first extraction. 

The results of both extractions are saved in their own column to make it easy to read and also to allow for a single sentence-token to contain both kinds of patterns. 

### Person-first pattern

In [None]:
pattern_1 = [{"POS": "NOUN"},                                        # define the person-first pattern - start with a noun
             {'DEP':'amod', 'OP':"?"},                               # followed by an optional modifier
             {"TEXT": {"REGEX": "(with|by|from)"}},                  # followed by some words that set up the p-f pattern
             {'DEP':'amod', 'OP':"?"},                               # then space for up to three optional modifiers
             {'DEP':'amod', 'OP':"?"},
             {'DEP':'amod', 'OP':"?"},
             {"TEXT": {"REGEX": "(^[Aa]utis|^[Aa]sperger|^ASD|^AS$)"}}] # finally, the keywords (original format, just in case)

# Matcher class object 
matcher = Matcher(nlp.vocab)                                         # define a matcher class object
matcher.add("matching_1", [pattern_1])                               # add my three person-first patterns to it


In [None]:
def find_pattern_match(input):                                               # define a function that applies the person-first
    thingy = nlp(input)                                                      # matcher class object to strings
    match = matcher(thingy)                                                  # and returns any matches to the pattern(s)
    if match == []:
        out_value = ''
    else:
        hold_multi_spans = []
        for match_id, start, end in match:
                string_id = nlp.vocab.strings[match_id]  # Get string representation
                span = thingy[start:end]  # The matched span
                hold_multi_spans.append(span)
        out_value = hold_multi_spans
    return out_value

In [None]:
matched_sentences['Person-first'] = matched_sentences.apply(lambda row: find_pattern_match(row.Sentence), axis = 1)
                                                                        # apply the newly defined person-first matcher function
                                                                        # and store the returned output in a new column
len(matched_sentences)                                                  # double check length remains same

### Identity-first pattern

In [None]:
pattern_a = [{'DEP':'amod', 'OP':"?"},                                 # same for identity-first patterns,
             {'DEP':'amod', 'OP':"?"},                                 # starting with two optional modifiers
             {"TEXT": {"REGEX": "(^[Aa]utis|^[Aa]sperger|^ASD|^AS$)"}}, # the keywords (original format, just in case)
             {'DEP':'amod', 'OP':"?"},                                 # then upt to three more optional modifiers
             {'DEP':'amod', 'OP':"?"},
             {'DEP':'amod', 'OP':"?"},
             {"POS": "NOUN"}]                                          # and then a noun

# Matcher class object                                         
matcher = Matcher(nlp.vocab) 
matcher.add("matching_2", [pattern_a])            # this overwrites the matcher object to identity-first

In [None]:
matched_sentences['Identity-first'] = matched_sentences.apply(lambda row: find_pattern_match(row.Sentence), axis = 1)
                                                                        # apply the newly overwritten matcher function
                                                                        # and store the returned output in a new column
len(matched_sentences)                                                  # check the length - why not?

### Consolidation

Following the cleaning and extraction phases, the last phase is consolidation. This phase further refines the data by removing all the rows that do not contain a match for one or both of the patterns. For example, there would be a row for "The child was tested for autism." because it contains a keyword of interest. However, this sentence would be eliminated in the consolidation phase as the keyword does not fit into either the person-first or identity-first patterns. 

Further, this phase goes on to lemmatise the extracted patterns so that they can be counted more easily. This phase also lowercases all occurrences of "Autistic", "Autism", and "Asperger's" as well as removing the apostrophe, the 's' and any non-white characters that might intrude between the 'r' and the 's' of "Asperger's". This phase also removes any square brackets, quotes and extra commas introduced by the lemmatisation process. 

This phase ends by writing out the consolidated data frame to a .csv for manual inspection. I could not find a feasible way of identifying whether or not the nouns matched in the extraction phase are person-nouns or not. As the list is not a totally unreasonable length (in the hundreds) I found it workable to 
* open in excel, 
* save the file under another name (e.g. pattern_matches_reviewed), 
* order the entire data set alphabetically by 'Person-first', 
* scan through the ordered results check whether each result in the 'Person-first' column is about a person, 
* removing entire rows if the 'Person-first' match is not about a person (checking the 'Sentence' or 'Text' column if needed)
* re-order the entire data set alphabetically by 'Identity-first', 
* scan through the ordered results check whether each result in the 'Identity-first' column is about a person, 
* removing entire rows if the 'Identity-first' match is not about a person, 
* save file again. 

For example, 'association with autism' matches the person-first pattern but is not about a person, so this row was removed. Many more rows were removed in the 'Identity-first' matches as things like 'autistic behaviours' and 'autism testing' were removed for not being about people. 

NOTE: There were several instances of "ASD dataset" which are not easy to determine if they are about people or not. Do they mean dataset composed from blood tests taken as part of ASD testing? If so, each row in the data set would be a blood test with the possibility that more than one test comes from the same person. Or do they mean a pool of case records, each of which represents a single person? The former would not be "about people" but the second would. I did not remove these rows as we cannot be certain. Leaving them out would also have been a valid option, as long as the choice was clear. 

Coincidentally, during this manual checking part of the consolidation phase I learned that, in the context of human genetics research "proband" is a person-noun. 

In [None]:
matched_patterns = matched_sentences[(matched_sentences['Person-first'] != '') | (matched_sentences['Identity-first'] != '')]
                                                     # keep only rows w/ non-null 'Person-first' and/or 'Identity-first' columns
len(matched_patterns)                                # check length

In [None]:
matched_patterns = matched_patterns.explode('Person-first')    # explode 'Person-first' column to create 1 row per match
                                                               # if there were two matches within the same sentence
len(matched_patterns)                                          # check the length

In [None]:
matched_patterns = matched_patterns.explode('Identity-first')  # Do the same for 'Identity-first' column
len(matched_patterns)                                          # check the length

In [None]:
matched_patterns.head(5)                                               # have a look at them

In [None]:
Lem = WordNetLemmatizer()                         # Define a short way to call the WordNetLemmatizer

def consolidate_matched_patterns (input):         # 
    final_lemma_list = []
    temp_lemma_list = []
    for phrase in input:                       # start for loop looking at each pattern in the person-first pattern column
        phrase_as_string = str(phrase)                               # hold the current pattern
        words_in_phrase = phrase_as_string.split() # split the current pattern into words
        for word in words_in_phrase :                            # for each word in the split up words
            lemma = Lem.lemmatize(word)             # turn that word into a lemma
            temp_lemma_list.append(lemma)                # append that lemma to a temporary list
        string_lem = str(temp_lemma_list)              # turn that temporary list into a string
        stripped_lem = re.sub(r"\[|\]|\'|\,",'', string_lem)  # remove  square brackets, commas and '' marks from the string
        final_lemma_list.append(stripped_lem)        # append the string version of the list to the output list
        temp_lemma_list = []                               # ensure the temp variable is empty

    return(final_lemma_list)


In [None]:
person_lemma_list = consolidate_matched_patterns(matched_patterns['Person-first'])
identity_lemma_list = consolidate_matched_patterns(matched_patterns['Identity-first'])

In [None]:
matched_patterns['Person-first'] = person_lemma_list    # copy the person-first output to new column in data frame 
matched_patterns['Identity-first'] = identity_lemma_list  # copy the identity-first output to new column in data frame 
matched_patterns = matched_patterns.drop_duplicates()                         # drop any duplicates
matched_patterns                                                   # have a look at the data frame with its new columns

In [None]:
matched_patterns.to_csv('..\\output\\pattern_matches_to_review.csv')        
                                                            # Write the data frame to a .csv for manual processing in excel

At this point, I open the file in Excel (for example), removed the brackets, quotation marks and commas in the Person-first lemmatised and Identity-first lemmatised columns, then sort by each of one of these columns. I then scan through the results, removing any rows that are obviously not about people (e.g. "autistic testing") and checking the 'Text' column on any that are unclear 'autistic quartets'). I then sort by the other column and repeat the step of reviewing and deleting non-person rows. Save under "pattern_matches_reviewed.csv" for the next step. 

## Exlporing statistics for PFL and IFL

In [None]:
reviewed_matches = pd.read_csv('..\\output\\pattern_matches_reviewed.csv')    # one for just those that match the keyword
reviewed_matches.head(3)

In [None]:
print("There are " + 
      str(len(reviewed_matches)) + " rows in the post-manual review data frame coming from " +
      str(reviewed_matches['Title'].nunique()) +
      " unique titles.")

In [None]:
def find_PF_nouns(input):
    output = []
    for thingy in input:
        if isinstance(thingy,str):
            word_list = thingy.split()
            noun = word_list[0]
            output.append(noun)
        else:
            output.append("")
    return output

def find_IF_nouns(input):
    output = []
    for thingy in input:
        if isinstance(thingy,str):
            word_list = thingy.split()
            noun = word_list[-1]
            output.append(noun)
        else:
            output.append("")
    return output

In [None]:
reviewed_matches['PF_nouns'] = find_PF_nouns(reviewed_matches['Person-first'])      # applies the new functions to find
reviewed_matches['IF_nouns'] = find_IF_nouns(reviewed_matches['Identity-first'])    # pf and if nouns and put them in columns

In [None]:
reviewed_matches.to_csv('..\\output\\nouns_to_review.csv')        
                                                            # Write the data frame to a .csv for manual processing in excel
                                                            # This second manual processing revealed one row that was not about 
                                                            # 'people' and which should have been removed before also a row 
                                                            # in which a preceding adjective had been counted as a noun

In [None]:
reviewed_matches = pd.read_csv('..\\output\\nouns_reviewed.csv')    # one for just those that match the keyword
reviewed_matches.head(3)

In [None]:
pf_instances = reviewed_matches.iloc[:,12].tolist()                         # copies the Person-first column to a list,   
print(len(pf_instances))                                                    # check length of list before filtering
filtered_pf_instances = [instance for instance in pf_instances              # filters the list to remove non-string items
                         if isinstance(instance, str)] 
                                                                            
print("Total count of PFL instances: ",                                     # Print length of list after filtering
      len(filtered_pf_instances))                                           # This represents total count of PF instances

pf_instance_dict = dict((instance, filtered_pf_instances.count(instance))   # creates a dict from the 
                        for instance in filtered_pf_instances)              # filtered list that counts instances
                                                                 
pf_instance_dict = dict(sorted(pf_instance_dict.items(),                    # sorts that dict according to count 
                               key = lambda item: item[1], reverse=True))   # in descending order
                                                                 
print("Count of unique PFL instances: ", len(pf_instance_dict))              # This represents count of unique PF instances
pf_instance_dict                                                             # Prints the sorted, filtered dict

In [None]:
if_instances = reviewed_matches.iloc[:,13].tolist()                         # copies the Identity-first column to a list,   
print(len(if_instances))                                                    # check length of list before filtering
filtered_if_instances = [instance for instance in if_instances              # filters the list to remove non-string items
                         if isinstance(instance, str)] 
                                                                            
print("Total count of IFL instances: ",                                     # check length of list after filtering
      len(filtered_if_instances))                                           # This represents total count of IF instances

if_instance_dict = dict((instance, filtered_if_instances.count(instance))   # creates a dict from the 
                        for instance in filtered_if_instances)              # filtered list that counts instances
                                                                 
if_instance_dict = dict(sorted(if_instance_dict.items(),                    # sorts that dict according to count 
                               key = lambda item: item[1], reverse=True))   # in descending order
                                                                 
print("Count of unique IFL instances: ", len(if_instance_dict))              # This represents count of unique IF instances
if_instance_dict                                                             # Prints the sorted, filtered dict


In [None]:
instances_output = pd.DataFrame.from_dict([pf_instance_dict,               # copy both instances dicts to dataframe
                                           if_instance_dict]).transpose()  # and transpose it to be long
                                                                           
print(list(instances_output.columns))                                      # check column names in new dataframe
instances_output = instances_output.rename(columns={0: 'PF_count',         # rename column names according to source dict
                                                    1: 'IF_count'}) 

instances_output = instances_output[(instances_output['PF_count'] > 4)     # remove the rows that are below 4 instances for at
                            | (instances_output['IF_count'] > 4)]          # least one of the sources
instances_output.to_csv('..\\output\\instances_count.csv')                 # write popular noun data frame to .csv 
                                                                           # to make it easy to compare or use 
instances_output                                                           # print it to check

In [None]:
pf_nouns = reviewed_matches.iloc[:,14].tolist()                            # Same again, but for just the Person-first nouns
filtered_pf_nouns = [noun for noun in pf_nouns if isinstance(noun, str)] 

print("Total count of PFL nouns: ",                                        # check length of list after filtering
      len(filtered_pf_nouns))                                          # Should be the same as total PF instances

pf_noun_dict = dict((noun, filtered_pf_nouns.count(noun)) for noun in filtered_pf_nouns)
pf_noun_dict = dict(sorted(pf_noun_dict.items(), key = lambda item: item[1], reverse=True))                                                                 
print("Count of unique PFL nouns: ", len(pf_noun_dict))                    # This represents count of unique PF nouns
pf_noun_dict                                                               # Prints the sorted, filtered dict




In [None]:
if_nouns = reviewed_matches.iloc[:,15].tolist()                        # One more time, but for Identity-first nouns
filtered_if_nouns = [noun for noun in if_nouns if isinstance(noun, str)] 

print("Total count of IFL nouns: ",                                        # check length of list after filtering
      len(filtered_if_nouns))                                              # Should be the same as total IF instances


if_noun_dict = dict((noun, filtered_if_nouns.count(noun)) for noun in filtered_if_nouns)
if_noun_dict = dict(sorted(if_noun_dict.items(), key = lambda item: item[1], reverse=True))

print("Count of unique IFL nouns: ", len(if_noun_dict))                    # This represents count of unique IF nouns
if_noun_dict

In [None]:
nouns_output = pd.DataFrame.from_dict([pf_noun_dict, if_noun_dict]).transpose() # copy & transpose both noun dicts to dataframe 
print(list(nouns_output.columns))                                               # check column names in new dataframe
nouns_output = nouns_output.rename(columns={0: 'PF_count', 1: 'IF_count'})      # rename column names according to source dict
nouns_output = nouns_output[(nouns_output['PF_count'] > 0)                      # remove the rows that are not multiple for at
                            | (nouns_output['IF_count'] > 0)]                   # least one of the sources
nouns_output.to_csv('..\\output\\nouns_count.csv')                              # write popular noun data frame to .csv 
                                                                                # to make it easy to compare or use 
nouns_output                                                                    # print it to check

In [None]:
# back in the big data frame, we want to track how many instances of each occur by year
PFL_by_year = reviewed_matches.groupby(['Year'])['Person-first'].count()      # group by year and Person-first, then count    
IFL_by_year = reviewed_matches.groupby(['Year'])['Identity-first'].count()    # group by year and Identity-first, then count    
person_identity_count = pd.concat([PFL_by_year,IFL_by_year],axis=1)           # concatenate the groups into new data frame
person_identity_count = person_identity_count.rename(                         # Rename the columns in new data frame
    columns={"Person-first": "PFL", "Identity-first": "IFL"})    

person_identity_count = person_identity_count.rename_axis('Year').reset_index()         # rename axis, reset index
person_identity_count = person_identity_count.sort_values(by=['Year']).astype('Int64')  # retype and sort by value of year

print(person_identity_count)

In [None]:
person_identity_count['Year'] = pd.to_datetime(person_identity_count['Year'].astype(str), format="%Y") 
                                                                     # reformat the Year to be a date as string in format YEAR 
person_identity_count = person_identity_count.set_index('Year')      # set newly reformatted Year column to be the index

print(person_identity_count)                                         # Print it to make sure nothing is gone terribly wrong


In [None]:
person_identity_count.plot()                        # Plot the counts of PFL and IFL by year
plt.ylabel("Count of total pattern matches")        # Set y label
plt.xlabel("Year")                                  # Set x label
plt.title("Instances of PFL and IFL by year")       # Set plot title
plt.legend(loc="upper left", frameon=False)         # Set position for legend and set legend frame to be false
plt.show()                                          # Show that plot!

plt.savefig('..\\output\\matches_count.jpg',  )    # we can right click on the plot above to save it, or save it via command

## Chart person-first or identity-first by year

## Count abstracts by the structures they use

In [None]:
# back in the big data frame, we want to track how many instances of each occur by abstract title 
# to see if authors mix their use or if they use both more or less evenly

person_by_title = reviewed_matches.groupby(['Title'])['Person-first'].count()     # group and count by Title & Person-first
identity_by_title = reviewed_matches.groupby(['Title'])['Identity-first'].count() # group and count by Title & Identity-first
title = pd.concat([person_by_title,identity_by_title],axis=1)                     # concatenate the groups into new data frame
print(list(title.columns))                                                        # check column names in new dataframe
title = title.rename(columns={'Person-first': 'PF_count', 
                              'Identity-first': 'IF_count'})                 # rename column names according to source dict

#title = title[(title['PF_count'] > 0)                                  # OPTIONAL - uncomment these 2 lines to 
#                            & (title['IF_count'] > 0)]                 # select only the abstracts that contain both patterns

print(len(title))                                                            # Print how many titles have at least one of each

In [None]:
title.sort_values(by=['PF_count'], ascending=False)        # sort by PFL 

In [None]:
title.sort_values(by=['IF_count'], ascending=False)        # Sort by IFL


In [None]:
co_occurence_counts = title.groupby(by=['PF_count', 'IF_count']).size().to_frame('size').reset_index()
co_occurence_counts

In [None]:
output = plt.scatter(x=co_occurence_counts['PF_count'], 
                     y=co_occurence_counts['IF_count'], 
                     c=co_occurence_counts['size'],
                     s=200)
plt.ylabel("Instance of IFL")
plt.xlabel("Instance of PFL")
plt.title("Abstracts using both PFL and IFL")
plt.ylim([-1, 10])
plt.xlim([-1, 10])

handles, labels = output.legend_elements(prop="colors", alpha=1)
plt.legend(handles, labels, loc="upper right", title="Abstract count at this point", frameon=False)
plt.show()

# right click on the plot to save it