# Pattern matching


We will start in the same way as the last notebook started  - by downloading/importing the packages needed and importing the .csv file(s) needed. In this case, we only need the .csv file that has the matched abstracts as we are specifically looking at person-first and identity-first patterns that are "about" autism (or ASD, Asperger's syndrome, etc.). 

We could use the same basic approach to look at person-first and identity-first language for other conditions for which there are good noun and adjective forms of the words (diabetes? obesity? cancer? something else?). Doing that would mean using the .csv file with all of the abstracts or potentially creating and entirely new file of abstracts matched to another condition of interest. However, that lies outside the scope of this research, so I will not address it further here. 

## Get ready 

As always, we start with code that:
* loads up and nicknames some useful packages, 
* checks file locations,
* imports files, and 
* checks them. 


In [1]:
%%capture

!pip install nltk
!pip install spacy -q
!python -m spacy download en_core_web_lg -q

import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
from nltk import word_tokenize    # and some of its key functions
from nltk import sent_tokenize  
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
from nltk.corpus import wordnet                    # Finally, things we need for lemmatising!
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
nltk.download('averaged_perceptron_tagger')        # Like a POS-tagger...
nltk.download('wordnet')
nltk.download('webtext')
from nltk.corpus import webtext

import pandas as pd
pd.set_option('display.max_colwidth', 200)
import numpy as np
import statistics

import csv                        # csv is for importing and working with csv files

from collections import Counter

import statistics
import re                         # things we need for RegEx corrections
import string 
import spacy 
from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 
nlp = spacy.load('en_core_web_lg')
nlp.max_length = 1500000 #or any large value, as long as you don't run out of RAM

import math
import matplotlib.pyplot as plt
print(os.listdir("..\\output")  )    

# the '%%capture' at the top of this cell suppresses the output (which is normally quite long and annoying looking). 
# You can remove or comment it out if you prefer to see the output.

## Import

Having checked the contents of the output folder and seen the files we expected to see, we can now import the specific file of interest for this step of the analysis.

In [108]:
matched_texts = pd.read_csv('..\\output\\matched_abstracts_no_null_texts.csv')    # one for just those that match the keyword
len(matched_texts)                                                                # check the length 

3794

## Cleaning phase

Cleaning begins by turning any instances of extra whitespaces (two or more in a row) into a single whitespace. Then, identifying any run-on sentences (where a lowercase letter, a full stop, and an uppercase letter are clustered without a whitespace) and inserting a whitespace between the full stop and the uppercase letter. Both of these steps will improve the sentence tokenisation that happens next. 

Then, we proceed to sentence tokenising the text. Like word tokens, sentence tokens become the unit for analysisis. As a trivial example, sentence tokenisation would turn a short text such as 


''' The cat named Cat is one of five cats. Honestly, I wonder why I have so many cats.
''' 

into a list of sentence tokens like

''' [[The cat named Cat is one of five cats.]

[Honestly, I wonder why I have so many cats.]]

''' 

An important difference is that the punctuation within the sentences that contributes to its structured and meaning (e.g. the comma and the full stops) are retained. This punctuation, like the capitalisation at the start of the sentences or for the poper nouns, is also retained as it helps the sentence-tokenisation process identify the words within the sentence correctly for their parts of speech (e.g. which of the words are nouns, verbs, etc. ). 



The sentence tokens are then put on individual rows, filtered to retain only those that contain one or more of the keywords of interest, and then filtered to ensure that there are no empty rows or duplicates. 

In [109]:
def remove_errors (input):
    no_extra_spaces = re.sub(r'(\s)(\s+)', r'\1', input)               # turn 2+ sequential whitespaces into 1
    no_run_ons1 = re.sub(r'([a-z].)([A-Z])', r'\1 \2', no_extra_spaces) # identifies run-ons (e.g. "word.New sentence ")
    no_run_ons2 = re.sub(r'([A-Z].)([A-Z])', r'\1. \2', no_run_ons1) # identifies run-ons (e.g. "word.New sentence ")

    return(no_run_ons2)

In [110]:
no_run_ons = [remove_errors(abstract) for abstract in matched_texts['Text'] ] 
                                             # create abstract list without extra spaces/run-ons 
                                             # this is to improve sentence tokenisation later 
matched_texts['Sentence'] = no_run_ons       # copy the no extra space/run-on abstract list back into df as a new column

In [111]:
sentences  = [sent_tokenize(abstract) for abstract in matched_texts['Sentence'] ] # create tokenised list of cleaner abstracts
matched_texts['Sentence'] = sentences                                   # copy that list back into df as a new column
sentence_per_row = matched_texts.explode('Sentence')                    # explode column in new df with 1 row/sentence token
print("How many sentences in total: " + str(len(sentence_per_row)))     # check the length of new df


How many sentences in total: 122020


In [112]:
print(sentence_per_row[['Text','Sentence']])                            # have a look. The selected rows should have 
                                                                        # 'Text' the same, but 'Sentence' different 

                                                                                                                                                                                                         Text  \
0     Despite the availability of most of the human genome sequence  the accu rate identification of genes on the DNA sequence remains to be continu ously improved and updated. This procedure relies on ...   
0     Despite the availability of most of the human genome sequence  the accu rate identification of genes on the DNA sequence remains to be continu ously improved and updated. This procedure relies on ...   
0     Despite the availability of most of the human genome sequence  the accu rate identification of genes on the DNA sequence remains to be continu ously improved and updated. This procedure relies on ...   
0     Despite the availability of most of the human genome sequence  the accu rate identification of genes on the DNA sequence remains to be continu ously improved 

In [113]:
matched_sentences = sentence_per_row[sentence_per_row['Sentence'].str.contains('[Aa]utis|ASD|AS|[Aa]sperger')]
                                                     # create a new data frame with only the sentences that contain keywords
print("How many matching sentences: " + str(len(matched_sentences)))            # check the length

How many matching sentences: 7242


In [114]:
matched_sentences = matched_sentences[~matched_sentences['Sentence'].isnull()]  # remove any rows with empty 'Sentence' column
matched_sentences = matched_sentences.drop_duplicates()                         # drop any duplicates
print("Now how many matching sentences: " + str(len(matched_sentences)))        # check length of remaining data frame

Now how many matching sentences: 6941


In working with the matching sentences, it became clear there were several common errors, variations on how things were written and other annoying minor differences in the texts that made the manual checking more time-consuming than it needed to be.

Further, the minor differences meant that the counting steps later on were counting "child with ASD" separately from "child with autism" when perhaps the more interesting distinction there is whether "child with autism/ASD" is more or less common than "patient with autism/ASD" or "proband with autism/ASD" or any other common person-nouns. 

Thus, this tidy_up_terminology function corrects several importing errors, spelling and style differences, and consolidates on terminology. 

In [119]:
def tidy_up_terminology (input):
    space1 = re.sub(r'([A-Z]).(A-Z)', r'\1. \2', input)                 # removes multiple white spaces between words
    space2 = re.sub(r'([a-z])(disorder|disability|spectrum)', r'\1 \2', space1) # adds a space in select run-ons
    space3 = re.sub(r'([a-z])(disorder|spectrum)', r'\1 \2', space2)    # a second go at adding a space in select run-ons      
    space4 = re.sub(r'(spec) (trum)', r'\1\2', space3)                  # removes a space between 'spec' and 'trum'
    no_apost = re.sub(r'([Aa]sperger[\S*?]s)', r'asperger', space4)     # lowercases, removes ' and S from '[Aa]sperger's' 
    lower1 = re.sub(r'Autis', r'autis', no_apost)                       # lowercases 'Autism' and 'Autistic'
    lower2 = re.sub(r'[Aa]spergers|[Aa]sperger', r'asperger', lower1)   # lowercases/removes S from '[Aa]spergers' & '[Aa]sperger'
    lower3 = re.sub(r'[Ss]pectrums|[Ss]pectra', r'spectrum', lower2)    # lowercases and removes various plurals for spectrum
    lower4 = re.sub(r'[Ss]yndromes|[Ss]yndrome', r'syndrome', lower3)   # lowercases and removes plurals for syndrome
    lower5 = re.sub(r'[Dd]isorders|Disorder', r'disorder', lower4)      # lowercases and removes plurals for disorder
    lower6 = re.sub(r'[Dd]iseases|Disease', r'disease', lower5)         # lowercases and removes plurals for disease
    plur = re.sub(r'ASDs', r'ASD', lower6)                              # removes plural from instances of more than one ASD
    stan0 = re.sub(r'(autism|autistic|asperger) syndrome', r'autism spectrum', plur ) # turns select 'syndrome' to 'spectrum'
    stan1 = re.sub(r'spectrum disease', r'spectrum disorder', stan0 )   # turns select 'disease' to 'disorder'
    stan2 = re.sub(r'(autism|autistic|asperger) spectrum disorder \(ASD\)', r'ASD', stan1) # abbreviates various ASD definitions
    stan3 = re.sub(r'(autism|autistic|asperger) spectrum disorder', r'ASD', stan2) # abbreviates various options to ASD
    stan4 = re.sub(r'(autism|autistic|asperger) spectrum \(AS\)', r'ASD', stan3)  # standardises more options to ASD
    stan5 = re.sub(r'(autism|autistic|asperger) spectrum', r'ASD', stan4)         # standardises more options to ASD
    stan6 = re.sub(r'AS ', r'ASD ', stan5)                              # standardises 'AS ' to 'ASD ' - note trailing space
    stan7 = re.sub(r'(autism|autistic|asperger) disorder)', r'ASD', stan6) # abbreviates various ASD definitions
    aut0 = re.sub(r'asperger autism', r'autism', stan7)                  # standardises 'asperger autism' to 'autism'
    ID1 = re.sub(r'[Ii]ntellectual [Dd]isability \(ID\)', r'ID', aut0)
    ID2 = re.sub(r'[Ii]ntellectual [Dd]isability', r'ID', ID1)

    return(ID2)

In [8]:
### old version  - keep until final tidy? 

def tidy_up_terminology (input):                         
    space1 = re.sub(r'([A-Z]).(A-Z)', r'\1. \2', input)                         # removes multiple white spaces between words
    space2 = re.sub(r'([a-z])(disorder|disability|spectrum)', r'\1 \2', space1) # adds a spac 
    space3 = re.sub(r'([a-z])(disorder|spectrum)', r'\1 \2', space2)         
    space4 = re.sub(r'(spec) (trum)', r'\1\2', space3)               
    no_apost = re.sub(r'([Aa]sperger[\S*?]s)', r'asperger', space4)  
    lower1 = re.sub(r'Autis', r'autis', no_apost)
    lower2 = re.sub(r'[Aa]spergers|[Aa]sperger', r'asperger', lower1)
    lower3 = re.sub(r'[Ss]pectrums|[Ss]pectra', r'spectrum', lower2)
    lower4 = re.sub(r'[Ss]yndromes|[Ss]yndrome', r'syndrome', lower3)
    lower5 = re.sub(r'[Dd]isorders|Disorder', r'disorder', lower4)
    lower6 = re.sub(r'[Dd]iseases|Disease', r'disease', lower5)
    plur1 = re.sub(r'ASDs', r'autism', lower6)    
    AS0 = re.sub(r'(autism|asperger) (syndrome|spectrum) \(AS\)', r'autism', plur1)
    AS1 = re.sub(r'(autism|asperger) (syndrome|spectrum)', r'autism', AS0)
    AS2 = re.sub(r'asperger autism', r'autism', AS1)
    AS3 = re.sub(r'AS ', r'autism ', AS2)
    ASD0 = re.sub(r'autism spectrum disorder \(ASD\)', r'autism', AS3)   
    ASD1 = re.sub(r'autism spectrum disorder', r'autism', ASD0)   
    ASD2 = re.sub(r'(autistic|autism|asperger) disorder', r'autism', ASD1)
    ASD3 = re.sub(r'(autistic|autism|asperger) spectrum disorder', r'autism', ASD2)
    ASD4 = re.sub(r'(autistic|autism|asperger) spectrum', r'autism', ASD3)
    ASD5 = re.sub(r'(autism|autistic) disease', r'autism', ASD4)
    ASD6 = re.sub(r'ASD', r'autism', ASD5)
    ID1 = re.sub(r'[Ii]ntellectual [Dd]isability \(ID\)', r'ID', ASD6)
    ID2 = re.sub(r'[Ii]ntellectual [Dd]isability', r'ID', ID1)
    ID3 = re.sub(r'(ID and )(autism|ASD)', r'autism', ID2)
    ID4 = re.sub(r'(ID, )(autism|ASD)', r'autism', ID3)

    return(ID4)

In [120]:
 # Optional cell code block to test or understand what the tidy_up_terminology function does
    
tidy_test = "Autism spectrum intellectual disability and autism ID, ASD \
            Autisticspectrum autisticspectrumdisorder ASD \
            Asperger's syndrome asperger's syndrome \
            intellectual disability Intellectual Disability (ID)\
            aspergers syndrome autism spectrum  ASDs ASD ID, and autism "

tidy_up_terminology(tidy_test)

'ASD ID and autism ID, ASD            ASD ASD ASD            ASD ASD            ID ID           ASD ASD  ASD ASD ID, and autism '

In [121]:
tidy_text = [tidy_up_terminology(sentence) for sentence in matched_sentences['Sentence'] ] 
                                             # create abstract list without extra spaces/run-ons 
                                             # this is to improve sentence tokenisation later 
matched_sentences['Sentence'] = tidy_text    # copy the no extra space/run-on abstract list back into df as a new column

In [122]:
backup = matched_sentences                    # A backup is useful at this step because the next may not go the way you expect

In [123]:
matched_sentences = backup                    # If you need the backup, re-run this step. 

## Extraction

Following the cleaning phase, we move on to the extraction phase. This has two parts, first for the person-first extraction and then for the identity-first extraction. 

The results of both extractions are saved in their own column to make it easy to read and also to allow for a single sentence-token to contain both kinds of patterns. 

### Person-first pattern

In [124]:
pattern_1 = [{"POS": "NOUN"},                                        # define the person-first pattern - start with a noun
             {'DEP':'amod', 'OP':"?"},                               # followed by an optional modifier
             {"TEXT": {"REGEX": "(with|by|from)"}},                  # followed by some words that set up the p-f pattern
             {'DEP':'amod', 'OP':"?"},                               # then space for up to three optional modifiers
             {'DEP':'amod', 'OP':"?"},
             {'DEP':'amod', 'OP':"?"},
             {"TEXT": {"REGEX": "(^[Aa]utis|^[Aa]sperger|^ASD|^AS$)"}}] # finally, the keywords (original format, just in case)

# Matcher class object 
matcher = Matcher(nlp.vocab)                                         # define a matcher class object
matcher.add("matching_1", [pattern_1])                               # add my three person-first patterns to it


In [125]:
def find_pattern_match(input):                                               # define a function that applies the person-first
    thingy = nlp(input)                                                      # matcher class object to strings
    match = matcher(thingy)                                                  # and returns any matches to the pattern(s)
    if match == []:
        out_value = ''
    else:
        hold_multi_spans = []
        for match_id, start, end in match:
                string_id = nlp.vocab.strings[match_id]  # Get string representation
                span = thingy[start:end]  # The matched span
                hold_multi_spans.append(span)
        out_value = hold_multi_spans
    return out_value

In [127]:
matched_sentences['Person-first'] = matched_sentences.apply(lambda row: find_pattern_match(row.Sentence), axis = 1)
                                                                        # apply the newly defined person-first matcher function
                                                                        # and store the returned output in a new column
len(matched_sentences)                                                  # double check length remains same

6941

### Identity-first pattern

In [128]:
pattern_a = [{'DEP':'amod', 'OP':"?"},                                 # same for identity-first patterns,
             {'DEP':'amod', 'OP':"?"},                                 # starting with two optional modifiers
             {"TEXT": {"REGEX": "(^[Aa]utis|^[Aa]sperger|^ASD|^AS$)"}}, # the keywords (original format, just in case)
             {'DEP':'amod', 'OP':"?"},                                 # then upt to three more optional modifiers
             {'DEP':'amod', 'OP':"?"},
             {'DEP':'amod', 'OP':"?"},
             {"POS": "NOUN"}]                                          # and then a noun

# Matcher class object                                         
matcher = Matcher(nlp.vocab) 
matcher.add("matching_2", [pattern_a])            # this overwrites the matcher object to identity-first

In [129]:
matched_sentences['Identity-first'] = matched_sentences.apply(lambda row: find_pattern_match(row.Sentence), axis = 1)
                                                                        # apply the newly overwritten matcher function
                                                                        # and store the returned output in a new column
len(matched_sentences)                                                  # check the length - why not?

6941

### Consolidation

Following the cleaning and extraction phases, the last phase is consolidation. This phase further refines the data by removing all the rows that do not contain a match for one or both of the patterns. For example, there would be a row for "The child was tested for autism." because it contains a keyword of interest. However, this sentence would be eliminated in the consolidation phase as the keyword does not fit into either the person-first or identity-first patterns. 

Further, this phase goes on to lemmatise the extracted patterns so that they can be counted more easily. This phase also lowercases all occurrences of "Autistic", "Autism", and "Asperger's" as well as removing the apostrophe, the 's' and any non-white characters that might intrude between the 'r' and the 's' of "Asperger's". This phase also removes any square brackets, quotes and extra commas introduced by the lemmatisation process. 

This phase ends by writing out the consolidated data frame to a .csv for manual inspection. I could not find a feasible way of identifying whether or not the nouns matched in the extraction phase are person-nouns or not. As the list is not a totally unreasonable length (in the hundreds) I found it workable to 
* open in excel, 
* save the file under another name (e.g. pattern_matches_reviewed), 
* order the entire data set alphabetically by 'Person-first', 
* scan through the ordered results check whether each result in the 'Person-first' column is about a person, 
* removing entire rows if the 'Person-first' match is not about a person (checking the 'Sentence' or 'Text' column if needed)
* re-order the entire data set alphabetically by 'Identity-first', 
* scan through the ordered results check whether each result in the 'Identity-first' column is about a person, 
* removing entire rows if the 'Identity-first' match is not about a person, 
* save file again. 

For example, 'association with autism' matches the person-first pattern but is not about a person, so this row was removed. Many more rows were removed in the 'Identity-first' matches as things like 'autistic behaviours' and 'autism testing' were removed for not being about people. 

NOTE: There were several instances of "ASD dataset" which are not easy to determine if they are about people or not. Do they mean dataset composed from blood tests taken as part of ASD testing? If so, each row in the data set would be a blood test with the possibility that more than one test comes from the same person. Or do they mean a pool of case records, each of which represents a single person? The former would not be "about people" but the second would. I did not remove these rows as we cannot be certain. Leaving them out would also have been a valid option, as long as the choice was clear. 

Coincidentally, during this manual checking part of the consolidation phase I learned that, in the context of human genetics research "proband" is a person-noun. 

In [130]:
matched_patterns = matched_sentences[(matched_sentences['Person-first'] != '') | (matched_sentences['Identity-first'] != '')]
                                                     # keep only rows w/ non-null 'Person-first' and/or 'Identity-first' columns
len(matched_patterns)                                # check length

1409

In [131]:
matched_patterns = matched_patterns.explode('Person-first')    # explode 'Person-first' column to create 1 row per match
                                                               # if there were two matches within the same sentence
len(matched_patterns)                                          # check the length

1410

In [132]:
matched_patterns = matched_patterns.explode('Identity-first')  # Do the same for 'Identity-first' column
len(matched_patterns)                                          # check the length

1473

In [133]:
matched_patterns                                               # have a look at them

Unnamed: 0.1,Unnamed: 0,Title,Session_Code,Authors_and_Affiliations,Text,Year,Email,Author,Affiliations,Sentence,Person-first,Identity-first
8,150,Location of the first predisposing gene locus for Asperger syndrome on chromosome 1q21 22,C102.,E. Jarvela1 T. Ylisaukko oja2 T. Nieminen3 E. Kempas1 M. Auranen1 L. Peltonen1 1National Public Health Institute Helsinki Finland 2National Public Health Insitute Helsinki Finland 3Uni...,Asperger syndrome (AS) was first described in 1944 by a Viennese physi cian Hans Asperger who reported a group of boys with autistic psychopa thy whose clinical features resembled autism with som...,2001.0,irma.jarvela@hus.fi,,,ASD was first described in 1944 by a Viennese physi cian Hans asperger who reported a group of boys with autistic psychopa thy whose clinical features resembled autism with some modifications.,"(boys, with, autistic)","(autistic, psychopa)"
8,150,Location of the first predisposing gene locus for Asperger syndrome on chromosome 1q21 22,C102.,E. Jarvela1 T. Ylisaukko oja2 T. Nieminen3 E. Kempas1 M. Auranen1 L. Peltonen1 1National Public Health Institute Helsinki Finland 2National Public Health Insitute Helsinki Finland 3Uni...,Asperger syndrome (AS) was first described in 1944 by a Viennese physi cian Hans Asperger who reported a group of boys with autistic psychopa thy whose clinical features resembled autism with som...,2001.0,irma.jarvela@hus.fi,,,We report the analysis of13 candidate gene loci associated with autism and schizophrenia in 17Finnish ASD families with autosomal dominant mode of inheritance.,,"(ASD, families)"
8,150,Location of the first predisposing gene locus for Asperger syndrome on chromosome 1q21 22,C102.,E. Jarvela1 T. Ylisaukko oja2 T. Nieminen3 E. Kempas1 M. Auranen1 L. Peltonen1 1National Public Health Institute Helsinki Finland 2National Public Health Insitute Helsinki Finland 3Uni...,Asperger syndrome (AS) was first described in 1944 by a Viennese physi cian Hans Asperger who reported a group of boys with autistic psychopa thy whose clinical features resembled autism with som...,2001.0,irma.jarvela@hus.fi,,,Linkageto the previously reported predisposing loci for autism could not be repli cated with Finnish ASD families.,,"(ASD, families)"
21,400,Prader Willi and Angelman Syndromes in Chilean Patients. Clinical and Molecular Diagnosis.,P0295.,Curotto L. Santa Mar a A. Alliende F. Cort s INTA University of Chile SANTIAGO C,Prader Willi (PWS) and Angelman AS) syndromes are multigenic disorderscharacterized by developmental and neurobehavioral abnormalities. Dif ferent underlying genetic defects cause loss of expressi...,2001.0,malliend@uec.inta.uchile.cl,,,Sand ASD patients have a deletion in 15q11 q13 whereas uniparental disomy( UP.,,"(ASD, patients)"
21,400,Prader Willi and Angelman Syndromes in Chilean Patients. Clinical and Molecular Diagnosis.,P0295.,Curotto L. Santa Mar a A. Alliende F. Cort s INTA University of Chile SANTIAGO C,Prader Willi (PWS) and Angelman AS) syndromes are multigenic disorderscharacterized by developmental and neurobehavioral abnormalities. Dif ferent underlying genetic defects cause loss of expressi...,2001.0,malliend@uec.inta.uchile.cl,,,ASD patients were evaluated with the consensus cri teria.,,"(ASD, patients)"
...,...,...,...,...,...,...,...,...,...,...,...,...
3748,675,,,,P0280Two cases of Cri Du Chat syndrome in the same family Spatial association of oppositely imprinted regions in late S-phase without a familial translocation or inversion. but not at other stag...,2004.0,,,,"S/ASD locus but no ribosomal genes, the AR.",,"(ASD, locus)"
3765,1100,,,,"P0473DNA-polymorphisms of NAT2 and MnSOD and and carcinomas through a signiÜcant increase of somatic G > T predisposition to breast cancer. transversion in the APC and KRAS genes. In particular,...",2004.0,,,,ASD genes.,,"(ASD, genes)"
3776,1710,,,,P0697How to prepare stable reference materials for genetic establish that treatment of cultured human dermal Übroblasts with testing several recombinant Übrillin-1 fragments induces up-regulatio...,2004.0,,,,N-ASD neuroblastoma cell lines A number of genetic CR.,,"(ASD, neuroblastoma)"
3783,1932,,,,"P0760Cellular mislocalization of mutant ALADIN causes triple neuromuscular disorder characterised by distal muscular weakness A syndrome and atrophies, gait abnormalities and sensory deÜcit resu...",2004.0,,,,ASD gene encoding a novel WD-repeat protein GJ.,,"(ASD, gene)"


In [134]:
Lem = WordNetLemmatizer()                         # Define a short way to call the WordNetLemmatizer

def consolidate_matched_patterns (input):         # 
    final_lemma_list = []
    temp_lemma_list = []
    for phrase in input:                       # start for loop looking at each pattern in the person-first pattern column
        phrase_as_string = str(phrase)                               # hold the current pattern
        words_in_phrase = phrase_as_string.split() # split the current pattern into words
        for word in words_in_phrase :                            # for each word in the split up words
            lemma = Lem.lemmatize(word)             # turn that word into a lemma
            temp_lemma_list.append(lemma)                # append that lemma to a temporary list
        string_lem = str(temp_lemma_list)              # turn that temporary list into a string
        stripped_lem = re.sub(r"\[|\]|\'|\,",'', string_lem)  # remove  square brackets, commas and '' marks from the string
        final_lemma_list.append(stripped_lem)        # append the string version of the list to the output list
        temp_lemma_list = []                               # ensure the temp variable is empty

    return(final_lemma_list)


In [135]:
person_lemma_list = consolidate_matched_patterns(matched_patterns['Person-first'])
identity_lemma_list = consolidate_matched_patterns(matched_patterns['Identity-first'])

In [136]:
matched_patterns['Person-first'] = person_lemma_list    # copy the person-first output to new column in data frame 
matched_patterns['Identity-first'] = identity_lemma_list  # copy the identity-first output to new column in data frame 
matched_patterns = matched_patterns.drop_duplicates()                         # drop any duplicates
matched_patterns                                                   # have a look at the data frame with its new columns

Unnamed: 0.1,Unnamed: 0,Title,Session_Code,Authors_and_Affiliations,Text,Year,Email,Author,Affiliations,Sentence,Person-first,Identity-first
8,150,Location of the first predisposing gene locus for Asperger syndrome on chromosome 1q21 22,C102.,E. Jarvela1 T. Ylisaukko oja2 T. Nieminen3 E. Kempas1 M. Auranen1 L. Peltonen1 1National Public Health Institute Helsinki Finland 2National Public Health Insitute Helsinki Finland 3Uni...,Asperger syndrome (AS) was first described in 1944 by a Viennese physi cian Hans Asperger who reported a group of boys with autistic psychopa thy whose clinical features resembled autism with som...,2001.0,irma.jarvela@hus.fi,,,ASD was first described in 1944 by a Viennese physi cian Hans asperger who reported a group of boys with autistic psychopa thy whose clinical features resembled autism with some modifications.,boy with autistic,autistic psychopa
8,150,Location of the first predisposing gene locus for Asperger syndrome on chromosome 1q21 22,C102.,E. Jarvela1 T. Ylisaukko oja2 T. Nieminen3 E. Kempas1 M. Auranen1 L. Peltonen1 1National Public Health Institute Helsinki Finland 2National Public Health Insitute Helsinki Finland 3Uni...,Asperger syndrome (AS) was first described in 1944 by a Viennese physi cian Hans Asperger who reported a group of boys with autistic psychopa thy whose clinical features resembled autism with som...,2001.0,irma.jarvela@hus.fi,,,We report the analysis of13 candidate gene loci associated with autism and schizophrenia in 17Finnish ASD families with autosomal dominant mode of inheritance.,,ASD family
8,150,Location of the first predisposing gene locus for Asperger syndrome on chromosome 1q21 22,C102.,E. Jarvela1 T. Ylisaukko oja2 T. Nieminen3 E. Kempas1 M. Auranen1 L. Peltonen1 1National Public Health Institute Helsinki Finland 2National Public Health Insitute Helsinki Finland 3Uni...,Asperger syndrome (AS) was first described in 1944 by a Viennese physi cian Hans Asperger who reported a group of boys with autistic psychopa thy whose clinical features resembled autism with som...,2001.0,irma.jarvela@hus.fi,,,Linkageto the previously reported predisposing loci for autism could not be repli cated with Finnish ASD families.,,ASD family
21,400,Prader Willi and Angelman Syndromes in Chilean Patients. Clinical and Molecular Diagnosis.,P0295.,Curotto L. Santa Mar a A. Alliende F. Cort s INTA University of Chile SANTIAGO C,Prader Willi (PWS) and Angelman AS) syndromes are multigenic disorderscharacterized by developmental and neurobehavioral abnormalities. Dif ferent underlying genetic defects cause loss of expressi...,2001.0,malliend@uec.inta.uchile.cl,,,Sand ASD patients have a deletion in 15q11 q13 whereas uniparental disomy( UP.,,ASD patient
21,400,Prader Willi and Angelman Syndromes in Chilean Patients. Clinical and Molecular Diagnosis.,P0295.,Curotto L. Santa Mar a A. Alliende F. Cort s INTA University of Chile SANTIAGO C,Prader Willi (PWS) and Angelman AS) syndromes are multigenic disorderscharacterized by developmental and neurobehavioral abnormalities. Dif ferent underlying genetic defects cause loss of expressi...,2001.0,malliend@uec.inta.uchile.cl,,,ASD patients were evaluated with the consensus cri teria.,,ASD patient
...,...,...,...,...,...,...,...,...,...,...,...,...
3748,675,,,,P0280Two cases of Cri Du Chat syndrome in the same family Spatial association of oppositely imprinted regions in late S-phase without a familial translocation or inversion. but not at other stag...,2004.0,,,,"S/ASD locus but no ribosomal genes, the AR.",,ASD locus
3765,1100,,,,"P0473DNA-polymorphisms of NAT2 and MnSOD and and carcinomas through a signiÜcant increase of somatic G > T predisposition to breast cancer. transversion in the APC and KRAS genes. In particular,...",2004.0,,,,ASD genes.,,ASD gene
3776,1710,,,,P0697How to prepare stable reference materials for genetic establish that treatment of cultured human dermal Übroblasts with testing several recombinant Übrillin-1 fragments induces up-regulatio...,2004.0,,,,N-ASD neuroblastoma cell lines A number of genetic CR.,,ASD neuroblastoma
3783,1932,,,,"P0760Cellular mislocalization of mutant ALADIN causes triple neuromuscular disorder characterised by distal muscular weakness A syndrome and atrophies, gait abnormalities and sensory deÜcit resu...",2004.0,,,,ASD gene encoding a novel WD-repeat protein GJ.,,ASD gene


In [137]:
matched_patterns.to_csv('..\\output\\pattern_matches_to_review.csv')        
                                                            # Write the data frame to a .csv for manual processing in excel

At this point, I open the file in Excel (for example), removed the brackets, quotation marks and commas in the Person-first lemmatised and Identity-first lemmatised columns, then sort by each of one of these columns. I then scan through the results, removing any rows that are obviously not about people (e.g. "autistic testing") and checking the 'Text' column on any that are unclear 'autistic quartets'). I then sort by the other column and repeat the step of reviewing and deleting non-person rows. Save under "pattern_matches_reviewed.csv" for the next step. 

## Chart person-first or identity-first by year

In [32]:
reviewed_matches = pd.read_csv('..\\output\\pattern_matches_reviewed.csv')    # one for just those that match the keyword
reviewed_matches.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Title,Session_Code,Authors_and_Affiliations,Text,Year,Email,Author,Affiliations,Sentence,Person-first,Identity-first
0,3791,2157,"Presence of elevated lactate, lactate/pyruvate ratio and Analysis of the patientsÓclinical and molecular data demonstrated that acylcarnitine proÜle in patients with autism. all Üve patients wit...",P0835,"S. E. Carlo1, N. Arciniegas1, J. Acevedo2, N. Ramirez3, A. Reis1, A. S. Cornier1","No association between the onset of pulmonary symptoms and genotype was observed. Finally, the presence of 1Ponce School of Medicine, Ponce, PR, United States, 2University of Puerto W32X, the ...",2004,,"S. E. Carlo1, N. Arciniegas1, J. Acevedo2, N. Ramirez3, A. Reis1, A. S. Cornier1",,"P0835Presence of elevated lactate, lactate/pyruvate ratio and Analysis of the patientsÓclinical and molecular data demonstrated that acylcarnitine proÜle in patients with autism.",patient with autism,
1,3738,409,"S424 , at Xq23-24 in one family with cryptogenic epilepsy",,,. affected male members presented with non-speciÜc MR and Objective: practically healthy boy at 10 years stopped in psychomotor verbal disability. We report three generations of single family wi...,2004,,,,"T reduce in size, cortex is 1p and monosomy 12q was found in autistic brother.",,autistic brother
2,3733,376,"The Results of Cytogenetic Analysis in 150 Autistic Patients BOF syndrome consists of branchial defect, dysmorphic face",P0100,"M. Havlovicov\x861, D. Novotn\x861, E. Koc\x86rek1, M. Hrdlicka2, Z. Sedl\x86cek1 1Institute of Biology and Medical Genetics, 2nd Medical Faculty, University Hospital Motol, Prague, Czech Republ...","M. Havlovicov\x861, D. Novotn\x861, E. Koc\x86rek1, M. Hrdlicka2, Z. Sedl\x86cek1 (dolichocephaly, sparse hair, high forehead, malar hypoplasia, small 1Institute of Biology and Medical Genetics,...",2004,,"M. Havlovicov\x861, D. Novotn\x861, E. Koc\x86rek1, M. Hrdlicka2, Z. Sedl\x86cek1","1Institute of Biology and Medical Genetics, 2nd Medical Faculty, University Hospital Motol, Prague, Czech Republic, 2Department of Child PsychiatryULB, Brussels, Belgium, 9Hopital Erasme, Bruss...","Patient 1 presented with pseudocleft, right branchial mass and We performed a detailed genetic analysis of 150 autistic patients an unilateral pyelo-ureteral junction stenosis.",,autistic patient
3,3718,257,Synostosis of metacarpals IV and V: a diagnostic be affected by segmental UPD. Further analyses on SRS candidate challenge genes will be focused on this area.,P0021,"E. Prott1, G. Gillessen-Kaesbach1, F. Majewski2, P. Meinecke3 1Institut f\x9er Humangenetik, Essen, Germany, 2Institut f\x9er Humangenetik, D\x9es- seldorf, Germany, 3Medizinische Genetik, Hambur...","Metacarpal synostosis (MS) IV-V may represent one anomaly among Clinical genetics 86 others in several malformation syndromes, e.g. Apert syndrome, TAR methylation analysis to uncover the change...",2004,,"E. Prott1, G. Gillessen-Kaesbach1, F. Majewski2, P. Meinecke3 1Institut f\x9er Humangenetik, Essen, Germany, 2Institut f\x9er Humangenetik, D\x9es- seldorf, Germany, 3Medizinische Genetik, Hambur...","1Institut f\x9er Humangenetik, Essen, Germany, 2Institut f\x9er Humangenetik, D\x9es- seldorf, Germany, 3Medizinische Genetik, Hamburg, Germany.",Out of 10 autism suspected patients 1 showed inheritance of index cases show the characteristic MS .,,autism suspected patient
4,3711,187,"Loss of desmoplakin isoform I causes severe in neurons may be involved in the pathogenesis in a subset of arrhythmogenic left and right ventricular cardiomyopathy, individuals with autism. The ...",C54,"1, B. Wollnik2","Loss of desmoplakin isoform I causes severe in neurons may be involved in the pathogenesis in a subset of arrhythmogenic left and right ventricular cardiomyopathy, individuals with autism. The ...",2004,,"E. E. Norgett1, A. Uz\x9emc\x9e2, O. Uyguner2, A. Dindar3, H. Kayserili2, K. Nisli3, E. RNAi-mediated gene knock-down. Dupont4, N. Severs4, M. Yuksel-Apak2, D. P. Kelsell","1Institute of Cell and Molecular Science, Barts and The London School","C54Loss of desmoplakin isoform I causes severe in neurons may be involved in the pathogenesis in a subset of arrhythmogenic left and right ventricular cardiomyopathy, individuals with autism.",individual with autism,


In [None]:
if person-first not empty then find noun in person first pattern


In [33]:
print("There are " + 
      str(len(reviewed_matches)) + " rows in the post-manual review data frame coming from " +
      str(reviewed_matches['Title'].nunique()) +
      " unique titles.")

There are 360 rows in the post-manual review data frame coming from 209 unique titles.


In [34]:
print("In total, there are " + str(reviewed_matches['Person-first'].count()) +
      " examples of PFL, coming from " +
      str(len(reviewed_matches[reviewed_matches["Person-first"].notnull() == True].groupby(['Title']).nunique())) +
      " unique titles and with " +
      str(reviewed_matches['Person-first'].nunique()) + " unique patterns. They are distributed as follows: ")
reviewed_matches.groupby(['Person-first'])['Title'].nunique().sort_values(ascending=False)

In total, there are 128 examples of PFL, coming from 97 unique titles and with 33 uinique patterns. They are distributed as follows: 


Person-first
patient with autism                     36
child with autism                       17
individual with autism                  15
boy with autism                          5
girl with autism                         4
family with autism                       3
people with autism                       2
subject with autism                      2
child with autistic                      2
patient with autistic                    2
male with autism                         2
boy with autistic                        2
patient with idiopathic autism           2
patient with severe autistic             1
patient with severe autism               1
patient with mild autism                 1
patient with high functioning autism     1
uncle with autism                        1
individual with idiopathic autism        1
adolescent with autism                   1
girl with atypical autism                1
child with simplex autism                1
child with severe autistic               

In [35]:
print("In total, there are " + str(reviewed_matches['Identity-first'].count()) +
      " examples of IFL, coming from " +
      str(len(reviewed_matches[reviewed_matches["Identity-first"].notnull() == True].groupby(['Title']).nunique())) +
      " unique titles and with " +
      str(reviewed_matches['Identity-first'].nunique()) + " unique patterns. They are distributed as follows: ")
reviewed_matches.groupby(['Identity-first'])['Title'].nunique().sort_values(ascending=False)

In total, there are 233 examples of IFL, coming from 132 unique titles and with 48 uinique patterns. They are distributed as follows: 


Identity-first
autism patient                    48
autistic patient                  20
autism case                       13
autistic child                    13
autism dataset                    11
autistic individual                8
autism family                      7
autistic group                     6
autism cohort                      5
autistic population                5
autism group                       3
autism proband                     3
autistic proband                   3
autism trio                        3
African autistic population        2
autism child                       2
autism individual                  2
autism brother                     2
autistic case                      2
autistic subject                   1
unrelated autistic child           1
typical autism case                1
old autistic girl                  1
definite autistic case             1
homogeneous autism subgroup        1
large autism dataset               1
old autistic female    

In [31]:
reviewed_matches['Identity-first']


0                              NaN
1                              NaN
2                              NaN
3                              NaN
4                              NaN
                  ...             
355        Spanish autistic family
356           total autistic group
357            typical autism case
358       unrelated autistic child
359    various autistic population
Name: Identity-first, Length: 360, dtype: object

In [49]:
reviewed_matches['Noun'] = "" 

In [71]:
def find_PF_nouns(input):
    output = []
    for thingy in input:
        if isinstance(thingy,str):
            word_list = thingy.split()
            noun = word_list[0]
            output.append(noun)
        else:
            output.append("")
    return output

def find_IF_nouns(input):
    output = []
    for thingy in input:
        if isinstance(thingy,str):
            word_list = thingy.split()
            noun = word_list[-1]
            output.append(noun)
        else:
            output.append("")
    return output

In [85]:
test = find_PF_nouns(reviewed_matches['Person-first'])
print(len(reviewed_matches['Person-first']))
print(len(test))
print(test)


360
360
['patient', '', '', '', 'individual', 'child', '', '', 'patient', 'patient', '', '', 'patient', '', '', '', '', '', 'child', 'people', 'individual', '', '', '', '', 'male', '', '', '', '', 'adolescent', '', '', 'patient', 'patient', 'patient', 'patient', '', '', 'child', 'child', 'patient', 'individual', '', '', 'patient', 'center', '', 'child', '', 'patient', 'uncle', '', '', 'woman', 'people', '', '', 'child', 'boy', 'child', 'child', 'patient', 'patient', '', 'individual', 'child', '', 'child', 'child', 'individual', 'individual', '', 'child', '', 'individual', 'male', '', '', '', '', 'individual', 'individual', '', '', '', 'girl', '', '', '', '', '', 'subject', '', '', 'patient', '', '', 'patient', 'individual', 'boy', 'child', 'subject', '', '', 'patient', 'child', 'patient', 'patient', '', 'case', 'child', '', '', 'individual', '', '', '', '', '', '', 'boy', '', 'patient', 'patient', '', '', '', '', 'boy', 'boy', 'boy', 'patient', '', 'case', '', '', '', '', '', '', '', '

In [86]:
test2 = find_PF_nouns(reviewed_matches['Identity-first'])
print(len(reviewed_matches['Identity-first']))
print(len(test2))
print(test2)

360
360
['', 'autistic', 'autistic', 'autism', '', '', 'autism', 'autism', '', '', 'autism', 'autism', '', 'autism', 'autism', 'autism', 'autism', 'autism', '', '', '', 'autism', 'autism', 'autism', 'autism', '', 'autism', 'autistic', 'old', 'typical', '', 'autistic', 'autistic', '', '', '', '', 'autistic', 'autistic', '', '', '', '', 'autism', 'autism', '', '', 'autism', '', 'autistic', '', '', 'autistic', 'autism', '', '', 'autistic', 'autistic', '', '', '', '', '', '', 'autism', '', '', 'autism', '', '', '', '', 'autism', '', 'autist', '', '', 'autism', 'autism', 'autism', 'large', '', '', 'autism', 'autism', 'autism', '', 'African', 'autistic', 'autistic', 'unrelated', 'autism', '', 'autistic', 'Italian', '', 'autism', 'autistic', '', '', '', '', '', 'autism', 'autism', '', '', '', '', 'autism', '', '', 'autistic', 'autistic', '', 'autism', 'autistic', 'autistic', 'homogeneous', 'autism', 'autism', '', 'autistic', '', '', 'autism', 'autistic', 'African', 'autistic', '', '', '', '',

In [13]:
m
word_list = my_str.split()  # list of words

# first word  v              v last word
>>> word_list[0], word_list[-1]                            # add my three person-first patterns to it


In [None]:
matched_sentences['Person-first'] = matched_sentences.apply(lambda row: find_pattern_match(row.Sentence), axis = 1)
                                                                        # apply the newly defined person-first matcher function
                                                                        # and store the returned output in a new column
len(matched_sentences)                                                  # double check length remains same

In [None]:
person_identity_count.plot()
plt.show()
plt.savefig('..\\output\\matches_count.jpg')    # we can right click on the plot above to save it, or save it via command

In [None]:
person_examples = reviewed_matches.groupby(['Person-first'])['Person-first'].count()
identity_examples = reviewed_matches.groupby(['Identity-first'])['Identity-first'].count()
print(len(person_examples))
print(len(identity_examples))

In [None]:
person_identity_examples=pd.concat([person_examples,identity_examples],axis=1)


In [None]:
person_identity_examples.sort_values(by=['Person-first'], ascending=False).head(10)

In [None]:
person_identity_examples.sort_values(by=['Identity-first'], ascending=False).head(10)

In [None]:
person_identity_examples.notnull().sum()

## Count abstracts by the structures they use

In [None]:
test = pd.merge(person_by_title,identity_by_title,on='Title',how='outer')
print(test)
print(type(test))

In [None]:
person_by_title = reviewed_matches.groupby(['Title'])['Person-first'].count()
identity_by_title = reviewed_matches.groupby(['Title'])['Identity-first'].count()
title = pd.concat([person_by_title,identity_by_title],axis=1)
print(title)

In [None]:
title.sort_values(by=['Identity-first'], ascending=False)

In [None]:
title.sort_values(by=['Person-first'], ascending=False)

In [None]:
columns = ['Person-first','Identity-first']
filter_ = (title[columns] > 0).all(axis=1)
title[filter_]
len(title[filter_])


In [None]:
title[filter_].sort_values(by=['Person-first'], ascending=False)

In [None]:
title[filter_].sort_values(by=['Identity-first'], ascending=False)

In [None]:
has_pf = title[title['Person-first'] > 0]
has_both = has_pf[has_pf['Identity-first'] > 0]
print(len(has_both))
has_both

In [None]:
plt.scatter(title['Person-first'], title['Identity-first'])
plt.show()

#plt.savefig('..\\output\title_count.jpg')    # we can right click on the plot above to save it, or save it via command

In [None]:
title['Person-first']


In [None]:
ax = plt.axes()
ax.scatter(x, y, c='g', marker='x')
ax.set_title("Anscombe's First Data Set")
ax.set_ylabel('Y-Values')
ax.set_ylim(4, 11)
ax.set_xlabel('X-Values')
ax.set_xlim(3, 15)

plt.show()