# Step 1 - Prepare

Admittedly, quite a bit of work has already taken place. The .pdf files were sent off to a research support team that scraped the text out of them and stored the contents of those files in .csv files which have useful headers. Each .pdf file was processed to produce a single .csv file. This processing was not universally successful because the original .pdf files were not all encoded in the same format and most had a totally unique layout. In effect, this means that each .pdf to .csv process was not equally able to automatically capture and segment the text in the .pdf into .csv columns such as author, affiliation, session code, etc. This means that some of the .csv files have empty columns while other .csv files have contents written in the same columns. This also means that some of the rows are probably faulty. Obviously, a detailed manual inspection would correct some of these errors but total accuracy is not the point of this research effort. 

The file for 2004 was particularly tricky and needs separate attention because the output from the research support team was not structured in a way that matches the others very well. 

Nevertheless, we have all the files in a somewhat useful way that allows us to use natural language processing methods to investigate person-first and identity-first language. The next step is to import the various .csv files, consolidate them in one data frame, tidy up some of the erroneous rows and columns and then same the output in a new .csv. 

## Get ready

All of my jupyter notebooks begin with some code cells focussed on downloading/importing necessary packages, loading useful short names, and so forth. 

I also like to check the relevant file locations before importing the .csv files to work on. 

In [1]:
%%capture                         
                                  # The above capture statement is optional. 
                                  # You can remove this to see the chatter normally produced during import steps. 

import os                         # os is a module for navigating your machine (e.g., file directories).

import pandas as pd               # pandas is necessary for working with data frames - shortening it to pd just saves time. 
pd.set_option('display.max_colwidth', 200)   # some of the files are big so set a big column width. 
import numpy as np                # like pandas, numpy is useful and useful to have a short name for
import statistics                 # gotsta have stats

import csv                        # csv is for importing and working with csv files

import re                         # things we need for RegEx corrections
import string 
import math 

In [2]:
print(os.listdir("..\\results")  )                                # check 'results' folder is not empty/has correct stuff



['ESHG2001abstractICHG.csv', 'ESHG2002Abstracts.csv', 'ESHG2003Abstracts.csv', 'ESHG2004.csv', 'ESHG2005Abstracts.csv', 'ESHG2006Abstracts.csv', 'ESHG2007Abstracts.csv', 'ESHG2008Abstracts.csv', 'ESHG2009Abstracts.csv', 'ESHG2010Abstracts.csv', 'ESHG2011Abstracts.csv', 'ESHG2012Abstracts.csv', 'ESHG2013Abstracts.csv', 'ESHG2014Abstracts.csv', 'ESHG2015Abstracts.csv', 'ESHG2016Abstracts.csv', 'ESHG2017 electronic posters.csv', 'ESHG2017 oral presentation.csv', 'ESHG2018 electronic posters.csv', 'ESHG2018 oral presentation.csv', 'ESHG2019 electronic posters.csv', 'ESHG2019 oral presentation.csv', 'ESHG2020 electronic posters.csv', 'ESHG2020 oral presentation.csv', 'ESHG2021 electronic posters.csv', 'ESHG2021 oral presentation.csv']


## Import

Having got all the packages we need and having checked the files, let's import them. This requires:
* defining a function to import multiple files from a known location (better than one-by-one importing!)
* checking the output of the mass-import for length and contents (since I suspect 2004 may not have worked correctly)
* having found an error, investigating it a bit
* correcting the error by removing problem rows, manually incorporating better rows, adding correct rows back in and checking

In [3]:
files = []                                                        # create empty list to hold names of files in 'results'
def import_results(input):                                        # create a function import the contents of the
    for f in os.listdir(input):                                   # folder named in the function input
        f = pd.read_csv(input + '\\'+ f,encoding='latin1')        # by reading them in as csv files, one by one
        files.append(f)                                           # appending the newly read csv file to a temporary list
    output = pd.concat(files)                                     # then concatenating that temp list to the pre-defined list
    return output                                                 # returning the output

In [4]:
all_results = import_results("..\\results")      # run the newly defined function on the 'results' folder
how_many_total = len(all_results)                # check the length 
how_many_total                                   # 

34630

In [5]:
print(all_results['Year'].drop_duplicates())     # quick check shows that 2004 (a known problem file) has not imported properly

0    2001.0
0    2002.0
0    2003.0
0       NaN
0    2005.0
0    2006.0
0    2007.0
0    2008.0
0    2009.0
0    2010.0
0    2011.0
0    2012.0
0    2013.0
0    2014.0
0    2015.0
0    2016.0
0    2017.0
0    2018.0
0    2019.0
0    2020.0
0    2021.0
Name: Year, dtype: float64


In [6]:
how_many_no_year = all_results['Year'].isna().sum() # Let's just count how many rows NaN instead of the year
how_many_no_year

2205

In [7]:
no_Nan_in_Year = all_results[~all_results['Year'].isnull()]          # remove the 'Year' = Nan rows
how_many_without_Nan_year = len(no_Nan_in_Year)                      # check length again now that Nan rows are nemoved
how_many_without_Nan_year

32425

In [8]:
print (how_many_total - how_many_no_year )                   # print the total number minus the number that are missing a year
print (how_many_without_Nan_year)                            # compare to the new total just to be sure

32425
32425


In [9]:
year_04 = pd.read_csv('..\\results\\ESHG2004.csv')      # specifically read in year 2004 (it needed a bit of extra work)
year_04 = year_04.iloc[:, [0,1]]                        # cut a two-column slice out of it with only the year and text
year_04                                                 # check how it looks (which also shows us how many rows are in it!)

Unnamed: 0,Year,Text
0,2004,"L01Multiple Sulfatase Defi ciency: Molecular defect and properties of the autosomal forms of epigenetic mosaicism can be caused by missing enzyme. retrotransposon activity. K. von Figura, M. Maria..."
1,2004,L04Regional differences in genetic testing and counselling in Europe - An overview
2,2004,"L02Biogenesis of mitochondria: Human diseases linked to S. Aymé protein transport, folding and degradation INSERM, Paris, France. W. Neupert"
3,2004,L05Hereditary Breast/Ovarian Cancer risk: international energy present in oxidizable substrates is transduced into energy comparison of the acceptability of Preventive strategies stored in ATP. Mi...
4,2004,L06Variation in prenatal counselling in Europe: the example of highly motile within the cell. Quite a number of genes are involved Klinefelter in these processes which are closely linked to the in...
...,...,...
2200,2004,"C7 A10 in the aetiology of cystinuria, we could not identify any Affected children may have only one episode of illness or multiple mutation in SL"
2201,2004,"C7 A10 in the two families. Nevertheless, there remains recurrences. A common mutation (985A >G) has been identiÜed the possibility that other genes are involved in cystinuria. Further among pa..."
2202,2004,"P0845Inactivation of the spasmolytic trefoil peptide (Tff2) leads In this study, two unrelated MCAD patients, compound heterozygous to increased expression of additional gastroprotective factors..."
2203,2004,"P0843MCDR1 Locus - Screening for candidate genes. functional disturbance in stomach and gut, Tff2-/- constructs do N. Udar1, M. Chalukya1, R. Silva-Garcia1, J. Yeh1, P. Wong1,2, K. Small1 not di..."


In [10]:
all_results_corrected = pd.concat([no_Nan_in_Year, year_04])     # add those specially imported 2004 rows back into the output
                                                                 # which had the weird no-year rows removed

In [11]:
how_many_total_new = len(all_results_corrected)                  # check length again - are we back up to where we started?

if how_many_total == how_many_total_new :                        # write a quick check to be sure the totals before and after
    print('The numbers add up!')                                 # removing/replacing the 2004 rows are the same. 

The numbers add up!


## Remove rows that imported correctly for reasons unrelated to 2004. 

Having imported all the various .csv files and storing them in one data frame (even that tricksy 2004 .csv) I do a bit of clean up. Turning .pdf files to .csv is not a straightforward or fool proof process, so I want to remove any rows that have nothing in the 'Text' column and check the length again to see how many we have lost. Then I want to remove any columns that are entirely empty (which is probably the result of badly imported rows that are shifted over) and have a quick look at the remaining columns and what might be in them. 

In [12]:
no_null_texts = all_results_corrected[~all_results_corrected['Text'].isnull()] 
                                                                    # remove any rows where the 'text' column is empty
how_many_no_null_texts = len(no_null_texts)                         # check length again - still making sense?
print (how_many_no_null_texts)
print (how_many_total_new - how_many_no_null_texts)

33979
651


In [13]:
no_null_texts = no_null_texts.dropna(axis=1, how="all")   # remove all columns which contain only NaNs
print(len(no_null_texts))                                 # just check the length has not changed
no_null_texts                                             # have a nosy at which columns remain, what it is them, etc. 

33979


Unnamed: 0,Title,Session_Code,Authors_and_Affiliations,Text,Year,Email,Author,Affiliations
0,Progress in sequencing and annotating the human genome,PS01.,Weissenbach Genoscope Evry F,Despite the availability of most of the human genome sequence the accu rate identification of genes on the DNA sequence remains to be continu ously improved and updated. This procedure relies on ...,2001.0,jsbach@genoscope.cns.fr,,
1,Protein Structures Inherited Mutations and Diseases,PS02.,M. ThorntonUniversity College Biochemistry and Molecular Biology Department andBirkbeck College London United K,The mechanisms whereby inherited DNA mutations cause disease areonly beginning to be understood. These are best understood in the contextof knowledge of the three dimensional structure of the rele...,2001.0,thornton@biochem.ucl.ac.uk,,
2,Mutations and Their Effect on Protein Structure,PS03.,R. FershtCentre for Protein Engineering University of Cambridge Cambridge Unit ed K,Missense mutations can knock out residues in proteins that are importantfor binding catalysis or conformational change or have subtle effects onstability or conformation. Because buried residues...,2001.0,hjt1001@cam.ac.uk,,
3,Pharmacogenetics A Disruptive Technology For The Next Decade,PS04.,Roses Senior Vice President Genetics Research GlaxoSmithKline London United K,The Genome Project has accelerated the availability of tools to apply phar macogenetics to the development of personalized medicines. This differsfrom the long term [10 15 year] strategy that is r...,2001.0,mdb4138@glaxowellcome.co.uk,,
4,Predictive Testing for Huntington Disease 15 Years Later,PS05.,R. Hayden Centre for Molecular medicine & Therapeutics Department of MolecularGenetics University of British Columbia Vancouver BC C,Predictive genetic testing (PT) for Huntington Disease has now beenoffered for longer than for any other late onset genetic illness. The signifi cant number of persons at risk who have participat...,2001.0,mrh@cmmt.ubc.ca,,
...,...,...,...,...,...,...,...,...
2200,,,,"C7 A10 in the aetiology of cystinuria, we could not identify any Affected children may have only one episode of illness or multiple mutation in SL",2004.0,,,
2201,,,,"C7 A10 in the two families. Nevertheless, there remains recurrences. A common mutation (985A >G) has been identiÜed the possibility that other genes are involved in cystinuria. Further among pa...",2004.0,,,
2202,,,,"P0845Inactivation of the spasmolytic trefoil peptide (Tff2) leads In this study, two unrelated MCAD patients, compound heterozygous to increased expression of additional gastroprotective factors...",2004.0,,,
2203,,,,"P0843MCDR1 Locus - Screening for candidate genes. functional disturbance in stomach and gut, Tff2-/- constructs do N. Udar1, M. Chalukya1, R. Silva-Garcia1, J. Yeh1, P. Wong1,2, K. Small1 not di...",2004.0,,,


## Tidy up the 'Text' column

Since the main thrust of this research is to use natural language processing on the 'Text' column, it makes sense to do a bit of cleaning at this early stage. I had a look at what was in the 'Text' column and noticed a few problems encoding/decoding errors, extra whitespace, run on sentences, extra punctuation, spelling, etc. The problem is that this is not a normal, conversational English set of texts. There are, quite rightly, a lot of characters from other languages, words that unlikely to be in typical language dictionaries, etc. 

I set up a few processes to loop over the text to clean these up. We need a few more import/download functions here, so let's start with that. 


In [123]:
# importing the nltk suite  
import nltk 
from nltk import word_tokenize                     # a useful functions from nltk that helps identify individual words
  
# importing jaccard distance 
# and ngrams from nltk.util 
from nltk.metrics.distance import jaccard_distance 
from nltk.util import ngrams

# importing edit distance   
from nltk.metrics.distance  import edit_distance 

# Downloading and importing 
# package 'words' from nltk corpus 
nltk.download('words') 
from nltk.corpus import words 

!pip install unidecode
from unidecode import unidecode
correct_words = words.words()

from spellchecker import SpellChecker
spell = SpellChecker()

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!

[notice] A new release of pip is available: 23.1.2 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting unidecode
  Downloading Unidecode-1.3.7-py3-none-any.whl (235 kB)
                                              0.0/235.5 kB ? eta -:--:--
     ------------------------------------   225.3/235.5 kB 6.9 MB/s eta 0:00:01
     -------------------------------------- 235.5/235.5 kB 4.8 MB/s eta 0:00:00
Installing collected packages: unidecode
Successfully installed unidecode-1.3.7


In [15]:
file = open('words_alpha.txt','r')
words_alpha=file.readlines()
file.close()
dictionary = []
for entry in words_alpha:
    item = re.sub(r'\n', r'', entry)
    dictionary.append(item)
    
print(dictionary[:100])

['a', 'aa', 'aaa', 'aah', 'aahed', 'aahing', 'aahs', 'aal', 'aalii', 'aaliis', 'aals', 'aam', 'aani', 'aardvark', 'aardvarks', 'aardwolf', 'aardwolves', 'aargh', 'aaron', 'aaronic', 'aaronical', 'aaronite', 'aaronitic', 'aarrgh', 'aarrghh', 'aaru', 'aas', 'aasvogel', 'aasvogels', 'ab', 'aba', 'ababdeh', 'ababua', 'abac', 'abaca', 'abacay', 'abacas', 'abacate', 'abacaxi', 'abaci', 'abacinate', 'abacination', 'abacisci', 'abaciscus', 'abacist', 'aback', 'abacli', 'abacot', 'abacterial', 'abactinal', 'abactinally', 'abaction', 'abactor', 'abaculi', 'abaculus', 'abacus', 'abacuses', 'abada', 'abaddon', 'abadejo', 'abadengo', 'abadia', 'abadite', 'abaff', 'abaft', 'abay', 'abayah', 'abaisance', 'abaised', 'abaiser', 'abaisse', 'abaissed', 'abaka', 'abakas', 'abalation', 'abalienate', 'abalienated', 'abalienating', 'abalienation', 'abalone', 'abalones', 'abama', 'abamp', 'abampere', 'abamperes', 'abamps', 'aband', 'abandon', 'abandonable', 'abandoned', 'abandonedly', 'abandonee', 'abandoner'

In [16]:
no_null_texts['Text'][1]                                        # First, let's have a look at the 'Text' column to spot some
                                                                # issues. Right away, I can see "areonly", "earthÃ¢ÂÂs", 
                                                                # "sequen cing", etc. Work to be done!

1    The mechanisms whereby inherited DNA mutations cause disease areonly beginning to be understood. These are best understood in the contextof knowledge of the three dimensional structure of the rele...
1    Gene therapy is an attractive option for a number of genetic  disorders. Genetic supplementation could in theory lead to long  lasting disease phenotype correction. However, efficient targeting,  ...
1                                                                                                                                                                                     No abstract available.
1    Four decades ago homocystinuria due to cystathionine beta synthase (CBS) has been described as a typical inborn error of metabolism partially resembling the Marfan syndrome. As extremely high conc...
1    The earthÃ¢ÂÂs rotation causes 24 hour cycles in many aspects of the  physical environment, while the earthÃ¢ÂÂs revolution around the sun causes seasonal changes . Most l

In [129]:
test_string = "This is a test.String. It has   problems that areonly going to get better."

def remove_errors (input):
    no_extra_spaces = re.sub(r'(\s)(\s+)', r'\1', input)               # identifies 2 or more sequential whitespaces and cuts them to 1
    no_run_ons = re.sub(r'([a-z].)([A-Z])', r'\1 \2', no_extra_spaces) # identifies runons (e.g. "word.New sentence "
    normalised = unidecode(no_run_ons)
    tokens = word_tokenize(normalised)                                     # 
    output = []
    for token in tokens:
        if token.lower() in dictionary :
            output.append(token)
        else:
            if token in "-!\"#$%&()'*-–+,./:;<=>?@[\]^_`{|}~''“”":
                output.append(token)
            else:
                segmented = get_segments(token)
                output.append(segmented)
    return(output)

remove_errors(test_string)

['This',
 'is',
 'a',
 'test',
 '.',
 'String',
 '.',
 'It',
 'has',
 'problems',
 'that',
 ['are', 'only'],
 'going',
 'to',
 'get',
 'better',
 '.']

In [128]:
import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

remove_accents(Ã¢ÂÂclock)

SyntaxError: invalid character '¢' (U+00A2) (1044527773.py, line 8)

In [None]:
for thingy in no_null_texts['Text'][1]:
    remove_errors(thingy)

In [114]:
def get_segments(input):
    sentence = input.lower()
    onegrams = OneGramDist(filename='count_1w.txt')
    onegram_fitness = functools.partial(onegram_log, onegrams)
    return(segment(sentence, word_seq_fitness=onegram_fitness))
    
get_segments('areonly')

['are', 'only']

## Save the consolidated output as .csv

Having imported, consolidated, tidied and checked everything, I want to save the output in a new .csv file. It is important to use a good name for the file, because bad file names are the bane of my existance. 

For simplicity sake, I will also create a new data frame containing only those rows for which the 'Text' column contains one of the keywords of interest, check its length and save it as a new .csv file with a good name. 

In [14]:
type(no_null_texts)                          # Let's just double check what kind of a thing 'no_null_texts' is
                                             # This lets us know what kind of write-out-to-csv function we need.

pandas.core.frame.DataFrame

In [15]:
no_null_texts.to_csv('..\\output\\all_abstracts_no_null_texts.csv')  # write out the data frame to a .csv, with a useful name
                                                                     # which clarifies this is ALL abstracts with non-null texts

In [16]:
no_nans_matched_texts = no_null_texts[no_null_texts['Text'].str.contains('autis|Autis|ASD|Asperger|asperger')]
                                                         # keep only rows where text contains a keyword of interest
len(no_nans_matched_texts)                               # check the length

906

In [17]:
no_nans_matched_texts.to_csv('..\\output\\matched_abstracts_no_null_texts.csv')  # write out the matched texts df to a .csv too
                                                                                 # again, with a clear and useful name

## Manually check the saved .csv files

You may want to go and check that the two files you have created here have been created and saved correctly. You may even want to open them up and have a nosy through them to see what they look like. 

The next notebook picks up where this leaves off, by importing those files and working with them to produce some stats that help explore the research question. 