# Step 1 - Prepare

Admittedly, quite a bit of work has already taken place. The original .pdf files were sent off to a research support team that scraped the text out of them and stored the contents of those files in .csv files which have useful headers. Each .pdf file was processed to produce a single .csv file, which you will find in 'root/2023_Second_analysis/results/' . The original .pdf files can be found in the 'root/2022_First_analysis/input_pdfs/ESHG/', but since there is no code in this second attempt that uses the original .pdfs directly, they are not included here too. 

This processing was not universally successful; the original .pdf files were not all encoded in the same format and most had a totally unique layout. In effect, this means that each .pdf-to-.csv process was not equally able to automatically capture and segment the text in the .pdf to turn it into .csv columns such as author, affiliation, session code, etc. This means that some of the .csv files have empty columns while other .csv files have contents in those columns. This also means that some of the rows are probably faulty. Obviously, a detailed manual inspection would correct some of these errors but total accuracy is not the point of this research effort. 

The file for 2004 was particularly tricky and needs separate attention because the output from the research support team was not structured in a way that matches the others very well. 

Nevertheless, the output from the research support team is a set of files that are useful for natural language processing methods to investigate person-first and identity-first language. The next step is to import the various .csv files, consolidate them in one data frame, tidy up some of the erroneous rows and columns and then save the output in a new .csv. 

## Get ready

All of my jupyter notebooks begin with some code cells focussed on downloading/importing necessary packages, loading useful short names, and so forth. 

I also like to check the relevant file locations before importing the .csv files to work on. 

In [None]:
%%capture                         
                                  # The above capture statement is optional. 
                                  # You can remove this to see the chatter normally produced during import steps. 

import os                         # os is a module for navigating your machine (e.g., file directories).

import pandas as pd               # pandas is necessary for working with data frames - shortening it to pd just saves time. 
pd.set_option('display.max_colwidth', 200)   # some of the files are big so set a big column width. 
import numpy as np                # like pandas, numpy is useful and useful to have a short name for
import statistics                 # gotsta have stats

import csv                        # csv is for importing and working with csv files

import re                         # things we need for RegEx corrections
import string 
import math 




## Import

Having got all the packages we need and having checked the files, let's import them. This requires:
* defining a function to import multiple files from a known location (better than one-by-one importing!)
* checking the output of the mass-import for length and contents (since I suspect 2004 may not have worked correctly)
* having found an error, investigating it a bit
* correcting the error by removing problem rows, manually incorporating better rows, adding correct rows back in and checking

In [None]:
print(os.listdir("..\\results")  )                                # check 'results' folder is not empty/has correct stuff

In [None]:
files = []                                                        # create empty list to hold names of files in 'results'
def import_results(input):                                        # create a function import the contents of the
    for f in os.listdir(input):                                   # folder named in the function input
        f = pd.read_csv(input + '\\'+ f,encoding='latin1')        # by reading them in as csv files, one by one
        files.append(f)                                           # appending the newly read csv file to a temporary list
    output = pd.concat(files)                                     # then concatenating that temp list to the pre-defined list
    return output                                                 # returning the output

In [None]:
all_results = import_results("..\\results")      # run the newly defined function on the 'results' folder
how_many_total = len(all_results)                # check the length 
how_many_total                                   # 

In [None]:
print(all_results['Year'].drop_duplicates())     # quick check shows that 2004 (a known problem file) has not imported properly

In [None]:
how_many_no_year = all_results['Year'].isna().sum() # Let's just count how many rows NaN instead of the year
how_many_no_year

In [None]:
no_Nan_in_Year = all_results[~all_results['Year'].isnull()]          # remove the 'Year' = Nan rows
how_many_without_Nan_year = len(no_Nan_in_Year)                      # check length again now that Nan rows are nemoved
how_many_without_Nan_year

In [None]:
print (how_many_total - how_many_no_year )                   # print the total number minus the number that are missing a year
print (how_many_without_Nan_year)                            # compare to the new total just to be sure

In [None]:
year_04 = pd.read_csv('..\\results\\ESHG2004.csv')      # specifically read in year 2004 (it needed a bit of extra work)
year_04 = year_04.iloc[:, [0,1]]                        # cut a two-column slice out of it with only the year and text
year_04                                                 # check how it looks (which also shows us how many rows are in it!)

In [None]:
all_results_corrected = pd.concat([no_Nan_in_Year, year_04])     # add those specially imported 2004 rows back into the output
                                                                 # which had the weird no-year rows removed
    
print(len(all_results_corrected))

In [None]:
how_many_total_new = len(all_results_corrected)                  # check length again - are we back up to where we started?

if how_many_total == how_many_total_new :                        # write a quick check to be sure the totals before and after
    print('The numbers add up!')                                 # removing/replacing the 2004 rows are the same. 

## Remove rows that imported correctly for reasons unrelated to 2004. 

Having imported all the various .csv files and storing them in one data frame (even that tricksy 2004 .csv) I do a bit of clean up. Turning .pdf files to .csv is not a straightforward or fool proof process, so I want to remove any rows that have nothing in the 'Text' column and check the length again to see how many we have lost. Then I want to remove any columns that are entirely empty (which is probably the result of badly imported rows that are shifted over) and have a quick look at the remaining columns and what might be in them. 

In [None]:
no_null_texts = all_results_corrected[~all_results_corrected['Text'].isnull()] 
                                                                    # remove any rows where the 'text' column is empty
how_many_no_null_texts = len(no_null_texts)                         # check length again - still making sense?
print (how_many_no_null_texts)
print (how_many_total_new - how_many_no_null_texts)

In [None]:
no_null_texts = no_null_texts.dropna(axis=1, how="all")   # remove all columns which contain only NaNs
print(len(no_null_texts))                                 # just check the length has not changed
no_null_texts                                             # have a nosy at which columns remain, what it is them, etc. 

## Clean up the 'Text' column a little bit

It was not obvious what text cleaning steps were most valuable when we first started working with this text data. However, having run through all of the steps several times, I have concluded that a few cleaning steps at this early stage are very useful. The first of these does basic things like removing multiple sequential whitespaces and inserting spaces when two sentences have been jammed together. 

Following that, another cleaning step standardises the keywords of interest. For example, this would turn 'Asperger's syndrome', 'Asperger syndrome disorder', 'autistic spectrum disorder' and 'autism spectrum disorder' into 'ASD'.   

In [None]:
def remove_spacing_errors (input):
    no_extra_spaces = re.sub(r'(\s)(\s+)', r'\1', input)                        # turn 2+ sequential whitespaces into 1
    no_run_ons1 = re.sub(r'([a-z][.|?|;])([A-Z])', r'\1 \2', no_extra_spaces)   # removes run-ons (e.g. "word.New sentence ")
    no_run_ons2 = re.sub(r'([A-Z][.|?|;])([A-Z])', r'\1 \2', no_run_ons1)       # removes run-ons (e.g. "ACRONYM.New sentence ")
    space1 = re.sub(r'([a-z]+)(disorder|disability|spectrum|disease)', r'\1 \2', no_run_ons2) # adds a space in select run-ons
    space2 = re.sub(r'(spec) (trum)', r'\1\2', space1)                          # removes a space between 'spec' and 'trum'
    space3 = re.sub(r'(psychopa) (thy)', r'\1\2', space2)                          # removes a space between 'spec' and 'trum'

    return(space3)

In [None]:
 # Optional cell code block to test or understand what the remove_spacing_errors function does
    
spacing_error_test = "word.New sentence   extra spaces    test. \
            ACRONYM.New sentence    \
            Asperger'sdisorder sdisorder Autisticdisability autismspectrum \
            autismdisease spec trum "

remove_spacing_errors(spacing_error_test)

In [None]:
no_run_ons = [remove_spacing_errors(abstract) for abstract in no_null_texts['Text'] ] 
                                             # create list of texts without extra spaces/run-on erors 
                                             # this is to improve sentence tokenisation later 
no_null_texts['Text'] = no_run_ons           # Overwrite the 'Text' column with the new no extra space/run-on abstract list

In [None]:
def tidy_up_terminology (input):
    no_apost = re.sub(r'([Aa]sperger)(\'?)(s?)', r'asperger', input)   # Standardises case and apostrophe in '[Aa]sperger's'
    lower1 = re.sub(r'[Ss]pectrums|[Ss]pectra|[Ss]pectrum', r'spectrum', no_apost) # lowercases / removes plurals for spectrum
    lower2 = re.sub(r'[Ss]yndromes|[Ss]yndrome', r'syndrome', lower1)              # lowercases / removes plurals for syndrome
    lower3 = re.sub(r'[Dd]isorders|Disorder', r'disorder', lower2)                 # lowercases / removes plurals for disorder
    lower4 = re.sub(r'[Dd]iseases|Disease', r'disease', lower3)                    # lowercases / removes plurals for disease
    lower5 = re.sub(r'[Aa]utis', r'autis', lower4)                                  # lowercases 'Autism' and 'Autistic'
    plur = re.sub(r'ASD([\'*?])(s?)', r'ASD', lower5)                            # removes plural for more than one ASD
    aut0 = re.sub(r'(asperger )(autis)', r'\2', plur)                  # removes '[Aa]sperger' from '[Aa]sperger' autis'
    aut1 = re.sub(r'(autistic )(psychopathy)', r'autism', aut0)           # standardises 'autistic psychopathy' to 'autism'
    stan0 = re.sub(r'(autism|autistic|asperger) syndrome', r'autism', aut1 )       # turns select 'syndrome' to 'spectrum'
    stan1 = re.sub(r'(autism|autistic|asperger) spectrum', r'autism', stan0)       # turns select 'spectrum' to 'autism'
    stan2 = re.sub(r'(autism|autistic|asperger) disease', r'autism', stan1)        # turns select 'disease' to 'autism'
    stan3 = re.sub(r'(autism|autistic|asperger) disability', r'autism', stan2)     # turns select 'disability' to 'autism'
    stan4 = re.sub(r'(autism|autistic|asperger) disorder', r'autism', stan3)       # turns select 'disorder' to 'autism'
    stan5 = re.sub(r'(autism|autistic|asperger) \(ASD\)', r'autism', stan4)     # turns select former abbreviations to 'autism'
    stan6 = re.sub(r'(autism|autistic|asperger) \(AS\)', r'autism', stan5)      # turns select former abbreviations to 'autism'
    stan7 = re.sub(r'(AS)([^\w])', r'ASD\2 ', stan6)                  # standardises 'AS.' to 'ASD.' 
    stan8 = re.sub(r'AS ', r'ASD ', stan7)                              # standardises 'AS ' to 'ASD ' - note trailing space

    return(stan8)

In [None]:
 # Optional cell code block to test or understand what the tidy_up_terminology function does
    
tidy_test = "1 Aspergers Asperger's Asperger Spectrum \
            2 Spectra Syndrome Syndrome \
            3 Disorders disorders Diseases disease ASDs ASD's \
            4 asperger's autism autistic psychopathy\
            5 autistic syndrome \
            6 autism disease \
            7 autistic disability  autism spectrum (AS)\
            8 asperger disability autistic disease \
            9 autism spectrum disorder (ASD) \
            10 autism spectrum (ASD) \
            11 autistic spectrum disease (ASD) \
            12 autism (ASD) \
            13 Asperger syndrome (AS)\
            14 autistic spectrum \
            15 AS. AS! ASQ "

tidy_up_terminology(tidy_test)

In [None]:
tidy_text = [tidy_up_terminology(abstract) for abstract in no_null_texts['Text'] ] 
                                             # create abstract list that has the terminology tidied and standardised
                                             # this is to improve pattern recognition later
no_null_texts['Text'] = tidy_text            # Over-write the 'Text' column with the tidied/standardised version

In [None]:
backup = no_null_texts         # A backup may be useful at this step if you want to adjust/test the tidy functions more

In [None]:
no_null_texts = backup          # If you need the backup, re-run this step. 

## Save the consolidated output as .csv

Having imported, consolidated, tidied and checked everything, I want to save the output in a new .csv file. It is important to use a good name for the file, because bad file names are the bane of my existance. 

For simplicity sake, I will also create a new data frame containing only those rows for which the 'Text' column contains one of the keywords of interest, check its length and save it as a new .csv file with a good name. 

In [None]:
type(no_null_texts)                          # Let's just double check what kind of a thing 'no_null_texts' is
                                             # This lets us know what kind of write-out-to-csv function we need.

In [None]:
no_null_texts                                # OPTIONAL - have a quick look to see if it looks the way you expect


In [None]:
no_null_texts.to_csv('..\\output\\all_abstracts_no_null_texts.csv')  # write out the data frame to a .csv, with a useful name
                                                                     # which clarifies this is ALL abstracts with non-null texts

In [None]:
no_nans_matched_texts = no_null_texts[no_null_texts['Text'].str.contains('[Aa]utis|ASD|AS|[Aa]sperger')]
                                                         # keep only rows where text contains one or more original keywords
len(no_nans_matched_texts)                               # check the length

In [None]:
no_nans_matched_texts.to_csv('..\\output\\matched_abstracts_no_null_texts.csv')  # write out the matched texts df to a .csv too
                                                                                 # again, with a clear and useful name

## Manually check the saved .csv files

You may want to go and check that the two files you have created here have been created and saved correctly. You may even want to open them up and have a nosy through them to see what they look like. 

The next notebook picks up where this leaves off, by importing those files and working with them to produce some stats that help explore the research question. 