# Step 1 - Prepare

As this work builds on previous work (in the Person-first/Identity-first repo) the preparation phase is relatively straightforward. Mostly, it involves importing a large .csv file with lots of columns, one of which is 'year' and one of which is 'text'. 

## Get ready

All of my jupyter notebooks begin with some code cells focussed on downloading/importing necessary packages, loading useful short names, and so forth. 

I also like to check the relevant file locations before importing the .csv files to work on. 

In [1]:
%%capture                         
                                  # The above capture statement is optional. 
                                  # You can remove this to see the chatter normally produced during import steps. 

import os                         # os is a module for navigating your machine (e.g., file directories).

import pandas as pd               # pandas is necessary for working with data frames - shortening it to pd just saves time. 
pd.set_option('display.max_colwidth', 200)   # some of the files are big so set a big column width. 
import numpy as np                # like pandas, numpy is useful and useful to have a short name for
import statistics                 # gotsta have stats

import csv                        # csv is for importing and working with csv files

import re                         # things we need for RegEx corrections
import string 
import math 




## Import, check and clean up

Having got all the packages we need and having checked the files, let's import them. This requires:
* reading in and checking the .csv file
* dropping all the columns except for 'Text' and 'Year'
* checking again, see that 'Year' is appearing as a float when it shouldn't
* retyping 'Year' as integer

In [2]:
print(os.listdir("..\\results")  )                                # check 'results' folder is not empty/has correct stuff

['.ipynb_checkpoints', 'texts_by_year.csv']


In [3]:
texts_by_year = pd.read_csv('..\\results\\texts_by_year.csv')    # read in the file
texts_by_year

Unnamed: 0.1,Unnamed: 0,Title,Session_Code,Author,Affiliations,Text,Year,Authors_and_Affiliations,Email
0,0,".-(TenthousandEuros,orequivalentvalueinkind)peryearpercompanyareconsidered â\x80\x9cModestâ\x80\x9d. Contributions above EUR 10 000.- per year are considered â\x80\x9cSignificantâ\x80\x9d. Oral pr...",EUR10000,,,".-(TenthousandEuros,orequivalentvalueinkind)peryearpercompanyareconsidered â\x80\x9cModestâ\x80\x9d. Contributions above EUR 10 000.- per year are considered â\x80\x9cSignificantâ\x80\x9d. Oral pr...",2018.0,,
1,1,mentioned more than 60/169 reasons. The reasons could The ethics of clinical applications of germline genome becategorisedinto:(i)qualityoflifeofaffectedindividuals; modification:a systematic rev...,EBPL1.1,,,mentioned more than 60/169 reasons. The reasons could The ethics of clinical applications of germline genome becategorisedinto:(i)qualityoflifeofaffectedindividuals; modification:a systematic rev...,2018.0,,
2,2,implications.Asystematicoverviewofreasonsforbeingin Enablinginformed opinionsabout germline editingamong favouroragainstgermlinegenomemodificationismissing. the general public: a pilot study We s...,EBPL1.2,,,implications. Asystematicoverviewofreasonsforbeingin Enablinginformed opinionsabout germline editingamong favouroragainstgermlinegenomemodificationismissing. the general public: a pilot study We ...,2018.0,,
3,3,measurement this ratio flipped. The PRISM-IMPACT study: What are the hopes and Discussion/conclusion: Our pilot study demonstrates a expectations of families and health care professionals signifi...,EBPL1.4,,,measurement this ratio flipped. The PRISM-IMPACT study: What are the hopes and Discussion/conclusion: Our pilot study demonstrates a expectations of families and health care professionals signifi...,2018.0,,
4,4,"Cancer Centre, Sydney Childrenâ\x80\x99s Hospital, Randwick, NSW, Informedconsentinahumangermlinegeneeditingstudy- Australia, 3Hereditary Cancer Centre, Prince of Wales ethical issues Hospital, S...",EBPL1.3,,,"Cancer Centre, Sydney Childrenâ\x80\x99s Hospital, Randwick, NSW, Informedconsentinahumangermlinegeneeditingstudy- Australia, 3Hereditary Cancer Centre, Prince of Wales ethical issues Hospital, S...",2018.0,,
...,...,...,...,...,...,...,...,...,...
38223,2200,,,,,"C7 A10 in the aetiology of cystinuria, we could not identify any Affected children may have only one episode of illness or multiple mutation in SL",2004.0,,
38224,2201,,,,,"C7 A10 in the two families. Nevertheless, there remains recurrences. A common mutation (985A >G) has been identiÜed the possibility that other genes are involved in cystinuria. Further among patie...",2004.0,,
38225,2202,,,,,"P0845Inactivation of the spasmolytic trefoil peptide (Tff2) leads In this study, two unrelated MCAD patients, compound heterozygous to increased expression of additional gastroprotective factors f...",2004.0,,
38226,2203,,,,,"P0843MCDR1 Locus - Screening for candidate genes. functional disturbance in stomach and gut, Tff2-/- constructs do N. Udar1, M. Chalukya1, R. Silva-Garcia1, J. Yeh1, P. Wong1,2, K. Small1 not disp...",2004.0,,


In [5]:
list(texts_by_year.columns)                             # Get the column names in a list to make it easier to remove the 
                                                        # ones that we don't need to keep

['Unnamed: 0',
 'Title',
 'Session_Code',
 'Author',
 'Affiliations',
 'Text',
 'Year',
 'Authors_and_Affiliations',
 'Email']

In [7]:
texts_year_only = texts_by_year.drop(['Unnamed: 0',     # Copy/paste the output from the previous step here, making sure to
                    'Title',                            # remove the columns that we want to KEEP
                    'Session_Code', 
                    'Author', 
                    'Affiliations', 
                    'Authors_and_Affiliations', 
                    'Email'], axis=1)
texts_year_only                                         # check it again

Unnamed: 0,Text,Year
0,".-(TenthousandEuros,orequivalentvalueinkind)peryearpercompanyareconsidered â\x80\x9cModestâ\x80\x9d. Contributions above EUR 10 000.- per year are considered â\x80\x9cSignificantâ\x80\x9d. Oral pr...",2018.0
1,mentioned more than 60/169 reasons. The reasons could The ethics of clinical applications of germline genome becategorisedinto:(i)qualityoflifeofaffectedindividuals; modification:a systematic rev...,2018.0
2,implications. Asystematicoverviewofreasonsforbeingin Enablinginformed opinionsabout germline editingamong favouroragainstgermlinegenomemodificationismissing. the general public: a pilot study We ...,2018.0
3,measurement this ratio flipped. The PRISM-IMPACT study: What are the hopes and Discussion/conclusion: Our pilot study demonstrates a expectations of families and health care professionals signifi...,2018.0
4,"Cancer Centre, Sydney Childrenâ\x80\x99s Hospital, Randwick, NSW, Informedconsentinahumangermlinegeneeditingstudy- Australia, 3Hereditary Cancer Centre, Prince of Wales ethical issues Hospital, S...",2018.0
...,...,...
38223,"C7 A10 in the aetiology of cystinuria, we could not identify any Affected children may have only one episode of illness or multiple mutation in SL",2004.0
38224,"C7 A10 in the two families. Nevertheless, there remains recurrences. A common mutation (985A >G) has been identiÜed the possibility that other genes are involved in cystinuria. Further among patie...",2004.0
38225,"P0845Inactivation of the spasmolytic trefoil peptide (Tff2) leads In this study, two unrelated MCAD patients, compound heterozygous to increased expression of additional gastroprotective factors f...",2004.0
38226,"P0843MCDR1 Locus - Screening for candidate genes. functional disturbance in stomach and gut, Tff2-/- constructs do N. Udar1, M. Chalukya1, R. Silva-Garcia1, J. Yeh1, P. Wong1,2, K. Small1 not disp...",2004.0


In [12]:
texts_year_only['Year'] = texts_year_only['Year'].astype(int)   # I noticed years are appearing as floats, e.g. "2004.0"
                                                                # so save the 'Year' column over itself, but as an integer
texts_year_only                                                 # check again

Unnamed: 0,Text,Year
0,".-(TenthousandEuros,orequivalentvalueinkind)peryearpercompanyareconsidered â\x80\x9cModestâ\x80\x9d. Contributions above EUR 10 000.- per year are considered â\x80\x9cSignificantâ\x80\x9d. Oral pr...",2018
1,mentioned more than 60/169 reasons. The reasons could The ethics of clinical applications of germline genome becategorisedinto:(i)qualityoflifeofaffectedindividuals; modification:a systematic rev...,2018
2,implications. Asystematicoverviewofreasonsforbeingin Enablinginformed opinionsabout germline editingamong favouroragainstgermlinegenomemodificationismissing. the general public: a pilot study We ...,2018
3,measurement this ratio flipped. The PRISM-IMPACT study: What are the hopes and Discussion/conclusion: Our pilot study demonstrates a expectations of families and health care professionals signifi...,2018
4,"Cancer Centre, Sydney Childrenâ\x80\x99s Hospital, Randwick, NSW, Informedconsentinahumangermlinegeneeditingstudy- Australia, 3Hereditary Cancer Centre, Prince of Wales ethical issues Hospital, S...",2018
...,...,...
38223,"C7 A10 in the aetiology of cystinuria, we could not identify any Affected children may have only one episode of illness or multiple mutation in SL",2004
38224,"C7 A10 in the two families. Nevertheless, there remains recurrences. A common mutation (985A >G) has been identiÜed the possibility that other genes are involved in cystinuria. Further among patie...",2004
38225,"P0845Inactivation of the spasmolytic trefoil peptide (Tff2) leads In this study, two unrelated MCAD patients, compound heterozygous to increased expression of additional gastroprotective factors f...",2004
38226,"P0843MCDR1 Locus - Screening for candidate genes. functional disturbance in stomach and gut, Tff2-/- constructs do N. Udar1, M. Chalukya1, R. Silva-Garcia1, J. Yeh1, P. Wong1,2, K. Small1 not disp...",2004


## Divide the texts up into batches by year

Since we want to track the changes in word use over time, we need to divide the texts up into years. 

In [30]:
years = texts_year_only['Year'].drop_duplicates().sort_values()
years = years.tolist()
years

[2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016,
 2017,
 2018,
 2019,
 2020,
 2021]

In [81]:

def batch_by_year(input):
    years = input['Year'].drop_duplicates().sort_values()
    print(years)
    for year in years:
        batch = 'batch_" + str(year)
        x = input.query('Year == @year')
        batch = x
        print(batch)


SyntaxError: unterminated string literal (detected at line 5) (2869886311.py, line 5)

In [82]:
batch_by_year(texts_year_only)



Unnamed: 0,Text,Year
100,Despite the availability of most of the human genome sequence the accu rate identification of genes on the DNA sequence remains to be continu ously improved and updated. This procedure relies on t...,2001
101,The mechanisms whereby inherited DNA mutations cause disease areonly beginning to be understood. These are best understood in the contextof knowledge of the three dimensional structure of the rele...,2001
102,Missense mutations can knock out residues in proteins that are importantfor binding catalysis or conformational change or have subtle effects onstability or conformation. Because buried residues i...,2001
103,The Genome Project has accelerated the availability of tools to apply phar macogenetics to the development of personalized medicines. This differsfrom the long term [10 15 year] strategy that is r...,2001
104,Predictive genetic testing (PT) for Huntington Disease has now beenoffered for longer than for any other late onset genetic illness. The signifi cant number of persons at risk who have participate...,2001
...,...,...
2431,together 21 alternations. One of the intronic alternations was frame shift Human Chromosome 21 is conserved in mouse Chromosomes 10 16 and alternation due to T insertion. M. nemestrina is the mos...,2001
2432,ulm.de ed in brain only. The corresponding cDNAhas been isolated containing a Sequence analysis has revealed pronounced conservation among the 1683 bp ORF and deriving from at least 13 exons. For...,2001
2433,ing genes 2 Mb and 500 kb either side of SOX9 in human and 68 kb and X linked retinitis pigmentosa 3 (XLRP3) a progressive retinal degenera 94 kb either side of SOX9 in Fugu. Comparative sequence...,2001
2434,growing zebrafish ESTs define sets of evolutionarily conserved genes Human retinitis pigmentosa (RP) is the phenotypic equivalent to canine showing strong similarities throughout entire proteins ...,2001


In [31]:
def bag_of_words_analysis(input, how_many):     # define a 'bag of words' function with 2 arguments, an input and a quantity 
    holding_string = ""                                                        # that creates a temporary variable
    for text in input['Text']:                                                 # looks at the 'Text' column for the input
        holding_string += text                                                 # fills up the temp variable with the text
    holding_string = word_tokenize(holding_string)                             # word tokenises that text
    holding_string = [word.lower() for word in holding_string]                 # remove uppercase letters
    holding_string = [w.translate(table_punctuation) for w in holding_string]  # removes punctuation
    holding_string = (list(filter(lambda x: x, holding_string)))               # removes andy empty strings
    holding_string = [token for token in holding_string if not token.isdigit()]  # removes digits
    holding_string = [token for token in holding_string if token not in stop_words]  # removes stopwords
    holding_string = [porter.stem(token) for token in holding_string]                # stems the word-tokens
    list_for_count = []                                                              # and creates an empty list
    for token in holding_string:                                         # then iterates over the tokens
        list_for_count.append(token)                                     # appending them to the list
    counts = Counter(list_for_count)                                     # applies the Counter function imported earlier 
    return counts.most_common(how_many)                                  # and returns the tokens with highest counts 
                                                                         # up to the quantity specified as an argument

## Save the consolidated output as .csv

Having imported, consolidated, tidied and checked everything, I want to save the output in a new .csv file. It is important to use a good name for the file, because bad file names are the bane of my existance. 

For simplicity sake, I will also create a new data frame containing only those rows for which the 'Text' column contains one of the keywords of interest, check its length and save it as a new .csv file with a good name. 

In [None]:
type(no_null_texts)                          # Let's just double check what kind of a thing 'no_null_texts' is
                                             # This lets us know what kind of write-out-to-csv function we need.

In [None]:
no_null_texts


In [None]:
no_null_texts.to_csv('..\\output\\all_abstracts_no_null_texts.csv')  # write out the data frame to a .csv, with a useful name
                                                                     # which clarifies this is ALL abstracts with non-null texts

## Manually check the saved .csv files

You may want to go and check that the two files you have created here have been created and saved correctly. You may even want to open them up and have a nosy through them to see what they look like. 

The next notebook picks up where this leaves off, by importing those files and working with them to produce some stats that help explore the research question. 