# Step 1 - Prepare

As this work builds on previous work (in the Person-first/Identity-first repo) the preparation phase is relatively straightforward. Mostly, it involves importing a large .csv file with lots of columns, one of which is 'year' and one of which is 'text'. 

## Get ready

All of my jupyter notebooks begin with some code cells focussed on downloading/importing necessary packages, loading useful short names, and so forth. 

I also like to check the relevant file locations before importing the .csv files to work on. 

In [1]:
%%capture            
                                  # The above capture statement is optional. 
                                  # You can remove this to see the chatter normally produced during import steps. 

import os                         # os is a module for navigating your machine (e.g., file directories).

import pandas as pd               # pandas is necessary for working with data frames - shortening it to pd just saves time. 
pd.set_option('display.max_colwidth', 200)   # some of the files are big so set a big column width. 
import numpy as np                # like pandas, numpy is useful and useful to have a short name for
import statistics                 # gotsta have stats
import csv                        # csv is for importing and working with csv files
import re                         # things we need for RegEx corrections
import string 
import math 

In [2]:
from autocorrect import Speller   # things we need for spell checking
check = Speller(lang='en')
import codecs
import csv                        # csv is for importing and working with csv files

import nltk                       # get nltk 
from nltk import word_tokenize    # and some of its key functions
from nltk import sent_tokenize    
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.corpus import wordnet                    # Finally, things we need for lemmatising!
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
nltk.download('averaged_perceptron_tagger')        # Like a POS-tagger...
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('webtext')
from collections import Counter

import os                         # os is a module for navigating your machine (e.g., file directories).
import pandas as pd
pd.set_option('display.max_colwidth', 200)

import statistics
import re                         # things we need for RegEx corrections

import string 
import math 

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package webtext is already up-to-date!


## Import the data

Having got all the packages we need and having checked the files, let's import the data by reading it in from .csv. 

In [3]:
print(os.listdir("..\\results")  )                                # check 'results' folder is not empty/has correct stuff

['.ipynb_checkpoints', 'texts_by_year.csv']


In [5]:
texts_by_year = pd.read_csv('..\\results\\texts_by_year.csv')    # read in the file
texts_by_year.head(5)                                            # have a look at the top 5 rows

Unnamed: 0.1,Unnamed: 0,Title,Session_Code,Author,Affiliations,Text,Year,Authors_and_Affiliations,Email
0,0,".-(TenthousandEuros,orequivalentvalueinkind)peryearpercompanyareconsidered â\x80\x9cModestâ\x80\x9d. Contributions above EUR 10 000.- per year are considered â\x80\x9cSignificantâ\x80\x9d. Oral pr...",EUR10000,,,".-(TenthousandEuros,orequivalentvalueinkind)peryearpercompanyareconsidered â\x80\x9cModestâ\x80\x9d. Contributions above EUR 10 000.- per year are considered â\x80\x9cSignificantâ\x80\x9d. Oral pr...",2018.0,,
1,1,mentioned more than 60/169 reasons. The reasons could The ethics of clinical applications of germline genome becategorisedinto:(i)qualityoflifeofaffectedindividuals; modification:a systematic rev...,EBPL1.1,,,mentioned more than 60/169 reasons. The reasons could The ethics of clinical applications of germline genome becategorisedinto:(i)qualityoflifeofaffectedindividuals; modification:a systematic rev...,2018.0,,
2,2,implications.Asystematicoverviewofreasonsforbeingin Enablinginformed opinionsabout germline editingamong favouroragainstgermlinegenomemodificationismissing. the general public: a pilot study We s...,EBPL1.2,,,implications. Asystematicoverviewofreasonsforbeingin Enablinginformed opinionsabout germline editingamong favouroragainstgermlinegenomemodificationismissing. the general public: a pilot study We ...,2018.0,,
3,3,measurement this ratio flipped. The PRISM-IMPACT study: What are the hopes and Discussion/conclusion: Our pilot study demonstrates a expectations of families and health care professionals signifi...,EBPL1.4,,,measurement this ratio flipped. The PRISM-IMPACT study: What are the hopes and Discussion/conclusion: Our pilot study demonstrates a expectations of families and health care professionals signifi...,2018.0,,
4,4,"Cancer Centre, Sydney Childrenâ\x80\x99s Hospital, Randwick, NSW, Informedconsentinahumangermlinegeneeditingstudy- Australia, 3Hereditary Cancer Centre, Prince of Wales ethical issues Hospital, S...",EBPL1.3,,,"Cancer Centre, Sydney Childrenâ\x80\x99s Hospital, Randwick, NSW, Informedconsentinahumangermlinegeneeditingstudy- Australia, 3Hereditary Cancer Centre, Prince of Wales ethical issues Hospital, S...",2018.0,,


## Clean 

The previous import step shows that there are a lot of columns that are not really revelant to the task. THus, we want to drop  all the columns except for 'Text' and 'Year'. If you scroll sideways in the above step, you can also see that 'Year' is appearing as a float (it has a '.0' at the end) when we want it to appear as an integer (without any decimal or trailing zero). 

In [6]:
list(texts_by_year.columns)                             # Get the column names in a list to make it easier to remove the 
                                                        # ones that we don't need to keep

['Unnamed: 0',
 'Title',
 'Session_Code',
 'Author',
 'Affiliations',
 'Text',
 'Year',
 'Authors_and_Affiliations',
 'Email']

In [7]:
texts_year_only = texts_by_year.drop(['Unnamed: 0',     # Copy/paste the output from the previous step here, making sure to
                    'Title',                            # remove the columns that we want to KEEP
                    'Session_Code', 
                    'Author', 
                    'Affiliations', 
                    'Authors_and_Affiliations', 
                    'Email'], axis=1)
texts_year_only.head(5)                                         # check it again

Unnamed: 0,Text,Year
0,".-(TenthousandEuros,orequivalentvalueinkind)peryearpercompanyareconsidered â\x80\x9cModestâ\x80\x9d. Contributions above EUR 10 000.- per year are considered â\x80\x9cSignificantâ\x80\x9d. Oral pr...",2018.0
1,mentioned more than 60/169 reasons. The reasons could The ethics of clinical applications of germline genome becategorisedinto:(i)qualityoflifeofaffectedindividuals; modification:a systematic rev...,2018.0
2,implications. Asystematicoverviewofreasonsforbeingin Enablinginformed opinionsabout germline editingamong favouroragainstgermlinegenomemodificationismissing. the general public: a pilot study We ...,2018.0
3,measurement this ratio flipped. The PRISM-IMPACT study: What are the hopes and Discussion/conclusion: Our pilot study demonstrates a expectations of families and health care professionals signifi...,2018.0
4,"Cancer Centre, Sydney Childrenâ\x80\x99s Hospital, Randwick, NSW, Informedconsentinahumangermlinegeneeditingstudy- Australia, 3Hereditary Cancer Centre, Prince of Wales ethical issues Hospital, S...",2018.0


In [8]:
texts_year_only['Year'] = texts_year_only['Year'].astype(int)   # I noticed years are appearing as floats, e.g. "2004.0"
                                                                # so save the 'Year' column over itself, but as an integer
texts_year_only.head(5)                                                 # check again

Unnamed: 0,Text,Year
0,".-(TenthousandEuros,orequivalentvalueinkind)peryearpercompanyareconsidered â\x80\x9cModestâ\x80\x9d. Contributions above EUR 10 000.- per year are considered â\x80\x9cSignificantâ\x80\x9d. Oral pr...",2018
1,mentioned more than 60/169 reasons. The reasons could The ethics of clinical applications of germline genome becategorisedinto:(i)qualityoflifeofaffectedindividuals; modification:a systematic rev...,2018
2,implications. Asystematicoverviewofreasonsforbeingin Enablinginformed opinionsabout germline editingamong favouroragainstgermlinegenomemodificationismissing. the general public: a pilot study We ...,2018
3,measurement this ratio flipped. The PRISM-IMPACT study: What are the hopes and Discussion/conclusion: Our pilot study demonstrates a expectations of families and health care professionals signifi...,2018
4,"Cancer Centre, Sydney Childrenâ\x80\x99s Hospital, Randwick, NSW, Informedconsentinahumangermlinegeneeditingstudy- Australia, 3Hereditary Cancer Centre, Prince of Wales ethical issues Hospital, S...",2018


## Divide the data into batches, one for each year 

The data frame now has only the columns we need and the 'Year' column is correctly typed as integer. However, we want to track how the popularity of certain words changes from one year to the next, meaning we need to make a "bag of words" for each year. Thus, we need to divide our big data frame up into smaller ones that only hold the texts for each year and save them as .csv to make importing easier for later analysis. 

In [12]:
def batch_by_year(input):                                                     # create a function that
    years = texts_year_only['Year'].drop_duplicates().sort_values().to_list() # gets a sorted list of all unique year values
    for year in years:                                                        # then iterates over that list to create a temp
        temp_df = texts_year_only[texts_year_only.Year == year]               # data frame with only the rows matching each year
        temp_df.to_csv('..\\output\\batch_' + str(year) + '.csv')            # saving the year-filtered data .csv in '\results'


In [13]:
batch_by_year(texts_year_only)                 # run that newly defined function on our texts

## Manually check the saved .csv files

You may want to go and check that the batched files you have created here have been created and saved correctly. You may even want to open them up and have a nosy through them to see what they look like. 

The next notebook picks up where this leaves off, by importing those saved batched files and working with them to produce some stats that help explore the research question. 