<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pre-processing" data-toc-modified-id="Pre-processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pre-processing</a></span></li></ul></div>

## Import required modules

In [1]:
# !pip install nltk
# !pip install xlrd

In [2]:
import os
# provides functions for interacting with underlying operating system
# e.g. change working directory, locate files

from nltk import word_tokenize
 # nltk stands for natural language tool kit and is useful for text-mining
    
import re
# re is for regular expressions, which we use later 

import pandas as pd
# includes useful functions for manipulating data 

import xlrd
# we also need xlrd to read the .xls file because pandas is not old school

# Pre-processing

Data Preprocessing is a technique which is used to convert the raw data set into a clean data set. In other words, whenever the data is collected from different sources it is collected in raw format which is not feasible for the analysis.

Hence, certain steps are followed and executed in order to convert the data into a small and clean data set.

## Read-in data

In [3]:
# Read-in the csv we created in the previous notebook
# We create a variable 'df' and use pd.read_csv(filepath) to convert the csv file into a DataFrame
df = pd.read_csv('Data/text.csv')

In [4]:
# Let's view the first 10 rows of the dataset
df.head(10)
# the default of head() is to print the first 5 rows

Unnamed: 0.1,Unnamed: 0,0,1
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...


## Removing + renaming some columns

In [5]:
# We don't need the first column 'Unnamed: 0', as our rows already have a numbered index
df = df.drop(columns = ['Unnamed: 0'])


# Let's rename our remaining columns with something more intelligible 

df.columns = ["Filename", "Text"]

In [6]:
# Let's take a quick look at our dataset to see if the above has worked...

df.head()

Unnamed: 0,Filename,Text
0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...


## Splitting DataFrame

We can clean up our dataset even further by making some smart decisions about how to break it up. By printing the first 50 rows below, we can identify 3 different types of files. We have:

* diary files - row 0-39
* focus groups - row 40-45
* interview files - row 46-86

Let's go ahead and split this big DataFrame into 2 smaller ones, to make it easier to analyse diary files and group + interview files separately.

In [15]:
df.head(50)

Unnamed: 0,Filename,Text
0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...
5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...
6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...
7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...
8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...
9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...


In [25]:
# Here we create a variable 'diary_files' which will contain rows 0-39 of our original DataFrame
diary_files = df.loc[:39]

# .loc - is used to access rows or columns in a DataFrame
# BEFORE the comma - indicates start position of rows we want to access
# AFTER the comma - indicates the end position
# NOTE: if there is no number BEFORE the comma this means access everything up to and including the end value!

In [28]:
diary_files

Unnamed: 0,Filename,Text
0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...
5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...
6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...
7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...
8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...
9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...


In [29]:
# Let's go ahead and do the same for our rows containing group + interview files

interview_files = df.loc[40:]

# NOTE: if there is no number AFTER the comma this means access everything after (and including) the start position

In [27]:
interview_files

Unnamed: 0,Filename,Text
40,5407fg01.rtf,\nGroups Discussion with Members of Farmers F...
41,5407fg02.rtf,Groups Discussion with Members of Small Busine...
42,5407fg03.rtf,\n\nGroups Discussion with Members of Agricul...
43,5407fg04.rtf,\nNO AUDIO RECORDING\n\nGroups Discussion with...
44,5407fg05.rtf,\n\nGroups Discussion with Community Group of ...
45,5407fg06.rtf,"\n\nGroup Discussion Panel Members, Group 6 – ..."
46,5407int02.rtf,\nDate of Interview: 14/03/02\n\nInformation a...
47,5407int03.rtf,\nDate of Interview: 08/03/02\n\nInformation a...
48,5407int07.rtf,\nDate of Interview: 14/03/02\n\nInformation a...
49,5407int08.rtf,\nDate of Interview: 06/03/02\n\nInformation a...


## Create columns for Date, Gender, and Occupation

Perhaps for our research project it's important for us to extract text information relating to the date of the diary entry or interview, and the gender and occupation of the participants. In terms of health data this can be particularly important. If you're looking to scrape some information to build a dataset on a particular disease (e.g. long covid) and the symptoms that people are experiencing, you're going to want some socio-demographic variables.  

In [34]:
diary_files['Date'] = diary_files['Text'].str.extract(r'(\d{1,2}\w+\s+\w+\s+\d{4})')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  diary_files['Date'] = diary_files['Text'].str.extract(r'(\d{1,2}\w+\s+\w+\s+\d{4})')


In [35]:
diary_files

Unnamed: 0,Filename,Text,Date
0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,
1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,
2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,6th January 2003
3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,9th March 2002
4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,25th February 2002
5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,11th March 2002
6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,18th March 2002
7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,
8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,
9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,2000 for 2001
