<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-required-modules" data-toc-modified-id="Import-required-modules-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import required modules</a></span></li><li><span><a href="#Retrieval" data-toc-modified-id="Retrieval-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Retrieval</a></span></li><li><span><a href="#Accessing-rows,-columns-and-cells" data-toc-modified-id="Accessing-rows,-columns-and-cells-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Accessing rows, columns and cells</a></span></li></ul></div>

# Pre-processing

Data Preprocessing is a technique which is used to convert the raw data set into a clean data set. In other words, whenever the data is collected from different sources it is collected in raw format which is not feasible for the analysis.

Hence, certain steps are followed and executed in order to convert the data into a small and clean data set.

## Import required modules

In [1]:
# !pip install nltk
# !pip install xlrd

In [2]:
import os
# provides functions for interacting with underlying operating system
# e.g. change working directory, locate files

from nltk import word_tokenize
 # nltk stands for natural language tool kit and is useful for text-mining
    
import re
# re is for regular expressions, which we use later 

import pandas as pd
# includes useful functions for manipulating data 

import xlrd
# we also need xlrd to read the .xls file because pandas is not old school

## Retrieval 

The first step in text-mining, or any form of data-mining, is to retrieve the data set that you will be working with. Within text-mining or any language analysis context, one dataset is usually referred to as a 'corpus', whilst multiple datasets are referred to as 'corpora'.

For text-mining a corpus could be:

* a set of tweets
* the full text of an 18th century novel
* the contents of a page in the dictionary
* minutes of local council meetings
* random gibberish letters and numbers, or
* just about anything else in text format

Instead, for the purposes of this notebook, we will be retrieving the .csv file that we created in the 'Read_in_data.ipynb' notebook.

### Locate the text.csv file

I.e. check it exists and is where you think it is

In [37]:
# List all the files in the "Data" folder

for file in os.listdir("Data"):
#     We use an if statement to exclude '.DS_Store' files
    if file != '.DS_Store':
        print("One of the files in Data is...", file)

One of the files in Data is... text.csv
One of the files in Data is... foot_mouth_info.xls


Great! The file we want to use is available, so now we need to load that .csv file as a Python object.

### Load in text.csv file

We can do this by using the Pandas library and converting our .csv file into a DataFrame, which is a structure that contains 2D data and its labels. Basically, we convert our text.csv into an excel-like object that allows it to be manipulated with pandas functions.

In [38]:
# Read-in the csv we created in the previous notebook
# We create a variable 'df' and use pd.read_csv(filepath) to convert the csv file into a DataFrame
df = pd.read_csv('Data/text.csv')

In [39]:
# Let's view the first 10 rows of the dataset
df.head(10)
# the default of head() is to print the first 5 rows

Unnamed: 0.1,Unnamed: 0,0,1
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...


We can see below that we have 3 columns:

* file number
* file name
* file contents - i.e., the text contained in each file

Before we go any further let's use the built-in 'type' function to check that our df variable is a pandas DataFrame.

In [40]:
type(df)

pandas.core.frame.DataFrame

Congratulations! We are done with the retreival portion of this process. The rest won't be quite so straightforward because next up... Accessing! This allows us to get individual rows, columns and/or cells to inspect, change, label, or split them. 

## Accessing rows, columns and cells

In order to work with the contents of a data frame, we need to be able to access only some of it at a time. To do that, we need to call the DataFrame and then tell it which parts to return to us. We can save what is returned as a variable, or print it, or write it to a .csv file or something. For now, lets just see it on the screen. 

### Accessing rows by index

Let's start by accessing just rows.

This approach uses the "index", "index location", or "index position", and works on the principle of counting. If you reorder your rows, it will affect what is returned.

In [42]:
# This is how you access a single row in a DataFrame
df[:1]
# NOTE: there is no comma anywhere inside the square brackets
# The importance of that becomes clear later...

Unnamed: 0.1,Unnamed: 0,0,1
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...


In [46]:
df[:50]
# We can see that it selects rows 0-49
# In the weird world of computer science we count from 0, so we stop at 49

Unnamed: 0.1,Unnamed: 0,0,1
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...


Have a play around with the cell above and see what happens if you enter different values inside the square brackets. Below are some examples of how you can use the index to acccess one or more rows.

In [53]:
df.iloc[2]
# Select row by Index - not very easy to read though...
# iloc gets rows (and/or columns) at integer locations.

Unnamed: 0                                                    2
0                                               5407diary07.rtf
1             \n\nInformation about diarist\nDate of birth: ...
Name: 2, dtype: object

In [55]:
df.iloc[[2]]
# Select row by READABLE index - note the double beackets!

df.iloc[[2,3,6]]
# Select row by Index list - to do this you NEED the double brackets

Unnamed: 0.1,Unnamed: 0,0,1
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...


In [59]:
# For the following, you can only use single brackets 
# However, they will all have readable format!

df.iloc[1:5]      # Select Rows by Integer Index Range
# Here we are slicing our DataFrame
# df.iloc[start:end]
# When we perform a slice we print everything including our start value, UP UNTIL our end value
# So you'll notice we don't have row 5 here

Unnamed: 0.1,Unnamed: 0,0,1
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...


I'm not going to go through the rest of the examples below fully, but if you want to test out what each line of code does then you can do so in your own time. Remember that if you're working from Jupyter notebook like me, then you'll need a separate code cell for each line of code! The notebook only prints the last line of code in each cell, which is why we only see the results of df.iloc[::2].

In [57]:
df.iloc[:1]       # Select First Row
df.iloc[:3]       # Select First 3 Rows
df.iloc[-1:]      # Select Last Row
df.iloc[-3:]      # Select Last 3 Row
df.iloc[::2]      # Selects alternate rows

Unnamed: 0.1,Unnamed: 0,0,1
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...
10,10,5407diary17.rtf,Information about diarist\nDate of birth: 1936...
12,12,5407diary19.rtf,\nInformation about diarist\nDate of birth: 19...
14,14,5407diary22.rtf,\nInformation about diarist\nDate of birth: 19...
16,16,5407diary24.rtf,\nInformation about diarist\nDate of birth: 19...
18,18,5407diary27.rtf,\nInformation about diarist\nDate of birth: 19...


### Accessing rows by label

As well as index, you can also access rows by their label. Our data does not have names (in the sense that they are not named with "Julia" or "row 1" or anything that is a string. If you work with a DataFrame that has strings as labels, just put the row index label inside of quotes (like 'Julia').

Again, here are some examples of how to access one or more rows in multiple ways using row labels instead of index. 

NOTE: the index uses ".iloc" and label uses ".loc". This is very easy to forget. 

In [73]:
# Select Rows by Label Index Range
df.loc[1:5]
# Here you'll notice a difference between .iloc and .loc
# When we executed this code before using .iloc we only got rows 1-4

Unnamed: 0.1,Unnamed: 0,0,1
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...


Because our rows in our DataFrame don't have string index labels, we're not going to see much difference here between using .iloc and .loc. So, to demonstrate its usefulness I'll create a small toy DataFrame below for you to examine in your own time.

In [72]:
toy = pd.DataFrame(list("abcdef"), index=[49, 'Julia', 47, 0, 1, 2]) 
toy

Unnamed: 0,0
49,a
Julia,b
47,c
0,d
1,e
2,f


Remember, if you want to see what each line of code does in Jupyter notebook, you'll need to enter each line of the code below in a separate cell!

In [87]:
# Select row by index label
toy.loc[2]
# Select row by index label as a string
toy.loc['Julia']
# Select rows by index label list
toy.loc[[49, 'Julia', 0]]
# Select rows by label index that uses numbers and strings
toy.loc[49:'Julia']
# Select alternate rows with index labels
toy.loc[49:2:2]

Unnamed: 0,0
49,a
47,c
1,e


**To sum up the differences:**

* `.loc` works with numbers and strings whereas `.iloc` only works with numbers
* `.loc` will give errors if you search for a label that isn't there but `.iloc` will just access everything up until the end 
> e.g. if you try `df.iloc[:100]` but only have 50 rows, it will just give you all 50 rows
* When using `.loc` you can select Index ranges and Lists via combinations of numbers and strings!
* With `.loc` you need to specify the labels of rows further down but with `.iloc` you can just use minus indexing - great for if you dont know how many rows your DataFrame has!

Aside from this the syntax is the exact same!

### Accessing columns by index and label

How about accessing columns in a dataframe?

It works a lot like accessing rows, with .iloc (the index position) and .loc (the label). 

BUT! Now you need a comma inside the square brackets:

* BEFORE the comma (or everything if there is no comma) determines what rows to return, and
* AFTER the comma determines the columns to return.


ALSO! For some reason, you cannot use non-string labels when accessing columns. No idea. 

*So yeah it is **really** looking like .loc is only really beneficial for locating rows/columns via string labels - otherwise it just has too many restrictions!*

In [95]:
# Selecting Columns by Index
# df.iloc[rows to return, columns to return]

df.iloc[:,1]
# So here, we're returning all rows from column 1 which contains our file names

0     5407diary02.rtf
1     5407diary03.rtf
2     5407diary07.rtf
3     5407diary08.rtf
4     5407diary09.rtf
           ...       
82      5407int49.rtf
83      5407int52.rtf
84      5407int53.rtf
85      5407int54.rtf
86      5407int55.rtf
Name: 0, Length: 87, dtype: object

### Removing + renaming some columns

In [11]:
# We don't need the first column 'Unnamed: 0', as our rows already have a numbered index
df = df.drop(columns = ['Unnamed: 0'])


# Let's rename our remaining columns with something more intelligible 

df.columns = ["Filename", "Text"]

In [14]:
# Let's take a quick look at our dataset to see if the above has worked...

df.head()

Unnamed: 0,Filename,Text
0,5407int26.rtf,\nDate of Interview: 01/03/02\n\nInformation a...
1,5407int32.rtf,\nDate of Interview: 04/02/02\n\nInformation a...
2,5407diary43.rtf,\nInformation about diarist\nDate of birth: 19...
3,5407diary42.rtf,\nInformation about diarist\nDate of birth: 19...
4,5407int27.rtf,\nDate of Interview: 04/02/02\n\nInformation a...


### Splitting DataFrame


In [15]:
df.head(50)

Unnamed: 0,Filename,Text
0,5407int26.rtf,\nDate of Interview: 01/03/02\n\nInformation a...
1,5407int32.rtf,\nDate of Interview: 04/02/02\n\nInformation a...
2,5407diary43.rtf,\nInformation about diarist\nDate of birth: 19...
3,5407diary42.rtf,\nInformation about diarist\nDate of birth: 19...
4,5407int27.rtf,\nDate of Interview: 04/02/02\n\nInformation a...
5,5407int19.rtf,\nDate of Interview: 23/02/02\n\nInformation a...
6,5407int31.rtf,\nDate of Interview: 04/02/02\n\nInformation a...
7,5407diary40.rtf,\nInformation about diarist\nDate of birth: 19...
8,5407diary54.rtf,\nInformation about diarist\nDate of birth: 19...
9,5407diary55.rtf,\nInformation about diarist\nDate of birth: 19...
