![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Basics in R

Welcome to the UK Data Service training series on New Forms of Data for Social Science Research. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. We provide webinars, interactive notebooks containing live programming code, reading lists and more.

To access training materials for the entire series: [Training Materials]

To keep up to date with upcoming and past training events: [Events]

To get in contact with feedback, ideas or to seek assistance: [Help]

Dr Julia Kasmire
UK Data Service
University of Manchester
November 2020

Table of Contents
1  Introduction
2  Retrieval
3  Processing
3.1  Tokenisation
3.2  Standardising
3.2.1  Remove uppercase letters
3.2.2  Spelling correction
3.2.3  RegEx replacements
3.3  Removing irrelevancies
3.3.1  Remove punctuation
3.3.2  Stopwords
3.4  Consolidation
3.4.1  Stemming words
3.4.2  Lemmatisation
4  Conclusions
4.1  Further reading

There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

## Introduction

This is the first in a series of jupyter notebooks on text-mining that cover basic preparation processes, common natural language processing tasks, and some more advanced natural language tasks. These interactive code-along notebooks use python as a programming language, but introduce various packages related to text-mining and text processing. Most of those tasks could be done in other packages, so please be aware that the options demonstrated here are not the only way, or even the best way, to accomplish a text-mining task. 

For more information on what jupyter notebooks are or how to interact with them, follow [THESE LINKS] (insert links please). 

## Retrieval

The first step in text-mining, or any form of data-mining, is retrieving a data set to work with. Within text-mining, or any language analysis context, one data set is usually referred to as 'a corpus' while multiple data sets are referred to as 'corpora'. 'Corpus' is a latin-root word and therefore has a funny plural. 

For text-mining, a corpus can be:
- a set of tweets, 
- the full text of an 18th centrury novel,
- the contents of a page in the dictionary, 
- minutes of local council meetings, 
- random gibberish letters and numbers, or
- just about anything else in text format. 


Retrieval is a very important step, but it is not the focus of this particular training series. If you are interested in creating a corpus from internet data, then you may want to check out previous the NFoD training series that covers Web-scraping (available as recordings of webinars or as a code-along jupyter notebook like this one) and API's (also as recording or jupyter notebook). Both of these demonstrate and discuss ways to get data from the internet that you could use to build a corpus. 

Instead, for the purposes of this session, we will assume that you already have a corpus to analyse. This is easy for us to assume, because we have provided a sample text file that we can use as a corpus for these exercises. 

First, let's check that it is there. To do that, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. 

For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for 'click in the code cell below and hit the 'Run' button at the top of this page or by hold down the 'Shift' key while hitting the 'Enter' key'. 


In [1]:
# It is good practice to always start by importing the modules and packages you will need. 

library(dplyr)

print("1. Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!




Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




[1] "1. Succesfully imported necessary modules"


In [2]:
print(list.files(path = "./data", pattern = "*"))      # create list of all .csv files in folder

print(list.files(path = "./", pattern = "*"))      # create list of all .csv files in folder

character(0)
 [1] "$RECYCLE.BIN" "Desktop"      "desktop.ini"  "Documents"    "Downloads"   
 [6] "Favorites"    "My Music"     "My Pictures"  "My Videos"    "R"           


In [None]:
_______________________________________________________________________________________________________________________________

Great! We have imported a useful module and used it to check that we have access to the sample_text file. 

Now we need to load that sample_text file into a variable that we can work with in python. Time to Run/Shift+Enter again!

In [None]:
# Open the "sample_text" file and read (import) its contents to a variable called "corpus"
with open("./data/sample_text.txt", "r") as f:
    corpus = f.read()
    
    print(corpus)

In [None]:
_______________________________________________________________________________________________________________________________
Hmm. Not excellent literature, but it will do for our purposes. 

A quick look tells us that there are capital letters, contractions, punctuation, numbers as digits, numbers written out, abbreviations, and other things that, as humans, we know are equivalent but that computers do not know about. 

Before we go further, it helps to know what kind of variable corpus is. Run/Shift+Enter the next code block to find out!

In [None]:
type(corpus)

In [None]:
_______________________________________________________________________________________________________________________________
This tells us that 'corpus' is one very long string of text characters.  

Congratulations! We are done with the retreival portion of this process. The rest won't be quite so straightforward because next up... Processing. 

Processing is about cleaning, correcting, standardizing and formatting the raw data returned from the retrieval process. 

## Processing

### Tokenisation

#### Remove uppercase letters

#### Spelling correction

#### RegEx replacements

### Remove irrelevancies

#### Remove punctuation

#### Stopwords

### Consolidation

#### Stemming words

#### Lemmitisation

## Conclusions

### Further Reading