<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-required-modules" data-toc-modified-id="Import-required-modules-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import required modules</a></span></li><li><span><a href="#Read-in-data" data-toc-modified-id="Read-in-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read-in data</a></span></li><li><span><a href="#Pre-processing" data-toc-modified-id="Pre-processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Pre-processing</a></span></li><li><span><a href="#Removing-+-renaming-some-columns" data-toc-modified-id="Removing-+-renaming-some-columns-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Removing + renaming some columns</a></span></li><li><span><a href="#Processing" data-toc-modified-id="Processing-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Processing</a></span><ul class="toc-item"><li><span><a href="#Tokenisation" data-toc-modified-id="Tokenisation-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Tokenisation</a></span></li></ul></li></ul></div>

## Import required modules

In [1]:
# !pip install nltk
# !pip install xlrd
# !pip install autocorrect

In [2]:
import os
# provides functions for interacting with underlying operating system
# e.g. change working directory, locate files

import nltk
from nltk import word_tokenize
 # nltk stands for natural language tool kit and is useful for text-mining
    
import re
# re is for regular expressions, which we use later 

import pandas as pd
# includes useful functions for manipulating data 

import xlrd
# we also need xlrd to read the .xls file because pandas is not old school

import autocorrect
# provides functions for spell check

In [3]:
os.getcwd()

'/Users/loucap/Documents/GitWork/Text-Mining-Health/Python_code'

## Read-in data

In [4]:
# Read-in the csv we created in the previous notebook
# We create a variable 'df' and use pd.read_csv(filepath) to convert the csv file into a DataFrame
df = pd.read_csv('Data/Womens_dataset.csv')


In [5]:
# Let's view the first 5 rows of the dataset
df.head(5)
# the default of head() is to print the first 5 rows

Unnamed: 0.1,Unnamed: 0,Headlines,Source,Author,Date,Link,Content,Quotes
0,0,Women’s Super League: talking points from the ...,The Guardian,"Suzanne Wrack, Sophie Downey and Sarah Rendell",2022-11-21,https://www.theguardian.com/football/2022/nov/...,Arsenal’s winning run was ended in dramatic st...,[]
1,1,Moving the Goalposts | ‘We want to keep dreami...,The Guardian,Sophie Downey,2022-12-07,https://www.theguardian.com/football/2022/dec/...,Club has soared since acquiring the Serie A li...,[]
2,2,Leicester’s Ashleigh Plumptre: ‘I love everyth...,The Guardian,Ella Braidwood,2022-11-17,https://www.theguardian.com/football/2022/nov/...,The defender on playing for the club in her he...,[]
3,3,Man City win keeps pressure on WSL top three,BBC News,,2022-12-04,https://www.bbc.co.uk/sport/football/63771047,Last updated on 4 December 20224 December 2022...,[]
4,4,Shaw stars for Man City in WSL win at Everton,BBC News,,2022-11-19,https://www.bbc.co.uk/sport/football/63606404,Last updated on 19 November 202219 November 20...,"['focal point', 'She does so much more than sc..."


## Pre-processing

Data Preprocessing is a technique which is used to convert the raw data set into a clean data set. In other words, whenever the data is collected from different sources it is collected in raw format which is not feasible for the analysis.

Hence, certain steps are followed and executed in order to convert the data into a small and clean data set.

## Removing + renaming some columns

In [6]:
# We don't need the first column 'Unnamed: 0', as our rows already have a numbered index
df = df.drop(columns = ['Unnamed: 0'])

df.head()

Unnamed: 0,Headlines,Source,Author,Date,Link,Content,Quotes
0,Women’s Super League: talking points from the ...,The Guardian,"Suzanne Wrack, Sophie Downey and Sarah Rendell",2022-11-21,https://www.theguardian.com/football/2022/nov/...,Arsenal’s winning run was ended in dramatic st...,[]
1,Moving the Goalposts | ‘We want to keep dreami...,The Guardian,Sophie Downey,2022-12-07,https://www.theguardian.com/football/2022/dec/...,Club has soared since acquiring the Serie A li...,[]
2,Leicester’s Ashleigh Plumptre: ‘I love everyth...,The Guardian,Ella Braidwood,2022-11-17,https://www.theguardian.com/football/2022/nov/...,The defender on playing for the club in her he...,[]
3,Man City win keeps pressure on WSL top three,BBC News,,2022-12-04,https://www.bbc.co.uk/sport/football/63771047,Last updated on 4 December 20224 December 2022...,[]
4,Shaw stars for Man City in WSL win at Everton,BBC News,,2022-11-19,https://www.bbc.co.uk/sport/football/63606404,Last updated on 19 November 202219 November 20...,"['focal point', 'She does so much more than sc..."


## Processing 

This includes the following steps:

* Tokenisation: splitting raw data into various kinds of "short things" that can be statistically analysed
* Standardising: includes converting case, correcting spelling, find-and-replace operations to remove abbreviations, RegEx etc)
* Removing irrelevancies: includes anything from punctuation to stopwords like 'the' or 'to' that are unhelpful for many kinds of analysis
* Consolidation: includes stemming and/or lemmatisation that strip words back to their 'root'
* Basic NLP: includes tagging, named entity recognition, and chunking.

NOTE: In practice, most text-mining work will require that any given corpus undergo multiple steps, but the exact steps and the exact order of steps depends on the desired analysis to be done.

Also, it is good practice to create new variables whenever you manipulate an existing variable rather than write over the original. This means that you keep the original and can go back to it anytime you need to if you want to try a different manipulation or correct an error. You will see how this works as we progress through the processing steps. 




### Tokenisation

Our first step is to cut our 'one big thing' into tokens, or 'lots of little things'. As an example, one project I worked involved downloading a file with hundreds of recorded chess games, which I then divided into individual text files with one game each. The games had a very standard format, with every game ending with either '1-0', '0-1' or '1/2-1/2'. Thus, I was able to use regular expressions (covered in more detail later) to iterate over the file, selecting everyithing until it found an instance of '1-0', '0-1' or '1/2-1/2', at which point it would cut what it had selected, write it to a blank file, save it, and start iterating over the original file again. 

Other options that might make more sense with other kinds of files would be to to cut and write from the large file to new files after a specified number of lines or characters. 

Whether you have one big file or many smaller ones, most text-mining work will also want to divide the corpus into what are known as 'tokens'. These 'tokens' are the unit of analysis, which might be chapters, sections, paragraphs, sentences, words, or something else. 

Since we have our file already with one on each row, we can skip the right to tokenising that text into sentences and words. Both options are functions available through the ntlk package that we imported earlier. These are both useful tokens in their own way, so we will see how to produce both kinds. 
 
We start by dividing the text in each file into words, splitting the string into substrings whenever 'word_tokenize' detects a word. 

Let's try that. But this time, let's just have a look at the first 100 things it finds instead of the entire text.

In [7]:
df.head()

Unnamed: 0,Headlines,Source,Author,Date,Link,Content,Quotes
0,Women’s Super League: talking points from the ...,The Guardian,"Suzanne Wrack, Sophie Downey and Sarah Rendell",2022-11-21,https://www.theguardian.com/football/2022/nov/...,Arsenal’s winning run was ended in dramatic st...,[]
1,Moving the Goalposts | ‘We want to keep dreami...,The Guardian,Sophie Downey,2022-12-07,https://www.theguardian.com/football/2022/dec/...,Club has soared since acquiring the Serie A li...,[]
2,Leicester’s Ashleigh Plumptre: ‘I love everyth...,The Guardian,Ella Braidwood,2022-11-17,https://www.theguardian.com/football/2022/nov/...,The defender on playing for the club in her he...,[]
3,Man City win keeps pressure on WSL top three,BBC News,,2022-12-04,https://www.bbc.co.uk/sport/football/63771047,Last updated on 4 December 20224 December 2022...,[]
4,Shaw stars for Man City in WSL win at Everton,BBC News,,2022-11-19,https://www.bbc.co.uk/sport/football/63606404,Last updated on 19 November 202219 November 20...,"['focal point', 'She does so much more than sc..."


In [20]:
df.Content.isna().sum()

6

In [23]:
df.Content.fillna('None') 

0     Arsenal’s winning run was ended in dramatic st...
1     Club has soared since acquiring the Serie A li...
2     The defender on playing for the club in her he...
3     Last updated on 4 December 20224 December 2022...
4     Last updated on 19 November 202219 November 20...
                            ...                        
76    We use cookies and other tracking technologies...
77    We use cookies and other tracking technologies...
78    We use cookies and other tracking technologies...
79    We use cookies and other tracking technologies...
80    We use cookies and other tracking technologies...
Name: Content, Length: 81, dtype: object

In [25]:
# Now I'm going to return to just using the previous DataFrame labelled 'df'
# This includes both diary files + group and interview files

# First I'll create a new column called 'tokenised_words'

df['tokenised_words'] = df.apply(lambda row: nltk.word_tokenize(row['Content']), axis = 1)

# apply - used to apply a function along an axis of the DataFrame: i.e, axis 1
# lambda - anonymous function (no name) that can take any number of arguments
# lambda ensures that the function tokenize is applied to every ROW in the text column



TypeError: expected string or bytes-like object