# Lab 1: Tokenization, Stemming and Lemmatization

This lab will focus on tokenization of the input text, then performing the stemming and lemmatization techniques we learnt in this session

## Step 1

Read the file/s from the data folder and load it into memory using python, pandas or any python fashion you are comfortable with

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import string
import unicodedata

import nltk

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.util import ngrams
from nltk import pos_tag
from nltk import RegexpParser

%matplotlib inline

In [5]:
df = pd.read_csv('../data/Corporate-messaging-DFE.csv', sep=',', encoding='latin-1')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3118 entries, 0 to 3117
Data columns (total 11 columns):
_unit_id               3118 non-null int64
_golden                3118 non-null bool
_unit_state            3118 non-null object
_trusted_judgments     3118 non-null int64
_last_judgment_at      2811 non-null object
category               3118 non-null object
category:confidence    3118 non-null float64
category_gold          307 non-null object
id                     3118 non-null float64
screenname             3118 non-null object
text                   3118 non-null object
dtypes: bool(1), float64(2), int64(2), object(6)
memory usage: 246.7+ KB


In [17]:
df['category'].unique()

array(['Information', 'Action', 'Dialogue', 'Exclude'], dtype=object)

Dataset consists of four caterogies
1. Information
2. Action
3. Dialogue
4. Exlude

In [23]:
df['_golden'].unique()

array([False,  True])

In [25]:
df['_unit_state'].unique()

array(['finalized', 'golden'], dtype=object)

## Step 2

At this step, we need to 
- tokenize the text into different tokens
- Eliminate tokens and candidates we dont care about - like stopwords, punctuations, numbers etc
- Create a clean list of tokens for each document

In [152]:
# function to remove the http at the end of all the documents
def remove_link(sentence):
    '''
    takes in a sentence
    
    returns a the sentennce with link at the removed
    
    '''
    
    doc1 = [item.lower() for item in sentence.strip().split('http')[:-1]]
    doc2 = ''.join(doc1).split()
    
    return doc2


In [153]:
# removing the http from all the sentences 
df['text1'] = df['text'].map(remove_link)

In [155]:
stopwords_ = set(stopwords.words('english'))

In [166]:
def filter_stopwords(word):
    return([w for w in word if not w in stopwords_])

In [167]:
# removing punctuation
import string
punctuation_ = set(string.punctuation)

In [171]:
# function to tokernize
def cleaner(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token.lower() for token in tokens]
    tokens = [token for token in tokens if token not in nltk.corpus.stopwords.words('english')]
    return tokens

In [173]:
df1 = df.copy()

In [174]:
# getting the tokens
df1['tokens'] = df1['text'].map(cleaner)

## Step 3

- Stemming using NLTK : http://www.nltk.org/howto/stem.html
- Lemmatization using NLTK lemmatizer : https://pythonprogramming.net/lemmatizing-nltk-tutorial/

In [177]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [184]:
stemmer_porter = PorterStemmer()

In [204]:
stemmer_porter.stem('drunking')

'drunk'

In [199]:
# stemming
def stemming(tokens):
    return [stemmer_porter.stem(token) for token in tokens]

df1['tokens_stem'] = df1['tokens'].map(stemming)
#df1['tokens'].map([stemmer_porter.stem(token) for token in df1['tokens'])

In [200]:
# lammitization

0       [barclay, ceo, stress, import, regulatori, cul...
1       [barclay, announc, result, right, issu, http, ...
2       [barclay, publish, prospectu, å£5.8bn, right, ...
3       [barclay, group, financ, director, chri, luca,...
4       [barclay, announc, iren, mcdermott, brown, app...
5       [barclay, respons, pra, capit, shortfal, exerc...
6       [barclay, sponsor, #, zamynforum, bbc, world, ...
7       [barclay, today, publish, respons, salz, revie...
8       [read, statement, #, barclay, ceo, bonu, award...
9       [59, %, worker, either, look, chang, job, appl...
10      [longer, one, workforc, ., five, ., want, empl...
11      [uk, entrepreneuri, activ, 2013, glanc, -, bar...
12      [emma, turner, ,, head, client, philanthropi, ...
13      [..., visit, us, @, napfnew, 5-7, march, workf...
14      [chill, #, emergingmarket, last, littl, longer...
15      [sinc, 2004, 've, invest, å£37.5m, #, spacesfo...
16      [barclay, servic, execut, :, help, feel, secur...
17      [jaim,

In [None]:
stemmer_porter = PorterStemmer()
tokens_stemporter = [list(map(stemmer_porter.stem, sent)) for sent in tokens_filtered]

## Step 4
Stemming and Lemmatization using spacy: https://spacy.io/usage/spacy-101#annotations