### Setup

It may be helpful to run the following commands, downloading nltk and the stopword corpus:

pip install -U nltk
python
nltk.download('stopwords')

### Extract

Given the following data:

text = ['Fury said to a mouse, That he met in the house, “Let us both go to law: I will prosecute you.—Come, I’ll take no denial; We must have a trial: For really this morning I’ve nothing to do.” Said the mouse to the cur, “Such a trial, dear Sir, With no jury or judge, would be wasting our breath.” “I’ll be judge, I’ll be jury,” Said cunning old Fury: “I’ll try the whole cause, and condemn you to death.”’', 
	'I was gonna tell a time travelling joke but you guys didn\'t like it', 
	'Did you know that Balluta Buildings (detail pictured), one of the finest Art Nouveau buildings in Malta, was built on the grounds of Villa St Ignatius, one of the island\'s earliest Gothic Revival buildings?']

### Create a DataFrame, with the column raw_text
Transform

   * Write a method, preprocess(input_text), which accepts a single string, and returns a list of strings. It should:
       - Lowercase all characters
       - Replace all non-alphabetical characters with one space (e.g. )
       - Create a list of tokens
       - Filter out stop words. An example method for filtering out stopwords is available here
   * Create a new column, tokens, by apply this method to the raw_text column
   * Create a new column num_words, with the number of words in the raw_text

Load

   * Save the whole dataframe to a file called observations.csv


In [22]:
import re
from nltk.corpus import stopwords
import pandas as pd

In [19]:
def preprocess(input_text):
    text1 = input_text.lower()
    text2 = re.sub('\W+',' ',text1)
    text_tokens = text2.split(' ')
    
    stopWords = set(stopwords.words('english'))
    text_filtered = []
    for w in text_tokens:
        if w not in stopWords:
            text_filtered.append(w)
    return text_filtered

In [15]:
preprocess('Hi There!  Nice car.')

['hi', 'there', 'nice', 'car', '']

In [16]:
text = ['Fury said to a mouse, That he met in the house, “Let us both go to law: I will prosecute you.—Come, I’ll take no denial; We must have a trial: For really this morning I’ve nothing to do.” Said the mouse to the cur, “Such a trial, dear Sir, With no jury or judge, would be wasting our breath.” “I’ll be judge, I’ll be jury,” Said cunning old Fury: “I’ll try the whole cause, and condemn you to death.”’', 'I was gonna tell a time travelling joke but you guys didn\'t like it', 'Did you know that Balluta Buildings (detail pictured), one of the finest Art Nouveau buildings in Malta, was built on the grounds of Villa St Ignatius, one of the island\'s earliest Gothic Revival buildings?']

In [20]:
preprocess(text[0])

['fury',
 'said',
 'mouse',
 'met',
 'house',
 'let',
 'us',
 'go',
 'law',
 'prosecute',
 'come',
 'take',
 'denial',
 'must',
 'trial',
 'really',
 'morning',
 'nothing',
 'said',
 'mouse',
 'cur',
 'trial',
 'dear',
 'sir',
 'jury',
 'judge',
 'would',
 'wasting',
 'breath',
 'judge',
 'jury',
 'said',
 'cunning',
 'old',
 'fury',
 'try',
 'whole',
 'cause',
 'condemn',
 'death',
 '']

In [31]:
df_story = pd.DataFrame(text,columns=['raw_text'])
df_story['tokens']=df_story['raw_text'].apply(preprocess)

In [32]:
df_story.head()

Unnamed: 0,raw_text,tokens
0,"Fury said to a mouse, That he met in the house...","[fury, said, mouse, met, house, let, us, go, l..."
1,I was gonna tell a time travelling joke but yo...,"[gonna, tell, time, travelling, joke, guys, like]"
2,Did you know that Balluta Buildings (detail pi...,"[know, balluta, buildings, detail, pictured, o..."


In [33]:
df_story.loc[0,'raw_text']

'Fury said to a mouse, That he met in the house, “Let us both go to law: I will prosecute you.—Come, I’ll take no denial; We must have a trial: For really this morning I’ve nothing to do.” Said the mouse to the cur, “Such a trial, dear Sir, With no jury or judge, would be wasting our breath.” “I’ll be judge, I’ll be jury,” Said cunning old Fury: “I’ll try the whole cause, and condemn you to death.”’'

In [34]:
re.findall("[a-zA-Z_]+", df_story.loc[0,'raw_text'])

['Fury',
 'said',
 'to',
 'a',
 'mouse',
 'That',
 'he',
 'met',
 'in',
 'the',
 'house',
 'Let',
 'us',
 'both',
 'go',
 'to',
 'law',
 'I',
 'will',
 'prosecute',
 'you',
 'Come',
 'I',
 'll',
 'take',
 'no',
 'denial',
 'We',
 'must',
 'have',
 'a',
 'trial',
 'For',
 'really',
 'this',
 'morning',
 'I',
 've',
 'nothing',
 'to',
 'do',
 'Said',
 'the',
 'mouse',
 'to',
 'the',
 'cur',
 'Such',
 'a',
 'trial',
 'dear',
 'Sir',
 'With',
 'no',
 'jury',
 'or',
 'judge',
 'would',
 'be',
 'wasting',
 'our',
 'breath',
 'I',
 'll',
 'be',
 'judge',
 'I',
 'll',
 'be',
 'jury',
 'Said',
 'cunning',
 'old',
 'Fury',
 'I',
 'll',
 'try',
 'the',
 'whole',
 'cause',
 'and',
 'condemn',
 'you',
 'to',
 'death']

In [37]:
df_story['num_words'] = df_story['raw_text'].apply(lambda x: len(re.findall("[a-zA-Z_]+", x)))

In [38]:
df_story.head()

Unnamed: 0,raw_text,tokens,num_words
0,"Fury said to a mouse, That he met in the house...","[fury, said, mouse, met, house, let, us, go, l...",85
1,I was gonna tell a time travelling joke but yo...,"[gonna, tell, time, travelling, joke, guys, like]",15
2,Did you know that Balluta Buildings (detail pi...,"[know, balluta, buildings, detail, pictured, o...",35
