I eventually want to do text analysis with the Kickstarter data, but I'll need to do some data cleaning and text preprocessing before I can do so.

In [1]:
import psycopg2
import pandas as pd

import nltk
import re

## Load data
Load data from database. List of columns found on day44

In [2]:
dbname = "kick"
tblname = "info"

# Connect to database
conn = psycopg2.connect(dbname=dbname)
cur = conn.cursor()

In [3]:
colnames = ["id", "name", "blurb"]

cur.execute("SELECT {col} FROM {tbl}".format(col=', '.join(colnames), tbl=tblname))
rows = cur.fetchall()

pd.DataFrame(rows, columns=colnames).head()

Unnamed: 0,id,name,blurb
0,1312331512,Otherkin The Animated Series,We have a fully developed 2D animated series t...
1,80827270,Paradigm Spiral - The Animated Series,A sci-fi fantasy 2.5D anime styled series abou...
2,737219121,I'm Sticking With You.,"A film created entirely out of paper, visual e..."
3,1946566454,A Tale of Faith - An Animated Short Film,A Tale of Faith is an animated short film base...
4,591797827,Honeybee: The Animated Series Trailer,Honeybee is a cartoon about a girl who can tal...


I want to combine `name` and `blurb`. We can use the `concat_ws` command in postgres

In [4]:
# Treat name + blurb as 1 document
cur.execute("SELECT id, concat_ws(name, blurb) FROM info")
rows = cur.fetchall()

df = pd.DataFrame(rows, columns=["id", "document"])
df.head()

Unnamed: 0,id,document
0,1312331512,We have a fully developed 2D animated series t...
1,80827270,A sci-fi fantasy 2.5D anime styled series abou...
2,737219121,"A film created entirely out of paper, visual e..."
3,1946566454,A Tale of Faith is an animated short film base...
4,591797827,Honeybee is a cartoon about a girl who can tal...


In [5]:
# close communication
cur.close()
conn.close()

In [6]:
# Number of documents
df.shape

(177140, 2)

## Text processing for 1 document

In [7]:
text = df["document"][1]
text

'A sci-fi fantasy 2.5D anime styled series about some guys trying to save the world, probably...'

### To lower case

In [8]:
text = text.lower()
text

'a sci-fi fantasy 2.5d anime styled series about some guys trying to save the world, probably...'

### Bag of word & tokenization
Digits are also removed

In [9]:
words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '', text))
words

['a',
 'scifi',
 'fantasy',
 'd',
 'anime',
 'styled',
 'series',
 'about',
 'some',
 'guys',
 'trying',
 'to',
 'save',
 'the',
 'world',
 'probably']

### Remove stopwords

Reference: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

In [10]:
from nltk.corpus import stopwords

english_stopwords = stopwords.words("english")

print(len(english_stopwords))
print(english_stopwords)

153
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should',

We have a list of 153 english stopwords

In [11]:
# Remove stopwords from document
words = [w for w in words if not w in english_stopwords]
words

['scifi',
 'fantasy',
 'anime',
 'styled',
 'series',
 'guys',
 'trying',
 'save',
 'world',
 'probably']

### Stemming vs Lemmatization
Reference: http://stackoverflow.com/questions/771918/how-do-i-do-word-stemming-or-lemmatization

In [12]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

port = PorterStemmer()
wnl = WordNetLemmatizer()

In [13]:
## Stemming
[port.stem(w) for w in words]

['scifi',
 'fantasi',
 'anim',
 'style',
 'seri',
 'guy',
 'tri',
 'save',
 'world',
 'probabl']

In [14]:
## Lemmatizing
[wnl.lemmatize(w) for w in words]

['scifi',
 'fantasy',
 'anime',
 'styled',
 'series',
 'guy',
 'trying',
 'save',
 'world',
 'probably']

### Putting it all together

In [15]:
def text_processing(text, method=None):
    # Lower case
    text = text.lower()
    
    # Remove non-letters &
    # Tokenize    
    words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '', text))
    
    # Remove stop words
    words = [w for w in words if not w in stopwords.words("english")]
    
    # Stemming vs Lemmatizing vs do nothing
    if method == "stem":
        port = PorterStemmer()
        words = [port.stem(w) for w in words]
    elif method == "lemm":
        wnl = WordNetLemmatizer()
        words = [wnl.lemmatize(w) for w in words]

    return(words)

In [16]:
text = df["document"][1]

compare = {
    "raw" : text_processing(text),
    "stemming": text_processing(text, method="stem"),
    "lemmatizing": text_processing(text, method="lemm")    
}
pd.DataFrame.from_dict(compare)[["raw", "stemming", "lemmatizing"]]

Unnamed: 0,raw,stemming,lemmatizing
0,scifi,scifi,scifi
1,fantasy,fantasi,fantasy
2,anime,anim,anime
3,styled,style,styled
4,series,seri,series
5,guys,guy,guy
6,trying,tri,trying
7,save,save,save
8,world,world,world
9,probably,probabl,probably


- Find some words are untouched:
    - scifi
    - save
    - world
- Some words are touched only in stemming:
    - fantsy-fantasi
    - anime->anim
    - styled->style
    - series->seri
    - trying->tri    
    - probably->probabl
- Agreement of stemming and lemmatizng
    - guys->guy

---

(Aside) How does stemming compare for other words?

In [17]:
[port.stem(w) for w in ["trying", "triangle", "triple"]]

['tri', 'triangl', 'tripl']

In [18]:
[port.stem(w) for w in ["series", "serious"]]

['seri', 'seriou']