# Python Text Processing

This notebook contains all the code associated with the standard python text pre-processing.

## Project Setup

This section imports all required libraries and downloads data for required packages. The installation will only download `stopwords` again if it is not up to date.

In [1]:
import pandas as pd
import nltk
import string
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA

# Install nltk data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/cheetah/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/cheetah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/cheetah/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/cheetah/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Loading the Text

This section will load in the data.

First, the data is read into a pandas data frame.

The reviews are read into a column of lists called `num_reviews`. This is exploded so that each review in the list becomes its own row.

In [2]:
reviews = pd.read_json('../../data/raw/appsearch_reviews/appsearch_reviews.txt', lines=True)
reviews = reviews.explode("reviews").reset_index(drop=True).drop("num_reviews", axis=1)

## Exploring the Data

This section will do a quick exploration of the data for major issues. Things to include

* Missing records
* Check very short and very long records

First, start with a preview of the dataframe records and a count of total records.

In [3]:
display(reviews.head())

print(f'--------------------------\nNumber of records: {len(reviews)}')

Unnamed: 0,reviews,app_id
0,This game is very good! Kudos to the developer...,com.nut.man
1,Terrific just get rid of ads and it will be th...,com.nut.man
2,I CAN'T STOP TAPPING. This game is too addict...,com.nut.man
3,. The game itself is really fun but the way t...,com.nut.man
4,ADS GALORE!. I was bored 1 day then came acro...,com.nut.man


--------------------------
Number of records: 1388668


## Pre-process the Text

This section will go through basic text cleaning steps.

* Normalize the text (make all text lowercase)
* Remove special characters
* Remove common stop words
* Remove numbers
* Remove extra whitespaces
* Tokenize the words
* Stem the words

The basic stopwords are coming from the `nltk` packages english `stopwords` list.

The `filters` variable contains special characters to remove from the text, including tab and newline characters

In [4]:
%%time
# Lowercase text
reviews['clean_text'] = reviews['reviews'].str.lower()

# Remove punctuation
reviews['clean_text'] = reviews['clean_text'].str.translate(str.maketrans('','', string.punctuation))

# Remove extra leading/trailing whitespaces
reviews['clean_text'] = reviews['clean_text'].str.strip()

# Tokenize the text
reviews['clean_text'] = reviews['clean_text'].apply(lambda x: word_tokenize(str(x)))

# Remove stop words
clean_text_temp = []
for text in reviews['clean_text']:
    no_stop_words = []
    for word in text:
        if word not in stopwords.words('english'):
            no_stop_words.append(word)
    clean_text_temp.append(no_stop_words)
reviews['clean_text'] = clean_text_temp

# Lemmatize
lemmatizer = WordNetLemmatizer()
clean_text_temp = []
for text in reviews['clean_text']:
    lemmatized_words = []
    for word in text:
        lemmatized_words.append(lemmatizer.lemmatize(word))
    clean_text_temp.append(lemmatized_words)
reviews['clean_text'] = clean_text_temp

temp_text = []
for word_list in reviews['clean_text']:
    temp_text.append(" ".join(word_list))
reviews['clean_text'] = temp_text

# Vectorize the Text
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer.fit(reviews['clean_text'].values)

count_data = count_vectorizer.transform(reviews['clean_text'].values)

CPU times: user 28min 46s, sys: 2min 4s, total: 30min 50s
Wall time: 30min 53s


In [5]:
display(reviews)
print(count_data)

Unnamed: 0,reviews,app_id,clean_text
0,This game is very good! Kudos to the developer...,com.nut.man,game good kudos developer wow absolutely stunn...
1,Terrific just get rid of ads and it will be th...,com.nut.man,terrific get rid ad best game ever even though...
2,I CAN'T STOP TAPPING. This game is too addict...,com.nut.man,cant stop tapping game addicting challenging a...
3,. The game itself is really fun but the way t...,com.nut.man,game really fun way ad get slipped irritating ...
4,ADS GALORE!. I was bored 1 day then came acro...,com.nut.man,ad galore bored 1 day came across game love im...
...,...,...,...
1388663,Awesome. Unlike other apps this does not requ...,air.com.jcward.speedwords,awesome unlike apps require people buy app pla...
1388664,Simple concept well executed. The game is sim...,air.com.jcward.speedwords,simple concept well executed game simple play ...
1388665,. Engaging,air.com.jcward.speedwords_demo,engaging
1388666,Just playyyy. So cool,air.com.jcward.speedwords_demo,playyyy cool


  (0, 17967)	1
  (0, 19827)	1
  (0, 20225)	1
  (0, 60556)	1
  (0, 75845)	1
  (0, 89202)	1
  (0, 93377)	1
  (0, 118069)	1
  (0, 119608)	1
  (0, 128266)	2
  (0, 135480)	1
  (0, 155396)	1
  (0, 176597)	1
  (0, 276675)	1
  (0, 291213)	1
  (0, 320399)	1
  (0, 339110)	1
  (1, 19827)	3
  (1, 47717)	1
  (1, 62049)	1
  (1, 114697)	1
  (1, 128266)	2
  (1, 130993)	1
  (1, 135480)	2
  (1, 175157)	1
  :	:
  (1388663, 257134)	1
  (1388663, 318405)	1
  (1388664, 20225)	1
  (1388664, 74018)	1
  (1388664, 110658)	1
  (1388664, 128266)	1
  (1388664, 147994)	1
  (1388664, 182528)	2
  (1388664, 206505)	1
  (1388664, 222081)	1
  (1388664, 233399)	1
  (1388664, 236542)	3
  (1388664, 269735)	1
  (1388664, 276675)	2
  (1388665, 105174)	1
  (1388666, 76393)	1
  (1388666, 237558)	1
  (1388667, 20225)	1
  (1388667, 117673)	1
  (1388667, 125864)	1
  (1388667, 128266)	1
  (1388667, 138133)	1
  (1388667, 153989)	1
  (1388667, 199911)	1
  (1388667, 306009)	1


## Compute the LDA Model

Now to compute the LDA model. It takes the count data and computes the specified number of topics.

In [6]:
%%time
number_of_topics = 2
number_of_words = 5
lda = LDA(n_components=number_of_topics)
lda.fit(count_data)

# output topics
words = count_vectorizer.get_feature_names_out()

topics = [[words[i] for i in topic.argsort()[:-number_of_words - 1:-1]] for (topic_idx, topic) in enumerate(lda.components_)]

print(topics)

[['game', 'love', 'good', 'like', 'great'], ['app', 'work', 'great', 'good', 'use']]
CPU times: user 38min 10s, sys: 285 ms, total: 38min 11s
Wall time: 38min 11s
