# Rapids Text Processing

This notebook contains all the code assosicated with the Nvidia RAPIDS text pre-processing.

## Project Setup

This section imports all required libraries and downloads data for required packages. The installation is commented out as it only needs to be completed once per machine.

In [1]:
import cudf
import nltk
import pandas as pd

# Install nltk data
#nltk.download('stopwords')

## Loading the Text

This section will load in the data.

First, the data is read into a GPU dataframe, `reviews`. Due to limitations of how rapids reads JSON files, the file is read using pandas first, then sent to the GPU.

In [2]:
reviews_pd = pd.read_json('../../data/raw/appsearch_reviews/appsearch_reviews.txt', lines=True)
reviews_pd = reviews_pd.explode("reviews").reset_index(drop=True).drop("num_reviews", axis=1)
reviews = cudf.from_pandas(reviews_pd)

reviews.head()

Unnamed: 0,reviews,app_id
0,This game is very good! Kudos to the developer...,com.nut.man
1,Terrific just get rid of ads and it will be th...,com.nut.man
2,I CAN'T STOP TAPPING. This game is too addict...,com.nut.man
3,. The game itself is really fun but the way t...,com.nut.man
4,ADS GALORE!. I was bored 1 day then came acro...,com.nut.man


## Exploring the Data

This section will do a quick exploration of the data to check for major issues. Things to include:

* Missing records
* Check very short and very long records

First, start with a preview of the dataframe records and a count of total records

In [3]:
display(reviews.head())

print(f'--------------------------\nNumber of records: {len(reviews)}')

Unnamed: 0,reviews,app_id
0,This game is very good! Kudos to the developer...,com.nut.man
1,Terrific just get rid of ads and it will be th...,com.nut.man
2,I CAN'T STOP TAPPING. This game is too addict...,com.nut.man
3,. The game itself is really fun but the way t...,com.nut.man
4,ADS GALORE!. I was bored 1 day then came acro...,com.nut.man


--------------------------
Number of records: 1388668


## Pre-process the Text

This section will go through basic text cleaning steps.

* Language detection (check that only english is included, chec the topic modeling Twitter data paper for what they used to complete this task).
* Normalize the text (make all text lowercase)
* Remove special characters
* Remove common stop words
* Remove numbers
* Remove extra whitespaces
* Tokenize the words
* Stem/Lemmatize the words

The text pre-processing was built as a single function. The function is based off one provided in the Nvida Developer blog to pre-process text data. (https://developer.nvidia.com/blog/nlp-and-text-precessing-with-rapids-now-simpler-and-faster/)

The basic stopwords are coming from the `nltk` package's english stopword list.

The `filters` variable contains special characters to remove from the text, including tab and newline characters.

In [4]:
STOPWORDS = nltk.corpus.stopwords.words('english')

filters = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\', ':', ';', '<', '=', '>',
           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\t','\n',"'",",",'~' , '—']

def preprocess_text(input_strs, filters=None, stopwords=STOPWORDS):
    """
        * filter punctuation
        * to_lower
        * remove stop words (from nltk)
        * replace multiple spaces with one
        * remove leading spaces
    """
    
    # filter punctuation and case conversions
    translation_table = {ord(char): ord(" ") for char in filters}
    input_strs = input_strs.str.translate(translation_table)
    input_strs = input_strs.str.lower()
    
    # Remove stopwords
    stopwords_gpu = cudf.Series(stopwords)
    input_strs = input_strs.str.replace_tokens(stopwords_gpu, ' ')
    
    # Replace multiple spaces with single one and strip leading/trailing spaces
    input_strs = input_strs.str.normalize_spaces()
    input_strs = input_strs.str.strip(' ')
    
    return input_strs

In [5]:
%%time
translation_table = {ord(char): ord(" ") for char in filters}
input_strs = reviews['reviews'].str.translate(translation_table)
input_strs = input_strs.str.lower()

reviews['clean_review'] = preprocess_text(reviews['reviews'], filters=filters, stopwords=STOPWORDS)

display(reviews)

Unnamed: 0,reviews,app_id,clean_review
0,This game is very good! Kudos to the developer...,com.nut.man,game good kudos developer wow absolutely stunn...
1,Terrific just get rid of ads and it will be th...,com.nut.man,terrific get rid ads best game ever even thoug...
2,I CAN'T STOP TAPPING. This game is too addict...,com.nut.man,stop tapping game addicting challenging always...
3,. The game itself is really fun but the way t...,com.nut.man,game really fun way ads get slipped irritating...
4,ADS GALORE!. I was bored 1 day then came acro...,com.nut.man,ads galore bored 1 day came across game love n...
...,...,...,...
1388663,Awesome. Unlike other apps this does not requ...,air.com.jcward.speedwords,awesome unlike apps require people buy app pla...
1388664,Simple concept well executed. The game is sim...,air.com.jcward.speedwords,simple concept well executed game simple play ...
1388665,. Engaging,air.com.jcward.speedwords_demo,engaging
1388666,Just playyyy. So cool,air.com.jcward.speedwords_demo,playyyy cool


CPU times: user 260 ms, sys: 15.9 ms, total: 276 ms
Wall time: 265 ms


## Output the data to file

In [6]:
%%time
reviews.to_parquet('../../data/cleaned/appsearch_reviews_clean_rapids', index=False)

CPU times: user 205 ms, sys: 156 ms, total: 361 ms
Wall time: 429 ms


## Loading the Yelp Reviews Data

In [7]:
#yelp_reviews_path = '../../data/raw/yelp_archive/yelp_academic_dataset_review.json'
#yelp_reviews_df = cudf.read_json(yelp_reviews_path, engine='cudf', lines=True, nrows=100)