# Full pipeline for Text Data Exploration

As a data scientist specializing in Natural Language Processing (NLP), a thorough data exploration phase is crucial for understanding the text data, identifying patterns, and informing subsequent preprocessing and modeling steps. Here's a comprehensive pipeline with common tasks, tips, code, libraries, and useful charts, presented step-by-step in Python. The data used by this guide can be downloaded from https://zenodo.org/records/10157504.

# 1. Data Loading and Initial Inspection

**Common Task**: Load your text data and get a first glance at its structure and content.

**Tips**:
- Start with a sample if your dataset is massive.
- Understand the format: Is it a CSV, JSON, database, etc.?
- Check for missing values immediately.

In [None]:
#! pip install nltk

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
#!pip install pandas


Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [51]:
import pandas as pd
import numpy as np
import glob
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from collections import Counter


In [30]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lupi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Lupi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lupi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [31]:
file_input = 'AllProductReviews.csv'

In [32]:
reviews =pd.read_csv(file_input, encoding='utf-8')

In [33]:
reviews.tail()

Unnamed: 0,ReviewTitle,ReviewBody,ReviewStar,Product
14332,Good\r\n,Good\r\n,4,JBL T110BT
14333,Amazing Product\r\n,An amazing product but a bit costly.\r\n,5,JBL T110BT
14334,Not bad\r\n,Sound\r\n,1,JBL T110BT
14335,a good product\r\n,the sound is good battery life is good but the...,5,JBL T110BT
14336,"Average headphones , n overrated name\r\n",M writing this review after using for almost 7...,1,JBL T110BT


In [34]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14337 entries, 0 to 14336
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ReviewTitle  14337 non-null  object
 1   ReviewBody   14337 non-null  object
 2   ReviewStar   14337 non-null  int64 
 3   Product      14337 non-null  object
dtypes: int64(1), object(3)
memory usage: 448.2+ KB


There isn't null. 

# 2. Basic Text Statistics

**Common Tasks**: Calculate fundamental statistics about your text data to understand its overall characteristics.

**Tips**:
- Character count can indicate brevity or verbosity.
- Word count and sentence count provide insights into text length and complexity.
- Average word length can hint at the formality or simplicity of the language.

In [35]:
#Cleaning the data, replace \n with ""
reviews['ReviewTitle'] = reviews['ReviewTitle'].str.replace('\n', '', regex=False)


In [36]:
#reviews['char_count_Title'] = reviews['ReviewTitle'].str.len()
#all the analysis is made on body 
reviews['char_count'] = reviews['ReviewBody'].str.len()
reviews['word_count'] = reviews['ReviewBody'].str.split().str.len()

reviews['word_len'] = reviews['ReviewBody'].str.split().apply(lambda word_list: [len(word) for word in word_list])
reviews['average_word_len'] = reviews['word_len'].apply(lambda counts: sum(counts)/len(counts) if counts else 0)

In [37]:
reviews['sentence_count'] = reviews['ReviewBody'].apply(lambda sentence_count: len(sent_tokenize(sentence_count)))



In [38]:
reviews.head()

Unnamed: 0,ReviewTitle,ReviewBody,ReviewStar,Product,char_count,word_count,word_len,average_word_len,sentence_count
0,Honest review of an edm music lover\r,No doubt it has a great bass and to a great ex...,3,boAt Rockerz 255,444,77,"[2, 5, 2, 3, 1, 5, 4, 3, 2, 1, 5, 6, 5, 12, 3,...",4.753247,6
1,Unreliable earphones with high cost\r,"This earphones are unreliable, i bought it be...",1,boAt Rockerz 255,372,64,"[4, 9, 3, 11, 1, 6, 2, 6, 2, 4, 9, 5, 4, 3, 4,...",4.78125,1
2,Really good and durable.\r,"i bought itfor 999,I purchased it second time,...",4,boAt Rockerz 255,485,86,"[1, 6, 5, 5, 9, 2, 6, 5, 6, 5, 3, 2, 8, 4, 2, ...",4.627907,3
3,stopped working in just 14 days\r,Its sound quality is adorable. overall it was ...,1,boAt Rockerz 255,200,37,"[3, 5, 7, 2, 9, 7, 2, 3, 4, 3, 4, 3, 1, 5, 5, ...",4.378378,3
4,Just Awesome Wireless Headphone under 1000...😉\r,Its Awesome... Good sound quality & 8-9 hrs ba...,5,boAt Rockerz 255,236,36,"[3, 10, 4, 5, 7, 1, 3, 3, 7, 7, 4, 4, 7, 1, 1,...",5.527778,2


In [39]:
reviews.describe()

Unnamed: 0,ReviewStar,char_count,word_count,average_word_len,sentence_count
count,14337.0,14337.0,14337.0,14337.0,14337.0
mean,3.675874,127.584362,22.320709,4.836041,1.950338
std,1.503409,154.807798,27.702611,1.010389,1.742263
min,1.0,2.0,0.0,0.0,0.0
25%,3.0,37.0,6.0,4.24,1.0
50%,4.0,89.0,15.0,4.666667,1.0
75%,5.0,161.0,28.0,5.222222,2.0
max,5.0,5047.0,864.0,31.0,43.0


# 3. Text Preprocessing (for Exploration)

**Common Tasks**: Clean and normalize text to prepare it for frequency analysis and other exploratory tasks. This is a lighter preprocessing step compared to what you might do for modeling.

**Tips**:
- Lowercasing prevents treating "The" and "the" as different words.
- Punctuation removal reduces noise.
- Stopword removal focuses on meaningful content words.
- Stemming/Lemmatization reduces words to their root forms, consolidating variations.

In [40]:
reviews['ReviewBody'] = reviews['ReviewBody'].str.lower()
reviews['ReviewBody'] = reviews['ReviewBody'].str.replace(rf"[{string.punctuation}]", "", regex=True)

In [41]:

# Funtion to remove stopwords
# def remove_stopwords(text):
#     if not isinstance(text, str):  # evita errores si hay NaN u otros tipos
#         return ""
#     words = word_tokenize(text.lower())
#     filtered = [word for word in words if word.isalpha() and word not in stopwords.words('english')]
#     return " ".join(filtered)
# # Apply for each row
# reviews['CleanedReview'] = reviews['ReviewBody'].apply(remove_stopwords)


In [42]:
# Descargar recursos si no existen
def ensure_nltk_resource(resource_name, resource_path):
    try:
        nltk.data.find(resource_path)
    except LookupError:
        nltk.download(resource_name)

ensure_nltk_resource('punkt', 'tokenizers/punkt')
ensure_nltk_resource('stopwords', 'corpora/stopwords')

# Función de limpieza robusta
def clean_text_remove_stopwords(text):
    try:
        # Verificar que sea texto
        if not isinstance(text, str):
            return ""

        # Tokenizar
        words = word_tokenize(text.lower())

        # Filtrar: solo letras, sin stopwords
        clean_words = [
            word for word in words
            if word.isalpha() and word not in stopwords.words('english')
        ]

        return " ".join(clean_words)
    except Exception as e:
        # Si algo falla, devolver string vacío (y opcional: imprimir el error)
        print(f"Error al procesar: {text} → {e}")
        return ""


In [44]:
reviews['CleanedReview'] = reviews['ReviewBody'].apply(clean_text_remove_stopwords)


In [46]:
reviews['words'] = reviews['CleanedReview'].apply(lambda words: word_tokenize(words.lower()))

In [52]:
#Word analysis

list_word = reviews['words'].dropna().apply(
    lambda x: x if isinstance(x, list) else []
)
# Aplanar en una lista única
all_words = [word for lista in list_word for word in lista]

word_frecuency =Counter(all_words)

In [53]:
word_frecuency

Counter({'good': 6735,
         'quality': 5835,
         'sound': 5831,
         'product': 4635,
         'bass': 2668,
         'battery': 1865,
         'one': 1846,
         'price': 1539,
         'earphones': 1502,
         'best': 1353,
         'working': 1334,
         'great': 1212,
         'awesome': 1162,
         'also': 1158,
         'ear': 1072,
         'noise': 1043,
         'earphone': 1033,
         'buy': 1023,
         'nice': 1013,
         'use': 949,
         'music': 931,
         'life': 914,
         'using': 898,
         'like': 894,
         'better': 836,
         'bluetooth': 794,
         'really': 788,
         'dont': 768,
         'cancellation': 754,
         'go': 752,
         'money': 751,
         'got': 748,
         'worth': 705,
         'range': 704,
         'even': 686,
         'time': 677,
         'months': 672,
         'jbl': 629,
         'mic': 619,
         'headphones': 613,
         'excellent': 612,
         'much': 598,
   

In [49]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14337 entries, 0 to 14336
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ReviewTitle       14337 non-null  object 
 1   ReviewBody        14337 non-null  object 
 2   ReviewStar        14337 non-null  int64  
 3   Product           14337 non-null  object 
 4   char_count        14337 non-null  int64  
 5   word_count        14337 non-null  int64  
 6   word_len          14337 non-null  object 
 7   average_word_len  14337 non-null  float64
 8   sentence_count    14337 non-null  int64  
 9   CleanedReview     14337 non-null  object 
 10  words             14337 non-null  object 
dtypes: float64(1), int64(4), object(6)
memory usage: 1.2+ MB


# 4. Vocabulary Analysis

**Common Tasks**: Understand the unique words, their frequencies, and patterns.

**Tips**:

- Word clouds provide a quick visual summary of frequent terms.
- Bar charts of top N words show exact frequencies.
- Analyzing n-grams (bigrams, trigrams) reveals common phrases.

# 5. Part-of-Speech (POS) Tagging

**Common Task**: Analyze the distribution of grammatical categories (nouns, verbs, adjectives, etc.) in your text.

**Tips**:

- Provides insights into the linguistic structure of your corpus.
- Can highlight if your text is descriptive (many adjectives), action-oriented (many verbs), or topic-focused (many nouns).

# 6. Named Entity Recognition (NER)

**Common Task**: Identify and categorize named entities (people, organizations, locations, dates, etc.) in your text.

**Tips**:

- Reveals key subjects and concepts in your data.
- Useful for extracting structured information from unstructured text.


# 7. Sentiment Analysis (if applicable)

**Common Task**: Determine the emotional tone (positive, negative, neutral) of your text data.

**Tips**:

- Provides a high-level understanding of the sentiment distribution.
- Can be done with simple lexicon-based models or more complex pre-trained models.

# 8. Topic Modeling (High-level exploration)

**Common Task**: Discover abstract "topics" that occur in a collection of documents.

**Tips**:

- LDA (Latent Dirichlet Allocation) is a common algorithm.
- Requires a document-term matrix.
- Provides a sense of the main themes present in your corpus.