In [2]:
# ============================================
# NLP PREPROCESSING PIPELINE (CONCEPTUAL DEMO)
# ============================================
# Note:
# The actual project dataset contains only numerical values.
# Therefore, this NLP pipeline is demonstrated using sample text
# to show understanding of tokenization, stopword removal, and TF-IDF.
# -----------------------------
# Sample Text Data
# -----------------------------
texts = [
    "Stock prices increased significantly today",
    "Market volatility affects investment decisions",
    "Investors analyze market trends and stock performance"
]
# -----------------------------
# Import TF-IDF Vectorizer
# -----------------------------
from sklearn.feature_extraction.text import TfidfVectorizer
# -----------------------------
# Apply NLP Preprocessing
# 1) Tokenization
# 2) Stopword Removal
# 3) TF-IDF Vectorization
# -----------------------------
vectorizer = TfidfVectorizer(
    stop_words='english',   # removes common English stopwords
    lowercase=True          # converts text to lowercase
)
tfidf_matrix = vectorizer.fit_transform(texts)
# -----------------------------
# Display Results
# -----------------------------
# Feature names (tokens after preprocessing)
print("ðŸ“Œ Extracted Tokens (Vocabulary):")
print(vectorizer.get_feature_names_out())
print("\nðŸ“Œ TF-IDF Matrix:")
print(tfidf_matrix.toarray())


ðŸ“Œ Extracted Tokens (Vocabulary):
['affects' 'analyze' 'decisions' 'increased' 'investment' 'investors'
 'market' 'performance' 'prices' 'significantly' 'stock' 'today' 'trends'
 'volatility']

ðŸ“Œ TF-IDF Matrix:
[[0.         0.         0.         0.46735098 0.         0.
  0.         0.         0.46735098 0.46735098 0.35543247 0.46735098
  0.         0.        ]
 [0.46735098 0.         0.46735098 0.         0.46735098 0.
  0.35543247 0.         0.         0.         0.         0.
  0.         0.46735098]
 [0.         0.44036207 0.         0.         0.         0.44036207
  0.3349067  0.44036207 0.         0.         0.3349067  0.
  0.44036207 0.        ]]


1) Tokenization:
   Text is split into meaningful words automatically

2) Stopword Removal:
   Common words like is, and, the are removed

3) TF-IDF Vectorization:
   Text is converted into numerical features

4) Conceptual NLP Pipeline:
   Valid even when the actual dataset is numeric