## Chapter 6 and 7

Chapter 6 and 7 of O'Reilly's focuses on pipelines and using supervised learning classification algorithms with text-data. These chapters are the heart of the natural language processing and basic machine. In these examples we will introduce basic preprocessing steps such as the Count Vectorizer, scaling text data to assign a higher weight to rare words and a lower weight to common words, and how to detect context using ngrams.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from scipy.stats import uniform, randint



This example uses a binary classification algorithm to predict if a movie review is positive or negative. A label of 1 indicates a positive review and a label of 0 is a negative review.

In [None]:

# Sample data
texts = [
    "I love this movie, it's amazing",
    "This movie is terrible and boring",
    "Great film, wonderful acting",
    "Poor direction, bad acting, terrible movie",
    "Excellent cinematography and great story",
    "Waste of time, horrible movie"
]

labels = [1, 0, 1, 0, 1, 0]  # 1 for positive, 0 for negative

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42
)


A pipeline allows us to create handle the preprocessing and the model selection in a single object referred to as a pipeline. This is useful because we can run a grid search downstream to optimize both the preprocessing and the model hyperparameters to find the best combination of hyperparameters

## Count Vectorizer

### Short Text Classification


+ Business Case: Customer service ticket categorization or email routing
+ Justification: When dealing with short support tickets or emails, simple word presence/absence is often sufficient for categorization. The computational simplicity means faster processing of high-volume customer inquiries.
+ Example: Automatically routing customer emails to different departments based on keywords


### Keyword-Based Systems


+ Business Case: Content tagging or basic document categorization
+ Justification: When specific keywords strongly indicate category membership, counting occurrences is more interpretable for business stakeholders
+ Example: Tagging product reviews with relevant product categories based on mentioned features


### Resource-Constrained Environments


+ Business Case: Real-time classification systems with limited computing resources
+ Justification: Lower computational overhead compared to TF-IDF, making it more suitable for edge computing or mobile applications
+ Example: Mobile app features that need to classify text with minimal battery impact

In [None]:
# Example 1: Using CountVectorizer with Logistic Regression
print("Using CountVectorizer with Logistic Regression:")
count_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(
        stop_words='english'
    )),
    ('classifier', LogisticRegression(
        max_iter=1000,
        random_state=42
    ))
])

# Train and evaluate
count_pipeline.fit(X_train, y_train)
count_predictions = count_pipeline.predict(X_test)
print("\nCountVectorizer Results:")
print(classification_report(y_test, count_predictions))


Suppose we want to trim the noise by assigning a higher weight to rare words and a lower weight to common words instead of using stop words. The Term Freqency-Inverse Document Frequency (TF-IDF) scales text data to assign a higher probability to rare words and a lower probability to common words. 

Due to the formula, it is generally not necessary to remove stop words (unless there is a business justification) because common english stop words will already have a lower weight.

Document Search and Retrieval


+ Business Case: Enterprise search systems or content recommendation engines
+ Justification: Better at identifying distinctive terms in documents, leading to more relevant search results
+ Example: Internal document search system where finding the most relevant documents is crucial

## TFIDF Vectorizer

### Long-Form Content Analysis


+ Business Case: Article categorization or research paper classification
+ Justification: Accounts for term importance across the document corpus, reducing noise from commonly used words
+ Example: Automatically categorizing news articles or research papers into topics


### Competitive Intelligence


+ Business Case: Analysis of competitor content or market research
+ Justification: Better at identifying distinctive features in documents, helpful for understanding unique selling propositions
+ Example: Analyzing competitor websites to identify key differentiating themes


### Content Recommendation


+ Business Case: Product description matching or content similarity analysis
+ Justification: Better at capturing the importance of terms in context, leading to more nuanced recommendations
+ Example: Suggesting similar products based on description similarity

In [None]:
# Example 2: Using TfidfVectorizer with Logistic Regression
print("\nUsing TfidfVectorizer with Logistic Regression:")
tfidf_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(
        stop_words='english',
        min_df=1,
        max_df=0.8,
        norm='l2'
    )),
    ('classifier', LogisticRegression(
        max_iter=1000,
        random_state=42
    ))
])

# Train and evaluate
tfidf_pipeline.fit(X_train, y_train)
tfidf_predictions = tfidf_pipeline.predict(X_test)
print("\nTfidfVectorizer Results:")
print(classification_report(y_test, tfidf_predictions))

## Grid Search 

### Pros:

+ Systematically evaluates every possible parameter combination
+ Guaranteed to find the optimal combination within the defined parameter space
+ More reliable for smaller parameter spaces
+ Results are reproducible and deterministic
+ to interpret results due to systematic exploration

### Cons:

+ Computationally expensive, especially with large parameter spaces
+ Time complexity increases exponentially with each additional parameter
+ May waste resources exploring unproductive parameter combinations
+ Not practical for high-dimensional parameter spaces
+ Can be inefficient when some parameters are more important than others

## Randomized Grid Search

### Pros:

+ More efficient use of computational resources
+ Can handle larger parameter spaces effectively
+ Better at finding good parameters when some are more important than others
+ Can explore continuous parameter distributions
+ Usually finds good solutions much faster than grid search
+ Allows more control over computational budget through n_iter parameter

### Cons:

+ May miss optimal parameter combinations due to random sampling
+ Results can vary between runs due to randomness
+ Less systematic, making it harder to ensure complete coverage of parameter space
+ May require multiple runs to ensure stability of results
+ Less suitable for small parameter spaces where exhaustive search is feasible

In [None]:
## Optimizing Machine Learning Models

# Create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(random_state=42))
])

# Grid Search parameters
grid_params = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__max_df': [0.5, 0.75],
    'tfidf__min_df': [1, 2],
    'tfidf__use_idf': [True, False],
    'clf__C': [0.1, 1.0],
    'clf__penalty': ['l1', 'l2'],
    'clf__solver': ['liblinear'],
}

# Randomized Search parameters
random_params = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__max_df': uniform(0.5, 0.5),
    'tfidf__min_df': randint(1, 3),
    'tfidf__use_idf': [True, False],
    'clf__C': uniform(0.1, 10.0),
    'clf__penalty': ['l1', 'l2'],
    'clf__solver': ['liblinear'],
}

# Perform Grid Search
grid_search = GridSearchCV(pipeline, grid_params, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Perform Randomized Search
random_search = RandomizedSearchCV(pipeline, random_params, n_iter=100, cv=5, n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)