In [None]:
!pip install pyLDAvis

The code performs several tasks related to text processing, natural language processing (NLP), and topic modeling. It primarily uses libraries such as NumPy, pandas, scikit-learn, Gensim, NLTK, spaCy, and pyLDAvis. Let's break down the code step by step:

1. **Importing Libraries:** This section imports various Python libraries that will be used throughout the script. These libraries include NumPy for numerical operations, pandas for data manipulation, and several NLP-related libraries such as Gensim, NLTK, spaCy, and scikit-learn for text processing and topic modeling. It also imports visualization libraries like matplotlib and pyLDAvis.

2. **Fetching the 20 Newsgroups Dataset:**
   - The code uses scikit-learn's `fetch_20newsgroups` function to load the 20 Newsgroups dataset, which is a collection of newsgroup documents grouped into categories.
   - It loads both the training and testing subsets of the dataset.
   - The `remove` parameter is used to remove specific parts of the text data, such as email headers, footers, and quotes, to prepare the data for further analysis.

3. **Creating DataFrames:**
   - After loading the dataset, the code creates two Pandas DataFrames, `news_train` and `news_test`, to organize the data. These DataFrames have two columns: 'news' (containing the text of the news articles) and 'class' (containing the class labels).

4. **Merging DataFrames:**
   - The code concatenates the training and testing DataFrames, `news_train` and `news_test`, into a single DataFrame called `df`. This concatenation combines both the training and testing data into one dataset.
   - The resulting `df` DataFrame is reset so that it has a continuous index.

5. **Displaying DataFrame Head:**
   - The script concludes by displaying the first few rows of the merged DataFrame `df` using the `df.head()` method. This is done to provide a preview of the loaded and merged dataset.

Overall, this code prepares the 20 Newsgroups dataset for further text analysis and topic modeling by loading, preprocessing, and merging the data into a convenient Pandas DataFrame structure. The dataset is now ready for additional text processing, topic modeling, and analysis tasks.

In [4]:
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel


# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
#import pyLDAvis.sklearn
import matplotlib.pyplot as plt
# dataset
from sklearn.datasets import fetch_20newsgroups
# Import Dataset
# loading train dataset
news_group_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'))
news_group_data_train = news_group_train.data
news_group_target_names_train = news_group_train.target_names
news_group_target_train = news_group_train.target

# Creating a dataframe from the loaded data
news_train = pd.DataFrame({'news': news_group_data_train,
                        'class': news_group_target_train})

#Loading test data
news_group_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'))
news_group_data_test = news_group_test.data
news_group_target_names_test = news_group_test.target_names
news_group_target_test = news_group_test.target

# Creating a dataframe from the loaded data
news_test = pd.DataFrame({'news': news_group_data_test,
                        'class': news_group_target_test})
#Merging both dataset
frames = [news_train,news_test]
df = pd.concat(frames).reset_index(drop=True)
df.head()

  and should_run_async(code)


Unnamed: 0,news,class
0,I was wondering if anyone out there could enli...,7
1,A fair number of brave souls who upgraded thei...,4
2,"well folks, my mac plus finally gave up the gh...",4
3,\nDo you have Weitek's address/phone number? ...,1
4,"From article <C5owCB.n3p@world.std.com>, by to...",14


This code prepares the 20 Newsgroups dataset for further text analysis and topic modeling by loading, preprocessing, and merging the data into a convenient Pandas DataFrame structure. It primarily uses libraries such as NumPy, pandas, scikit-learn, Gensim, NLTK, spaCy, and pyLDAvis. Let's break down the code step by step:

1. **Importing Libraries:** This section imports various Python libraries that will be used throughout the script. These libraries include NumPy for numerical operations, pandas for data manipulation, and several NLP-related libraries such as Gensim, NLTK, spaCy, and scikit-learn for text processing and topic modeling. It also imports visualization libraries like matplotlib and pyLDAvis.

2. **Fetching the 20 Newsgroups Dataset:**
   - The code uses scikit-learn's `fetch_20newsgroups` function to load the 20 Newsgroups dataset, which is a collection of newsgroup documents grouped into categories.
   - It loads both the training and testing subsets of the dataset.
   - The `remove` parameter is used to remove specific parts of the text data, such as email headers, footers, and quotes, to prepare the data for further analysis.

3. **Creating DataFrames:**
   - After loading the dataset, the code creates two Pandas DataFrames, `news_train` and `news_test`, to organize the data. These DataFrames have two columns: 'news' (containing the text of the news articles) and 'class' (containing the class labels).

4. **Merging DataFrames:**
   - The code concatenates the training and testing DataFrames, `news_train` and `news_test`, into a single DataFrame called `df`. This concatenation combines both the training and testing data into one dataset.
   - The resulting `df` DataFrame is reset so that it has a continuous index.

5. **Displaying DataFrame Head:**
   - The script concludes by displaying the first few rows of the merged DataFrame `df` using the `df.head()` method. This is done to provide a preview of the loaded and merged dataset.



In [5]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer

class CleanTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("Inside cleaning init")

    def fit(self,X,y=None):
        print("Inside cleaning pipe fit")
        return self



    def transform(self,X,y=None):
        print("Inside cleaning pipe transform")
        # Convert to list
        data = X.tolist()

        # Remove Emails
        data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

        # Remove new line characters
        data = [re.sub('\s+', ' ', sent) for sent in data]

        # Remove distracting single quotes
        data = [re.sub("\'", "", sent) for sent in data]


        def sent_to_words(sentences):
            for sentence in sentences:
                yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

        data_words = list(sent_to_words(data))

        def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
            """https://spacy.io/api/annotation"""
            texts_out = []
            for sent in texts:
                doc = nlp(" ".join(sent))
                texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
            return texts_out

        # Initialize spacy 'en' model, keeping only tagger component (for efficiency)
        nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

        # Do lemmatization keeping only Noun, Adj, Verb, Adverb
        data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

        return(data_lemmatized)

  and should_run_async(code)
  data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
  data = [re.sub('\s+', ' ', sent) for sent in data]


The code defines a custom transformer class called `LDATransformer`, which is designed to fit and transform text data using Latent Dirichlet Allocation (LDA) for topic modeling. This class inherits from scikit-learn's `BaseEstimator` and `TransformerMixin` classes and follows the scikit-learn estimator and transformer API conventions. Overall, this custom transformer (`LDATransformer`) encapsulates the process of fitting an LDA model to text data and transforming the data into topic distributions. It can be used within a scikit-learn pipeline for text analysis and feature engineering.

Here's an explanation of the code:

1. **Import Libraries:** The code imports necessary libraries, including scikit-learn's `CountVectorizer` for text vectorization and Gensim for topic modeling.

2. **Initialize LDA Model:** An LDA model is initialized with specific parameters, such as the number of topics (20), maximum iterations, learning method, random state, batch size, and more. This LDA model will be used within the transformer.

3. **Define `LDATransformer` Class:**
   - The `LDATransformer` class is defined as a custom transformer.
   - It has `fit` and `transform` methods, which are required by scikit-learn transformers.

4. **`fit` Method:**
   - The `fit` method takes the input text data `X` and performs the following steps:
     - Creates a Gensim dictionary from the input text data.
     - Computes the Term Document Frequency (TF-IDF) for the data.
     - Builds an LDA model using Gensim, which learns the topics from the data.
   - The number of topics, alpha, eta, and other LDA model parameters are specified within this method.
   - The learned LDA model is stored as an attribute of the transformer.

5. **`transform` Method:**
   - The `transform` method takes the input text data `X` and performs the following steps:
     - Computes the Term Document Frequency (TF-IDF) for the input data.
     - Extracts the topics for each document in the input data using the pre-trained LDA model.
     - Converts the extracted topic distributions into a Pandas DataFrame where each row represents a document and each column represents a topic. The values in the DataFrame indicate the weight of each topic for each document.
     - Finally, the method returns the topic distributions as a NumPy array.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
count=0
lda_model=LatentDirichletAllocation(n_components=20,               # Number of topics
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )
class LDATransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print()

    def fit(self,X,y=None):

        # Create Dictionary
        self.dictionary_LDA = corpora.Dictionary(X)

        # Term Document Frequency
        corpus = [self.dictionary_LDA.doc2bow(data_lemmatized) for data_lemmatized in X]

        # Build LDA model
        self.num_topics = 20
        self.lda_model = gensim.models.LdaModel(corpus, num_topics=self.num_topics, id2word=self.dictionary_LDA, passes=4, alpha=[0.01]*self.num_topics, \
                                           eta=[0.01]*len(self.dictionary_LDA.keys()))
#         for i,topic in self.lda_model.show_topics(formatted=True, num_topics=self.num_topics, num_words=20):
#             print(str(i)+": "+ topic)
#             print()
        return self

    def transform(self,X):
        # Term Document Frequency
        corpus = [self.dictionary_LDA.doc2bow(data_lemmatized) for data_lemmatized in X]

        topics = [self.lda_model[corpus[i]] for i in range(len(X))]
        def topics_document_to_dataframe(topics_document, num_topics):
            res = pd.DataFrame(columns=range(num_topics))
            for topic_weight in topics_document:
                res.loc[0, topic_weight[0]] = topic_weight[1]
            return res
        features=pd.concat([topics_document_to_dataframe(topics_document, num_topics=self.num_topics) for topics_document in topics]) \
            .reset_index(drop=True).fillna(0)
        #print(features)
        return features.to_numpy()




  and should_run_async(code)


The code snippet uses scikit-learn's `train_test_split` function to split a dataset into training and testing sets for a machine learning task. Here's an explanation of what the code does:

1. **Import the Necessary Library:**
   - The code begins by importing the `train_test_split` function from scikit-learn's `model_selection` module.

2. **Splitting the Dataset:**
   - The `train_test_split` function is called with the following arguments:
     - `df['news']`: This represents the feature data, which is the content of the news articles.
     - `df['class']`: This represents the target variable or labels, which indicate the class or category of each news article.
     - `test_size=0.33`: This specifies the proportion of the dataset that should be allocated to the testing set. In this case, 33% of the data is allocated to the testing set, while the remaining 67% is used for training.
     - `random_state=0`: This sets the random seed for reproducibility. Setting a specific random seed ensures that the split is the same every time the code is run, making the results reproducible.
     - `stratify=df['class']`: This parameter ensures that the class distribution in the original dataset is preserved in both the training and testing sets. It's particularly useful when dealing with imbalanced datasets.

3. **Output Variables:**
   - The `train_test_split` function returns four sets of data:
     - `X_train`: This is the training data, which contains the news article content for the training set.
     - `X_test`: This is the testing data, which contains the news article content for the testing set.
     - `y_train`: These are the labels or target values for the training set, indicating the class of each news article in the training data.
     - `y_test`: These are the labels or target values for the testing set, indicating the class of each news article in the testing data.

The purpose of splitting the dataset into training and testing sets is to enable the training of a machine learning model on one subset and the evaluation of its performance on another. This helps assess the model's generalization ability and provides a way to estimate how well it will perform on unseen data.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['news'],df['class'],test_size=0.33,random_state=0,stratify=df['class'])

  and should_run_async(code)


The code constructs a scikit-learn pipeline named `pipe`. A scikit-learn pipeline is a way to streamline and automate the process of transforming and modeling data, making it easier to work with complex workflows. The pipeline allows you to apply a sequence of transformations to your data and then fit a classifier on the transformed data with a single call. It simplifies the process of defining, fitting, and evaluating machine learning models, especially in situations where data preprocessing is involved. Here's an explanation of the code:

1. **Import Necessary Libraries:**
   - The code imports the required libraries:
     - `Pipeline` from `sklearn.pipeline`: This class is used to create a machine learning pipeline that consists of multiple steps, such as data preprocessing, feature engineering, and modeling.
     - `SVC` (Support Vector Classifier) from `sklearn.svm`: This is a support vector machine classifier that will be used as the final classification model in the pipeline.

2. **Pipeline Initialization:**
   - The `Pipeline` class is initialized with a list of named steps, where each step is a tuple consisting of a name (a string) and an estimator (an object that implements the scikit-learn estimator interface).

3. **Pipeline Steps:**
   - The pipeline consists of the following steps:
     - `"CleanTransformer"`: This step is assumed to be a custom transformer that performs data cleaning and preprocessing on the input data. However, the code provided does not define the `CleanTransformer` class, so its behavior and implementation details are not visible in the provided code.
     - `"LDATransformer"`: This step is also assumed to be a custom transformer that implements the `LDATransformer` class, as defined earlier in your code. It fits an LDA model to the text data and transforms it into topic distributions.
     - `'svc'`: This step is the final classification model, which is an instance of the Support Vector Classifier (SVC) from scikit-learn. It is used for classifying the data after the text data has been preprocessed and transformed.

Please note that for the code to work correctly, you should define the `CleanTransformer` class (if it's a custom transformer) and ensure that it is correctly imported before constructing the pipeline.

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipe = Pipeline([("CleanTransformer", CleanTransformer()),("LDATransformer", LDATransformer()),('svc', SVC())])

Inside cleaning init
Inside lda pipe init


  and should_run_async(code)


The code `pipe.fit(X_train, y_train)` fits the scikit-learn pipeline (`pipe`) to the training data. Here's what this code does:

1. `X_train`: This is the training data, typically a feature matrix, containing the independent variables (in this case, the preprocessed text data) used to train the model.

2. `y_train`: These are the labels or target values corresponding to the training data. They indicate the correct categories or classes for each data point in `X_train`.

When you call `pipe.fit(X_train, y_train)`, the following steps occur:

- The data in `X_train` is passed through the pipeline, where each step (transformer or estimator) is applied sequentially. This includes any data preprocessing, feature engineering, and modeling defined in the pipeline.

- If you have a custom data cleaning or preprocessing step (e.g., the "CleanTransformer" step mentioned earlier in your pipeline), it will be applied first.

- Next, the "LDATransformer" step is applied, where the LDA model is fitted to the text data in `X_train`, and the text is transformed into topic distributions.

- Finally, the Support Vector Classifier (SVC) model, defined as 'svc' in the pipeline, is trained on the transformed data.

- The labels in `y_train` are used to train the SVC model to learn the relationship between the text data and the corresponding class labels.

After calling `pipe.fit(X_train, y_train)`, the pipeline is now fully trained and ready to make predictions on new data or be evaluated on the test dataset. It encapsulates all the necessary preprocessing and modeling steps, making it convenient to work with machine learning workflows.

In [9]:
pipe.fit(X_train,y_train)

  and should_run_async(code)


Inside cleaning pipe fit
Inside cleaning pipe transform
Inside lda pipe fit
Inside lda pipe transform


The code `pipe.score(X_test, y_test)` is used to calculate the accuracy score of the machine learning pipeline (`pipe`) on the test dataset (`X_test` and `y_test`). Specifically, it computes how well the pipeline's classifier (in this case, the Support Vector Classifier, SVC) performs in making predictions on the test data and comparing those predictions to the true labels.

Here's what the code does:

- `X_test`: This is the test data, typically a feature matrix, containing the independent variables (in this case, the preprocessed text data) used to evaluate the model's performance.

- `y_test`: These are the true labels or target values corresponding to the test data. They indicate the correct categories or classes for each data point in `X_test`.

When you call `pipe.score(X_test, y_test)`, the following steps occur:

1. The pipeline (`pipe`) takes the test data (`X_test`) and performs the same sequence of transformations and predictions as during training.

2. The text data in `X_test` is preprocessed, including any data cleaning, topic modeling, or other transformations defined in the pipeline.

3. The trained Support Vector Classifier (SVC) model in the pipeline makes predictions on the preprocessed test data.

4. These predicted labels are then compared to the true labels (`y_test`) to compute the accuracy score.

5. The accuracy score represents the fraction of correctly classified data points in the test dataset. It's a common metric for classification tasks and provides an estimate of the model's performance.

The `pipe.score(X_test, y_test)` call returns a single accuracy score, indicating how accurately the pipeline's classifier predicts the class labels on the test data. The higher the accuracy score, the better the model's performance in classifying the test data.

In [10]:
pipe.score(X_test, y_test)

  and should_run_async(code)


Inside cleaning pipe transform
Inside lda pipe transform


0.4289389067524116