## Pre-Setup:
install needed libraries and pre-trained models.

In [3]:
!pip install nltk spacy gensim scikit-learn numpy matplotlib




In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Step 1 : Import Libraries and Setup
First, let's import the necessary libraries. This will set the stage for the NLP pipeline:

In [5]:
# Import libraries
import nltk
import spacy
import gensim
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, f1_score

## Step 2 : Download Necessary Resources
Download certain resources from nltk and spacy.

In [6]:
# Download NLTK resources
nltk.download('punkt')  # Tokenizer
nltk.download('stopwords')  # Stopwords for text preprocessing
nltk.download('averaged_perceptron_tagger')  # POS tagging
nltk.download('maxent_ne_chunker')  # NER

# Download spaCy model
import spacy
spacy.cli.download("en_core_web_sm")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Step 3 : Initialize SpaCy Model
Load the pre-trained spaCy model for Named Entity Recognition (NER) and other NLP tasks.

In [7]:
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

## Step 4: Define Helper Functions
Let's create functions to handle the following tasks:

In [8]:
# Tokenization (NLTK)

def tokenize_text(text):
    # Tokenize the text into words
    return word_tokenize(text)

# Word Normalization (Stemming and Lemmatization)

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def normalize_words(tokens, method='stemming'):
    # Normalize words using stemming or lemmatization
    if method == 'stemming':
        return [stemmer.stem(word) for word in tokens]
    elif method == 'lemmatization':
        return [lemmatizer.lemmatize(word) for word in tokens]

# Named Entity Recognition (NER) using SpaCy

def named_entity_recognition(text):
    # Perform NER using spaCy
    doc = nlp(text)
    entities = [(entity.text, entity.label_) for entity in doc.ents]
    return entities

## Step 5 : Text Preprocessing Pipeline
Now, let's create a preprocessing pipeline that tokenizes, normalizes, and extracts named entities.

In [9]:
def preprocess_text(text, normalization_method='stemming'):
    # Tokenize, normalize, and extract NER
    tokens = tokenize_text(text)
    normalized_tokens = normalize_words(tokens, method=normalization_method)
    entities = named_entity_recognition(text)
    return normalized_tokens, entities

## Step 6 : Feature Extraction with TF-IDF
We'll use TF-IDF to convert sentences into numerical features for classification.

In [10]:
def extract_features(texts):
    # Convert a list of texts (sentences) into TF-IDF features
    tfidf_vectorizer = TfidfVectorizer()
    return tfidf_vectorizer.fit_transform(texts)

## Step 7 : Text Classification (Naive Bayes)
For simplicity, we can use Naive Bayes to classify sentences into summary-worthy or not.

In [11]:
def train_classifier(X_train, y_train):
    # Train a Naive Bayes classifier
    classifier = MultinomialNB()
    classifier.fit(X_train, y_train)
    return classifier

## Step 8 : Evaluation Metrics
After classification, we can evaluate the model's performance using metrics like Precision, Recall, and F1 Score.

In [12]:
def evaluate_model(y_true, y_pred):
    print("Precision:", precision_score(y_true, y_pred))
    print("Recall:", recall_score(y_true, y_pred))
    print("F1 Score:", f1_score(y_true, y_pred))


## Step 9 : Example Workflow
Let’s bring it all together with a small example.

In [19]:
# Small synthetic dataset
sentences = [
    "Albert Einstein was born in Germany in 1879.",  # Summary-worthy
    "He is known for developing the theory of relativity.",  # Summary-worthy
    "Einstein was awarded the Nobel Prize in Physics in 1921.",  # Summary-worthy
    "He enjoyed playing the violin in his free time.",  # Not summary-worthy
    "Einstein made significant contributions to the understanding of quantum mechanics.",  # Summary-worthy
    "The theory of relativity revolutionized modern physics.",  # Summary-worthy
    "Einstein's work influenced the development of nuclear energy.",  # Summary-worthy
    "He was a pacifist and advocated for peace.",  # Summary-worthy
    "Einstein moved to the United States in the 1930s.",  # Summary-worthy
    "Albert Einstein had three children."  # Not summary-worthy
]

# Labels: 1 means summary-worthy, 0 means not
labels = [1, 1, 1, 0, 1, 1, 1, 1, 1, 0]

# Extract TF-IDF features
X = extract_features(sentences)

# Train the Naive Bayes classifier
classifier = train_classifier(X, labels)

# Make predictions on the same dataset
y_pred = classifier.predict(X)

# Print predictions and actual labels
print("Predictions:", y_pred)
print("Actual labels:", labels)

# Evaluate the model
evaluate_model(labels, y_pred)

Predictions: [1 1 1 1 1 1 1 1 1 1]
Actual labels: [1, 1, 1, 0, 1, 1, 1, 1, 1, 0]
Precision: 0.8
Recall: 1.0
F1 Score: 0.8888888888888888
