### Install spaCy model

## Vanessa Williams
## Week 7

In [None]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Import Required Libraries

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

# Load the English NLP model
nlp = spacy.load('en_core_web_sm')

### Define Helper Functions

In [None]:
def text_summarizer(text):
    # Load the text into spaCy NLP pipeline
    doc = nlp(text)

    # Extracting key sentences based on length, punctuation, and stopwords
    keywords = []
    stopwords = list(STOP_WORDS)
    for token in doc:
        if token.text.lower() not in stopwords and token.text.lower() not in punctuation:
            keywords.append(token.text)

    # Calculate word frequencies
    word_frequencies = {}
    for word in doc:
        if word.text.lower() not in stopwords:
            if word.text.lower() not in punctuation:
                if word.text.lower() not in word_frequencies.keys():
                    word_frequencies[word.text.lower()] = 1
                else:
                    word_frequencies[word.text.lower()] += 1

    # Normalize frequencies by the max frequency
    max_freq = max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word] = word_frequencies[word] / max_freq

    # Score each sentence based on word frequencies
    sentence_scores = {}
    for sent in doc.sents:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]

    # Get the top N sentences (in this case, I keep top 5 sentences)
    from heapq import nlargest
    select_length = int(len(list(doc.sents)) * 0.3)  # 30% of total sentences
    summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)

    final_summary = ' '.join([str(sentence) for sentence in summary])
    return final_summary

### Create Corpus

In [None]:
# Example corpus
corpus = """
Sherlock Holmes took his bottle from the corner of the mantelpiece and his hypodermic syringe from its neat morocco case.
With his long, white, nervous fingers, he adjusted the delicate needle, and rolled back his left shirt cuff.
For some little time his eyes rested thoughtfully upon the sinewy forearm and wrist, all dotted and scarred with innumerable puncture-marks.
Finally, he thrust the sharp point home, pressed down the tiny piston, and sank back into the velvet-lined arm-chair with a long sigh of satisfaction.
"""

### Summarize the Text

In [None]:
# Clean and summarize the corpus
summary = text_summarizer(corpus)

# Print summary
print("Original Text:\n", corpus)
print("\nSummarized Text:\n", summary)

Original Text:
 
Sherlock Holmes took his bottle from the corner of the mantelpiece and his hypodermic syringe from its neat morocco case. 
With his long, white, nervous fingers, he adjusted the delicate needle, and rolled back his left shirt cuff. 
For some little time his eyes rested thoughtfully upon the sinewy forearm and wrist, all dotted and scarred with innumerable puncture-marks. 
Finally, he thrust the sharp point home, pressed down the tiny piston, and sank back into the velvet-lined arm-chair with a long sigh of satisfaction.


Summarized Text:
 Finally, he thrust the sharp point home, pressed down the tiny piston, and sank back into the velvet-lined arm-chair with a long sigh of satisfaction.



## Text Summarization Using spaCy in Jupyter Notebook

### Step 1: Install Required spaCy Model
To perform text summarization, we first installed the spaCy language model. The following command was used:

```bash
!python3 -m spacy download en_core_web_sm
```

### Step 2: Load and Clean Corpus
Loaded a sample text corpus (from Sherlock Holmes) and cleaned it using regular expressions. This removed unnecessary characters like punctuation and converted the text to lowercase.

### Step 3: Process the Text Using spaCy
We processed the cleaned text using the spaCy model, allowing us to extract tokens, lemmas, part-of-speech tags, and dependencies from the text.

### Step 4: Summarization Logic
Lastly, I applied a simple rule-based summarization approach using sentence tokenization, selecting key sentences that represented the overall content of the text.

#### Final Output:
The summarized text provided a condensed version of the original corpus using the spaCy model