<a href="https://colab.research.google.com/github/fadedeeplearning/ML_projects/blob/main/wiki_extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Here I've extracted (crawled) text from pages from wikipedia [wikipedia_text.txt]

In [1]:
import requests
from bs4 import BeautifulSoup

def scrape_wikipedia(url):
    # Fetch the content of the Wikipedia page
    response = requests.get(url)
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        # Find the main content of the page
        content = soup.find(id='mw-content-text')
        # Extract text from paragraphs
        paragraphs = content.find_all('p')
        # Write extracted text to a .txt file
        with open('wikipedia_text.txt', 'w', encoding='utf-8') as file:
            for p in paragraphs:
                file.write(p.get_text() + '\n')

# URL of the Wikipedia page you want to scrape
url = 'https://en.wikipedia.org/wiki/Annus_mirabilis_papers'
scrape_wikipedia(url)


### tokenised it in sentences and moved it into a text file [tokenized_sentences.txt]

In [6]:
import nltk
nltk.download('punkt')  # Ensure punkt tokenizer is downloaded

from nltk.tokenize import sent_tokenize

def tokenize_sentences(input_file, output_file):
    # Read input text from file
    with open(input_file, 'r', encoding='utf-8') as f:
        text = f.read()

    # Tokenize text into sentences
    sentences = sent_tokenize(text)

    # Write tokenized sentences into output file
    with open(output_file, 'w', encoding='utf-8') as f:
        for sentence in sentences:
            f.write(sentence + '\n')

# Example usage:
input_file = 'wikipedia_text.txt'  # Specify the input file name
output_file = 'tokenized_sentences.txt'  # Specify the output file name

tokenize_sentences(input_file, output_file)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Creating a code to find num of sentences from tokenized file
### We need minimum, 5000 Senences from multiple wikipedia pages.
### Minimum 10,000 words lemmatized and without duplicate meanings.

In [7]:
def count_sentences(tokenized_file):
    # Read tokenized sentences from file
    with open(tokenized_file, 'r', encoding='utf-8') as f:
        tokenized_sentences = f.readlines()

    # Count the number of sentences
    num_sentences = len(tokenized_sentences)

    return num_sentences

# Example usage:
tokenized_file = 'tokenized_sentences.txt'  # Specify the tokenized file name

num_sentences = count_sentences(tokenized_file)
print("Number of sentences:", num_sentences)


Number of sentences: 141


### Cleaning and removing stopwords from the tokenized file

In [8]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def remove_stopwords(tokenized_file, cleaned_file):
    stop_words = set(stopwords.words('english'))

    with open(tokenized_file, 'r', encoding='utf-8') as f:
        tokenized_sentences = f.readlines()

    cleaned_sentences = []
    for sentence in tokenized_sentences:
        words = sentence.split()
        cleaned_words = [word for word in words if word.lower() not in stop_words]
        cleaned_sentence = ' '.join(cleaned_words)
        cleaned_sentences.append(cleaned_sentence)

    with open(cleaned_file, 'w', encoding='utf-8') as f:
        for sentence in cleaned_sentences:
            f.write(sentence + '\n')

# Example usage:
tokenized_file = 'tokenized_sentences.txt'  # Specify the tokenized file name
cleaned_file = 'cleaned_sentences.txt'  # Specify the cleaned file name

remove_stopwords(tokenized_file, cleaned_file)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### removing the sentences that are unfinished

In [9]:
import string

def remove_unfinished_sentences(cleaned_file, finished_file):
    with open(cleaned_file, 'r', encoding='utf-8') as f:
        cleaned_sentences = f.readlines()

    finished_sentences = []
    for sentence in cleaned_sentences:
        # Check if the sentence ends with a proper punctuation mark
        if sentence.strip() and sentence.strip()[-1] in string.punctuation:
            finished_sentences.append(sentence)

    with open(finished_file, 'w', encoding='utf-8') as f:
        for sentence in finished_sentences:
            f.write(sentence)

# Example usage:
cleaned_file = 'cleaned_sentences.txt'  # Specify the cleaned file name
finished_file = 'finished_sentences.txt'  # Specify the finished file name

remove_unfinished_sentences(cleaned_file, finished_file)


### rechecking num of sentences from finished files.


In [10]:
# Example usage:
tokenized_file = 'finished_sentences.txt'  # Specify the tokenized file name

num_sentences = count_sentences(tokenized_file)
print("Number of sentences:", num_sentences)

Number of sentences: 111


### I'll be using a text file that contains lemmatized words [ base grammar words ], and will use that words to compare grammar words from the sentence.