# Import necessary libraries

In [1]:
# Installing Libraries
# !pip install nltk
# !pip install beautifulsoup4
# !pip install regex
# !pip install requests

In [2]:
import requests
from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Fetch and Parse HTML

Use the requests library to fetch the HTML content of a webpage and then use BeautifulSoup to parse it.

In [3]:
# Replace with the URL of the webpage you want to scrape
url = "https://en.wikipedia.org/wiki/Duolingo"

In [4]:
response = requests.get(url)
html_content = response.content

In [5]:
soup = BeautifulSoup(html_content, 'html.parser')

# Extract Text Data

Once you have the parsed HTML, extract the relevant text data using various methods such as .find(), .find_all(), and .get_text().

In [6]:
# Example: Extracting all paragraphs
paragraphs = soup.find_all('p')

In [7]:
# Extracting text from each paragraph
paragraph_texts = [paragraph.get_text() for paragraph in paragraphs]

# Text Preprocessing

Text preprocessing involves various steps to clean and normalize the extracted text.

In [8]:
# Convert to lowercase
lowercase_text = [text.lower() for text in paragraph_texts]

In [9]:
# Remove special characters using regex
cleaned_text = [re.sub(r'[^a-zA-Z0-9\s]', '', text) for text in lowercase_text]

In [10]:
# Tokenization
tokenized_text = [word_tokenize(text) for text in cleaned_text]

In [11]:
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_text = [[word for word in tokens if word not in stop_words] for tokens in tokenized_text]

In [12]:
# Stemming
stemmer = PorterStemmer()
stemmed_text = [[stemmer.stem(word) for word in tokens] for tokens in filtered_text]

# Further Processing

You can perform additional steps such as removing empty tokens, converting the processed text back to sentences or paragraphs, and so on, based on your requirements.

In [13]:
# Remove empty tokens
final_text = [[word for word in tokens if word.strip()] for tokens in stemmed_text]

In [14]:
# Convert tokens back to sentences
sentences = [' '.join(tokens) for tokens in final_text]

In [15]:
# Convert sentences back to paragraphs
processed_paragraphs = '\n\n'.join(sentences)

# Save Processed Text

Finally, you can save the processed text to a file for further analysis.

In [16]:
with open('processed_text.txt', 'w', encoding='utf-8') as file:
    file.write(processed_paragraphs)

In [17]:
# Printing the Saved Processed Text
print(processed_paragraphs)





duolingo incb american educ technolog compani produc learn app provid languag certif duolingo offer cours music5 math6 43 languages7 rang english french spanish less commonli studi languag welsh irish8 servic includ duolingo english test onlin certif program duolingo abc literaci app children compani use freemium model option premium servic super duolingo adfre offer featur

januari 2024updat duolingo world popular languag learn app base monthli download around 162 million user download month9 systemat review research duolingo 2012 2020 found compar studi platform efficaci languag learn review identifi sever studi report rel high user satisfact enjoy posit percept app effectiveness10 compani often recogn success market tactics1112

idea duolingo formul 2009 carnegi mellon univers professor lui von ahn swissborn postgradu student severin hacker1314 von ahn sold second compani recaptcha googl hacker want work educationrel project15 von ahn state saw expens peopl commun guatemala lear