In [None]:
# Natural Language Processing or Text Mining - Text Data is Unsupervised Learning Data.
# Unsupervised Learning means no proper structure, no variables, and also traditional models
# do not work on this data

# Text Data must be processed and convert to Supervised Learning where it will have proper
# structure, variable formation and traditional models will work.

# Different types of text data in terms of documents(PDF, Word Docs, OCR Files(PDF Images),
# Web Pages, Social Media Posts/Updates/Text Content, Databases(XML Format), Text from Images,
# IPO Documents(Redherring Prospectus), etc.)

Optical Character Recognition, or OCR, is a technology that allows you to convert different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data.

In [None]:
# Web Scraping - Scraping text data from webpages. Web Pages are typically html pages
# that has many other factors along with text like headings, Page Details, Styling, fonts,
# etc.

# Scrape Text content only. Libraries used are requests, bs4, nltk
# requests library for url webpage scraping
# bs4 for scraping text content

In [None]:
import requests
from bs4 import BeautifulSoup

BeautifulSoup is a popular Python library used for web scraping, i.e., extracting data from HTML and XML pages.

What it does

Parses messy or poorly formatted HTML.

Lets you easily search, navigate, and extract parts of a webpage (like titles, tables, links, text, etc.).

Makes web scraping much simpler than manually using regex or raw string methods.

Why is it called BeautifulSoup?

The name is inspired by:

The idea of turning "messy" HTML into "beautiful" structured data that you can easily work with ‚Äî like turning a messy soup into something clear.

A reference to the poem "Beautiful Soup" from Alice in Wonderland, where the Mock Turtle sings about soup.
The library‚Äôs creator liked the whimsical reference, and it fit the concept of dealing with a ‚Äúsoup‚Äù of markup.

So the name is partly metaphorical (cleaning messy markup) and partly literary (Alice in Wonderland

‚úÖ 1. Requests Library

Requests is a Python library used to send HTTP requests to websites.

You use it to:

Download a webpage

Submit forms

Call APIs

Send GET/POST requests

BeautifulSoup is the actual class/library you use

bs4 = the module/package

BeautifulSoup = the class inside that package
BeautifulSoup (part of the bs4 package) is used to parse HTML or XML and extract data easily.

It helps you:

Find tags (<h1>, <p>, <a>)

Extract text

Extract attributes (like href)

Navigate the DOM

In [None]:
response = requests.get("https://www.britannica.com/science/climate-change/Climate-change-since-the-emergence-of-civilization")
soup = BeautifulSoup(response.text, 'lxml') # Create a BeautifulSoup object from the webpage‚Äôs HTML using the fast lxml parser

#1. BeautifulSoup(response.text, ...)

#This creates a BeautifulSoup object from the HTML text you downloaded.

#response.text = the HTML content you got from the website

#BeautifulSoup(...) = parses that HTML and turns it into a structured object you can search

#Example: find tags, extract text, get links, etc.

We need to send **HTTP requests to websites** because that is the *only official way* to ask a website to give us its data or page content.

Here‚Äôs the simplest explanation:

---

# ‚úÖ Why do we send HTTP requests to websites?

### **1. Websites store data on a server**

A website‚Äôs content (HTML, images, text, etc.) is stored on a remote server.

Your computer **cannot access that data directly**.

### **2. HTTP is the ‚Äúlanguage‚Äù we use to communicate with the server**

When you open any website in your browser, your browser secretly does this:

```
GET /page
Host: example.com
```

This is an **HTTP GET request**, asking:

> ‚ÄúPlease send me this webpage.‚Äù

### **3. The server responds with the webpage**

The server replies with:

* HTML
* CSS
* Images
* JSON data
* Anything the website contains

Your browser displays it.

---

# ‚úÖ Why *we* (in Python) need to send HTTP requests?

Because we want to:

### ‚úîÔ∏è Download a webpage

(using `requests.get()`)

### ‚úîÔ∏è Scrape data from a webpage

(using BeautifulSoup)

### ‚úîÔ∏è Access an API (weather, maps, stock prices)

(using GET/POST requests)

### ‚úîÔ∏è Submit data

(logging in, sending a form, uploading info)

---

# üîÑ Summary (super simple)

* A website is stored on a server.
* To get anything from that server, we must **ask** using **HTTP requests**.
* Requests library helps Python do what your browser normally does.

---

If you want, I can show you:

* How browsers send requests internally
* A diagram of how server‚Äìclient communication works
* A short analogy (like ordering food in a restaurant)


In [None]:
paragraphs = soup.find_all('p')
paragraphs_txt=[p.text for p in paragraphs]
# soup.find_all() searches the HTML for all tags of a certain type.'p' means ‚Äúparagraph tags‚Äù
#This is a list comprehension.
#It loops over each <p> tag
# Extracts only its text (no HTML)
# Stores it in a clean Python list

In [None]:
# Text preprocessing - Cleaning up text using re library. re(Regular Expressions) library
# used for identifying different text patterns and clean them.

# Text patterns like email, digits, words, spaces, word boundaries(start/end), etc are
# predefined and are used for cleaning text data.

# Text preprocessing involves removing punctuations, special characters, digits, spaces,
# emoji's, hyperlinks, specific characters, etc.

![image.png](attachment:ed226b74-2beb-4118-9d91-67e4891c49e6.png)

In [None]:
import re
# The re library in Python is the built-in module for regular expressions.
#  Regular expressions are powerful patterns used for matching character combinations in strings.

That line of code, paragraphs_txt=re.sub(pattern,"",str(paragraphs_txt)), is performing text cleaning using regular expressions.

Here's a breakdown:

re.sub(): This is a function from Python's re (regular expression) library. It stands for "substitute" and is used to find all occurrences of a pattern in a string and replace them with a specified replacement string.
pattern:
So, the overall purpose of this line is to remove special characters, punctuation (excluding periods), and symbols from the text, leaving behind only letters, numbers, spaces, and periods, making the text cleaner for further analysis.

In [None]:
pattern=r'[^a-zA-Z0-9\s.]' # ^ ‚Üí start of string

In [None]:
paragraphs_txt=re.sub(pattern,"",str(paragraphs_txt)) # replace pattern by substituting with space

# rs.sub("pattern to be replace", "pattern replacement", data)

In [None]:
paragraphs_txt=re.sub(r'[0-9]+',"",paragraphs_txt) # Remove All Digits/numbers

In [None]:
paragraphs_txt=paragraphs_txt.lower()

# Convert text to Lower Case. All comparitive words are in lower case. predefined lists or
# lexicons are in smallcap/lower case. ind != Ind ; ind = ind
# lexicons simply mean collections of words, usually with their meanings or uses.
# Think of a lexicon as a specialized dictionary.

In [None]:
# Tokenization - Break text into tokens/words or sentences.
# Sentence Tokenization - Breaking text into sentences. default delimiter is fullstop
# Word Tokenization - Breaking text into words or tokens. default delimiter is space

In [None]:
# !pip install nltk
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download("all")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

In [None]:
climatesentences=sent_tokenize(paragraphs_txt) # SEntiment analysis

In [None]:
# On Sentences - Sentiment Analysis is done. positive, negative, neutral sentiments are
# generated for each sentence.
# Many Sentiment Analysis Models like VADER, Text Blob Polarity Model, Stanford Sentiment Model
# ,nltk sentiment , etc.

# Most popular and highly accurate is Text Blob Library sentiment model. Text Blob Model
# provides 2 scores - Polarity Score and Subjectivity Score.
# Polarity Score is a value that lies between -1 to 1. Using this score sentiment classification is
# done
#his will include how values close to -1 indicate negative sentiment, values close to 1 indicate positive sentiment,
# and values around 0 indicate neutral sentiment.
# Subjectivity Score is a value that lies between 0 and 1. Close to 1 is high personal
# opinion (involves adverbs & Superlatives) and Close to 0 is low personal opinion.

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It's often used for text from sources like Twitter, product reviews, and other informal texts.

Here are some key characteristics of VADER:

Lexicon-based: It uses a dictionary of words (lexicon) where each word is pre-scored for its sentiment (positive, negative, or neutral).

In [None]:
# !pip install textblob
from textblob import TextBlob

The TextBlob library is a user-friendly Python library designed for processing textual data, offering a simplified API for various Natural Language Processing (NLP) tasks. Its primary capabilities include sentiment analysis, which provides polarity (positive/negative) and subjectivity scores for text, as well as part-of-speech tagging, noun phrase extraction, tokenization, word inflection, and spelling correction, making it a versatile tool for quick text analysis and understanding.

In [None]:
TextBlob("Tendulkar is greatest batsman in Cricket").sentiment # Sentiment of sentence

Sentiment(polarity=1.0, subjectivity=1.0)

In [None]:
TextBlob("Tendulkar is great batsman").sentiment

Sentiment(polarity=0.8, subjectivity=0.75)

In [None]:
TextBlob("Tendulkar is most reputed cricketer").sentiment

Sentiment(polarity=0.5, subjectivity=0.5)

In [None]:
def analyze_sentiment(text):
    analysis=TextBlob(text)
    if analysis.sentiment.polarity>0:
        return "Positive"
    elif analysis.sentiment.polarity==0:
        return "Neutral"
    else:
        return "Negative"

In [None]:
import pandas as pd
climatesentences=pd.DataFrame(climatesentences,columns=['sentence'])

In [None]:
climatesentences['sentiment']=[str(analyze_sentiment(x)) for x in climatesentences['sentence']]

In [None]:
climatesentences['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1


In [None]:
climatesentences.head()

Unnamed: 0,sentence,sentiment


In [None]:
# NLP uses words or tokens as fundamental point of analysis
climatewords=word_tokenize(paragraphs_txt)

In [None]:
# isalnum() will select only words and digits. All special characters deleted
climatewords=[w for w in climatewords if w.isalnum()]

In [None]:
# Remove Stopwords. Stopwords are list of words like is, a, an, the, then, to, etc. that
# are not required for analysis
from nltk.corpus import stopwords

In [None]:
english_stopwords=set(stopwords.words("english"))

In [None]:
climatewords=[w for w in climatewords if not w in english_stopwords]

In [None]:
climatewords=[w for w in climatewords if len(w)>2] # Select words more than 2 characters

In [None]:
from nltk.probability import FreqDist

In [None]:
wordfreq=FreqDist(climatewords) # most common or most frequent word

In [None]:
wordfreq.most_common(20)

[]

In [None]:
# Word Cloud is a vizual representation of most frequent words. large font size most frequent
# small font size less frequent.

from wordcloud import WordCloud

In [None]:
wordcloud=WordCloud(width=1000,height=500,stopwords=english_stopwords,
                    colormap="plasma",max_words=200).generate(str(climatewords))

ValueError: We need at least 1 word to plot a word cloud, got 0.

In [None]:
import matplotlib.pyplot as plt

plt.imshow(wordcloud)