## Aim: Perform the steps involved in Text Analytics in Python & R

In [None]:
with open('text.txt', 'r') as file:
    text = file.read()

In [None]:
len(text.split(" "))

169893

In [None]:
for line in text[:1000]:
    print(line, end="")

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, ne_chunk

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

**Tokenization**

In [None]:
sentences = sent_tokenize(text)
print("Sentences:", sentences)



In [None]:
words = word_tokenize(text)
print("Words:", words)



**Frequency Distribution**

In [None]:
fdist = FreqDist(words)
print("Frequency Distribution:", fdist)
print("\nMost common:", fdist.most_common(10))

Frequency Distribution: <FreqDist with 14310 samples and 254509 outcomes>

Most common: [(',', 19846), (':', 10316), ('.', 7858), ('the', 5441), ('I', 5013), ('to', 3974), ('and', 3761), (';', 3628), ('of', 3314), ('you', 2849)]


**Removing Stop Words**

In [None]:
stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]

print("Filtered Words (Stopwords & Punctuations removed):", filtered_words)



In [None]:
fdist = FreqDist(filtered_words)
print("Frequency Distribution:", fdist)
print("\nMost common:", fdist.most_common(10))

Frequency Distribution: <FreqDist with 11008 samples and 103492 outcomes>

Most common: [('thou', 1403), ('thy', 1059), ('king', 923), ('shall', 842), ('thee', 762), ('lord', 709), ('good', 662), ('come', 622), ('sir', 595), ('would', 534)]


**Stemming**

In [None]:
porter_stemmer = PorterStemmer()
stemmed_words = [porter_stemmer.stem(word) for word in filtered_words]

print("Stemmed Words:", stemmed_words)

Stemmed Words: ['first', 'citizen', 'proceed', 'hear', 'speak', 'speak', 'speak', 'first', 'citizen', 'resolv', 'rather', 'die', 'famish', 'resolv', 'resolv', 'first', 'citizen', 'first', 'know', 'caiu', 'marciu', 'chief', 'enemi', 'peopl', 'first', 'citizen', 'let', 'us', 'kill', 'corn', 'price', 'verdict', 'talk', 'let', 'done', 'away', 'away', 'second', 'citizen', 'one', 'word', 'good', 'citizen', 'first', 'citizen', 'account', 'poor', 'citizen', 'patrician', 'good', 'author', 'surfeit', 'would', 'reliev', 'us', 'would', 'yield', 'us', 'superflu', 'wholesom', 'might', 'guess', 'reliev', 'us', 'human', 'think', 'dear', 'lean', 'afflict', 'us', 'object', 'miseri', 'inventori', 'particularis', 'abund', 'suffer', 'gain', 'let', 'us', 'reveng', 'pike', 'ere', 'becom', 'rake', 'god', 'know', 'speak', 'hunger', 'bread', 'thirst', 'reveng', 'second', 'citizen', 'would', 'proceed', 'especi', 'caiu', 'marciu', 'first', 'dog', 'commonalti', 'second', 'citizen', 'consid', 'servic', 'done', 'cou

**Lemmatization**

In [None]:
lemmatizer = WordNetLemmatizer()
lemma = [lemmatizer.lemmatize(word) for word in filtered_words]

print("Lemmatized Words:", lemma)



**Parts Of Speech**

In [None]:
pos_tags = pos_tag(filtered_words)

print("Part of Speech Tagging:", pos_tags)



**Named Entity Regognition**

In [None]:
ne_tree = ne_chunk(pos_tags)

print("Named Entity Recognition:", ne_tree)

In [None]:
len(set(lemma))

9603

**Web Scrapping**

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
url = 'https://www.nytimes.com/international/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
website_text = soup.get_text()
website_text = sent_tokenize(website_text)

In [None]:
print(website_text)

['\n\n\n\nThe New York Times International - Breaking News, US News, World News, Videos\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to contentSkip to site indexSKIP ADVERTISEMENTSkip to contentSkip to site indexU.S.InternationalCanadaEspañol中文\xa0Today’s PaperU.S.SectionsU.S.PoliticsNew YorkCaliforniaEducationHealthObituariesScienceClimateSportsBusinessTechThe UpshotThe MagazineU.S.', 'Politics2024 ElectionsSupreme CourtCongressBiden AdministrationTop StoriesTrump InvestigationsImmigrationAbortionThe Eric Adams AdministrationNewslettersThe MorningMake sense of the day’s news and ideas.The UpshotAnalysis that explains politics, policy and everyday life.See all newslettersPodcastsThe DailyThe biggest stories of our time, in 20 minutes a day.The Run-UpOn the campaign trail with Astead Herndon.See all podcastsWorldSectionsWorldAfricaAmericasAsiaAustraliaCanadaEuropeMiddle EastScienceClimateHealthObituariesTop StoriesIsrael-Hamas WarRussia-Ukraine WarNewslettersMorning Briefing: Europ

### R

In [23]:
library(tokenizers)
library(tm)
library(stringr)
library(SnowballC)
library(udpipe)
library(rvest)
library(spacyr)

In [22]:
system("python3 -m spacy download en")

In [10]:
text <- tolower(readLines("text.txt", warn = FALSE))
text

Output hidden; open in https://colab.research.google.com to view.

In [13]:
word_tokens <- unlist(tokenize_words(text))
word_tokens

Output hidden; open in https://colab.research.google.com to view.

In [16]:
word_freq <- table(word_tokens)
word_freq

word_tokens
               3                a        abandon'd            abase 
              27             3018                2                1 
           abate           abated            abbey            abbot 
               3                1                1                4 
            abed           abel's             abet            abhor 
               1                1                1                3 
        abhorr'd         abhorred        abhorring           abhors 
               5                2                1                1 
        abhorson            abide           abides        abilities 
              16                9                3                1 
         ability        ability's           abject          abjects 
               2                1                1                1 
         abjured             able           aboard            abode 
               1               11               10                2 
      abodements      

In [17]:
stopwords <- stopwords("en")
clean_tokens <- word_tokens[!tolower(word_tokens) %in% stopwords & str_detect(word_tokens, "[A-Za-z]")]
clean_tokens

Output hidden; open in https://colab.research.google.com to view.

In [26]:
stem_tokens <- wordStem(clean_tokens)
print(stem_tokens)

Output hidden; open in https://colab.research.google.com to view.

In [30]:
spacy_initialize(model_path = "/usr/local/lib/python3.7/dist-packages/en_core_web_sm/en_core_web_sm-3.0.0")
spacy_annotate <- spacy_parse(text)
named_entities <- spacy_entity_extract(spacy_annotate)

“Note that we have deprecated a number of parameters to simplify this function”
“running command ''/root/.virtualenvs/r-spacyr/bin/python' -m pip freeze' had status 1”


ERROR: Error in spacy_initialize(model_path = "/usr/local/lib/python3.7/dist-packages/en_core_web_sm/en_core_web_sm-3.0.0"): spaCy was not found in your environment. Use `spacy_install()`to get started.


In [31]:
url <- "https://www.nytimes.com/international/"
webpage <- read_html(url)
text_from_website <- html_text(webpage)
print(text_from_website)

[1] "The New York Times International - Breaking News, US News, World News, Videos{\"@context\":\"https://schema.org\",\"@type\":\"WebPage\",\"image\":[{\"@context\":\"https://schema.org\",\"@type\":\"ImageObject\",\"url\":\"https://static01.nyt.com/vi-assets/images/share/1200x675_nameplate.png\",\"height\":675,\"width\":1200,\"contentUrl\":\"https://static01.nyt.com/vi-assets/images/share/1200x675_nameplate.png\",\"creditText\":\"The New York Times\"},{\"@context\":\"https://schema.org\",\"@type\":\"ImageObject\",\"url\":\"https://static01.nyt.com/vi-assets/images/share/1200x900_t.png\",\"height\":900,\"width\":1200,\"contentUrl\":\"https://static01.nyt.com/vi-assets/images/share/1200x900_t.png\",\"creditText\":\"The New York Times\"},{\"@context\":\"https://schema.org\",\"@type\":\"ImageObject\",\"url\":\"https://static01.nyt.com/vi-assets/images/share/1200x1200_t.png\",\"height\":1200,\"width\":1200,\"contentUrl\":\"https://static01.nyt.com/vi-assets/images/share/1200x1200_t.png\",\