<a href="https://colab.research.google.com/github/arssite/NaturaLinguisticProgramming/blob/main/Stopwords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [37]:
paragraph = """Natural Language Processing (NLP) is a rapidly advancing field at the intersection of computer science, artificial intelligence, and linguistics, focusing on the interaction between computers and human languages. The main objective of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and useful. This complex challenge involves several sub-tasks, including speech recognition, natural language understanding, and natural language generation.

One of the foundational aspects of NLP is text preprocessing, which involves cleaning and preparing text data for analysis. This step typically includes tasks such as tokenization, where text is broken down into individual words or phrases; stemming and lemmatization, which reduce words to their base or root form; and the removal of stop words, which are common words like "and", "the", and "in" that usually do not carry significant meaning. These preprocessing steps are crucial for reducing noise and ensuring that the text data is in a suitable format for subsequent analysis.

Another critical area in NLP is syntactic and semantic analysis. Syntactic analysis, or parsing, involves examining the grammatical structure of sentences. This helps in understanding how words are related to each other and the roles they play in a sentence. Semantic analysis, on the other hand, focuses on the meaning of the words and how they combine to form meaningful sentences. Techniques such as named entity recognition (NER), which identifies and classifies key elements in text into predefined categories like names of people, organizations, and locations, are part of this process.

Machine learning, particularly deep learning, has revolutionized NLP in recent years. Models like Word2Vec and GloVe have enabled the creation of dense vector representations of words, known as word embeddings, which capture semantic relationships between words. These embeddings have improved the performance of various NLP tasks by allowing algorithms to understand context and similarity in a more nuanced way. More recently, transformer-based models such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) have pushed the boundaries even further. These models, trained on vast amounts of text data, can perform a wide range of NLP tasks with remarkable accuracy. They utilize self-attention mechanisms to weigh the importance of different words in a sentence, enabling them to grasp context and nuances more effectively.

Applications of NLP are vast and varied, ranging from practical implementations like chatbots and virtual assistants to more sophisticated uses in sentiment analysis, machine translation, and text summarization. For instance, in customer service, chatbots powered by NLP can handle a large volume of queries, providing quick and accurate responses. In social media analysis, sentiment analysis tools can gauge public opinion by analyzing text data from platforms like Twitter and Facebook. Machine translation systems like Google Translate use NLP to break down language barriers by providing accurate translations between numerous languages.

Despite the significant advancements, NLP still faces several challenges. Language is inherently complex and ambiguous, with context, tone, and cultural nuances playing a significant role in communication. Additionally, models often require large datasets and significant computational resources to train effectively. Addressing these challenges requires ongoing research and innovation.

In summary, NLP is a dynamic and evolving field that seeks to bridge the gap between human language and computer understanding. Through advancements in machine learning and deep learning, it continues to make strides in enabling machines to process and generate human language with increasing sophistication and accuracy, opening up new possibilities and applications across various domains."""


In [1]:
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [12]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
stemer=PorterStemmer()

In [39]:
sen=sent_tokenize(paragraph)
sen

['Natural Language Processing (NLP) is a rapidly advancing field at the intersection of computer science, artificial intelligence, and linguistics, focusing on the interaction between computers and human languages.',
 'The main objective of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and useful.',
 'This complex challenge involves several sub-tasks, including speech recognition, natural language understanding, and natural language generation.',
 'One of the foundational aspects of NLP is text preprocessing, which involves cleaning and preparing text data for analysis.',
 'This step typically includes tasks such as tokenization, where text is broken down into individual words or phrases; stemming and lemmatization, which reduce words to their base or root form; and the removal of stop words, which are common words like "and", "the", and "in" that usually do not carry significant meaning.',
 'These preprocessing steps a

In [10]:
len(sen)

27

In [11]:
type(sen)

list

In [13]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [43]:
#token from sentence to word
#stemming
#check stopwords
#list to words into sentences
for i in range(len(sen)):
    words=nltk.word_tokenize(sen[i])
    words=[stemer.stem(word) for word in words if word not in stopwords.words('english')]
    sen[i]=' '.join(words)

In [44]:
print(sen)
len(sen)

['natur languag process ( nlp ) rapid advanc field intersect comput scienc , artifici intellig , linguist , focu interact comput human languag .', 'main object nlp enabl machin understand , interpret , gener human languag way mean use .', 'thi complex challeng involv sever sub-task , includ speech recognit , natur languag understand , natur languag gener .', 'one foundat aspect nlp text preprocess , involv clean prepar text data analysi .', 'thi step typic includ task token , text broken individu word phrase ; stem lemmat , reduc word base root form ; remov stop word , common word like `` `` , `` `` , `` `` usual carri signif mean .', 'preprocess step crucial reduc noi ensur text data suitabl format subsequ analysi .', 'anoth critic area nlp syntact semant analysi .', 'syntact analysi , par , involv examin grammat structur sentenc .', 'thi help understand word relat role play sentenc .', 'semant analysi , hand , focu mean word combin form mean sentenc .', 'techniqu name entiti recognit

27

In [26]:
#withSnowballStemmer
from nltk.stem.snowball import SnowballStemmer
snowball=SnowballStemmer('english')

In [41]:
#token from sentence to word
#SnowBallstemming
#check stopwords
#list to words into sentences
for i in range(len(sen)):
    words=nltk.word_tokenize(sen[i])
    words=[snowball.stem(word) for word in words if word not in stopwords.words('english')]
    sen[i]=' '.join(words)

In [30]:
print(sen)
len(sen)

['natur languag process ( nlp ) rapid advanc field intersect comput scienc , artifici intellig , linguist , focu interact comput human languag .', 'main object nlp enabl machin understand , interpret , gener human languag way mean use .', 'thi complex challeng involv sever sub-task , includ speech recognit , natur languag understand , natur languag gener .', 'one foundat aspect nlp text preprocess , involv clean prepar text data analysi .', 'thi step typic includ task token , text broken individu word phrase ; stem lemmat , reduc word base root form ; remov stop word , common word like `` `` , `` `` , `` `` usual carri signif mean .', 'preprocess step crucial reduc noi ensur text data suitabl format subsequ analysi .', 'anoth critic area nlp syntact semant analysi .', 'syntact analysi , par , involv examin grammat structur sentenc .', 'thi help understand word relat role play sentenc .', 'semant analysi , hand , focu mean word combin form mean sentenc .', 'techniqu name entiti recognit

27

In [34]:
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [45]:
#token from sentence to word
#Lemmatizer
#check stopwords
#list to words into sentences
for i in range(len(sen)):
    words=nltk.word_tokenize(sen[i])
    words=[lem.lemmatize(word,pos='v') for word in words if word not in stopwords.words('english')]
    sen[i]=' '.join(words)

In [46]:
sen

['natur languag process ( nlp ) rapid advanc field intersect comput scienc , artifici intellig , linguist , focu interact comput human languag .',
 'main object nlp enabl machin understand , interpret , gener human languag way mean use .',
 'thi complex challeng involv sever sub-task , includ speech recognit , natur languag understand , natur languag gener .',
 'one foundat aspect nlp text preprocess , involv clean prepar text data analysi .',
 'thi step typic includ task token , text break individu word phrase ; stem lemmat , reduc word base root form ; remov stop word , common word like `` `` , `` `` , `` `` usual carri signif mean .',
 'preprocess step crucial reduc noi ensur text data suitabl format subsequ analysi .',
 'anoth critic area nlp syntact semant analysi .',
 'syntact analysi , par , involv examin grammat structur sentenc .',
 'thi help understand word relat role play sentenc .',
 'semant analysi , hand , focu mean word combin form mean sentenc .',
 'techniqu name entiti