## Removing stopwords

In [3]:
paragraph = """Natural Language Processing (NLP) is a fascinating field of artificial intelligence that focuses on the interaction between humans and computers using natural language. 
               As the world becomes increasingly digital, the demand for systems that can understand and interpret human language continues to grow. 
               Whether it's chatbots answering customer queries or smart assistants helping users with daily tasks, NLP technologies are transforming the way we communicate with machines. 
               However, teaching computers to understand language is not easy. 
               Human language is full of ambiguity, context, and emotion. 
               Words can mean different things depending on how they’re used, and people often say one thing but mean another. 
               This complexity makes NLP a challenging yet rewarding area of research. 
               One of the fundamental steps in processing text is to clean and simplify it. 
               This often involves removing stopwords — common words like "the", "is", "in", and "at" — which may not add significant meaning to the analysis. 
               By filtering out these words, we can focus more on the important terms that carry meaning and insight. 
               With proper preprocessing and tools, NLP can unlock patterns and structures in text that are otherwise difficult to detect"""

In [2]:
from nltk.stem import PorterStemmer

In [4]:
from nltk.corpus import stopwords

In [5]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\upama\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [6]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [8]:
stopwords.words('russian')

['и',
 'в',
 'во',
 'не',
 'что',
 'он',
 'на',
 'я',
 'с',
 'со',
 'как',
 'а',
 'то',
 'все',
 'она',
 'так',
 'его',
 'но',
 'да',
 'ты',
 'к',
 'у',
 'же',
 'вы',
 'за',
 'бы',
 'по',
 'только',
 'ее',
 'мне',
 'было',
 'вот',
 'от',
 'меня',
 'еще',
 'нет',
 'о',
 'из',
 'ему',
 'теперь',
 'когда',
 'даже',
 'ну',
 'вдруг',
 'ли',
 'если',
 'уже',
 'или',
 'ни',
 'быть',
 'был',
 'него',
 'до',
 'вас',
 'нибудь',
 'опять',
 'уж',
 'вам',
 'ведь',
 'там',
 'потом',
 'себя',
 'ничего',
 'ей',
 'может',
 'они',
 'тут',
 'где',
 'есть',
 'надо',
 'ней',
 'для',
 'мы',
 'тебя',
 'их',
 'чем',
 'была',
 'сам',
 'чтоб',
 'без',
 'будто',
 'чего',
 'раз',
 'тоже',
 'себе',
 'под',
 'будет',
 'ж',
 'тогда',
 'кто',
 'этот',
 'того',
 'потому',
 'этого',
 'какой',
 'совсем',
 'ним',
 'здесь',
 'этом',
 'один',
 'почти',
 'мой',
 'тем',
 'чтобы',
 'нее',
 'сейчас',
 'были',
 'куда',
 'зачем',
 'всех',
 'никогда',
 'можно',
 'при',
 'наконец',
 'два',
 'об',
 'другой',
 'хоть',
 'после',
 'на

In [9]:
stemmer = PorterStemmer()

In [24]:
sentences = nltk.sent_tokenize(paragraph)

In [12]:
type(sentences)

list

In [14]:
### Apply stopwords and filter and then apply stemming

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)  #converting words to sentences

In [15]:
sentences

['natur languag process ( nlp ) fascin field artifici intellig focus interact human comput use natur languag .',
 'as world becom increasingli digit , demand system understand interpret human languag continu grow .',
 "whether 's chatbot answer custom queri smart assist help user daili task , nlp technolog transform way commun machin .",
 'howev , teach comput understand languag easi .',
 'human languag full ambigu , context , emot .',
 'word mean differ thing depend ’ use , peopl often say one thing mean anoth .',
 'thi complex make nlp challeng yet reward area research .',
 'one fundament step process text clean simplifi .',
 "thi often involv remov stopword — common word like `` '' , `` '' , `` '' , `` '' — may add signific mean analysi .",
 'by filter word , focu import term carri mean insight .',
 'with proper preprocess tool , nlp unlock pattern structur text otherwis difficult detect']

In [21]:
from nltk.stem import SnowballStemmer
snowballstemmer = SnowballStemmer('english') 

In [22]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [snowballstemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)  #converting words to sentences

In [23]:
sentences

['natur languag process ( nlp ) fascin field artifici intellig focus interact human comput use natur languag .',
 'as world becom increas digit , demand system understand interpret human languag continu grow .',
 "whether 's chatbot answer custom queri smart assist help user daili task , nlp technolog transform way communic machin .",
 'howev , teach comput understand languag easi .',
 'human languag full ambigu , context , emot .',
 'word mean differ thing depend ’ use , peopl often say one thing mean anoth .',
 'this complex make nlp challeng yet reward area research .',
 'one fundament step process text clean simplifi .',
 "this often involv remov stopword — common word like `` '' , `` '' , `` '' , `` '' — may add signific mean analysi .",
 'by filter word , focus import term carri mean insight .',
 'with proper preprocess tool , nlp unlock pattern structur text otherwis difficult detect']

In [25]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [27]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer .lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)  #converting words to sentences

In [28]:
sentences

['Natural Language Processing ( NLP ) fascinating field artificial intelligence focus interaction human computer using natural language .',
 'As world becomes increasingly digital , demand system understand interpret human language continues grow .',
 "Whether 's chatbots answering customer query smart assistant helping user daily task , NLP technology transforming way communicate machine .",
 'However , teaching computer understand language easy .',
 'Human language full ambiguity , context , emotion .',
 'Words mean different thing depending ’ used , people often say one thing mean another .',
 'This complexity make NLP challenging yet rewarding area research .',
 'One fundamental step processing text clean simplify .',
 "This often involves removing stopwords — common word like `` '' , `` '' , `` '' , `` '' — may add significant meaning analysis .",
 'By filtering word , focus important term carry meaning insight .',
 'With proper preprocessing tool , NLP unlock pattern structure te