# 1. Understanding python with NLP

## 1. Pandas basic 

In [7]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [8]:
data=pd.read_csv("flight_data.csv")
data.head()

Unnamed: 0,YEAR,MONTH,DAY,CARRIER,ORIGIN,DEST,SCHED_DEP_TIME,ACT_DEP_TIME,DEP_DELAY,SCHED_ARR_TIME,ACT_ARR_TIME,ARR_DELAY
0,2019,7,24,G4,PIE,AVL,1511,1533.0,22.0,1644,1659.0,15.0
1,2019,7,29,G4,AUS,SFB,2002,2010.0,8.0,2335,2344.0,9.0
2,2019,7,7,G4,GRI,LAS,1118,1118.0,0.0,1144,1139.0,-5.0
3,2019,7,7,G4,AUS,MEM,1643,1726.0,43.0,1827,1922.0,55.0
4,2019,7,8,G4,IND,PIE,858,905.0,7.0,1107,1119.0,12.0


#### Q1. Which airport has the longest average delay in terms of flight departure

In [9]:
data.groupby("ORIGIN").mean()["DEP_DELAY"].idxmax()

'PPG'

So, it turns out that a remote airport (Pago Pago international airport) somewhere in the
American Samoa had the longest average departure delays recorded in July 2019

## 2. Using skikit-learn library to perform vectorization

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
sentence = ["How to change payment method and payment frequency"]
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(sentence).todense()

matrix([[1, 1, 1, 2]], dtype=int64)

This is an example of how a sentence comprehension task could be transformed into a linear algebra problem

## 3. Using NLTK library for NLP

In [18]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [19]:
from nltk.tokenize import word_tokenize

text = "Who would have thought that computer programs would be analyzing human sentiments"
tokens = word_tokenize(text)
print(tokens)

['Who', 'would', 'have', 'thought', 'that', 'computer', 'programs', 'would', 'be', 'analyzing', 'human', 'sentiments']


We have tokenized the preceding sentence using the word_tokenize() function of NLTK, which is simply splitting the sentence by white space. The output is a list, which is the first step toward vectorization

In [21]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [22]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

The following is a partial list of English stop words in NLTK. Stop words are mostly connector words that do not contribute much to the meaning of the sentence

In [23]:
newtokens=[word for word in tokens if word not in stopwords]

In [24]:
newtokens

['Who',
 'would',
 'thought',
 'computer',
 'programs',
 'would',
 'analyzing',
 'human',
 'sentiments']

Since NLTK provides us with a list of stop words, we can simply look up this list and filter out stop words from our word list:

We can further modify our vector by using lemmatization and stemming, which are techniques that are used to reduce words to their root form.

The following code snippet shows an example of performing lemmatization using the NLTK library's WordNetlemmatizer module:

In [27]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


True

In [31]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [32]:
from nltk.stem import WordNetLemmatizer
text = "Who would have thought that computer programs would be analyzing human sentiments"
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
tokens=[lemmatizer.lemmatize(word) for word in tokens]
print(tokens)

['Who', 'would', 'have', 'thought', 'that', 'computer', 'program', 'would', 'be', 'analyzing', 'human', 'sentiment']


The lemmatizer's effectiveness relies on WordNet's root word mapping. If a word isn't found, it returns the original word. However, it only successfully lemmatizes some plural forms, like "programs" and "sentiments." This shows its dependency on accurate root word mapping and its susceptibility to errors in transformation.

Stemming is similar to lemmatization but instead of looking up root words in a pre-built dictionary, it defines some rules based on which words are reduced to their root form. For example, it has a rule that states that any word with ing as a suffix will be reduced by removing the suffix.

Performing stemming using the NLTK library's PorterStemmer module

In [35]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
text = "Who would have thought that computer programs would be analyzing human sentiments"
tokens=word_tokenize(text.lower())
ps = PorterStemmer()
tokens=[ps.stem(word) for word in tokens]
print(tokens)

['who', 'would', 'have', 'thought', 'that', 'comput', 'program', 'would', 'be', 'analyz', 'human', 'sentiment']


Here analyzing is changed to analyz

## 4. Part of Speech tagging

POS tagging assigns word types (noun, verb, adverb, etc.) in a sentence, vital for NLP as it reveals a word's contextual meaning.

In [37]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [39]:
nltk.pos_tag(["your"])

[('your', 'PRP$')]

In [40]:
nltk.pos_tag(["beautiful"])

[('beautiful', 'NN')]

In [41]:
nltk.pos_tag(["eat"])

[('eat', 'NN')]

The following code is an example of how POS tagging can be done iteratively:

In [42]:
text = "Usain Bolt is the fastest runner in the world"
tokens = word_tokenize(text)
[nltk.pos_tag([word]) for word in tokens]

[[('Usain', 'NN')],
 [('Bolt', 'NN')],
 [('is', 'VBZ')],
 [('the', 'DT')],
 [('fastest', 'JJS')],
 [('runner', 'NN')],
 [('in', 'IN')],
 [('the', 'DT')],
 [('world', 'NN')]]

The exhaustive list of NLTK POS tags can be accessed using the upenn_tagset() function of NLTK:

In [43]:
nltk.download('tagsets') # need to download first time
nltk.help.upenn_tagset()

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data]   Unzipping help\tagsets.zip.


## 5. Sentiment Analysis with Textblob library

In [48]:
pip install textblob

Collecting textblob
  Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/02/07/5fd2945356dd839974d3a25de8a142dc37293c21315729a41e775b5f3569/textblob-0.18.0.post0-py3-none-any.whl.metadata
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Collecting nltk>=3.8 (from textblob)
  Obtaining dependency information for nltk>=3.8 from https://files.pythonhosted.org/packages/a6/0a/0d20d2c0f16be91b9fa32a77b76c60f9baf6eba419e5ef5deca17af9c582/nltk-3.8.1-py3-none-any.whl.metadata
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
   ---------------------------------------- 0.0/626.3 kB ? eta -:--:--
   ---------------------------------------- 0.0/626.3 kB ? eta -:--:--
    --------------------------------------- 10.2/626.3 kB ? eta -:--:--
   - ------------------------------------- 30.7/626.3 kB 330.3 kB/s eta 0:00:02
   - ------------------------------------- 30.7/626.


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: C:\Users\User\anaconda3\python.exe -m pip install --upgrade pip


In [51]:
from textblob import TextBlob
TextBlob("I love pizza").sentiment

Sentiment(polarity=0.5, subjectivity=0.6)

The polarity score ranges from -1 to 1, with -1 being the most negative sentiment and 1 being the most positive statement. The subjectivity score ranges from 0 to 1, with a score of 0
implying that the statement is factual, whereas a score of 1 implies a highly subjective statement.

In [52]:
TextBlob("The weather is excellent").sentiment

Sentiment(polarity=1.0, subjectivity=1.0)

It also appears that polarity and subjectivity have a high correlation.

## 6. Machine Translation using google translator

Textblob uses Google Translator's API to provide a very simple interface for translating text simply using translate() method.

In [58]:
from textblob import TextBlob
languages = ['fr','zh-CN','hi']
for language in languages:
 print(TextBlob("Who knew translation could be fun").translate(to=language))

AttributeError: 'TextBlob' object has no attribute 'translate'

couldn't solve this error

### Part of speech tagging

The tags function performs POS tagging like so:

In [74]:
TextBlob("The global economy is expected to grow this year").tags

[('The', 'DT'),
 ('global', 'JJ'),
 ('economy', 'NN'),
 ('is', 'VBZ'),
 ('expected', 'VBN'),
 ('to', 'TO'),
 ('grow', 'VB'),
 ('this', 'DT'),
 ('year', 'NN')]

In [78]:
# nltk.download('tagsets')
nltk.help.upenn_tagset()


$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

## 7. Vader

Valence Aware Dictionary and sEntiment Reasoner (VADER) is a recently developed lexicon-based sentiment analysis tool whose accuracy is shown to be much greater than the
existing lexicon-based sentiment analyzers.

In [80]:
pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
     ---------------------------------------- 0.0/126.0 kB ? eta -:--:--
     --- ------------------------------------ 10.2/126.0 kB ? eta -:--:--
     --------- --------------------------- 30.7/126.0 kB 330.3 kB/s eta 0:00:01
     --------- --------------------------- 30.7/126.0 kB 330.3 kB/s eta 0:00:01
     --------- --------------------------- 30.7/126.0 kB 330.3 kB/s eta 0:00:01
     --------- --------------------------- 30.7/126.0 kB 330.3 kB/s eta 0:00:01
     --------- --------------------------- 30.7/126.0 kB 330.3 kB/s eta 0:00:01
     ------------ ------------------------ 41.0/126.0 kB 103.8 kB/s eta 0:00:01
     ------------------ ------------------ 61.4/126.0 kB 156.1 kB/s eta 0:00:01
     --------------------------- --------- 92.2/126.0 kB 210.1 kB/s eta 0:00:01
     --------------------------- --------- 92.2/126.0 kB 210.1 kB/s eta 0:00:01
     --------------------------- ---------


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: C:\Users\User\anaconda3\python.exe -m pip install --upgrade pip


In [81]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
analyser.polarity_scores("This book is very good")

{'neg': 0.0, 'neu': 0.556, 'pos': 0.444, 'compound': 0.4927}

Here, we can see that VADER outputs the negative score, neutral score, and positive score and then aggregates them to calculate the compound score. The compound score is what
we are interested in. Any score greater than 0.05 is considered positive, while less than -0.05 is considered negative:

In [83]:
analyser.polarity_scores("OMG! The book is so cool")

{'neg': 0.0, 'neu': 0.604, 'pos': 0.396, 'compound': 0.5079}

# 2. Web Scrapping with Beautiful Soup

Till now, we have used corpora, or large repositories of text, for NLP research. However, these do not contain all the needed text; for instance, finance-related text may not be included. So, web scraping is an extremely useful tool to collect this data

In [85]:
pip install requests

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: C:\Users\User\anaconda3\python.exe -m pip install --upgrade pip


In [86]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: C:\Users\User\anaconda3\python.exe -m pip install --upgrade pip


In [87]:
import requests
from bs4 import BeautifulSoup

In [90]:
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
request = requests.get(url)
request.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<!-- Google Tag Manager -->\n<script>(function (w, d, s, l, i) {\n\t\tw[l] = w[l] || [];\n\t\tw[l].push({\n\t\t\t\'gtm.start\':\n\t\t\t\tnew Date().getTime(), event: \'gtm.js\'\n\t\t});\n\t\tvar f = d.getElementsByTagName(s)[0],\n\t\t\tj = d.createElement(s), dl = l != \'dataLayer\' ? \'&l=\' + l : \'\';\n\t\tj.async = true;\n\t\tj.src =\n\t\t\t\'https://www.googletagmanager.com/gtm.js?id=\' + i + dl;\n\t\tf.parentNode.insertBefore(j, f);\n\t})(window, document, \'script\', \'dataLayer\', \'GTM-NVFPDWB\');</script>\n<!-- End Google Tag Manager -->\n\t<title>Allinone | Web Scraper Test Sites</title>\n\t<meta charset="utf-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\n\t<meta name="keywords"\n\t\t  content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper"/>\n\t<meta name="description"\n\t\t  content="The most popular web scraping extension. Start scraping in minutes. Automate your tasks with

In [107]:
titles = []
prices = []
ratings = []
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
request = requests.get(url)
soup = BeautifulSoup(request.text, "html.parser")
for product in soup.find_all('div', {'class': 'col-md-4 col-xl-4 col-lg-4'}):
    for pr in product.find_all('div', {'class': 'caption'}):
        for p in pr.find_all('h4', {'class': 'float-end price card-title pull-right'}):
            prices.append(p.text)
        for title in pr.find_all('a' , {'title'}):
            titles.append(title.get('title'))
    for rt in product.find_all('div', {'class': 'ratings'}):
        ratings.append(len(rt.find_all('span', {'class': 'ws-icon ws-icon-star'})))


In [108]:
product_df = pd.DataFrame(zip(titles,prices,ratings), columns = ['Titles','Prices', 'Ratings'])
# product_df.to_csv("ecommerce.csv",index=False)


In [111]:
product_df

Unnamed: 0,Titles,Prices,Ratings
0,Asus VivoBook X441NA-GA190,$295.99,3
1,Prestigio SmartBook 133S Dark Grey,$299,2
2,Prestigio SmartBook 133S Gold,$299,4
3,Aspire E1-510,$306.99,3
4,Lenovo V110-15IAP,$321.94,3
...,...,...,...
112,Lenovo Legion Y720,$1399,3
113,Asus ROG Strix GL702VM-GC146T,$1399,3
114,Asus ROG Strix GL702ZC-GC154T,$1769,4
115,Asus ROG Strix GL702ZC-GC209T,$1769,1


In [110]:
product_df.to_csv("ecommerce.csv",index=False)

Finally saving the file in csv format.