# Add Virtual Environment to Jupyter Notebook
More at (https://janakiev.com/blog/jupyter-virtual-envs/)

First, you need to activate your virtual environment. I named it (handsonnlp here)

In [2]:
!cd handsonnlp
!source handsonnlp/bin/activate

Next, install ipykernel which provides the IPython kernel for Jupyter:

In [None]:
!pip install -U ipykernel

Then, you can add your virtual environment to Jupyter by typing:

In [4]:
!python3 -m ipykernel install --user --name=handsonnlp

Installed kernelspec handsonnlp in /home/ahmedkashkoush/.local/share/jupyter/kernels/handsonnlp


# Accessing Text from the Web
# AK: Access web text: Four ways
Four common ways to do that
* Using urllib: library for opening URLs (https://docs.python.org/3/library/urllib.request.html)
* Using requests:  allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.(https://docs.python-requests.org/en/master/)
* Using BeautifulSoup: a Python library for pulling data out of HTML and XML files. (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* Using scrapy: a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler.(https://scrapy.org/)

Using urllib

In [None]:
# AK: Access web text: Urllib
from urllib.request import urlopen
html = urlopen("https://github.com/ahmad-kashkoush").read().decode('utf-8')
print(html)



<!DOCTYPE html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true"
  
  >



  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  

  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-74231a1f3bbb.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-8a995f0bacd4.css" /><link data-color-theme="light_high_contrast" crossorigin="anonymous

What if you can't assume it's UTF-8? Either don't decode it, or check the headers.

urlopen()
This function always returns an object that has the properties url, headers, and status.

In [10]:
# AK: Urllib: header response
resp = urlopen("https://github.com/ahmad-kashkoush")
resp.headers.__str__()

'Date: Sun, 20 Apr 2025 15:52:38 GMT\nContent-Type: text/html; charset=utf-8\nVary: X-Requested-With, X-PJAX-Container, Turbo-Frame, Turbo-Visit,Accept-Encoding, Accept, X-Requested-With\nETag: W/"0daae7da8803d92570f56833971e4fa9"\nCache-Control: max-age=0, private, must-revalidate\nStrict-Transport-Security: max-age=31536000; includeSubdomains; preload\nX-Frame-Options: deny\nX-Content-Type-Options: nosniff\nX-XSS-Protection: 0\nReferrer-Policy: origin-when-cross-origin, strict-origin-when-cross-origin\nContent-Security-Policy: default-src \'none\'; base-uri \'self\'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src \'self\' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-62

In [11]:
resp.headers.items()

[('Date', 'Sun, 20 Apr 2025 15:52:38 GMT'),
 ('Content-Type', 'text/html; charset=utf-8'),
 ('Vary',
  'X-Requested-With, X-PJAX-Container, Turbo-Frame, Turbo-Visit,Accept-Encoding, Accept, X-Requested-With'),
 ('ETag', 'W/"0daae7da8803d92570f56833971e4fa9"'),
 ('Cache-Control', 'max-age=0, private, must-revalidate'),
 ('Strict-Transport-Security', 'max-age=31536000; includeSubdomains; preload'),
 ('X-Frame-Options', 'deny'),
 ('X-Content-Type-Options', 'nosniff'),
 ('X-XSS-Protection', '0'),
 ('Referrer-Policy',
  'origin-when-cross-origin, strict-origin-when-cross-origin'),
 ('Content-Security-Policy',
  "default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-up

In [13]:
# AK: Charset
# The above cells should run first
content_type_header = resp.headers.get('content-type')
content_type_header

'text/html; charset=utf-8'

In [17]:
# parse charset
if '=' in content_type_header:
    charset = content_type_header.split('=')[1]

In [15]:
charset

'utf-8'

In [16]:
# Or using get_content_charset()
resp.headers.get_content_charset()

'utf-8'

# Remove Stopwords using NLTK

In [None]:
# AK: stopwords removal
# !pip install nltk
import nltk
from nltk import word_tokenize
sentence= "Hi there, this is Muhammad Elgendi and this an example of a sentence with a couple of stopwords!"

# tokenization
tokens = word_tokenize(sentence)
print(tokens)

# stopwords in english
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [None]:
# if you get a LookupError, then download stopwords package
nltk.download('stopwords')

In [23]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
new_tokens = [w for w in tokens if not w in stop_words]

In [24]:
# after removal
new_tokens

['Hi',
 ',',
 'Muhammad',
 'Elgendi',
 'example',
 'sentence',
 'couple',
 'stopwords',
 '!']

# TF-IDF metric
# AK: TF-IDF: Definition
* statistical measure for evaluating the relevence between word and a document
* Done by multipling (occurence of word, inverse document frequency)
is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF is an acronym of two terms: term frequency and inverse document frequency.

Term frequency(TF): is the ratio representing the count of specific words to the total number of words in a document. 

Inverse document frequency(IDF): is a log ratio of the total number of documents to a document containing a particular word.

Example:
Suppose that a document contains 100 words, wherein the word happy appears five times. The term frequency
(i.e., tf) for happy is then (5/100) = 0.05.

Suppose we have 10 million documents, and the word happy appears in 1,000 of them. The inverse document frequency (i.e., idf), then, would be calculated as log (10,000,000/1,000) = 4

TF-IDF of "happy" is 0.05* 4 = 0.20

# TF-IDF  using SciKit-Learn library

In [None]:
# install SciKit-Learn
!python3 -m pip install -U scikit-learn

### Read more about TfidfVectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidfvectorizer#sklearn.feature_extraction.text.TfidfVectorizer)

In [None]:
# AK: TF-IDF: Text Vectorization: TFidVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
texts=["Ramiess sings classic songs","he listens to old pop",
"and rock music", ' and also listens to classical songs']
vect = TfidfVectorizer()
X = vect.fit_transform(texts)
print(X.todense())

[[0.         0.         0.52547275 0.         0.         0.
  0.         0.         0.         0.52547275 0.         0.52547275
  0.41428875 0.        ]
 [0.         0.         0.         0.         0.48546061 0.38274272
  0.         0.48546061 0.48546061 0.         0.         0.
  0.         0.38274272]
 [0.         0.48693426 0.         0.         0.         0.
  0.61761437 0.         0.         0.         0.61761437 0.
  0.         0.        ]
 [0.47212003 0.37222485 0.         0.47212003 0.         0.37222485
  0.         0.         0.         0.         0.         0.
  0.37222485 0.37222485]]


In [27]:
print(type(X))

<class 'scipy.sparse._csr.csr_matrix'>


### Read more about scipy.sparse.csr_matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)