## Web Scrapping (Data Extraction)

For each of the articles, given in the input.xlsx file, extract the article text and save the extracted article in a text file with URL_ID as its file name.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
input_data = pd.read_excel('Input.xlsx')
input_data.head(5)

Unnamed: 0,URL_ID,URL
0,1,https://insights.blackcoffer.com/how-is-login-...
1,2,https://insights.blackcoffer.com/how-does-ai-h...
2,3,https://insights.blackcoffer.com/ai-and-its-im...
3,4,https://insights.blackcoffer.com/how-do-deep-l...
4,5,https://insights.blackcoffer.com/how-artificia...


In [3]:
input_data.tail(5)

Unnamed: 0,URL_ID,URL
165,166,https://insights.blackcoffer.com/role-big-data...
166,167,https://insights.blackcoffer.com/sales-forecas...
167,168,https://insights.blackcoffer.com/detect-data-e...
168,169,https://insights.blackcoffer.com/data-exfiltra...
169,170,https://insights.blackcoffer.com/impacts-of-co...


In [4]:
url = input_data['URL'][0]
url

'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'

In [5]:
headers = {
    'User-Agent': 'My User Agent 1.0',
    'From': 'burnwalbittu@gmail.com'  # This is another valid field
}
page = requests.get(url, headers=headers)

In [6]:
soup = BeautifulSoup(page.text, "html.parser")

In [7]:
print(soup.prettify())

<!DOCTYPE doctype html >
<!--[if IE 8]>    <html class="ie8" lang="en"> <![endif]-->
<!--[if IE 9]>    <html class="ie9" lang="en"> <![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en-US">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="https://insights.blackcoffer.com/xmlrpc.php" rel="pingback"/>
  <script type="text/javascript">
   !function(){var e={};e.g=function(){if("object"==typeof globalThis)return globalThis;try{return this||new Function("return this")()}catch(e){if("object"==typeof window)return window}}(),function(n){let{ampUrl:t,isCustomizePreview:r,isAmpDevMode:o,noampQueryVarName:s,noampQueryVarValue:i,disabledStorageKey:a,mobileUserAgents:c,regexRegex:u}=n;if("undefined"==typeof sessionStorage)return;const d=new RegExp(u);if(!c.some((e=>{const n=e.match(d);return!(!n||!new RegExp(n[1],n[2]).test(navigator.userAgent))||navigator.userAgent.includes(e)})))return;e.g.addEventListener("

In [8]:
title = soup.find(name="h1", attrs={"class":"entry-title"})
print(title.get_text())

content = soup.find(name="div", attrs={"class":"td-post-content"})
print(content.get_text())

How is Login Logout Time Tracking for Employees in Office done by AI?

When people hear AI they often think about sentient robots and magic boxes. AI today is much more mundane and simple—but that doesn’t mean it’s not powerful. Another misconception is that high-profile research projects can be applied directly to any business situation. AI done right can create an extreme return on investments (ROIs)—for instance through automation or precise prediction. But it does take thought, time, and proper implementation. We have seen that success and value generated by AI projects are increased when there is a grounded understanding and expectation of what the technology can deliver from the C-suite down.
“Artificial Intelligence (AI) is a science and a set of computational technologies that are inspired by—but typically operate quite differently from—the ways people use their nervous systems and bodies to sense, learn, reason and take action.”3 Lately there has been a big rise in the day-to-

**Automating the entire process and from each of the articles given in the input.xlsx file, extracting only the article title and the article text. Later on saving the extracted article in a text file with URL_ID as its file name.**
 


In [9]:
headers = {
    'User-Agent': 'My User Agent 1.0',
    'From': 'burnwalbittu@gmail.com' 
}
for i in range(len(input_data)):
  file = open(f"{input_data['URL_ID'][i]}.txt", "w")
  url = input_data['URL'][i]
  page = requests.get(url, headers = headers)
  soup = BeautifulSoup(page.text, "html.parser")
  title = soup.find(name="h1", attrs={"class":"entry-title"})
  title_c = title.get_text()
  content = soup.find(name="div", attrs={"class":"td-post-content"})
  content_c = content.get_text()
  file.write(title_c)
  file.write(content_c)
  file.flush()
  file.close

## Data Analysis

**For each of the extracted texts from the article, perform textual analysis and compute variables**

1.	POSITIVE SCORE
2.	NEGATIVE SCORE
3.	POLARITY SCORE
4.	SUBJECTIVITY SCORE
5.	AVG SENTENCE LENGTH
6.	PERCENTAGE OF COMPLEX WORDS
7.	FOG INDEX
8.	AVG NUMBER OF WORDS PER SENTENCE
9.	COMPLEX WORD COUNT
10.	WORD COUNT
11.	SYLLABLE PER WORD
12.	PERSONAL PRONOUNS
13.	AVG WORD LENGTH





In [10]:
# Importing Required Packages

import re
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [11]:
with open("StopWords_Generic.txt", "r") as sw:
  stop_words = sw.read().lower()
  stopwords_list = stop_words.split('\n')

In [12]:
with open("PositiveWords.txt", "r") as pos:
  pos_words = pos.read().lower()
  pos_list = pos_words.split('\n')

In [13]:
with open("NegativeWords.txt", "r") as neg:
  neg_words = neg.read().lower()
  neg_list = neg_words.split('\n')

**Data Analysis**

In [14]:
with open("1.txt", "r") as text:
  text = text.read().lower()
  text = text.split('\n')
text = str(text)
text

"['how is login logout time tracking for employees in office done by ai?', 'when people hear ai they often think about sentient robots and magic boxes. ai today is much more mundane and simple—but that doesn’t mean it’s not powerful. another misconception is that high-profile research projects can be applied directly to any business situation. ai done right can create an extreme return on investments (rois)—for instance through automation or precise prediction. but it does take thought, time, and proper implementation. we have seen that success and value generated by ai projects are increased when there is a grounded understanding and expectation of what the technology can deliver from the c-suite down.', '“artificial intelligence (ai) is a science and a set of computational technologies that are inspired by—but typically operate quite differently from—the ways people use their nervous systems and bodies to sense, learn, reason and take action.”3 lately there has been a big rise in the

In [15]:
# Removing all character except string

corp = re.sub(r'[^a-zA-Z]',' ', text).strip()
corp

'how is login logout time tracking for employees in office done by ai     when people hear ai they often think about sentient robots and magic boxes  ai today is much more mundane and simple but that doesn t mean it s not powerful  another misconception is that high profile research projects can be applied directly to any business situation  ai done right can create an extreme return on investments  rois  for instance through automation or precise prediction  but it does take thought  time  and proper implementation  we have seen that success and value generated by ai projects are increased when there is a grounded understanding and expectation of what the technology can deliver from the c suite down      artificial intelligence  ai  is a science and a set of computational technologies that are inspired by but typically operate quite differently from the ways people use their nervous systems and bodies to sense  learn  reason and take action    lately there has been a big rise in the d

In [16]:
# Performing tokenization and converting words into tokens

tokens = word_tokenize(corp)
tokens

['how',
 'is',
 'login',
 'logout',
 'time',
 'tracking',
 'for',
 'employees',
 'in',
 'office',
 'done',
 'by',
 'ai',
 'when',
 'people',
 'hear',
 'ai',
 'they',
 'often',
 'think',
 'about',
 'sentient',
 'robots',
 'and',
 'magic',
 'boxes',
 'ai',
 'today',
 'is',
 'much',
 'more',
 'mundane',
 'and',
 'simple',
 'but',
 'that',
 'doesn',
 't',
 'mean',
 'it',
 's',
 'not',
 'powerful',
 'another',
 'misconception',
 'is',
 'that',
 'high',
 'profile',
 'research',
 'projects',
 'can',
 'be',
 'applied',
 'directly',
 'to',
 'any',
 'business',
 'situation',
 'ai',
 'done',
 'right',
 'can',
 'create',
 'an',
 'extreme',
 'return',
 'on',
 'investments',
 'rois',
 'for',
 'instance',
 'through',
 'automation',
 'or',
 'precise',
 'prediction',
 'but',
 'it',
 'does',
 'take',
 'thought',
 'time',
 'and',
 'proper',
 'implementation',
 'we',
 'have',
 'seen',
 'that',
 'success',
 'and',
 'value',
 'generated',
 'by',
 'ai',
 'projects',
 'are',
 'increased',
 'when',
 'there',
 

In [17]:
# Removing stopwords from the list of tokens

words = [t for t in tokens if t not in stopwords_list]
words

['login',
 'logout',
 'time',
 'tracking',
 'employees',
 'office',
 'done',
 'ai',
 'people',
 'hear',
 'ai',
 'often',
 'think',
 'sentient',
 'robots',
 'magic',
 'boxes',
 'ai',
 'today',
 'much',
 'mundane',
 'simple',
 'doesn',
 't',
 'mean',
 's',
 'powerful',
 'another',
 'misconception',
 'high',
 'profile',
 'research',
 'projects',
 'applied',
 'directly',
 'business',
 'situation',
 'ai',
 'done',
 'right',
 'create',
 'extreme',
 'return',
 'investments',
 'rois',
 'instance',
 'automation',
 'precise',
 'prediction',
 'take',
 'thought',
 'time',
 'proper',
 'implementation',
 'seen',
 'success',
 'value',
 'generated',
 'ai',
 'projects',
 'increased',
 'a',
 'grounded',
 'understanding',
 'expectation',
 'technology',
 'deliver',
 'c',
 'suite',
 'artificial',
 'intelligence',
 'ai',
 'a',
 'science',
 'a',
 'set',
 'computational',
 'technologies',
 'inspired',
 'typically',
 'operate',
 'quite',
 'differently',
 'ways',
 'people',
 'use',
 'nervous',
 'systems',
 'bod

In [18]:
# Performing Lemmatization on each word

lemmatize_text = [lemma.lemmatize(w) for w in words]
lemmatize_text

['login',
 'logout',
 'time',
 'tracking',
 'employee',
 'office',
 'done',
 'ai',
 'people',
 'hear',
 'ai',
 'often',
 'think',
 'sentient',
 'robot',
 'magic',
 'box',
 'ai',
 'today',
 'much',
 'mundane',
 'simple',
 'doesn',
 't',
 'mean',
 's',
 'powerful',
 'another',
 'misconception',
 'high',
 'profile',
 'research',
 'project',
 'applied',
 'directly',
 'business',
 'situation',
 'ai',
 'done',
 'right',
 'create',
 'extreme',
 'return',
 'investment',
 'roi',
 'instance',
 'automation',
 'precise',
 'prediction',
 'take',
 'thought',
 'time',
 'proper',
 'implementation',
 'seen',
 'success',
 'value',
 'generated',
 'ai',
 'project',
 'increased',
 'a',
 'grounded',
 'understanding',
 'expectation',
 'technology',
 'deliver',
 'c',
 'suite',
 'artificial',
 'intelligence',
 'ai',
 'a',
 'science',
 'a',
 'set',
 'computational',
 'technology',
 'inspired',
 'typically',
 'operate',
 'quite',
 'differently',
 'way',
 'people',
 'use',
 'nervous',
 'system',
 'body',
 'sense'

**Computing Variables**

In [19]:
# Positive Score: This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.
def positivescore(text):
  score = 0
  for word in text:
    if word in pos_list:
      score = score + 1
  return score

In [20]:
# Negative Score: This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. 
# We multiply the score with -1 so that the score is a positive number.

def negativescore(text):
  score = 0
  for word in text:
    if word in neg_list:
      score = score - 1
      score = score*(-1)
  return score

In [21]:
# Polarity Score: This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula: 
# Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001)

def polarityscore(positive_score, negative_score):
  ps = (positive_score - negative_score)/((positive_score + negative_score) + 0.000001)
  return round(ps, 4)

In [22]:
# Subjectivity Score: This is the score that determines if a given text is objective or subjective. It is calculated by using the formula: 
# Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)

def subjectivescore(positive_score, negative_score, text):
  ss = (positive_score - negative_score)/(len(text) + 0.000001)
  return round(ss, 4)

In [23]:
# Analysis of Readability is calculated using the Gunning Fox index formula described below.
# Average Sentence Length = the number of words / the number of sentences

def avg_sen_len(words, sentences):
  asl = len(words)/len(sentences)
  return round(asl)

In [24]:
# Calculating Vowels, Complex Word and Percentage of Complex words

def percent_complex_word(tokens):
  complexword = 0
  vowels=0
  for word in tokens:
      
      if word.endswith(('es','ed')):
          pass
      else:
          for w in word:
              if(w=='a' or w=='e' or w=='i' or w=='o' or w=='u'):
                  vowels += 1
          if(vowels > 2):
              complexword += 1

  complex_word_percentage = complexword/len(tokens)
  
  return vowels, complexword, round(complex_word_percentage,2)

In [25]:
# Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)

def fog_index(avg_sent_length, percentage_complex_word):
  fi = 0.4*(avg_sent_length+percentage_complex_word)
  return round(fi,3)

In [26]:
# Calculate Personal Pronouns

def calculate_personal_pronoun(text):
  pp = 0
  for word in text:
    if(word=='i' or word=='we' or word=='my' or word=='ours' or word=='us'):
      pp += 1

  return pp

In [27]:
# Average Word Length

def avg_word_len(text):
  count = 0
  for word in text:
    for w in word:
      count += 1
  avg  = count/len(text)
  return round(avg)

In [28]:
# Automating the entire process and computing variables for each article and storing into empty lists

Positive_Score = []
Negative_Score = []
Polarity_Score = []
Subjective_Score = []
Average_Sentence_Length = []
Percentage_of_Complex_words = []
Fog_Index = []
Average_Number_of_Words_Per_Sentence = [] 
Complex_Words = []
Word_Count = []
Syllable_Count_Per_Word = []
Personal_Pronouns = []
Average_Word_Length = []


for i in range(len(input_data)):
  with open(f"{i+1}.txt", "r") as text:
    text = text.read().lower()
    text = text.split('\n')

  text = str(text)
  corp = re.sub(r'[^a-zA-Z]',' ', text).strip()
  tokens = word_tokenize(corp)
  sent_token = sent_tokenize(text)
  words = [t for t in tokens if t not in stopwords_list]
  lemmatize_text = [lemma.lemmatize(w) for w in words]

  # Extracting Derived variables

  positive_score = positivescore(lemmatize_text)
  Positive_Score.append(positive_score)

  negative_score = negativescore(lemmatize_text)
  Negative_Score.append(negative_score)

  Polarity_Score.append(polarityscore(positive_score,negative_score))

  Subjective_Score.append(subjectivescore(positive_score, negative_score, lemmatize_text))

  # Analysis of Readability

  avg_sent_length = avg_sen_len(lemmatize_text, sent_token)
  Average_Sentence_Length.append(avg_sent_length)

  vowels, complex_word_count, percentage_complex_word = percent_complex_word(lemmatize_text)
  Percentage_of_Complex_words.append(percentage_complex_word)
  Syllable_Count_Per_Word.append(vowels)
  Complex_Words.append(complex_word_count)
  

  Fog_Index.append(fog_index(avg_sent_length,percentage_complex_word))
  Average_Number_of_Words_Per_Sentence.append(avg_sent_length)

  Word_Count.append(len(lemmatize_text))
  Personal_Pronouns.append(calculate_personal_pronoun(lemmatize_text))
  Average_Word_Length.append(avg_word_len(lemmatize_text))

In [29]:
# Reading the output file

output_data = pd.read_excel('Output_Data_Structure.xlsx')

In [30]:
output_data.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,1.0,https://insights.blackcoffer.com/how-is-login-...,,,,,,,,,,,,,
1,2.0,https://insights.blackcoffer.com/how-does-ai-h...,,,,,,,,,,,,,
2,3.0,https://insights.blackcoffer.com/ai-and-its-im...,,,,,,,,,,,,,
3,4.0,https://insights.blackcoffer.com/how-do-deep-l...,,,,,,,,,,,,,
4,5.0,https://insights.blackcoffer.com/how-artificia...,,,,,,,,,,,,,


In [31]:
output_data.columns

Index(['URL_ID', 'URL', 'POSITIVE SCORE', 'NEGATIVE SCORE', 'POLARITY SCORE',
       'SUBJECTIVITY SCORE', 'AVG SENTENCE LENGTH',
       'PERCENTAGE OF COMPLEX WORDS', 'FOG INDEX',
       'AVG NUMBER OF WORDS PER SENTENCE', 'COMPLEX WORD COUNT', 'WORD COUNT',
       'SYLLABLE PER WORD', 'PERSONAL PRONOUNS', 'AVG WORD LENGTH'],
      dtype='object')

In [32]:
output_data['POSITIVE SCORE'] = Positive_Score
output_data['NEGATIVE SCORE'] = Negative_Score
output_data['POLARITY SCORE'] = Polarity_Score
output_data['SUBJECTIVITY SCORE'] = Subjective_Score
output_data['AVG SENTENCE LENGTH'] = Average_Sentence_Length
output_data['PERCENTAGE OF COMPLEX WORDS'] = Percentage_of_Complex_words
output_data['FOG INDEX'] = Fog_Index
output_data['AVG NUMBER OF WORDS PER SENTENCE'] = Average_Number_of_Words_Per_Sentence
output_data['COMPLEX WORD COUNT'] = Complex_Words
output_data['WORD COUNT'] = Word_Count
output_data['SYLLABLE PER WORD'] = Syllable_Count_Per_Word
output_data['PERSONAL PRONOUNS'] = Personal_Pronouns
output_data['AVG WORD LENGTH'] = Average_Word_Length

In [33]:
output_data

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,1.0,https://insights.blackcoffer.com/how-is-login-...,5,1,0.6667,0.0088,18,0.93,7.572,18,425,455,1011,0,6
1,2.0,https://insights.blackcoffer.com/how-does-ai-h...,10,0,1.0000,0.0246,14,0.95,5.980,14,386,407,946,1,6
2,3.0,https://insights.blackcoffer.com/ai-and-its-im...,38,1,0.9487,0.0315,15,0.93,6.372,15,1096,1173,2679,2,6
3,4.0,https://insights.blackcoffer.com/how-do-deep-l...,6,1,0.7143,0.0179,21,0.92,8.768,21,257,279,672,1,7
4,5.0,https://insights.blackcoffer.com/how-artificia...,21,0,1.0000,0.0263,14,0.96,5.984,14,770,798,1834,2,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,166.0,https://insights.blackcoffer.com/role-big-data...,18,1,0.8947,0.0189,12,0.94,5.176,12,848,900,2159,0,6
166,167.0,https://insights.blackcoffer.com/sales-forecas...,21,1,0.9091,0.0312,16,0.94,6.776,16,599,640,1470,0,7
167,168.0,https://insights.blackcoffer.com/detect-data-e...,4,1,0.6000,0.0046,11,0.95,4.780,11,625,657,1588,0,7
168,169.0,https://insights.blackcoffer.com/data-exfiltra...,4,0,1.0000,0.0113,13,0.96,5.584,13,338,353,817,0,6


In [34]:
# Exporting File

output_data.to_excel('Output_Data_Structure_File.xlsx', index = False, header = True)