<h1 style='text-align: center'>Blackcoffer Web Sentiment Analysis</h1>

![img_url](https://media.licdn.com/dms/image/C511BAQERjQuN7KqSxw/company-background_10000/0/1584478803942/blackcoffer_cover?e=2147483647&v=beta&t=8DoTmZGtmx_34EtMm9i-OFbKGo_wpmGMfVQEjkdcu5E)

## Dataset Description

The provided dataset appears to be a simple table containing two columns:

- `URL_ID`: This column contains unique identifiers for each entry, formatted as strings (e.g., "blackassign0001", "blackassign0002", etc.).
 
- `URL`: This column contains URLs, likely pointing to web pages or articles.

## Sample Data

| URL_ID         | URL                                                                 |
|----------------|---------------------------------------------------------------------|
| blackassign0001 | https://insights.blackcoffer.com/rising-it-cities-the-new-smart-cities/   |
| blackassign0002 | https://insights.blackcoffer.com/rising-it-cities-the-new-smart-cities/   |
| blackassign0003 | https://insights.blackcoffer.com/internet-deman-from-students-up/ |
| blackassign0004 | https://insights.blackcoffer.com/rise-of-cyber-physical-systems/ |
| blackassign0005 | https://insights.blackcoffer.com/ott-platform-the-future/   |

This dataset is likely used for tasks involving web scraping, content analysis, or URL management. Each row corresponds to a specific article or webpage identified by a unique ID.

## Objective

The objective of this project is to perform web scraping on a list of URLs to analyze the textual content of each webpage. This analysis will involve several linguistic and sentiment measures to gain insights into the nature and characteristics of the content. Specifically, the following measures will be calculated for each webpage:

- `Total Length (total_len)`: The total number of characters in the text extracted from the webpage.

- `Positive Word Count (pos_count)`: The count of words that are classified as positive based on a predefined lexicon.

- `Negative Word Count (neg_count)`: The count of words that are classified as negative based on a predefined lexicon.

- `Sentiment (sentiment)`: A measure of the overall sentiment of the text, often calculated as the difference between positive and negative word counts or through a sentiment analysis model.

- `Polarity Score (polar_score)`: A score indicating the polarity of the text, typically ranging from -1 (very negative) to 1 (very positive).

- `Average Sentence Length (Average_Sentence_Length)`: The average number of words per sentence in the text.

- `Percentage of Complex Words (Percentage_Of_Complex_Words)`: The percentage of words in the text that are considered complex, typically defined as words with three or more syllables.

- `Fog Index (Fog_Index)`: A readability test that estimates the years of formal education needed to understand the text on a first reading.

- `Average Words Per Sentence (Average_Words_Per_Sentence)`: Similar to the average sentence length, this measures the average number of words in each sentence.

- `Complex Words Count (Complex_Words_Count)`: The total number of complex words in the text.

- `Word Count (Word_Count)`: The total number of words in the text.

- `Syllables Count (Syllables_Count)`: The total number of syllables in the text.

- `Syllables Per Word (Syllables_Per_Word)`: The average number of syllables per word in the text.

- `Personal Pronouns Count (Personal_Pronouns_Count)`: The count of personal pronouns (e.g., I, we, you, he, she, it, they) in the text.

- `Average Word Length (Average_Word_Length)`: The average length of words in the text, typically measured in characters.

## Workflow

- `Web Scraping`: Extract the text content from each URL listed in the dataset. This involves fetching the web page and parsing its content to isolate the main body of text.

- `Text Processing`: Clean and preprocess the extracted text to remove any non-text elements (e.g., HTML tags, scripts, advertisements).

- `Linguistic Analysis`: Calculate the various measures listed above using appropriate computational linguistics techniques and tools.

- `Sentiment Analysis`: Apply sentiment analysis methods to determine the sentiment and polarity of the text.

- `Result Compilation`: Compile the results into a structured format, such as a CSV file, where each row corresponds to a URL and each column corresponds to one of the calculated measures.

This analysis aims to provide a comprehensive understanding of the content characteristics of each webpage, facilitating insights into the textual patterns, readability, sentiment, and overall linguistic properties.

## Importing the Necessary Modules and Libraries

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import regex as re
import nltk
from nltk.corpus import cmudict
nltk.download('cmudict')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
import re
import math

[nltk_data] Downloading package cmudict to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Importing the Dataset

In [2]:
data = pd.read_excel("Input.xlsx")
data

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...
...,...,...
95,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...
96,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...
97,blackassign0098,https://insights.blackcoffer.com/contribution-...
98,blackassign0099,https://insights.blackcoffer.com/how-covid-19-...


## Making few Lists for Storing Data

In [3]:
a = data.URL
b = data.URL_ID
ID = []
text=[]
cleaned_text = []
URL_Link = []

## Web Scrapping

In [4]:
for i in range(100):
    url = a[i]
    response =  requests.get(url)
    if response.status_code == 200 :
        soup = BeautifulSoup(response.content, "html.parser")
        codes = soup.findAll("div",{"class": "td-post-content tagdiv-type"})
        text=[]
        for t in codes:
            f = t.get_text()
            text.append(f)
        for j in text:
            item = re.sub('\n', '', j)
            cleaned_text.append(item)
            ID.append(b[i])
            URL_Link.append(a[i])

## Opening a Text file and Storing the extracted data in it as per IDs

In [5]:
for i in range(len(ID)):
    f = open(ID[i] + ".txt", "w" , encoding="utf-8")
    f.write(cleaned_text[i])
f.close()      

In [6]:
zip_data = list(zip(ID, URL_Link, cleaned_text))

In [7]:
df = pd.DataFrame(zip_data, columns=['ID', 'URL_Link' , 'Text'])
Text = df.Text
df

Unnamed: 0,ID,URL_Link,Text
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,We have seen a huge development and dependence...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,"Throughout history, from the industrial revolu..."
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,"IntroductionIn the span of just a few decades,..."
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,"The way we live, work, and communicate has unq..."
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,The year 2040 is poised to witness a continued...
...,...,...,...
84,blackassign0094,https://insights.blackcoffer.com/gaming-disord...,Perhaps the virtual illusion has become today’...
85,blackassign0095,https://insights.blackcoffer.com/what-is-the-r...,What is COVID 19 pandemic?On 31st December 201...
86,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...,"Epidemics, in general, have both direct and in..."
87,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...,COVID 19 has bought the world to its knees. Wi...


## Text Processing

In [8]:
lemma = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [9]:
def text_prep(x):
     corp = str(x).lower() 
     corp = re.sub('[^a-zA-Z]+',' ', corp).strip() 
     tokens = word_tokenize(corp)
     words = [t for t in tokens if t not in stop_words]
     lemmatize = [lemma.lemmatize(w) for w in words]  
     return lemmatize

In [10]:
preprocess_tag = [text_prep(i) for i in df["Text"]]
df["preprocess_txt"] = preprocess_tag

## Calculating the length of the Processed Text

In [11]:
df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))

## Importing the files of Positive and Negative texts

In [12]:
file = open('Negative_Texts.txt', 'r', encoding="utf8")
neg_words = file.read().split()
file = open('Positive_Texts.txt', 'r', encoding="utf8")
pos_words = file.read().split()

## Calculating the Positive and Negative word Count

In [13]:
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg

## Calculating the Sentiment Score

In [14]:
df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)

## Calculating the Polar Score

In [15]:
df["polar_score"] = round((df['pos_count'] - df['neg_count']) / ((df['pos_count'] + df['neg_count']) + 0.000001), 2)

In [16]:
df.head()

Unnamed: 0,ID,URL_Link,Text,preprocess_txt,total_len,pos_count,neg_count,sentiment,polar_score
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,We have seen a huge development and dependence...,"[seen, huge, development, dependence, people, ...",623,214,9,0.33,0.92
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,"Throughout history, from the industrial revolu...","[throughout, history, industrial, revolution, ...",874,249,32,0.25,0.77
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,"IntroductionIn the span of just a few decades,...","[introductionin, span, decade, internet, under...",682,180,28,0.22,0.73
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,"The way we live, work, and communicate has unq...","[way, live, work, communicate, unquestionably,...",667,182,84,0.15,0.37
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,The year 2040 is poised to witness a continued...,"[year, poised, witness, continued, revolution,...",420,129,9,0.29,0.87


## Construcing funtions for calculating differnt measures

### Average Sentence Length

In [17]:
def Avg_sen_len(text):
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    num_words = len(words)
    num_sentences = len(sentences)
    if num_sentences == 0:
        return 0
    else:
        return (num_words / num_sentences)

### Percentage of Complex Words

In [18]:
def Per_Complex_words(text):
    words = word_tokenize(text)    
    cmu = cmudict.dict()
    num_complex_words = sum(1 for word in words if word.lower() in cmu)
    num_words = len(words)
    if num_words == 0:
        return 0
    else:
        return (num_complex_words / num_words * 100)

### FOg Index

In [19]:
def fog_index(text):
    avg_sentence_length = Avg_sen_len(text)
    percentage_complex_words = Per_Complex_words(text)
    fog_index = 0.4 * (avg_sentence_length + percentage_complex_words)
    return (fog_index)

### Average Words per Sentence

In [20]:
def avg_words_per_sentence(text):
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    total_words = sum(len(word_tokenize(sentence)) for sentence in sentences)
    return (total_words / len(sentences))

### Complex Word Count 

In [21]:
def complex_word_count(text):
    words = word_tokenize(text)
    stopwords_set = set(stopwords.words("english"))
    complex_words = [word for word in words if len(wn.synsets(word)) > 2 and word not in stopwords_set]
    return (len(complex_words))

### Word Count

In [22]:
def word_count(text):
    words = word_tokenize(text)
    return (len(words)) 

### Syllable Count

In [23]:
def syllables_count(word):
    return sum(1 for vowel in word if vowel.lower() in 'aeiou')

### Syllables Per word

In [24]:
def syllables_per_word(text):
    words = word_tokenize(text)
    syllables = sum(syllables_count(word) for word in words)
    return (syllables / len(words))

### Personal Pronoun Counts

In [25]:
def personal_pronouns_count(text):
    words = word_tokenize(text)
    tagged_words = nltk.pos_tag(words)
    personal_pronouns = [word for word, tag in tagged_words if tag == 'PRP']
    return (len(personal_pronouns))

### Average Word Length

In [26]:
def avg_word_length(text):
    words = word_tokenize(text)
    total_length = sum(len(word) for word in words)
    return (total_length / len(words))

In [27]:
text_list = df["Text"]

### Constructing Lists for storing the calculated values

In [28]:
Average_Sentence_Length = []
Percentage_Of_Complex_Words =[]
Fog_Index = []
Average_Words_Per_Sentence = []
Complex_Words_Count = []
Word_Count = []
Syllables_Count = []
Syllables_Per_Word = []
Personal_Pronouns_Count = []
Average_Word_Length = []

### Appending Calculated Values in the Lists

In [29]:
for i in range(len(text_list)):
    text = text_list[i]
    Average_Sentence_Length.append(Avg_sen_len(text))
    Percentage_Of_Complex_Words.append(Per_Complex_words(text))
    Fog_Index.append(fog_index(text))
    Average_Words_Per_Sentence.append(avg_words_per_sentence(text))
    Complex_Words_Count.append(complex_word_count(text))
    Word_Count.append(word_count(text))
    Syllables_Count.append(syllables_count(text))
    Syllables_Per_Word.append(syllables_per_word(text))
    Personal_Pronouns_Count.append(personal_pronouns_count(text))
    Average_Word_Length.append(avg_word_length(text))

### Constructing a Datframe out of the lists with required column names

In [31]:
df_new = pd.DataFrame(list(zip(Average_Sentence_Length, Percentage_Of_Complex_Words, Fog_Index, 
                           Average_Words_Per_Sentence, Complex_Words_Count, Word_Count,
                           Syllables_Count, Syllables_Per_Word, Personal_Pronouns_Count, Average_Word_Length)),
               columns =['Average_Sentence_Length', 'Percentage_Of_Complex_Words', 'Fog_Index', 
                         'Average_Words_Per_Sentence', 'Complex_Words_Count', 'Word_Count',
                         'Syllables_Count', 'Syllables_Per_Word', 'Personal_Pronouns_Count', 'Average_Word_Length'])

### Concatinating the Main Dataframe and the Calculated Dataframe

In [32]:
dataframe = pd.concat([df, df_new], axis=1, join='inner')

In [33]:
output = dataframe.copy()
output

Unnamed: 0,ID,URL_Link,Text,preprocess_txt,total_len,pos_count,neg_count,sentiment,polar_score,Average_Sentence_Length,Percentage_Of_Complex_Words,Fog_Index,Average_Words_Per_Sentence,Complex_Words_Count,Word_Count,Syllables_Count,Syllables_Per_Word,Personal_Pronouns_Count,Average_Word_Length
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,We have seen a huge development and dependence...,"[seen, huge, development, dependence, people, ...",623,214,9,0.33,0.92,22.457627,85.132075,43.035881,22.457627,463,1325,2179,1.644528,18,4.338868
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,"Throughout history, from the industrial revolu...","[throughout, history, industrial, revolution, ...",874,249,32,0.25,0.77,25.890625,84.007242,43.959147,25.890625,601,1657,3119,1.882317,22,4.968618
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,"IntroductionIn the span of just a few decades,...","[introductionin, span, decade, internet, under...",682,180,28,0.22,0.73,25.847826,83.179142,43.610787,25.847826,413,1189,2496,2.099243,17,5.625736
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,"The way we live, work, and communicate has unq...","[way, live, work, communicate, unquestionably,...",667,182,84,0.15,0.37,31.368421,81.208054,45.030590,31.368421,438,1192,2359,1.979027,12,5.421980
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,The year 2040 is poised to witness a continued...,"[year, poised, witness, continued, revolution,...",420,129,9,0.29,0.87,27.333333,86.043360,45.350678,27.333333,254,738,1425,1.930894,12,5.245257
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,blackassign0094,https://insights.blackcoffer.com/gaming-disord...,Perhaps the virtual illusion has become today’...,"[perhaps, virtual, illusion, become, today, ne...",625,169,50,0.19,0.54,24.037736,87.048666,44.434561,24.037736,447,1274,2135,1.675824,41,4.434066
85,blackassign0095,https://insights.blackcoffer.com/what-is-the-r...,What is COVID 19 pandemic?On 31st December 201...,"[covid, pandemic, st, december, novel, coronav...",355,66,29,0.10,0.39,41.117647,77.968526,47.634469,41.117647,237,699,1184,1.693848,11,4.652361
86,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...,"Epidemics, in general, have both direct and in...","[epidemic, general, direct, indirect, cost, as...",658,171,68,0.16,0.43,26.755556,88.787375,46.217172,26.755556,479,1204,2232,1.853821,5,4.985880
87,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...,COVID 19 has bought the world to its knees. Wi...,"[covid, bought, world, knee, business, shut, t...",525,156,38,0.22,0.61,45.461538,87.648054,53.243837,45.461538,373,1182,1956,1.654822,27,4.400169


In [34]:
final_output = output.drop(['Text', 'preprocess_txt'], axis=1)
final_output

Unnamed: 0,ID,URL_Link,total_len,pos_count,neg_count,sentiment,polar_score,Average_Sentence_Length,Percentage_Of_Complex_Words,Fog_Index,Average_Words_Per_Sentence,Complex_Words_Count,Word_Count,Syllables_Count,Syllables_Per_Word,Personal_Pronouns_Count,Average_Word_Length
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,623,214,9,0.33,0.92,22.457627,85.132075,43.035881,22.457627,463,1325,2179,1.644528,18,4.338868
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,874,249,32,0.25,0.77,25.890625,84.007242,43.959147,25.890625,601,1657,3119,1.882317,22,4.968618
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,682,180,28,0.22,0.73,25.847826,83.179142,43.610787,25.847826,413,1189,2496,2.099243,17,5.625736
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,667,182,84,0.15,0.37,31.368421,81.208054,45.030590,31.368421,438,1192,2359,1.979027,12,5.421980
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,420,129,9,0.29,0.87,27.333333,86.043360,45.350678,27.333333,254,738,1425,1.930894,12,5.245257
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,blackassign0094,https://insights.blackcoffer.com/gaming-disord...,625,169,50,0.19,0.54,24.037736,87.048666,44.434561,24.037736,447,1274,2135,1.675824,41,4.434066
85,blackassign0095,https://insights.blackcoffer.com/what-is-the-r...,355,66,29,0.10,0.39,41.117647,77.968526,47.634469,41.117647,237,699,1184,1.693848,11,4.652361
86,blackassign0096,https://insights.blackcoffer.com/what-is-the-r...,658,171,68,0.16,0.43,26.755556,88.787375,46.217172,26.755556,479,1204,2232,1.853821,5,4.985880
87,blackassign0097,https://insights.blackcoffer.com/impact-of-cov...,525,156,38,0.22,0.61,45.461538,87.648054,53.243837,45.461538,373,1182,1956,1.654822,27,4.400169


### Importing the dataframe in a excel worksheet.

In [35]:
final_output.to_excel("Output Data Structure.xlsx", index = False)