# Part 2: EDA, Feature Engineering, and Model Development

### Future Optimizations

1. Ignore website navigation for websites like (wikipedia, stackoverflow, etc.)


### Tasks

1. Research into transformer, BART model, to better understand model set up and how to get highest performance
2. Experiment with BART input (lemmatize/not, capitalization, etc.) to find best performance 
3. Try to implement BART model with recursion as per the drawn diagram 
4. Research into paralellization of transformer models with HuggingFace 

### Notes

1. To actually make project work, we need to create a few metrics that we can train a DL model on. Maybe, linguistic complexity (formulaic), sentiment analysis, text length, etc.

In [1]:
#import import_ipynb
import WebScrape

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>


In [2]:
df = WebScrape.get_df()

Enter your search query:  what is machine learning


Scraping...: 100%|████████████████████████████████| 1/1 [00:04<00:00,  4.50s/it]


Grabbed search results
45


In [3]:
df

Unnamed: 0,Domain,Title,URL,Text
0,mathworks,What are Machine Learning Algorithms for AI?,https://www.mathworks.com/discovery/machine-le...,Machine Learning is an AI technique that teach...
1,geeksforgeeks,What are Machine Learning Algorithms for AI?,https://www.geeksforgeeks.org/ml-machine-learn...,Machine Learning(ML) can be explained as autom...
2,expert,What are Machine Learning Algorithms for AI?,https://www.expert.ai/blog/machine-learning-de...,machine learning (ML) to help machines underst...
3,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...
4,mygreatlearning,What are Machine Learning Algorithms for AI?,https://www.mygreatlearning.com/blog/what-is-m...,"Machine Learning? Defination, Types, Applicati..."
5,ibm,What are Machine Learning Algorithms for AI?,https://www.ibm.com/topics/machine-learning,machine learning? Machine learning is a branch...
6,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...
7,techtarget,What are Machine Learning Algorithms for AI?,https://www.techtarget.com/searchenterpriseai/...,machine learning? Machine learning (ML) is a t...
8,microfocus,What are Machine Learning Algorithms for AI?,https://www.microfocus.com/en-us/what-is/machi...,Machine learning applications learn from the i...
9,en.wikipedia,What are Machine Learning Algorithms for AI?,https://en.wikipedia.org/wiki/Machine_learning,Machine learning approaches have been applied ...


In [7]:
len(df["Text"][0])
# df["URL"][6]

11393

In [8]:
df["Text"][0]

'machine learning. One of its own, Arthur Samuel, is credited for coining the term, “machine learning” with his (PDF, 481 KB) (link resides outside IBM) around the game of checkers. Robert Nealey, the self-proclaimed checkers master, played the game on an IBM 7094 computer in 1962, and he lost to the computer. Compared to what can be done today, this feat seems trivial, but it’s considered a major milestone in the field of artificial intelligence. Over the last couple of decades, the technological advances in storage and processing power have enabled some innovative products based on machine learning, such as Netflix’s recommendation engine and self-driving cars. Machine learning is an important component of the growing field of data science. Through the use of statistical methods, algorithms are trained to make classifications or predictions, and to uncover key insights in data mining projects. These insights subsequently drive decision making within applications and businesses, ideal

: 

# To Be Implemented

_____________________________

In [4]:
import spacy
from textstat.textstat import textstatistics
from textstat import textstat
legacy_round = textstat._legacy_round


def break_sentences(text):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    return list(doc.sents)
 
# Returns Number of Words in the text
def word_count(text):
    sentences = break_sentences(text)
    words = 0
    for sentence in sentences:
        words += len([token for token in sentence])
    return words
 
# Returns the number of sentences in the text
def sentence_count(text):
    sentences = break_sentences(text)
    return len(sentences)

def character_count(text):
    return (len(text)-text.count(" "))

def coleman_liau(text):
    words = word_count(text)
    L = ((len(text)-text.count(" "))/words)*100
    S = (sentence_count(text)/words)*100
    # print(L)
    # print(S)
    # print(sentence_count(text))
    coleman = 0.0588 * L - 0.296 * S - 15.8
    return (len(text)-text.count(" ")), legacy_round(coleman, 2) 


# print(avg_syllables_per_word(df["Text"][17]))
print(coleman_liau(df["Text"][8]))
print(len(df["Text"][8]))


(9451, 11.86)
11162


In [5]:
import spacy
from textstat.textstat import textstatistics
from textstat import textstat
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer, pipeline

legacy_round = textstat._legacy_round

def break_sentences(text):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    return list(doc.sents)

def word_count(text):
    sentences = break_sentences(text)
    words = 0
    for sentence in sentences:
        words += len([token for token in sentence])
    return words

def sentence_count(text):
    sentences = break_sentences(text)
    return len(sentences)

def coleman_liau(text):
    words = word_count(text)
    L = ((len(text) - text.count(" ")) / words) * 100
    S = (sentence_count(text) / words) * 100
    coleman = 0.0588 * L - 0.296 * S - 15.8
    return (len(text) - text.count(" ")), legacy_round(coleman, 2)

def lexical_diversity(text):
    words = text.split()
    unique_words = set(words)
    return len(unique_words) / len(words)

def sentiment_score(text):
    # Load the sentiment analysis model and tokenizer
    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
    model = DistilBertForSequenceClassification.from_pretrained(model_name)
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)

    # Split the input text into smaller chunks, each containing up to 512 tokens
    max_seq_length = 512
    chunks = [text[i:i+max_seq_length] for i in range(0, len(text), max_seq_length)]
    
    # Calculate the sentiment score for each chunk
    sentiment_analyzer = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
    sentiment_scores = [sentiment_analyzer(chunk)[0]['score'] for chunk in chunks]

    # Calculate the average sentiment score for the entire text
    average_sentiment_score = sum(sentiment_scores) / len(sentiment_scores)
    return average_sentiment_score

def complexity_score(text):
    char_count, coleman_liau_index = coleman_liau(text)
    lex_diversity = lexical_diversity(text)
    sentiment = sentiment_score(text)

    # Design the formula based on the weights assigned to each factor
    # You can adjust the weights based on the importance of each factor
    sentence_structure_weight = 0.35
    lexical_diversity_weight = 0.25
    word_length_syllable_weight = 0.2
    uncertainty_weight = 0.1

    # Combine the normalized features based on the assigned weights
    combined_score = (
        sentence_structure_weight * coleman_liau_index +
        lexical_diversity_weight * lex_diversity +
        word_length_syllable_weight * char_count +
        uncertainty_weight * (abs(sentiment - 0.5))
    )

    return combined_score

    
#Example usage
#df['Raw Complexity Score'] = df['Text'].apply(lambda text: complexity_score(text))

#print("Complexity Score:", complexity_score_example)


$\text{Webpage Complexity Score} = \frac{1}{n} (\sum_{i=1}^{n} \left( x_{1,i} \times w_{1} + x_{2,i} \times w_{2} + x_{3,i} \times w_{3} \right)) + y \times w_{y}$


Where:

$n$ is the number of text components

$x_{1, i}, x_{2, i}, x_{3, i}$ are the features associated with the i-th text component

$w_{1}, w_{2}, w_{3}$ are the weights associated per feature

$y$ is a continuous variable ranging from 0 to 1, representing the complexity score from ChatGPT

$w_{y}$ is the weight for $y$

### Edits made below this line are for dealing with text snippets

In [6]:
#Carlos - Testing Text Snippets 8/3

df["Text"][0]

"Machine Learning is an AI technique that teaches computers to learn from experience. Machine learning algorithms use computational methods to “learn” information directly from data without relying on a predetermined equation as a model. The algorithms adaptively improve their performance as the number of samples available for learning increases. Deep learning is a specialized form of machine learning. Table of Contents How Machine Learning Works Machine learning uses two types of techniques: , which trains a model on known input and output data so that it can predict future outputs, and , which finds hidden patterns or intrinsic structures in input data. Figure 1. Machine learning techniques include both unsupervised and supervised learning. Supervised Learning Supervised machine learning builds a model that makes predictions based on evidence in the presence of uncertainty. A supervised learning algorithm takes a known set of input data and known responses to the data (output) and tr

In [7]:
#perform the split, preserve all text data
def split_text_into_components(text, component_length=400):
    words = text.split()
    num_words = len(words)
    components = []

    for i in range(0, num_words, component_length):
        component = ' '.join(words[i:i + component_length])
        components.append(component)

    return components

# Example usage for one entry in df["Text"]
first_entry_text = df["Text"][0]
text_components = split_text_into_components(first_entry_text)
text_components

['Machine Learning is an AI technique that teaches computers to learn from experience. Machine learning algorithms use computational methods to “learn” information directly from data without relying on a predetermined equation as a model. The algorithms adaptively improve their performance as the number of samples available for learning increases. Deep learning is a specialized form of machine learning. Table of Contents How Machine Learning Works Machine learning uses two types of techniques: , which trains a model on known input and output data so that it can predict future outputs, and , which finds hidden patterns or intrinsic structures in input data. Figure 1. Machine learning techniques include both unsupervised and supervised learning. Supervised Learning Supervised machine learning builds a model that makes predictions based on evidence in the presence of uncertainty. A supervised learning algorithm takes a known set of input data and known responses to the data (output) and t

In [8]:
#The above cell appears to be working, now apply to the entire dataframe and create a new column:

df["Snippet Text"] = None
# Iterate through each entry in df["Text"]
for i, entry_text in enumerate(df["Text"]):
    text_components = split_text_into_components(entry_text)
    df.at[i, "Snippet Text"] = text_components

df

Unnamed: 0,Domain,Title,URL,Text,Snippet Text
0,mathworks,What are Machine Learning Algorithms for AI?,https://www.mathworks.com/discovery/machine-le...,Machine Learning is an AI technique that teach...,[Machine Learning is an AI technique that teac...
1,geeksforgeeks,What are Machine Learning Algorithms for AI?,https://www.geeksforgeeks.org/ml-machine-learn...,Machine Learning(ML) can be explained as autom...,[Machine Learning(ML) can be explained as auto...
2,expert,What are Machine Learning Algorithms for AI?,https://www.expert.ai/blog/machine-learning-de...,machine learning (ML) to help machines underst...,[machine learning (ML) to help machines unders...
3,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...
4,mygreatlearning,What are Machine Learning Algorithms for AI?,https://www.mygreatlearning.com/blog/what-is-m...,"Machine Learning? Defination, Types, Applicati...","[Machine Learning? Defination, Types, Applicat..."
5,ibm,What are Machine Learning Algorithms for AI?,https://www.ibm.com/topics/machine-learning,machine learning? Machine learning is a branch...,[machine learning? Machine learning is a branc...
6,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...
7,techtarget,What are Machine Learning Algorithms for AI?,https://www.techtarget.com/searchenterpriseai/...,machine learning? Machine learning (ML) is a t...,[machine learning? Machine learning (ML) is a ...
8,microfocus,What are Machine Learning Algorithms for AI?,https://www.microfocus.com/en-us/what-is/machi...,Machine learning applications learn from the i...,[Machine learning applications learn from the ...
9,en.wikipedia,What are Machine Learning Algorithms for AI?,https://en.wikipedia.org/wiki/Machine_learning,Machine learning approaches have been applied ...,[Machine learning approaches have been applied...


In [9]:
'''Now apply the complexity score formula (raw, unnormalized, score) to each of the text components for each entry in
df["Snippet Text"], and store each score in a 1D list'''

# Apply the complexity_score function to each text component
df["Raw Component Scores"] = df["Snippet Text"].apply(lambda components: [complexity_score(component) for component in components])
df

Unnamed: 0,Domain,Title,URL,Text,Snippet Text,Raw Component Scores
0,mathworks,What are Machine Learning Algorithms for AI?,https://www.mathworks.com/discovery/machine-le...,Machine Learning is an AI technique that teach...,[Machine Learning is an AI technique that teac...,"[471.06845877770587, 425.1950356500149, 428.77..."
1,geeksforgeeks,What are Machine Learning Algorithms for AI?,https://www.geeksforgeeks.org/ml-machine-learn...,Machine Learning(ML) can be explained as autom...,[Machine Learning(ML) can be explained as auto...,"[475.50509493867554, 469.92738025907676, 430.6..."
2,expert,What are Machine Learning Algorithms for AI?,https://www.expert.ai/blog/machine-learning-de...,machine learning (ML) to help machines underst...,[machine learning (ML) to help machines unders...,"[439.47750360039873, 468.8855853036245, 429.73..."
3,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...,"[460.8130749673446, 472.59217594965304, 484.76..."
4,mygreatlearning,What are Machine Learning Algorithms for AI?,https://www.mygreatlearning.com/blog/what-is-m...,"Machine Learning? Defination, Types, Applicati...","[Machine Learning? Defination, Types, Applicat...","[425.330586741209, 408.8674197986126, 420.5684..."
5,ibm,What are Machine Learning Algorithms for AI?,https://www.ibm.com/topics/machine-learning,machine learning? Machine learning is a branch...,[machine learning? Machine learning is a branc...,"[448.17652460459874, 415.42240747261053, 459.3..."
6,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...,"[460.8130749673446, 472.59217594965304, 484.76..."
7,techtarget,What are Machine Learning Algorithms for AI?,https://www.techtarget.com/searchenterpriseai/...,machine learning? Machine learning (ML) is a t...,[machine learning? Machine learning (ML) is a ...,"[441.6114619029761, 459.74303074292345, 466.37..."
8,microfocus,What are Machine Learning Algorithms for AI?,https://www.microfocus.com/en-us/what-is/machi...,Machine learning applications learn from the i...,[Machine learning applications learn from the ...,"[434.5384219884873, 414.38391798281674, 484.09..."
9,en.wikipedia,What are Machine Learning Algorithms for AI?,https://en.wikipedia.org/wiki/Machine_learning,Machine learning approaches have been applied ...,[Machine learning approaches have been applied...,"[426.15112552976615, 414.53034198927884, 459.0..."


In [12]:
# Filter out rows with URLs containing "youtube"
# And if for some reason there are entry text entries, remove those too
df = df[~df['Domain'].str.contains('youtube')]
df = df[df['Text'] != '']
df

Unnamed: 0,Domain,Title,URL,Text,Snippet Text,Raw Component Scores
0,mathworks,What are Machine Learning Algorithms for AI?,https://www.mathworks.com/discovery/machine-le...,Machine Learning is an AI technique that teach...,[Machine Learning is an AI technique that teac...,"[471.06845877770587, 425.1950356500149, 428.77..."
1,geeksforgeeks,What are Machine Learning Algorithms for AI?,https://www.geeksforgeeks.org/ml-machine-learn...,Machine Learning(ML) can be explained as autom...,[Machine Learning(ML) can be explained as auto...,"[475.50509493867554, 469.92738025907676, 430.6..."
2,expert,What are Machine Learning Algorithms for AI?,https://www.expert.ai/blog/machine-learning-de...,machine learning (ML) to help machines underst...,[machine learning (ML) to help machines unders...,"[439.47750360039873, 468.8855853036245, 429.73..."
3,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...,"[460.8130749673446, 472.59217594965304, 484.76..."
4,mygreatlearning,What are Machine Learning Algorithms for AI?,https://www.mygreatlearning.com/blog/what-is-m...,"Machine Learning? Defination, Types, Applicati...","[Machine Learning? Defination, Types, Applicat...","[425.330586741209, 408.8674197986126, 420.5684..."
5,ibm,What are Machine Learning Algorithms for AI?,https://www.ibm.com/topics/machine-learning,machine learning? Machine learning is a branch...,[machine learning? Machine learning is a branc...,"[448.17652460459874, 415.42240747261053, 459.3..."
6,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...,"[460.8130749673446, 472.59217594965304, 484.76..."
7,techtarget,What are Machine Learning Algorithms for AI?,https://www.techtarget.com/searchenterpriseai/...,machine learning? Machine learning (ML) is a t...,[machine learning? Machine learning (ML) is a ...,"[441.6114619029761, 459.74303074292345, 466.37..."
8,microfocus,What are Machine Learning Algorithms for AI?,https://www.microfocus.com/en-us/what-is/machi...,Machine learning applications learn from the i...,[Machine learning applications learn from the ...,"[434.5384219884873, 414.38391798281674, 484.09..."
9,en.wikipedia,What are Machine Learning Algorithms for AI?,https://en.wikipedia.org/wiki/Machine_learning,Machine learning approaches have been applied ...,[Machine learning approaches have been applied...,"[426.15112552976615, 414.53034198927884, 459.0..."


In [16]:
# Calculate the average scores for each entry in "Raw Component Scores"
df["Average Raw Score"] = df["Raw Component Scores"].apply(lambda scores: sum(scores) / len(scores))

# Normalize the average scores from 0 to 1
max_score = df["Average Raw Score"].max()
min_score = df["Average Raw Score"].min()
df["Normalized Raw Scores"] = (df["Average Raw Score"] - min_score) / (max_score - min_score)
df

Unnamed: 0,Domain,Title,URL,Text,Snippet Text,Raw Component Scores,Average Raw Score,Normalized Raw Scores
0,mathworks,What are Machine Learning Algorithms for AI?,https://www.mathworks.com/discovery/machine-le...,Machine Learning is an AI technique that teach...,[Machine Learning is an AI technique that teac...,"[471.06845877770587, 425.1950356500149, 428.77...",397.292981,0.848827
1,geeksforgeeks,What are Machine Learning Algorithms for AI?,https://www.geeksforgeeks.org/ml-machine-learn...,Machine Learning(ML) can be explained as autom...,[Machine Learning(ML) can be explained as auto...,"[475.50509493867554, 469.92738025907676, 430.6...",456.026911,0.989956
2,expert,What are Machine Learning Algorithms for AI?,https://www.expert.ai/blog/machine-learning-de...,machine learning (ML) to help machines underst...,[machine learning (ML) to help machines unders...,"[439.47750360039873, 468.8855853036245, 429.73...",358.483001,0.755573
3,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...,"[460.8130749673446, 472.59217594965304, 484.76...",409.2843,0.877641
4,mygreatlearning,What are Machine Learning Algorithms for AI?,https://www.mygreatlearning.com/blog/what-is-m...,"Machine Learning? Defination, Types, Applicati...","[Machine Learning? Defination, Types, Applicat...","[425.330586741209, 408.8674197986126, 420.5684...",415.273629,0.892032
5,ibm,What are Machine Learning Algorithms for AI?,https://www.ibm.com/topics/machine-learning,machine learning? Machine learning is a branch...,[machine learning? Machine learning is a branc...,"[448.17652460459874, 415.42240747261053, 459.3...",402.88802,0.862271
6,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...,"[460.8130749673446, 472.59217594965304, 484.76...",409.2843,0.877641
7,techtarget,What are Machine Learning Algorithms for AI?,https://www.techtarget.com/searchenterpriseai/...,machine learning? Machine learning (ML) is a t...,[machine learning? Machine learning (ML) is a ...,"[441.6114619029761, 459.74303074292345, 466.37...",382.306133,0.812816
8,microfocus,What are Machine Learning Algorithms for AI?,https://www.microfocus.com/en-us/what-is/machi...,Machine learning applications learn from the i...,[Machine learning applications learn from the ...,"[434.5384219884873, 414.38391798281674, 484.09...",382.420503,0.813091
9,en.wikipedia,What are Machine Learning Algorithms for AI?,https://en.wikipedia.org/wiki/Machine_learning,Machine learning approaches have been applied ...,[Machine learning approaches have been applied...,"[426.15112552976615, 414.53034198927884, 459.0...",443.596187,0.960087


In [27]:
from sklearn.preprocessing import StandardScaler


# Standardize the average scores using StandardScaler
scaler = StandardScaler()
df['Standardized Raw Scores'] = scaler.fit_transform(df[['Average Raw Score']])

df

Unnamed: 0,Domain,Title,URL,Text,Snippet Text,Raw Component Scores,Average Raw Score,Normalized Raw Scores,Standardized Raw Scores
0,mathworks,What are Machine Learning Algorithms for AI?,https://www.mathworks.com/discovery/machine-le...,Machine Learning is an AI technique that teach...,[Machine Learning is an AI technique that teac...,"[471.06845877770587, 425.1950356500149, 428.77...",397.292981,0.848827,0.415392
1,geeksforgeeks,What are Machine Learning Algorithms for AI?,https://www.geeksforgeeks.org/ml-machine-learn...,Machine Learning(ML) can be explained as autom...,[Machine Learning(ML) can be explained as auto...,"[475.50509493867554, 469.92738025907676, 430.6...",456.026911,0.989956,0.988752
2,expert,What are Machine Learning Algorithms for AI?,https://www.expert.ai/blog/machine-learning-de...,machine learning (ML) to help machines underst...,[machine learning (ML) to help machines unders...,"[439.47750360039873, 468.8855853036245, 429.73...",358.483001,0.755573,0.03653
3,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...,"[460.8130749673446, 472.59217594965304, 484.76...",409.2843,0.877641,0.532451
4,mygreatlearning,What are Machine Learning Algorithms for AI?,https://www.mygreatlearning.com/blog/what-is-m...,"Machine Learning? Defination, Types, Applicati...","[Machine Learning? Defination, Types, Applicat...","[425.330586741209, 408.8674197986126, 420.5684...",415.273629,0.892032,0.590919
5,ibm,What are Machine Learning Algorithms for AI?,https://www.ibm.com/topics/machine-learning,machine learning? Machine learning is a branch...,[machine learning? Machine learning is a branc...,"[448.17652460459874, 415.42240747261053, 459.3...",402.88802,0.862271,0.470011
6,oracle,What are Machine Learning Algorithms for AI?,https://www.oracle.com/artificial-intelligence...,Machine Learning Defined Machine learning (ML)...,[Machine Learning Defined Machine learning (ML...,"[460.8130749673446, 472.59217594965304, 484.76...",409.2843,0.877641,0.532451
7,techtarget,What are Machine Learning Algorithms for AI?,https://www.techtarget.com/searchenterpriseai/...,machine learning? Machine learning (ML) is a t...,[machine learning? Machine learning (ML) is a ...,"[441.6114619029761, 459.74303074292345, 466.37...",382.306133,0.812816,0.269091
8,microfocus,What are Machine Learning Algorithms for AI?,https://www.microfocus.com/en-us/what-is/machi...,Machine learning applications learn from the i...,[Machine learning applications learn from the ...,"[434.5384219884873, 414.38391798281674, 484.09...",382.420503,0.813091,0.270207
9,en.wikipedia,What are Machine Learning Algorithms for AI?,https://en.wikipedia.org/wiki/Machine_learning,Machine learning approaches have been applied ...,[Machine learning approaches have been applied...,"[426.15112552976615, 414.53034198927884, 459.0...",443.596187,0.960087,0.867403


In [25]:
df["URL"][32]

'https://www.britannica.com/technology/machine-learning'

#### end

In [None]:
# Find the row index with the maximum complexity score
max_score_index = df['score'].idxmax()

# Find the row index with the minimum complexity score
min_score_index = df['score'].idxmin()

# Access the entire row with the maximum complexity score
row_with_max_score = df.loc[max_score_index]

# Access the entire row with the minimum complexity score
row_with_min_score = df.loc[min_score_index]

# Print the rows with the maximum and minimum complexity scores
print("Row with Maximum Complexity Score:")
print(row_with_max_score)

print("\nRow with Minimum Complexity Score:")
print(row_with_min_score)

In [52]:
# Get the minimum and maximum complexity scores in the dataframe
min_score = df['score'].min()
max_score = df['score'].max()

# Perform Min-Max normalization on the 'Complexity Score' column
df['Normalized_Score'] = (df['score'] - min_score) / (max_score - min_score)
df

Unnamed: 0,Domain,Title,URL,Text,score,Normalized_Score,Standardized_Score
0,mathworks,What is Machine Learning?,https://www.mathworks.com/discovery/machine-le...,"Machine Learning? How it works, why it matters...",1587.715427,0.133911,-0.304059
1,oracle,What is Machine Learning?,https://www.oracle.com/artificial-intelligence...,Machine Learning? Machine learning defined Mac...,2029.436482,0.171532,-0.083091
2,ibm,What is Machine Learning?,https://www.ibm.com/topics/machine-learning,machine learning along with important definiti...,3218.942733,0.272843,0.511952
3,en.wikipedia,What is Machine Learning?,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is an umbrella term for ...,11756.60496,1.0,4.782858
4,mygreatlearning,What is Machine Learning?,https://www.mygreatlearning.com/blog/what-is-m...,"Machine Learning? Defination, Types, Applicati...",5784.709243,0.491371,1.795459
5,geeksforgeeks,What is Machine Learning?,https://www.geeksforgeeks.org/ml-machine-learn...,machine learning as – a “Field of study that g...,2739.854929,0.232039,0.272291
6,techtarget,What is Machine Learning?,https://www.techtarget.com/searchenterpriseai/...,machine learning Share this item with your net...,2255.442144,0.190781,0.029967
7,expert,What is Machine Learning?,https://www.expert.ai/blog/machine-learning-de...,Machine Learning? A Definition. - 14 March 202...,1473.788449,0.124207,-0.36105
8,microfocus,What is Machine Learning?,https://www.microfocus.com/en-us/what-is/machi...,Machine learning is a subset of focused on bui...,1958.52209,0.165492,-0.118565
9,zdnet,What is Machine Learning?,https://www.zdnet.com/article/what-is-machine-...,"machine learning is, how it is related to arti...",5361.103249,0.455292,1.583553


In [53]:
df["score"]

0      1587.715427
1      2029.436482
2      3218.942733
3     11756.604960
4      5784.709243
5      2739.854929
6      2255.442144
7      1473.788449
8      1958.522090
9      5361.103249
10     2211.926008
11     3526.128287
12     2319.804583
13      649.037578
14     2978.689684
15      609.917371
16     3386.981269
17      134.520025
18       15.448827
19     2126.561831
20     1598.470207
21     2631.754108
22     2783.864984
23      638.195219
24     1966.177028
25     1339.506465
26      838.038988
27     3461.326060
28     5013.693653
29     3030.218144
30     1249.071004
31      254.692162
32      797.857363
33      797.857363
34      851.072658
35     1524.875672
36     1375.277379
37      833.654665
38     2145.624468
39     3764.116082
40     1047.316750
41     2208.670203
42      233.889568
43     2163.989245
44      124.819649
Name: score, dtype: float64

In [None]:


df_complexity = df[['Text','URL']].copy()
df_complexity['character_count'] = df['Text'].apply(character_count)
df_complexity['complex_score'] = ''

df_complexity.head()
df_complexity["URL"][0]


'https://www.mathworks.com/discovery/machine-learning.html'

In [None]:
import openai
openai.api_key = "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"

def get_complexity_score(text, domain, model="gpt-3.5-turbo"):
    
    char_count, coleman_liau_index = coleman_liau(text)
    
    prompt = f"""
    CURRENT PROMPT IS ARBITRARY
    Given the webpage content,
    Lexical complexity (Coleman-Liau Index): {coleman_liau_index}
    Character Count: {char_count}
    Domain: {domain}

    Read the content of the webpage delimited by triple backticks, ```{text}```, to gain an understanding of its technical difficulty.
    Generate a complexity score as a float from 0-1 rounding off to the hundredth place through reading the website, lexical complexity, character count, and information about its domain. 
    Ensure and place focus on the accuracy of the generated complexity score by potentially using sophisticated formulas that can help with accuracy.
    Explain the how you generated this score before presenting the score.

    """
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]


def get_complexity_score():
    for i in range(len(df_complexity)):
        content_score, domain_score = get_complexity_score(df_complexity['Text'][i], df_complexity['Domain'][i])
        df_complexity['complex_score'][i] = #some formula using content, domain, lexical score, and char count


In [None]:
import torch
from sklearn.model_selection import train_test_split

# Assuming you have a DataFrame called 'df' containing your data
# Split the data into train and test sets (80% train, 20% test)
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Split the train data further into train and validation sets (80% train, 20% validation)
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)

# Print the sizes of each dataset
print("Train set size:", len(train_data))
print("Validation set size:", len(val_data))
print("Test set size:", len(test_data))

In [None]:
test_data.reset_index().drop(["index"], axis = 1)
val_data.reset_index().drop(["index"], axis = 1)
train_data.reset_index().drop(["index"], axis = 1)

____________________________________________________________

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", 
                      model="pszemraj/led-base-book-summary", 
                      device=0 if torch.cuda.is_available() else -1,
                      )

result = summarizer(
    df["Text"][0],
    min_length=8,
    max_length=256,
    no_repeat_ngram_size=3,
    encoder_no_repeat_ngram_size=3,
    repetition_penalty=3.5,
    num_beams=4,
    do_sample=False,
    early_stopping=True,
)
print(df["Text"][0])
print()
print(result[0]["summary_text"])

# NOTES ABOUT SUMMARIZATION
1. We need to filter out the navigation content for Bart to do a good job, I assume we can accomplish that through briefly studying the html of some of the web pages
2. Some website includes latex to describe mathematical expressions or other such expressions, hence BART cannot read that and cannot comprehend that, we either need to integrate a latex parser or filter out those expressions
3. Research why LED keeps cutting off some of the summary, how to fix this

https://www.geeksforgeeks.org/readability-index-pythonnlp/#