In [1]:
!pip install -U -q transformers



In [2]:
#@markdown utils
from transformers.utils.logging import set_verbosity

set_verbosity(40)

import warnings
# ignore hf pipeline complaints
warnings.filterwarnings("ignore", category=UserWarning, module='transformers')
warnings.filterwarnings("ignore", category=FutureWarning, module='transformers')

In [3]:
import torch
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-base-16384-book-summary",
    device=0 if torch.cuda.is_available() else -1,
)


Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [5]:
params = {
    "max_length": 256,
    "min_length": 8,
    "no_repeat_ngram_size": 3,
    "early_stopping": True,
    "repetition_penalty": 3.5,
    "length_penalty": 0.3,
    "encoder_no_repeat_ngram_size": 3,
    "num_beams": 4,
} # parameters for text generation out of model


In [6]:
input_text = """
Recommender systems are widely used these days in e-commerce, for the purpose of personalized
recommendation. Based on each user’s profile, previous purchase history, and online behavior, they
suggest products which they are likely to prefer. For example, Amazon.com is using recommender
systems for books. When a user logs-in to the system, it suggests books similar to previously bought
ones by the user.
Personalized recommendation can be applied to outside of commercial applications. These days,
many academic papers are coming out from a lot of conferences and journals. Academic researchers
should go through all the conferences and journals which are related to their field of research and
find out if there is any new articles that may relate to their current works. Sometimes they search
the articles from Google scholars or Citeseer with the key words that might show interesting articles
to them. However, these two methods require users to commit their time to search articles, which is
labor-intensive, and also do not guarantee that they will find the exact articles related to their field
of research.
In order to reduce their workload, we suggest developing the scholarly paper recommendation system for academic researchers, which will automatically detect their research topics they are interested in and recommend the related articles they may be interested in based on similarity of the
works. We believe this system will save the researchers’ time to search the articles and increase the
accuracy of finding the articles they are interested in.
In this section we briefly present some of the research literature related to recommender systems in
general, academic paper recommendation system, and evaluation of recommender systems.
Recommender systems are broadly classified into three categories[7]: collaborative filtering,
contents-based methods, and hybrid methods. First, collaborative filtering uses only user-item rating
matrix for predicting unseen preference[21, 1]. It can be categorized into memory-based CF, which
contains the whole matrix on memory, and model-based CF, building a model for estimation[2].
The most effective memory-based algorithms known so far is item-based CF[19]. Recently, making use of matrix factorization, a kind of model-based approach[14, 16, 18, 3, 24], is known as
the most efficient and accurate, especially after those approaches won the Netflix prize in 2009.
Content-based methods, on the other hand, recommend items based on their characteristics as well
as specific preferences of a user[7]. Pazzani[15] studied this approach in depth, including how to
build user and item profiles. Last category, hybrid approach, tries to combine both collaborative and
content-based recommendation. Koren[8] suggested effectively combining rating information and
user, item profiles for more accurate recommendation.
Recommender systems have concentrated on recommending media items such as movies, but recently they are extending to academy. Most popular application is citation recommendation[5, 12,
23, 20]. Recently, Matsatsinis [11] introduced scientific paper recommendation using decision theory. Sugiyama[22] extended scholarly paper recommendation with citation and reference information.
Although recommender systems are very popular in commercial applications these days, it is still
difficult to evaluate them due to the lack of standard methods. Traditional recommender systems [4, 17, 6] were usually introduced in Human-Computer Interaction community, so they have
been evaluated by user study. This approach is still used, especially for verifying improvement in
terms of user experience.
3.2 Data Model
3.2.1 Bag-of-word model
With the gathered data, we modeled them by a bag-of-word model. In this model, each word appeared in the whole document corpora becomes an attribute. Then, each document is represented by
a bit vector, indicating whether each word appears or not. This model is based on two assumptions;
1) word probabilities for one text position are independent of the words that occur in other positions
(Naive Bayes Assumption) and 2) the probability of encountering a specific word is independent of
its position. (Independent Identical Distribution Assumption) This assumption is incorrect, but it is
known that this does not seriously affect classification or learning task. [13] We combined title, key
words, and abstract to construct a set of words representing a paper.
3.2.2 Heuristics
For more efficient processing, we applied some heuristics. First, we removed stop words such as
”the” or ”of”. These words appear in almost every document in English, so they are not useful for
classifying or filtering some specific documents, but just slow down computation speed by increasing
the text length. We removed about 140 words which were selected manually. This process reduces
the length of dictionary, resulting in reduced dimension of the clustering work, so we expect speed
improvement.
The other heuristic applied is stemming. In English, same word can be used as different parts, usually in a slightly different form. For example, ”clear”, ”clearly”, and ”cleared” have same meaning,
but used in different forms for its position or role in the sentence. It is much better to deal with these
minor changes of forms as same words, as it can dramatically reduce the dimension. However, this
work is not straightforward. As a first step, we just removed last ”ed”, ”ly”, and ”ing” from the
word, whenever encountered.
3.3 Learner (Recommender)
Using the crawled documents and data model discussed so far, we are ready to proceed to our main
goal: personalized recommendation of academic papers. As a perspective of recommendation system, we can consider authors as users and papers as items. We will use these terms interchangeably
henceforth. We can think of recommendation system as a task to fill out missing preference data on
a user-item matrix, based on observed values. There can be lots of schemes to decide proper values
for missing preference. Filling with the user’s average or item’s average can be a simple baseline. In
this section, we discuss fundamental characteristics of our problem, and then describe our algorithm.
3.3.1 Inherent Characteristics of Problem
The information we gather contains each paper’s title, list of authors, key words, and abstract. In
order to build a user-item matrix with this data, we assume that users are interested in their own
papers. Thus, we set high score (in this paper, 5) to every <author, paper> pair that the paper is
written by the author. We use 1-5 scale as it is widely used in recommendation systems in literature.
We claim that this user-item matrix we use is extremely sparse, which means most of values are
missing while only small portion of them are observed. This situation is common in recommendation, though. According to Netflix Prize data, only 1% of cells of the user-item matrix are observed
3
values. Nonetheless, it has been shown that it is possible to accurately estimate missing data only
using small amount of observed data. In our situation, however, the sparsity can be worse. Regularly, one author writes only one or two papers in one conference proceeding. There are only at
most two or three top-level conferences in each field, the maximum number of papers one author
can publish a year is about 10. This is an ideal case, and most researchers may have only one or two
papers. Thus, our matrix have only a few number of preference data.
More serious problem is that we do not have ”dislike” information. When we request users to
explicitly rate items in a common recommendation system, we can get both positive and negative
feedback from the user. For example, we can get ”very like” feedback for the movie ”Titanic” as
well as ”very hate” one for the ”Shrek 2.” Based on this variety, we can infer that the user may
prefer romantic movies to animations. In our data, however, we do not have negative feedback. This
problem makes difficult us to use widely-used collaborative filtering algorithms.
3.3.2 Naive Recommender
We basically assume that authors will like papers similar to ones they wrote before. In this context,
we note that similar papers mean ones dealing with similar topic. In our Naive Recommender,
we just apply this assumption. When we try to recommend a set of papers to a specific user, we
first calculate similarity between every paper and the user’s own papers. Then, we take the highest
similarity as the score of that paper. This process is similar to k-Nearest Neighbors (kNN) algorithm.
That is, we can easily select and recommend most similar n papers to the target user’s previous paper.
We used vector cosine of our data model (bit vector) as the similarity measure.
However, the real situation is a little bit more complicated, as the user may have written more than
one paper. It is still kNN, but we can have more than one queried point. Thus, we applied clustering
first. All candidate papers are assigned to only one of the most similar paper written by the target
user. This process is similar to K-means, but the centroids are also papers, so their geometric location
in the space cannot change. Thus, we do not need to iterate in our case. After assigned to a cluster,
the score is calculated based on the distance between the candidate paper and its centroid. For
example, as shown in Figure 2, each big circle represents a centroid of a cluster and small circles
connected to the centroid are members of its cluster. Using the calculated score as a distance metric
for kNN, we select k papers for recommendation to the target user.
interested in more accurately. Also, through the focus group interview we discovered the interesting
fact that even though the topics are not as much as relevant to their research topic, they showed great
interest to the papers that their peer researchers, i.e., their former students or the researchers they
have done research together before, wrote. In this way, it will be important to include the information about relevant researchers to users and recommend papers that they found interesting or they
have wrote. Also, the subjects replied, if we provide information about which researcher liked this
papers, it would also give them great reason and motivation to read that paper.
For the perspective of machine learning, we may need to consider about scalability. Although our
current system runs within a few minutes, it may take more time when we crawl more data. First, we
can improve accuracy of similarity measure by allowing counting the frequency of each word in a
document, instead of bit vector model. TF-IDF model [10] can be a great candidate to implement. In
this model, we give more weight for frequently used words in a specific document, but not in other
ones. Also, we may need to speed up the calculation. For this, dimension reduction will be helpful.
Specifically, it would be better to add more stemming logic because this can deal with more words
as same ones, so we can successfully reduce dimension. We may use L-Distance algorithm [9] for
calculating similarity of each word pair, and decide whether they are same or not.
 Conclusion
In this paper, we have presented a Personalized Academic Research Paper Recommendation System, which recommends related articles for each researcher. Thanks to our system, researchers can
get their related papers without searching keywords on Google or browsing top conferences’ proceedings. Our system makes three contributions. First, we have developed a web crawler to retrieve
a huge number of research papers from the web. Second, we define a similarity measure for research
papers. Third, we have developed our recommender system using collaboration filtering methods.
Evaluation results show the usefulness of our system.
"""


In [9]:
%%time
result = summarizer(input_text, **params)

print(result)


[{'summary_text': 'This paper introduces a new recommendation system that uses machine learning to predict which articles are most likely to be useful for academics. It uses a combination of crowd-mapping, clustering, and sentimental analysis to develop a recommendation system. The main goal of the proposed system is to automatically find papers that relate to the interests of each individual.'}]
CPU times: total: 5min 48s
Wall time: 1min 42s


In [16]:
# Calculate dynamic max_length as a proportion of input_length
input_length = len(input_text)
proportion = 0.6  # You can adjust this value
dynamic_max_length = int(input_length * proportion)

params = {
    "max_length": 512,
    "min_length": 8,
    "no_repeat_ngram_size": 3,
    "early_stopping": True,
    "repetition_penalty": 3.5,
    "length_penalty": 0.3,
    "encoder_no_repeat_ngram_size": 3,
    "num_beams": 4,
} # parameters for text generation out of model


# Perform summarization with dynamic max_length
result = summarizer(input_text, **params)

result

[{'summary_text': 'This paper introduces a new recommendation system that uses machine learning to predict which articles are most likely to be useful for academics. It uses a combination of crowd-mapping, clustering, and sentimental analysis to develop a recommendation system. The main goal of the proposed system is to automatically find papers that relate to the interests of each individual.'}]

In [10]:

# Split the input text into chunks of a manageable length
chunk_size = 1000
chunks = [input_text[i:i+chunk_size] for i in range(0, len(input_text), chunk_size)]
    
# Generate summaries for each chunk and combine them
combined_summary = ""
for chunk in chunks:
    summary = summarizer(chunk, **params)
    combined_summary += summary[0]['summary_text'] + " "
print(combined_summary)
len(combined_summary)

In this paper, we discuss the use of recommendation systems to help customers find the right books and articles for their needs. In this paper, we present a recommendation system that will help academics find the most relevant articles for their fields of study. In this paper, we discuss the different types of recommendation approaches that can be used to help us make better decisions about what to buy and how to purchase it. Some of the most effective approaches are described in this paper. The first is a "memory-based" approach, which uses memory to build a prediction model. The second is an "item-based," or "collaborative," approach. These approaches work best because they combine both information about the item and its characteristics. In this paper, we present a new recommendation system that uses machine learning to predict the likelihood of each word appearing in a scientific paper. In this paper, we explore the use of a combination of key words and abstracts to build a dictiona

In [13]:
print("1.",len(combined_summary))
print("2.",len(result))
print("2.",len(result))

1. 2919
2. 1
2. 1


In [14]:

# Split the input text into chunks of a manageable length
chunk_size = 4
chunks = [input_text[i:i+chunk_size] for i in range(0, len(input_text), chunk_size)]
    
# Generate summaries for each chunk and combine them
combined_summary = ""
for chunk in chunks:
    summary = summarizer(chunk, **params)
    combined_summary += summary[0]['summary_text'] + " "
print(combined_summary)
len(combined_summary)

Exception ignored in: <function tqdm.__del__ at 0x000002063BE19510>
Traceback (most recent call last):
  File "C:\Users\soulo\AppData\Local\Programs\Python\Python310\lib\site-packages\tqdm\std.py", line 1145, in __del__
    self.close()
  File "C:\Users\soulo\AppData\Local\Programs\Python\Python310\lib\site-packages\tqdm\notebook.py", line 276, in close
    def close(self):
KeyboardInterrupt: 


KeyboardInterrupt: 

In [15]:

# Split the input text into chunks of a manageable length
chunk_size = 3000
chunks = [input_text[i:i+chunk_size] for i in range(0, len(input_text), chunk_size)]
    
# Generate summaries for each chunk and combine them
combined_summary = ""
for chunk in chunks:
    summary = summarizer(chunk, **params)
    combined_summary += summary[0]['summary_text'] + " "
print(combined_summary)
len(combined_summary)

In this paper, Wuthering Heights discusses the use of recommendation systems for recommending books to customers. He uses an example of an e commerce system that recommends books based upon past purchases and previous behavior. This paper also discusses how academic journals and other sources of information can be used to predict which articles will be most useful for students. In this paper, Wuthering Heights uses a combination of machine learning and natural language understanding to develop a recommendation system for academic papers In this chapter, we describe the problem of predicting whether a given paper will be useful for a particular audience. The goal of our model is to find out how many papers each author has written in one year. This information can be used to predict which papers will be most popular with the target audience. We then use a similarity-separation technique to find the most similar papers to each other. This paper presents a novel recommendation system that 

1581