# Optimal Cost-Effective Document Summarization

This notebook provides a step-by-step guide to implement the "Optimal Cost-Effective Approach" for document summarization as described. The approach involves splitting a document into sections, vectorizing each section, clustering the vectors to identify key topics, and then generating a summary using the OpenAI API.

## Step 1: Document Splitting

Split the document into smaller sections or chunks, such as paragraphs. This can be done using simple string manipulation functions to split the text based on newline characters or other delimiters.


In [1]:
#pip install openai langchain pypdf chromadb scikit-learn sentence-transformers

In [2]:
import os
import openai
import numpy as np
import PyPDF2
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

In [3]:
#os.environ['OPENAI_API_KEY'] = ''

In [4]:
openai.api_key = os.getenv('OPENAI_API_KEY')

In [5]:
file_path = './docs/Houghton (2010) Does Max Webers Notion of Authority Still Hold in the Twenty-First Century.pdf'

In [6]:
def format_text(text):
    """Split the text into chunks of sentences."""
    # Split the text into individual sentences
    sentences = text.split('.')
    
    # Group sentences together to form chunks
    chunk_size = 5
    chunks = ['.'.join(sentences[i:i+chunk_size]) for i in range(0, len(sentences), chunk_size)]
    
    return chunks

def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file and format it into chunks."""
    with open(pdf_path, 'rb') as file:
        # Initialize PDF reader
        pdf_reader = PyPDF2.PdfReader(file)
        
        # Check if the PDF is encrypted
        if pdf_reader.is_encrypted:
            pdf_reader.decrypt('')
        
        # Extract text from each page
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()
    
    # Format the extracted text into chunks
    return format_text(text)

In [None]:
chunks = extract_text_from_pdf(file_path)

In [16]:
pages

[Document(page_content='Does Max Weber’s notion\nof authority still hold\nin the twenty-ﬁrst century?\nJeffery D. Houghton\nCollege of Business and Economics, West Virginia University,\nMorgantown, West Virginia, USA\nAbstract\nPurpose – The purpose of this brief commentary is to provide a brief overview of Max Weber’s life,\nwork, and contributions to management thought before addressing the question of whether his notionof authority still holds in the twenty-ﬁrst century.\nDesign/methodology/approach – The commentary begins with a brief biographical sketch followed\nby an examination of Weber’s conceptualization of authority, its inﬂuence on the ﬁeld of management andits relevancy in the twenty-ﬁrst century.\nFindings – Weber’s writings on charismatic authority have been and continue to be instrumental\nin shaping modern leadership theory, that the charismatic form of authority may be particularly applicableand effective in today’s chaotic and rapidly changing environments, and that 

In [12]:
page = pages[0]

In [14]:
page.page_content

'Does Max Weber’s notion\nof authority still hold\nin the twenty-ﬁrst century?\nJeffery D. Houghton\nCollege of Business and Economics, West Virginia University,\nMorgantown, West Virginia, USA\nAbstract\nPurpose – The purpose of this brief commentary is to provide a brief overview of Max Weber’s life,\nwork, and contributions to management thought before addressing the question of whether his notionof authority still holds in the twenty-ﬁrst century.\nDesign/methodology/approach – The commentary begins with a brief biographical sketch followed\nby an examination of Weber’s conceptualization of authority, its inﬂuence on the ﬁeld of management andits relevancy in the twenty-ﬁrst century.\nFindings – Weber’s writings on charismatic authority have been and continue to be instrumental\nin shaping modern leadership theory, that the charismatic form of authority may be particularly applicableand effective in today’s chaotic and rapidly changing environments, and that the empowered andself-m

## Step 2: Vectorization

Convert each section or chunk into a vector representation using embeddings. We'll use the SentenceTransformers library for this purpose.


In [None]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
text = """The writings of Max Weber have had a profound and perhaps even unrivaled
influence on management thought and organizational theory over the past century
(Greenwood and Lawrence, 2005). Recently, however, some organizational theorists
have questioned the current relevancy of Weber’s theories in today’s late-modern
knowledge-based information age characterized by a very different set of economic,
social, and technological realities from the time in which Weber’s ideas were born
(Greenwood and Lawrence, 2005; Lounsbury and Carberry, 2005). A complete
examination of the enduring influence of the entire breadth of Weber’s writings in the
current context is well beyond the scope of this brief commentary. However, after a brief
biographical sketch and overview of Weber’s work, I will focus specifically on Weber’s
conceptualization of authority and its relevancy in the twenty-first century. In short, I will
suggest that Weber’s writings on authority are still material in modern organizations and
are still helping to shape the thinking of today’s management scholars.
Max Weber was born in 1864 in Erfurt, Germany, the oldest of eight children. Weber
studied law at the University of Heidelberg, but his educational experiences and
subsequent academic career would span a remarkably broad number of disciplines
including law, history, economics, philosophy, political science, and sociology. After
completing his doctoral dissertation and habilitation (the highest level of academic
qualification in certain European countries), Weber obtained his first university appointment in 1892 at the University of Freiburg. He would also hold professorships at
the University of Heidelberg, the University of Vienna, and the University of
Munich during an academic career marked by periods of intense writing productivity
but punctuated with bouts of neurosis resulting in periods of scholarly inactivity
and long leaves of absence from any teaching responsibilities. As Greenwood and
Lawrence (2005) have noted, it is doubtful that Weber could have followed such a career
path in today’s intense “publish or perish” academic culture. In 1904, after a five-year
period during which he published virtually nothing, Weber began publishing some of
his most influential essays. These essays were later collected to comprise his most
influential book, The Protestant Ethic and the Spirit of Capitalism, which included
Weber’s famous metaphor of the iron cage (Weber, 1930). Subsequently, Weber
presented his fully developed ideas on bureaucracy, political leadership, domination,
and authority in his three-volume masterpiece, Economy and Society (Weber, 1968),
which was first published posthumously in 1922. Weber had died of pneumonia in 1920
after contracting Spanish influenza. The influence of Weber’s writings on management
thought following his death was slowed somewhat by a lack of English translations of
his work. Although, The Protestant Ethic was first translated into English in 1930,
Weber’s essays on bureaucracy and authority were not widely available until the late
1940s. Yet despite the slow start, Weber’s influence on management and organizational
theory in the middle decades of the twentieth century was tremendous (Greenwood and
Lawrence, 2005). Interested readers may refer to Ka¨sler (1979) for a more detailed
discussion of Weber’s life and work.
Without question, Max Weber’s ideas have had a broad and far-reaching influence on
the development of the fields of management and organizational theory and his writings
on authority certainly rank among his most influential. Weber forwarded three basic
types of authority: traditional, rational-legal, and charismatic (Weber, 1968). Traditional
authority, such as that of tribal chiefs, feudal lords, and monarchs, is based on customs
and traditions that are passed down from one generation to the next. Rational-legal
authority, in contrast, is founded upon laws, rules, and the power stemming from a
legitimate position or office. Weber felt that the bureaucracy was a primary example of
rational-legal authority. Finally, charismatic authority results from extraordinary
personal characteristics of a leader that have the capacity to inspire others.
Weber saw these three types of authority essentially as forces that would bring either
stability and order (traditional/legal rational) or change and disorder (charismatic) to
institutions and society (Conger, 1993). Thus, for Weber, traditional authority was
viewed as stable, impersonal, and nonrational; rational/legal authority was seen as
stable, impersonal, and rational; and charismatic authority was considered unstable,
personal, and nonrational (Blau, 1963; Conger, 1993). Weber painted a distinct contrast
between the act of following a personal yet transitory charismatic leader as opposed to
submitting to the more stable and impersonal traditional and rational/legal forms
of authority. He further saw both legal/rational and charismatic authority as forms of
rebellion against the stagnant status quo of traditional authority, the former through
principles and procedures based on consensus and rationality and the latter through an
emotional reaction to a heroic leader. Weber also suggested that charismatic authority is
inherently transitory and unstable, and is therefore most effective in times of crisis and
change, serving primarily to facilitate the transition from one order to another (Conger,
1993). Having accomplished this transition, charismatic authority is either “routinized”or simply fades away as charismatic leadership is replaced by the rules, tradition,
and institutionalized bureaucracy. Hence, the irony of charismatic authority is that it is
often replaced by the very forms of authority that it originally sought to overturn
(Conger, 1993).
Weber developed his theory of authority in the early twentieth century and in that
context his ideas make a great deal of sense. The industrial age was creating larger and
more complex bureaucratic organizations and Weber observed that traditional authority
structures were being replaced by legal/rational-based authority systems often aided by
larger-than-life charismatic leaders whose entrepreneurial vision and energy were being
transformed into the great corporations of the twentieth century. Weber, however, lived and
wrote in a simpler and arguably less dynamic context in which pace of change was slower
and less frenetic than today (Greenwood and Lawrence, 2005). Modern organizations are
increasingly characterized by new, decentralized network-based structures that are quite
different from the large, complex bureaucracies of Weber’s day. Today, knowledge-based
work and cutting-edge technologies are creating new organizational realities centered on
concepts such as telecommuting, empowerment, and self-managing teams that may
potentially undermine the traditional and legal/rational forms of authority. Given these new
organizational forms and practices, does Weber’s notion of authority still hold in
twenty-first century? In the remainder of this commentary, I will suggest that Weber’s
ideas on authority are just as relevant today as they were 100 years ago, but in different
ways and for different reasons. In short, I will propose that Weber’s writings on charismatic
authority have been and continue to be instrumental in shaping modern leadership theory,
that the charismatic form of authority may be particularly applicable and effective in
today’s chaotic and rapidly changing environments, and that the empowered and
self-managing organizational forms of the twenty-first century may simply represent a
different embodiment of Weber’s iron cage of legal/rational authority.
When Weber redefined the term “charisma” from its original ecclesiastical meaning
of “divinely bestowed power or talent” to mean “a special quality of an individual
capable of inspiring and influencing others,” he laid the foundation for the concept of
charismatic leadership. As Conger (1993, p. 277) points out, Weber “is essentially the
“father of the field” – responsible for the introduction of the concept as both a lay and
scientific term”. Thus, beginning in the 1970s and continuing to the present, Weber’s
writings on charismatic authority have served as the conceptual basis for the
development of theoretical models of charismatic leadership as well as for empirical
research on the subject (Conger, 1988, 1993). Moreover, Weber’s writings have done
more than serve as the seminal works for one of the most popular concepts in modern
leadership theory; they have also continued to move the field forward as various nuances
of charismatic leadership have been identified and explored. For instance, nearly two
decades ago following the initial development of the charismatic leadership theory,
Conger (1993) called for researchers to examine two previously neglected aspects of
Weber’s theory: the routinization of charismatic leadership and the role of context in
charismatic leadership. In response, researchers have carefully examined the extent to
which the emergence and maintenance of charismatic leadership depends on the
presence of a dynamic context and/or crisis situation (Bligh et al., 2004; de Hoogh et al.,
2005; Pillai, 1996; Shamir and Howell, 1999). Indeed, these concepts along with other
aspects of Weber’s theory of charismatic authority recently prompted a lively debate
among leadership scholars within the pages of Leadership Quarterly (Bass, 1999; Beyer, 1999; House, 1999; Shamir, 1999). The point is that Max Weber’s ideas are still
informing and inspiring research, debate, and theory building in one of the most popular
areas of organizational research today.
Leadership theorists have generally suggested that times of stress, turbulence, and
rapid change are more conducive to a charismatic leadership approach because the
transforming vision of a charismatic leader is more appealing in times of uncertainly
(Bryman, 1993; Conger, 1999). Weber (1968) himself focused specifically on times of crisis
as a primary facilitating environment for charismatic authority. It seems reasonable then
to suggest that charismatic leadership may have even greater applicability and relevance
in today’s turbulent and rapidly changing organizational environments than the concept
of charismatic authority did in Weber’s day.
Finally, as Barker (1993) suggested in his highly influential ethnographic study, new
participatory organizational forms and practices such as self-managing teams and
employee empowerment may not represent an escape from the iron cage of legal/rational
authority and bureaucratic control. Quite to the contrary, these practices may simply
represent a shift in the locus of control from managers in traditional bureaucratic
structures to the workers themselves (Barker, 1993). Workers become accountable to
their teammates and to themselves rather than to a relatively distant manager. Ironically
then, empowerment and self-management practices may serve to tighten Weber’s iron
cage of rational control as organizational members effective police themselves and their
co-workers more closely than would be possible in the strictest bureaucracy. Once again,
Weber’s notion of authority informs our modern understanding complex organizational
phenomena.
In conclusion, Hunt (1999, p. 129) has suggested that the development of charismatic
leadership was in part responsible for rejuvenating the study of leadership by creating
“a paradigm shift that has attracted numerous new scholars and moved the field as a
whole out of its doldrums”. If this is true, then the field of leadership was to a large extent
rejuvenated by the influence of Max Weber as leadership scholars looked to the ideas of
the past to create the leadership theory and practice of the present and future. Clearly,
Max Weber’s writings have been shaping the thinking of management scholars for more
than a century and his influence will likely continue into the foreseeable future."""

In [None]:
sections = text.split("\n\n")  # Split by paragraphs

In [None]:
sections

In [None]:
doc_embeddings = embeddings.embed_documents([text])

## Step 3: Clustering

Apply a clustering algorithm, such as K-means, to the generated embeddings. This will group the vectors (and hence the sections or chunks) that are semantically similar into clusters.


In [None]:
from sklearn.cluster import KMeans

num_clusters = 1  # You can determine the optimal number using methods like the Elbow method
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(doc_embeddings)


## Step 4: Extract Representative Vectors

For each cluster, identify the center point or the representative vector. This can be the centroid of the cluster in K-means. This representative vector captures the "average meaning" of that topic cluster.


In [None]:
centroids = kmeans.cluster_centers_

## Step 5: Generate Summary using OpenAI API

We'll use the OpenAI API to generate a concise summary for each section that is closest to the centroid of its cluster.


In [None]:
pip install numpy

In [None]:
import numpy as np


In [None]:
import openai

def generate_summary(text):
    response = openai.Completion.create(
      model="davinci-002",
      prompt=text,
      max_tokens=150  # Adjust based on your needs
    )
    return response.choices[0].text.strip()

summary_sections = []
for centroid in centroids:
    closest_section = sections[np.argmin(np.linalg.norm(embeddings - centroid, axis=1))]
    summarized_text = generate_summary(closest_section)
    summary_sections.append(summarized_text)

summary = "\n\n".join(summary_sections)
print(summary)
