# Extracting text with pdfplumber

In [2]:
import pdfplumber
import re

# URL to file: https://libcom.org/files/A%20Thousand%20Plateaus.pdf
path = 'AThousandPlateaus.pdf'

## Extract single pages

In [3]:
def extract_text(path, page_no):
    '''Returns the content of one page as a string.'''
    with pdfplumber.open(path) as pdf:
        page = pdf.pages[page_no-1]
        content = page.extract_text()
    return content

## Clean page

In [4]:
def clean_page(txt):
    '''Removes unwanted characters from a string.'''
    
    # Remove hyphen and merge word
    txt = txt.replace('-\n', '')
    
    # Convert string to list without removing \n
    txt = [line+'\n' for line in txt.split('\n')]
    
    # Remove lines which seem to be headlines etc.
    # (Removes numbers at the bottom as well.)
    txt_ = []
    
    for i in range(1, len(txt)):
        len_line = len(txt[i])
        len_prev_line = len(txt[i-1])
        
        if len_line < 55 and len_prev_line < 55:
            pass
        
        else:
            txt_.append(txt[i])
    txt = txt_
    
    
    # Convert list to string
    txt = ' '.join(txt)
    
    # Remove footnotes
    # Look like: 'end of sentence.2 Next sentence'
    txt = re.sub('\.+?[0-9\-]', '.', txt)

    
    # Remove numbers at the bottom
    # Look like: '75 \n'
    txt = re.sub('[0-9\-]+ \\n', '', txt)
    
    # Remove newline characters except for (guessed) the end of a paragraph
    # First remove all
    txt = txt.split('\n')
    
    # Then add if a line is shorter than 60 characters
    txt_ = []
    for line in txt:
        if len(line) < 60:
            txt_.append(line + '\n')
        else:
            txt_.append(line)
    txt = ''.join(txt_)

    
    # Remove multiple spaces
    txt = re.sub(' +', ' ', txt)
    
    return txt

### 24

In [5]:
content = extract_text(path, 24)
print(content)

 
 
SYLVANO BUSSOTI
The two of us wrote Anti-Oedipus together. Since each of us was several, 
there was already quite a crowd. Here we have made use of everything that 
came within range, what was closest as well as farthest away. We have 
assigned clever pseudonyms to prevent recognition. Why have we kept our 
own names? Out of habit, purely out of habit. To make ourselves unrecog-
nizable in turn. To render imperceptible, not ourselves, but what makes us 
act, feel, and think. Also because it's nice to talk like everybody else, to say 
the sun rises, when everybody knows it's only a manner of speaking. To 
reach, not the point where one no longer says I, but the point where it is no 
longer of any importance whether one says I. We are no longer ourselves. 
Each will know his own. We have been aided, inspired, multiplied. 
A book has neither object nor subject; it is made of variously formed 
matters, and very different dates and speeds. To attribute the book to a 
subject is to overl

<br>
First 3 lines could be removed.<br>
Last line contains online a number -> could be removed.

In [6]:
print(clean_page(content))

The two of us wrote Anti-Oedipus together. Since each of us was several, there was already quite a crowd. Here we have made use of everything that came within range, what was closest as well as farthest away. We have assigned clever pseudonyms to prevent recognition. Why have we kept our own names? Out of habit, purely out of habit. To make ourselves unrecognizable in turn. To render imperceptible, not ourselves, but what makes us act, feel, and think. Also because it's nice to talk like everybody else, to say the sun rises, when everybody knows it's only a manner of speaking. To reach, not the point where one no longer says I, but the point where it is no longer of any importance whether one says I. We are no longer ourselves. Each will know his own. We have been aided, inspired, multiplied. A book has neither object nor subject; it is made of variously formed matters, and very different dates and speeds. To attribute the book to a subject is to overlook this working of matters, and t

### 26

In [7]:
content = extract_text(path, 26)
print(content)

I
NTRODUCTION: RHIZOME □ 5 
signifying. It has to do with surveying, mapping, even realms that are yet to 
come. 
A first type of book is the root-book. The tree is already the image of the 
world, or the root the image of the world-tree. This is the classical book, as 
noble, signifying, and subjective organic interiority (the strata of the book). 
The book imitates the world, as art imitates nature: by procedures specific 
to it that accomplish what nature cannot or can no longer do. The law of the 
book is the law of reflection, the One that becomes two. How could the law 
of the book reside in nature, when it is what presides over the very division 
between world and book, nature and art? One becomes two: whenever we 
encounter this formula, even stated strategically by Mao or understood in 
the most "dialectical" way possible, what we have before us is the most clas-
sical and well reflected, oldest, and weariest kind of thought. Nature 
doesn't work that way: in nature, roots are

<br>
First 2 lines should be removed.

In [8]:
print(clean_page(content))

signifying. It has to do with surveying, mapping, even realms that are yet to come. 
 A first type of book is the root-book. The tree is already the image of the world, or the root the image of the world-tree. This is the classical book, as noble, signifying, and subjective organic interiority (the strata of the book). The book imitates the world, as art imitates nature: by procedures specific to it that accomplish what nature cannot or can no longer do. The law of the book is the law of reflection, the One that becomes two. How could the law of the book reside in nature, when it is what presides over the very division between world and book, nature and art? One becomes two: whenever we encounter this formula, even stated strategically by Mao or understood in the most "dialectical" way possible, what we have before us is the most classical and well reflected, oldest, and weariest kind of thought. Nature doesn't work that way: in nature, roots are taproots with a more multiple, lateral,

### 28

In [9]:
content = extract_text(path, 28)
print(content)

0 
INTRODUCTION: RHIZOME □ 7 
tions of shelter, supply, movement, evasion, and breakout. The rhizome 
itself assumes very diverse forms, from ramified surface extension in all 
directions to concretion into bulbs and tubers. When rats swarm over each 
other. The rhizome includes the best and the worst: potato and couchgrass, 
or the weed. Animal and plant, couchgrass is crabgrass. We get the distinct 
feeling that we will convince no one unless we enumerate certain approxi-
mate characteristics of the rhizome. 
1 and 2. Principles of connection and heterogeneity: any point of a rhi-
zome can be connected to anything other, and must be. This is very differ-
ent from the tree or root, which plots a point, fixes an order. The linguistic 
tree on the Chomsky model still begins at a point S and proceeds by dichot-
omy. On the contrary, not every trait in a rhizome is necessarily linked to a 
linguistic feature: semiotic chains of every nature are connected to very 
diverse modes of coding (

<br>
Footnote (last line): Remove numbers following a dot.

In [10]:
print(clean_page(content))

tions of shelter, supply, movement, evasion, and breakout. The rhizome itself assumes very diverse forms, from ramified surface extension in all directions to concretion into bulbs and tubers. When rats swarm over each other. The rhizome includes the best and the worst: potato and couchgrass, or the weed. Animal and plant, couchgrass is crabgrass. We get the distinct feeling that we will convince no one unless we enumerate certain approximate characteristics of the rhizome. 1 and 2. Principles of connection and heterogeneity: any point of a rhizome can be connected to anything other, and must be. This is very different from the tree or root, which plots a point, fixes an order. The linguistic tree on the Chomsky model still begins at a point S and proceeds by dichotomy. On the contrary, not every trait in a rhizome is necessarily linked to a linguistic feature: semiotic chains of every nature are connected to very diverse modes of coding (biological, political, economic, etc.) that bri

Inserting a newline character between `rhizome.` and `1 and 2.` does not work here, because the line `mate characteristics of the rhizome` is merged with the previous line through the word division of `approximate`.

### 47

In [11]:
content = extract_text(path, 47)
print(content)

 
2. 1914: One or Several Wolves?
 
 
Field of Tracks, or Wolf Line 
That day, the Wolf-Man rose from the couch particularly tired. He knew 
that Freud had a genius for brushing up against the truth and passing it by, 
then filling the void with associations. He knew that Freud knew nothing 
about wolves, or anuses for that matter. The only thing Freud understood 
was what a dog is, and a dog's tail. It wasn't enough. It wouldn't be enough. 
The Wolf-Man knew that Freud would soon declare him cured, but that it 
was not at all the case and his treatment would continue for all eternity 
under Brunswick, Lacan, Leclaire. Finally, he knew that he was in the pro-
cess of acquiring a veritable proper name, the Wolf-Man, a name more 
properly his than his own, since it attained the highest degree of singularity 
26 


In [12]:
print(clean_page(content))

That day, the Wolf-Man rose from the couch particularly tired. He knew that Freud had a genius for brushing up against the truth and passing it by, then filling the void with associations. He knew that Freud knew nothing about wolves, or anuses for that matter. The only thing Freud understood was what a dog is, and a dog's tail. It wasn't enough. It wouldn't be enough. The Wolf-Man knew that Freud would soon declare him cured, but that it was not at all the case and his treatment would continue for all eternity under Brunswick, Lacan, Leclaire. Finally, he knew that he was in the process of acquiring a veritable proper name, the Wolf-Man, a name more properly his than his own, since it attained the highest degree of singularity 



### 60

In [13]:
content = extract_text(path, 60)
print(content)

3. 10,000 B.C: The Geology of Morals   
(Who Does the Earth Think It Is?)
 
 
Double Articulation 
39


In [14]:
len(content)

101

<br>
Option: Remove complete page if it contains less then x characters.<br>

In [16]:
# Everything will be removed through line length.
print(clean_page(content))





### 96

In [17]:
content = extract_text(path, 96)
print(content)

 
4. November 20, 1923—Postulates of 
Linguistics
 
 
The Order-word Assemblage 
I. "Language Is Informational and Communicationai" 
When the schoolmistress instructs her students on a rule of grammar or 
arithmetic, she is not informing them, any more than she is informing her-
self when she questions a student. She does not so much instruct as 
"insign," give orders or commands. A teacher's commands are not external 
or additional to what he or she teaches us. They do not flow from primary 
significations or result from information: an order always and already con-
cerns prior orders, which is why ordering is redundancy. The compulsory 
education machine does not communicate information; it imposes upon 
the child semiotic coordinates possessing all of the dual foundations of 
75 


In [18]:
print(clean_page(content))

When the schoolmistress instructs her students on a rule of grammar or arithmetic, she is not informing them, any more than she is informing herself when she questions a student. She does not so much instruct as "insign," give orders or commands. A teacher's commands are not external or additional to what he or she teaches us. They do not flow from primary significations or result from information: an order always and already concerns prior orders, which is why ordering is redundancy. The compulsory education machine does not communicate information; it imposes upon the child semiotic coordinates possessing all of the dual foundations of 



## Loop through pages 24 to 536

In [190]:
import pyprind

In [191]:
txt = '' # empty variable for the extracted text

for i in pyprind.prog_percent(range(24, 536)):
# for i in range(24, 27):
    
    # Extract content
    content = extract_text(path, i)
    # Clean content
    content = clean_page(content)
    # Add content
    txt += content
    


[100 %] Time elapsed: 00:08:36 | ETA: 00:00:00
Total time elapsed: 00:08:36


In [192]:
# Save file
with open('dataset.txt', 'w') as f:
    f.write(txt)

## Clean page Notes

#### Remove footnotes

In [120]:
s = 'Some 35 63 7 ds like a patch of oil.2 It is always possible to break a language' 
res = re.sub('\.+?[0-9\-]', '.', s)
print(res)

Some 35 63 7 ds like a patch of oil. It is always possible to break a language


#### Remove page numbers

In [129]:
l = ['education machine does not communicate information; it imposes upon \n', 'the child semiotic coordinates possessing all of the dual foundations of \n', '75 \n', 'some more']
print(l)

['education machine does not communicate information; it imposes upon \n', 'the child semiotic coordinates possessing all of the dual foundations of \n', '75 \n', 'some more']


In [131]:
for line in l:
    print(re.sub('[0-9\-]+ \\n', 'REMOVED', line))

education machine does not communicate information; it imposes upon 

the child semiotic coordinates possessing all of the dual foundations of 

REMOVED
some more


#### Paragraph

In [164]:
txt = '0 \nINTRODUCTION: RHIZOME □ 7 \ntions of shelter, supply, movement, evasion, and breakout. The rhizome \nitself assumes very diverse forms, from ramified surface extension in all \ndirections to concretion into bulbs and tubers. When rats swarm over each \nother. The rhizome includes the best and the worst: potato and couchgrass, \nor the weed. Animal and plant, couchgrass is crabgrass. We get the distinct \nfeeling that we will convince no one unless we enumerate certain approxi-\nmate characteristics of the rhizome. \n1 and 2. Principles of connection and heterogeneity: any point of a rhi-\nzome can be connected to anything other, and must be. This is very differ-\nent from the tree or root, which plots a point, fixes an order. The linguistic \n'
print(txt)
print('\n\n')
txt = txt.split('\n')
txt_ = []
for line in txt:
    if len(line) < 65:
        txt_.append(line + '\n')
    else:
        txt_.append(line)
txt = ''.join(txt_)
print(txt)

0 
INTRODUCTION: RHIZOME □ 7 
tions of shelter, supply, movement, evasion, and breakout. The rhizome 
itself assumes very diverse forms, from ramified surface extension in all 
directions to concretion into bulbs and tubers. When rats swarm over each 
other. The rhizome includes the best and the worst: potato and couchgrass, 
or the weed. Animal and plant, couchgrass is crabgrass. We get the distinct 
feeling that we will convince no one unless we enumerate certain approxi-
mate characteristics of the rhizome. 
1 and 2. Principles of connection and heterogeneity: any point of a rhi-
zome can be connected to anything other, and must be. This is very differ-
ent from the tree or root, which plots a point, fixes an order. The linguistic 




0 
INTRODUCTION: RHIZOME □ 7 
tions of shelter, supply, movement, evasion, and breakout. The rhizome itself assumes very diverse forms, from ramified surface extension in all directions to concretion into bulbs and tubers. When rats swarm over each ot