# Data cleaning and Parsing

Data cleaning and parsing are crucial parts of working with text. Here are some basic methods of cleaning and parsing text into data. 
Our goal here is to start with a long String text or txt file, tokenizing it, and in the end, turn them into a list of words. 

## String as data

String is a data type in python for a long list of Characters. 

In [1]:
text = '''
Rose is a rose is a rose is a rose
'''

print(text)


Rose is a rose is a rose is a rose



## .txt file as data

As we previously learnt, it is also possible to read a txt file, and turn them into String for futher usage.

In [2]:
with open('data/rose.txt', 'r') as f:
    text = f.read()
    
print(text)

Rose is a rose is a rose is a rose


## .pdf file as data

reading pdf as text require some external support.

We are using pdfplumber 
https://pypi.org/project/pdfplumber/

In [3]:
!pip install pdfplumber

You should consider upgrading via the '/Users/AprilCoffee/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [4]:
# importing required modules
import pdfplumber
 
with pdfplumber.open("data/vorlesungsverzeichnis.pdf") as pdf:
    print(len(pdf.pages))
    
    first_page = pdf.pages[0]
    print(first_page.extract_text())


81
Lehrveranstaltungen im  
Sommersemester 2022


In [5]:
# get all data from every page

full_text = [] 

with pdfplumber.open("data/vorlesungsverzeichnis.pdf") as pdf:
    for page_num in range(len(pdf.pages)):
        page = pdf.pages[page_num]
        
        txt = page.extract_text()
        full_text.append(txt)

In [6]:
# transform full_text from list to string with space between each page
full_text = ' '.join(full_text)
print(full_text[:500])

Lehrveranstaltungen im  
Sommersemester 2022 Grundlagenseminar
Prof.PeterFriedrichStephan,JacquelineHen
Beginnlosigkeit
GrundlagenseminarMultimedialeGestaltung
Kompaktseminar
Semester SoSe22
Zielgruppe Grundstudium
Ort&Termine
Mo,30.05.2022-Fr,03.06.2022
Filzengraben8-10,exMediaLab4.03
WirerkundengrundlegendePhänomenederanalogenunddigitalenGestaltung.DazugehörenFarbe
undLicht,FormundRaum,RhythmusundStruktur,DynamikundInteraktion.
Diese Bereiche werden aus den Perspektiven von Kunst, Wissenschaft


## Data Cleaning


In [7]:
import string

text = 'Röse is a Rose, is 1 rose!?'

text=text.replace("\n"," ").replace("ä","ae").replace("Ä","Ae").replace("ö","oe").replace("Ö","oe").replace("ü","ue").replace("Ü","ue")

text = text.lower()

remove_digits = str.maketrans('', '', '0123456789!')
text = text.translate(remove_digits)

text = text.translate(str.maketrans('','',string.punctuation))


print(text)

roese is a rose is  rose


# Merge and split text

## Spliting text

In [8]:
text = 'rose is  , a rose is a rose'

#split by comma
token = text.split(',')
print(token)

#split by blank
token = text.split(' ')
print(token)

corpus = [x for x in token if x]
print(corpus)

['rose is  ', ' a rose is a rose']
['rose', 'is', '', ',', 'a', 'rose', 'is', 'a', 'rose']
['rose', 'is', ',', 'a', 'rose', 'is', 'a', 'rose']


## Merging text

join is the function to combine `list` of words into `string`

In [9]:
words = ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose']

#join without null
print(''.join(words))

#join with comma
print(', '.join(words))

#join with space
print(' '.join(words))


roseisaroseisarose
rose, is, a, rose, is, a, rose
rose is a rose is a rose
