In [None]:
# Before we begin, run this cell if you are using Colab
!git clone https://github.com/danielinux7/StemLab.git

# Shell Scripting

#### Content
1. Shell commands (shuf, wc, grep, sed, sort, uniq, cat, head, tail)
2. Regular Expressions (regex)

#### What you will be able to do after the tutorial
* Extract text from pdf (data dump).
* Text clean up, remove undesired output.
* Learn about regex.
* Fix common issues and split text into sentences.
* Build a parallel corpus.


#### **Extract text from pdf (data dump)**

In [None]:
!pip install PyPDF2

In [None]:
# importing required modules
import PyPDF2
import sys

# store standard output to restore it later
origin_stdout = sys.stdout

# creating a pdf file object
pdfFileObj_ab = open('/content/StemLab/4-Shell-Scripting/last-of-the-departed_ ab.pdf', 'rb')
pdfFileObj_ru = open('/content/StemLab/4-Shell-Scripting/last-of-the-departed_ ru.pdf', 'rb')

# creating a pdf reader object
pdfReader_ab = PyPDF2.PdfFileReader(pdfFileObj_ab)
pdfReader_ru = PyPDF2.PdfFileReader(pdfFileObj_ru)

# extracting text from pdf
sys.stdout = open("ab.txt", "w")
for i in range(pdfReader_ab.numPages):
    current_page = pdfReader_ab.getPage(i)
    print(current_page.extractText())

sys.stdout = open("ru.txt", "w")
for i in range(pdfReader_ru.numPages):
    current_page = pdfReader_ru.getPage(i)
    print(current_page.extractText())
    
sys.stdout = origin_stdout
# closing the pdf file object
pdfFileObj_ab.close()
pdfFileObj_ru.close()


#### **Text clean up, remove undesired output**

The extracted text from pdf files need a lot of clean up, first thing to do is to look at the txt files, to firgure out similiar noise patterns.

Eventually we will use shell commands and scripting to accomplish our goal, the reason for using shell commands because they have fast execution time, this is very import in big data.

**File stats:** understanding some details about the files that we are dealing with, 

In [None]:
# Show the number of lines, words and characters of our files
!wc ab.txt ru.txt

In [None]:
# Show the number of lines of our files
!wc -l ab.txt
!wc -l ru.txt

**Peak at the files:**
The files we are dealing with are usually big, so we can take a look at small parts withthe commands head, tail or sed

In [None]:
# Show first 20 lines with head in a text file
!head -20 ab.txt 

In [None]:
# Show the last 20 lines with tail in a text file
!tail -20 ru.txt

In [None]:
# Show the lines from 4 to 8 in a text file
!sed -n '4,8p' ab.txt

**Remove extra lines:** we need to remove empty lines, lines with empy space, lines with symbols that won't be useful for our translation task.

In [None]:
# Remove empty lines from the text file, we use piping "|" to chain the inputs and outputs of our commands
!head -20 ab.txt | sed -r '/^$/d'

*Question:* why we shouldn't do this?
```!sed -r '/^$/d' ab.txt | head -20```



In [None]:
# Let's remove also lines with empty spaces
!head -20 ab.txt | sed -r '/^$/d' | sed -r '/^[ ]+$/d'

In [None]:
# In our case for the machine translation task, we only care about the lines 
# that have alphabetical characters for Russian and Abkhazian.
!head -20 ab.txt | sed -n '/[[:alpha:]]/p'
print()
!head -20 ru.txt | sed -n '/[[:alpha:]]/p'

In [None]:
# We process the files with sed and save the results in the same files, then we check out their stats.
!sed -i -n '/[[:alpha:]-]/p' ab.txt
!sed -i -n '/[[:alpha:]-]/p' ru.txt
!wc ab.txt
!wc ru.txt

#### **Learn about regex**

*   What is regex?
 
 A **Reg**ular **Ex**pression (regex) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

*   Where would we use regex?
 
 Regular expressions are used in search engines, in search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in lexical analysis. Most general-purpose programming languages support regex capabilities either natively or via libraries, including for example Python, C, C++, Java, and JavaScript.

* What is regex syntax?

1.   [POSIX_basic](https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended)
2.   [POSIX extended](https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended)
3.   [Character classes](https://en.wikipedia.org/wiki/Regular_expression#Character_classes)






#### **Fix common issues and split text into sentences**

In [None]:
# Show all printed characters in the abkhazian text
!sed -r 's/(.)/\L\1/g' ab.txt | grep -o '[[:print:]]' | sort -u

In [None]:
# Show all printed characters in the Russian text
!sed -r 's/(.)/\L\1/g' ru.txt | grep -o '[[:print:]]' | sort -u

In [None]:
# Replace non-breaking space with usual space, and remove line feed.
# Remove extra space
# Remove page numbers sticked to a word at the beginning of a line
# Words break to the next line
# Sentences break to the next line
!head -150 ab.txt | sed -r 's/\xC2\xA0/ /g' | \
 sed -r 's/[ ]+/ /g' | \
 sed -r 's/^[0-9]+([[:alpha:]–])/\1/g' | \
 sed -z -r 's/(\w)\s?\n?\s?-\s?\n\s?(\w)/\1\2/g' | \
 sed -z -r 's/([^!\?\.\s])\s?\n/\1 /g' | \
 sed -r 's/([[:alpha:]][[:alpha:]][[:alpha:]][!\?\.]+)\s+/\1\n/g'

In [None]:
# Replace non-breaking space with usual space, and remove line feed.
# Remove extra space
# Remove page numbers sticked to a word at the beginning of a line
# Words break to the next line
# Sentences break to the next line
# Split into sentences
!head -150 ru.txt | sed -r 's/\xC2\xA0/ /g' | \
 sed -r 's/[ ]+/ /g' | \
 sed -r 's/^[0-9]+([[:alpha:]–])/\1/g' | \
 sed -z -r 's/(\w)\s?\n?\s?-\s?\n\s?(\w)/\1\2/g' | \
 sed -z -r 's/([^!\?\.\s])\s?\n/\1 /g' | \
 sed -r 's/([[:alpha:]][[:alpha:]][[:alpha:]][!\?\.]+)\s+/\1\n/g'

In [None]:
# Save the changes
!cat ab.txt | sed -r 's/\xC2\xA0/ /g' | \
 sed -r 's/[ ]+/ /g' | \
 sed -r 's/^[0-9]+([[:alpha:]–])/\1/g' | \
 sed -z -r 's/(\w)\s?\n?\s?-\s?\n\s?(\w)/\1\2/g' | \
 sed -z -r 's/([^!\?\.\s])\s?\n/\1 /g' | \
 sed -r 's/([[:alpha:]][[:alpha:]][[:alpha:]][!\?\.]+)\s+/\1\n/g' > ab2.txt

!cat ru.txt | sed -r 's/\xC2\xA0/ /g' | \
 sed -r 's/[ ]+/ /g' | \
 sed -r 's/^[0-9]+([[:alpha:]–])/\1/g' | \
 sed -z -r 's/(\w)\s?\n?\s?-\s?\n\s?(\w)/\1\2/g' | \
 sed -z -r 's/([^!\?\.\s])\s?\n/\1 /g' | \
 sed -r 's/([[:alpha:]][[:alpha:]][[:alpha:]][!\?\.]+)\s+/\1\n/g' > ru2.txt

!mv ab2.txt ab.txt
!mv ru2.txt ru.txt
!wc -l ab.txt
!wc -l ru.txt

#### **Build a parallel corpus**

For alignment we will use:

1.   [hunalign](https://github.com/danielvarga/hunalign)
2.   neural-bifixer TODO



In [None]:
# We will use hunalign for alignment.
!git clone https://github.com/danielvarga/hunalign.git
!cd hunalign/src/hunalign && make

In [None]:
!hunalign/src/hunalign/hunalign StemLab/4-Shell-Scripting/ru-ab.dic \
ab.txt ru.txt -realign -utf -text -bisent > ab-ru.tsv

In [None]:
!head -50 ab-ru.tsv