# Python from ground zero workshop (SOLUTIONS)

Welcome to Python from ground zero workshop. In this notebook, we'll go step by step to build a solid base for Python. 

# Part 1 - Learning Colab! 

In this part we'll learn the Colab notebook interface.

Colab, or "Colaboratory", allows you to write and execute Python in your browser. We write codes in snippets called "cells", execute them and see what they do just below.

Let's get a hang of it, shall we?

Type in the following box: `42` 
Then press the ▶️ button on the left

In [2]:
42

42

We entered an expression, the simplest expression there can be: a number and the Python compiler interpreted it and gave us a result. 

Since there's nothing to be interpreted by a number it just gives whatever we give to it: `42`

*(Secret to programming: Programs do whatever us humans tell them to do)*

Now type: `42 + 90` (or any other mathematical expression)

In [3]:
42 + 90

132

Now, python compiler interpreted the mathematical expression and gave us its result.

Try entering a text now: `'I love Python!'`

*Hint: You can also use the short cut Shift+Enter to execute a cell quickly!*

In [4]:
"I love Python!"

'I love Python!'

Now let's get fancy. Enter the command: `print("Hello World!")`

In [152]:
print("Hello World!")

Hello World!


You can write as many lines as you want. Although unless you explicitly do a print, only the latest expression will show up in the result. Let's try below

In [153]:
42
"I love python!"
print("Hello World!")
"555"
print("Hola!")
"hey"

Hello World!
Hola!


'hey'

Now let's give ourselves a high five for writing our first script! 

# Part 2 - Variables

Variables are where we keep information in programming. It's a way of asking the computer to allocate a space for us to store things. 

To create a variable, all we need to do is to put a name and a value to it. It is as simple as writing an equation:

`mybrandnewvariable = 5`

Once it is defined like this, the variable name will be our key into reaching this variable. 

The value of a variable is what's kept INSIDE the memory. You can think of variable name as the key of a box and the value as what's inside the box. 

Variable names are static but we can go changing the value however we want. So if I modify it in a certain point, it'll keep the updated value. 

`mybrandnewvariable = 9`

Let's do a print using a variable. 

In [154]:
my_string_variable = "hello world!"
print(my_string_variable)

hello world!


Here, we stored the text `Hello world!` inside a variable called `my_string_variable`. 

Then, we printed referencing that variable. 

Note that we specified the text within quotation marks `"..."` but not when we reference the variable. 

Let's see the difference

In [155]:
print(my_string_variable)

hello world!


You see Python interprets whatever inside quotation marks as textual information. Since we used quotation marks here, Python didn't look for a variable named `my_string_variable`. Instead, it just interpreted it as regular text. 

Although we are free to write text any way we want, there are some naming conventions behind variables:
- The variable name should begin with a letter or an underscore.
- The variable name can ONLY consist of letters, numbers and underscore (NO SPACES)
- Variable names ARE case sensitive (`my_variable` and `MY_Variable` would be two different variables)


In [156]:
this_is_a_good_variable_name = 5
ThisIsAVaribleName = "hello"

## Variable types

There are various types of variables. For example:

- number variable: `a = 5`
- text/string variable: `v = "some text"`
- lists: `l = [1,2,3,4]`
- boolean: `b = True`

We can see what type a variable has by using `type`. 

In [16]:
a = [1,2,3,4,5]
type(a)

list

`str` stands for string. That's what programmers call a sequence of characters. That is any textual data. 

Also, have you noted that we referenced a variable that we defined in another cell? 

The Python instance is running continuously while we're working so it'll remember whatever variable we define. 

## Numbers

A whole number can be stored in an `int` variable standing for `integer`. 

A decimal number can be stored in a `float`, standing for floating point value.

Execute the following cell and see different types of numbers

In [17]:
num1 = 42
num2 = 5.111
print("Type of num1 is", type(num1))
print("Type of num2 is", type(num2))

Type of num1 is <class 'int'>
Type of num2 is <class 'float'>


In [19]:
product = num1 * num2
print(product)
print(type(product))

214.66199999999998
<class 'float'>


# Challenge 1 - Variable arithmetics

Define two number variables with values `9` and `12` and then get their sum, difference, division and product using the following variables:

| Symbol | Task Performed |
|----|---|
| +  | Addition |
| -  | Subtraction |
| /  | division |
| %  | mod |
| *  | multiplication |
| //  | floor division |
| **  | to the power of |

In [20]:
a = 9
b = 12

sum = a + b
print("sum", sum)

sum 21


In [21]:
subtraction = a -b 
print("sub", subtraction)

division = a / b
print("div", division)

multiplication = a*b
print("mul", multiplication)

sub -3
div 0.75
mul 108


# Part 3 - String operations

Many times we need to manipulate the text we're working with. There are plenty of operations in Python we can use for our tasks.

## String concatentation 

You can concatenate (merge) two strings with `+` operator. 

Try executing the following expression: `"Hello" + "World"`

In [22]:
"Hello" + "World"

'HelloWorld'

Oops, forgot a space there. Add it yourself in the cell below:

In [157]:
"Hello" + " " + "World"

'Hello World'

## Indexing

We said earlier that strings are a sequence of characters. Python allows us to get the character in the position we want using indices (`[x]`).

Try executing the code below: 

In [32]:
mystr = "localization"
mystr[5]

'i'

`mystr[5]` gave us the fifth character in the string `localization` which is `l`... wait a second

We didn't get `l` but instead `i`. Why?

Well, it's because the indices in Python (and many other programming languages) start from 0. 

So... `0`-`l`, `1`-`o`, `2`-`c`, `3`-`a`, `4`-`l`, `5`-`i`... we got `i`!

## String length

You can get the length of a string using `len(str)` function.

Try getting the length of `mystr` below:

In [29]:
len(mystr)

12

# Challenge 2 - String arithmetics #1

Obtain the character in the middle of `localization` using the following tools:
- `len(str)` for getting the length of a string
- `/` division operator
- `[x]` index operator
- `int(x)` casting for converting a float to integer

You can define new variables if needed.

In [38]:
middle = len(mystr)/2
print(middle)
print(int(middle))
mystr[int(middle)]

6.0
6


'z'

## `str.replace`

Let's see what `str.replace` does by using `help` 

In [39]:
help(str.replace)

Help on method_descriptor:

replace(self, old, new, count=-1, /)
    Return a copy with all occurrences of substring old replaced by new.
    
      count
        Maximum number of occurrences to replace.
        -1 (the default value) means replace all occurrences.
    
    If the optional argument count is given, only the first count occurrences are
    replaced.



Now let's use `str.replace` to replace `Mr.` with `Dr.` in the following string

In [40]:
myname = "Mr. Jekyll"
myname.replace("Mr.", "Dr.")

'Dr. Jekyll'

Let's print `myname` now...

In [41]:
print(myname)

Mr. Jekyll


It didn't change why? 

It's because the change is not made in-place. If you check the help above, you'll see that it "returns a copy" of the string but not the string itself. 

If we want to store the modified variable we need to set the result into a new variable.

In [42]:
#Defining new variable
mynewname = myname.replace("Mr.", "Dr.")
print(myname)
print(mynewname)

Mr. Jekyll
Dr. Jekyll


It also replaces all occurences of the given string.

In [43]:
"Mr.Jekyll and Mr.Hyde".replace("Mr.", "Dr.")

'Dr.Jekyll and Dr.Hyde'

## `str.format()`

This is used to place variables inside a string. 

In [45]:
secret = '{} {} is cool!'.format('Python', 'Programming')
print(secret)

Python Programming is cool!


In [158]:
#Introduce yourself using format
print('Hello! My name is {} {}. Nice to meet you!'.format("Alp", "Öktem"))

Hello! My name is Alp Öktem. Nice to meet you!


## `str.join()`

In [160]:
str1 = 'pandas'
str2 = 'numpy'
str3 = 'requests'
cool_python_libs = ', '.join([str1, str2, str3])

In [161]:
print('Some cool python libraries: {}'.format(cool_python_libs))

Some cool python libraries: pandas, numpy, requests


## `str.upper(), str.lower(), str.title()`

These three are very useful when cleaning text

In [53]:
mystr = 'pyTHoN hackER'

In [54]:
mystr.upper()

'PYTHON HACKER'

In [55]:
mystr.lower()

'python hacker'

In [56]:
mystr.title()

'Python Hacker'

In [57]:
mystr.capitalize()

'Python hacker'

## `str.strip`

Strips out sneaky whitespace at the beginning or end of the string. 

Note that `\t` stands for tab and `\n` stands for a new line in python (For more info see [Escape characters](http://python-reference.readthedocs.io/en/latest/docs/str/escapes.html#escape-characters))

In [58]:
my_crooked_str = "   \t THis Is a \tTeRRIble sTRing \n"

In [59]:
print(my_crooked_str)

   	 THis Is a 	TeRRIble sTRing 



In [60]:
slightly_better = my_crooked_str.strip()
print(slightly_better)

THis Is a 	TeRRIble sTRing


We can pipeline operations by adding them one after the other

In [61]:
slightly_better = my_crooked_str.strip().lower()
print(slightly_better)

this is a 	terrible string


## `str.split()`

This is used to split string from the delimiters we choose. Let's see what `help` says about it. 

In [62]:
help(str.split)

Help on method_descriptor:

split(self, /, sep=None, maxsplit=-1)
    Return a list of the words in the string, using sep as the delimiter string.
    
    sep
      The delimiter according which to split the string.
      None (the default value) means split according to any whitespace,
      and discard empty strings from the result.
    maxsplit
      Maximum number of splits to do.
      -1 (the default value) means no limit.



In [63]:
sentence = 'three different words'
words = sentence.split()
print(words)

['three', 'different', 'words']


As the description says, it returns a list. We'll get into lists later. 

We could use other delimiters too.

In [68]:
another_sentence = 'three, different, words'
words = another_sentence.split(", ")
print(words)

['three', 'different', 'words']


# Challenge 3 - String arithmetics #2

Let's use what we learned to make this horrible string truly beautiful. 

Let's process `my_crooked_str` so that it looks like this:

`"This is a beautiful string."`

*(Don't forget the period in the end)*

In [76]:
my_crooked_str

'   \t THis Is a \tTeRRIble sTRing \n'

In [162]:
my_beautiful_string = my_crooked_str.strip().lower().replace("\t", "").capitalize()
print(my_beautiful_string)

my_beautiful_string = my_beautiful_string + "." 
print(my_beautiful_string)

This is a terrible string
This is a terrible string.


# Part 3 - Conditionals

Welcome to conditionals, the most fundamental decision mechanism in coding. It's a way of showing direction to our program based on a condition.

But before we get into that, I'll introduce another variable type called boolean (named after [George Boole](https://en.wikipedia.org/wiki/George_Boole)).

A boolean is a binary variable, that is it can either be "on" or "off", 1 or 0, `True` or `False`. 

In [85]:
my_boolean = True
print(my_boolean)
print(type(my_boolean))

True
<class 'bool'>


I can negate a boolean variable by using the expression `not`

In [86]:
my_boolean = True
another_boolean = not my_boolean
print(another_boolean)

False


Now let's build our first conditional using `if`. 

In [163]:
statement = False
if statement:
  print('statement is True')
    
if not statement:
  print('statement is not True')
  print("Statement needs to be fixed")

statement is not True
Statement needs to be fixed


Note the indentation under `if` expression

`if...else` also comes in handy

In [90]:
statement = False
if statement:
    print('statement is True')
else:
    print('statement is not True')

statement is not True


Let's say we want to decide what we want to wear depending on the temperature given in degrees celcius.

To decide based on a number, we'll need the following:

| Symbol | Task Performed |
|----|---|
| == | equals to |
| >  | Bigger than |
| <  | Smaller than |
| >=  | Bigger than or equal to |
| <=  | Smaller than or equal to |

In [91]:
temperature = 15

#if...else
if temperature == 15:
  print("Wear your leather jacket")

Wear your leather jacket


In [94]:
temperature = 25

#if...else
if temperature > 15:
  print("Weather is nice. No need for jacket")
else:
  print("Wear a jacket!")

Weather is nice. No need for jacket


If we have more than one condition then we use `if...elif...else`

In [96]:
temperature = -18

#if...elif...else
if temperature >= 15:
  print("no jacket")
elif temperature > 10:
  print("Wear a light jacket")
else:
  print("It's pretty cold, layer up!")

It's pretty cold, layer up!


We can combine conditions with `and`, `or`

In [97]:
temperature = 25
sunny = True

#combined if
if temperature > 20 and sunny:
  print("Wear your sunglasses")

Wear your sunglasses


In [101]:
temperature = 25
rainy = True

#combined if with OR
if temperature < 15 or rainy:
  print("Wear a jacket")

Wear a jacket


`==` and `!=` are used to evaluate if two expressions are equal or not equal

In [106]:
language = 'bn'

if language == 'fr':
  print("Bonjour!")
elif language == 'en':
  print("Good morning!")
elif language == 'tr':
  print("Günaydın")
else:
  print("Sorry, i don't speak {}".format(language))

Sorry, i don't speak bn


In [108]:
password = '12345'
user_input = '1234'

#Warn if password is wrong using !=
if user_input != password:
  print("Wrong password!")

Wrong password!


#Challenge 4 - Conditionals

Fill in where it's marked with `____` with the proper conditional

In [109]:
myname = "Alp öktem"
len(myname)

9

In [164]:
name = 'George Boole'

if len(name) > 20:
    print('Name "{}" is more than 20 chars long'.format(name))
    length_description = 'long'
elif len(name) > 15:
    print('Name "{}" is more than 15 chars long'.format(name))
    length_description = 'semi long'
elif len(name) > 10:
    print('Name "{}" is more than 10 chars long'.format(name))
    length_description = 'semi long'
elif len(name) >= 8:
    print('Name "{}" is 8, 9 or 10 chars long'.format(name))
    length_description = 'semi short'
else:
    print('Name "{}" is a short name'.format(name))
    length_description = 'short'

print("Length description:", length_description)

Name "George Boole" is more than 10 chars long
Length description: semi long


# Part 4 - Lists and dictionaries

Lists and dictionaries make it super easy to keep a set of values tidy in one single variable. 

## Lists

A list is defined as follows:

`my_list = [1,2,3,4,5,6]`

Let's say if we want to read the third element, we'd just need to say:

`my_list[2]`

*(Note that I put 2 there because remember first element is index 0)*

Try creating a list and print one of it's elements

In [118]:
my_list = [1,2,3,4,5,6]
my_list[2]

3

A list can be empty...

In [165]:
new_list = []

if not new_list:
  print("My list is empty :(")

My list is empty :(


And if we want to add to and remove elements from a list we already defined, we can use the `append` and `remove` function

In [166]:
#append and remove
new_list.append(4)
new_list.append(3)
new_list.append(99)
new_list.remove(3)
new_list.append(100)

print(new_list)

[4, 99, 100]


A length of a list can be obtained using the `len` function

In [167]:
print('list: {}, size: {}'.format(new_list, len(new_list)))

list: [4, 99, 100], size: 3


We can modify or remove an element too

In [168]:
my_list = [0, 1, 2, 3, 4, 5]
print(my_list)

print("\nModify value at index 0 to 99")
my_list[0] = 99
print(my_list)

print("\nRemove first value using del")
del my_list[4]
print(my_list)

[0, 1, 2, 3, 4, 5]

Modify value at index 0 to 99
[99, 1, 2, 3, 4, 5]

Remove first value using del
[99, 1, 2, 3, 5]


It's very useful to check if a variable we want is in the list. 

For that we can use `in`

In [127]:
languages = ['en', 'fr', 'tr', 'ar', 'sw']

if 'tr' in languages:
  print("Turkish is in the list")

if 'de' in languages:
  print("German is in the list")
else:
  print("German is not in the list")

if 'ro' not in languages:
  print("Romanian is not in the list")

Turkish is in the list
German is not in the list
Romanian is not in the list


And finally, we can sort a list using `sort`

In [128]:
numbers = [8, 1, 6, 5, 10]
numbers.sort()
print('sorted numbers:', numbers)

numbers.sort(reverse=True)
print('numbers reverse sorted:', numbers)

words = ['this', 'is', 'a', 'list', 'of', 'words']
words.sort()
print('words:',words)

sorted numbers: [1, 5, 6, 8, 10]
numbers reverse sorted: [10, 8, 6, 5, 1]
words: ['a', 'is', 'list', 'of', 'this', 'words']


Merging lists are easy as 1,2,3

In [129]:
old_star_wars = [4,5,6]
new_star_wars = [1,2,3]

all_star_wars = old_star_wars + new_star_wars
print(all_star_wars)

[4, 5, 6, 1, 2, 3]


# Challenge 5 - Lists

Define an empty list called `challenge`. Add the elements:
- `I`
- `am`
- `learning`
- `Python`
- `today`

Replace the element `learning` with `loving`

Remove the element `today`

Sort the list

In [135]:
challenge = []

challenge.append("I")
challenge.append("am")
challenge.append("learning")
challenge.append("Python")
challenge.append("today")

print(challenge)

challenge[2] = "loving"

print(challenge)

#del challenge[4]
challenge.remove("today")

print(challenge)

challenge.sort()

print(challenge)

['I', 'am', 'learning', 'Python', 'today']
['I', 'am', 'loving', 'Python', 'today']
['I', 'am', 'loving', 'Python']
['I', 'Python', 'am', 'loving']


In [136]:
#Execute this cell to see if you did everything correctly
assert challenge == ['I', 'Python', 'am', 'loving']

## Dictionaries

Dictionaries are key-value pairs. 

A dictionary is defined with curly brackets

In [137]:
my_dict = {'a':'apple', 'b':'banana', 'c':'cherries'}
print(my_dict)
print(type(my_dict))

{'a': 'apple', 'b': 'banana', 'c': 'cherries'}
<class 'dict'>


We can read the elements using the keys that we defined

In [139]:
my_dict['a']

'apple'

Or modify them

In [140]:
my_dict['b'] = 'berries'
print(my_dict)

{'a': 'apple', 'b': 'berries', 'c': 'cherries'}


# Part 5 - Loops

Sometimes there are too many variables that we need to automate their processing. Loops come in handy when we want to perform a repeated task. 

We usually iterate through a list's items. 

The keyword conventionally used in programming is `for`. 

when we say `for item in my_list`, I get the elements from `my_list` one by one into a variable called `item` which is only accessible within that loop step.

Let's do a simple printing example

In [141]:
my_list = [1, 2, 3, 4, 'Python', 'is', 'neat']
#Print items in list

for item in my_list:
  print(item)

1
2
3
4
Python
is
neat


# Challenge 6 - loops and dictionaries

Given our list of fruits, transfer them into a dictionary using their first letters as key 

For example:

Given list `['apple', 'pear', 'cherry']`

Make dictionary `{'a':'apple', 'p':'pear', 'c':'cherry'}`

In [142]:
#Fill in the blanks
fruits = ['cantaloop', 'cherries', 'mango', 'papaya', 'pomegranade']
fruit_dictionary = {}

for item in fruits:
  first_letter = item[0]
  fruit_dictionary[first_letter] = item

print(fruit_dictionary)

{'c': 'cherries', 'm': 'mango', 'p': 'pomegranade'}


In [None]:
#Execute this to see if you get it right
assert fruit_dictionary == {'c': 'cantaloop', 'p': 'pomegranade', 'm': 'mango'}

# Challenge 6b - List inside dictionary (Home exercise)

As you have noticed, because the keys clashed in the previous exercise, we lost some fruits that were in the original list. 

Edit the code so that we keep all fruits under the same key. 

The final dictionary should look like this:

`fruit_dictionary = {'c': ['cantaloop', 'cherries], 'p': ['papaya', 'pomegranade'], 'm': ['mango']}`

You can check if a key is inside dictionary with:
`if k in my_dict`

In [173]:
fruit_dictionary = {}

for item in fruits:
  first_letter = item[0]
  if first_letter in fruit_dictionary:
    fruit_dictionary[first_letter].append(item)
  else:
    fruit_dictionary[first_letter] = []
    fruit_dictionary[first_letter].append(item)

print(fruit_dictionary)

{'c': ['cantaloop', 'cherries'], 'm': ['mango'], 'p': ['papaya', 'pomegranade']}


# Part 6- Functions

A function is a replicable procedure to perform a specific task. It's very useful when we're going to perform one task repeatedly or in different parts of our code. 

Functions are declared with the keyword `def` followed by function name and then function parameters

```
def my_function(parameter1, parameter2)
  #Stuff that my function will do
```

Let's write a function that we can use to greet 

In [143]:
#Greet function
def greet():
  print("Hello!")

Now, let's call that function...

In [144]:
greet()

Hello!


Now let's use our name with it by letting it as a parameter

In [145]:
def greet(username):
  print("Hello", username)

In [146]:
greet("Alp")

Hello Alp


Functions can return values too.

Let's make a function that gives out the square of the number we give to it.

In [147]:
def square(n):
  n_squared = n**2
  return n_squared

In [149]:
square(5)

25

See, when we have a long array and we want to perform our function to all elements we can just create a loop and call our function to do its magic

In [151]:
my_numbers = [1,2,3,4,5,6,7]
my_numbers_squared = []

for n in my_numbers:
  print(square(n))
  my_numbers_squared.append(square(n))

print(my_numbers_squared)

1
4
9
16
25
36
49
[1, 4, 9, 16, 25, 36, 49]


# Challenge 7 - Searching for wanted people

Implement `find_wanted_people` function which takes a list of names (strings) as argument. The function should return a list of names which are present both in `WANTED_PEOPLE` and in the name list given as argument to the function.

In [175]:
WANTED_PEOPLE = ['John Doe', 'Clint Eastwood', 'Chuck Norris']

In [174]:
def find_wanted_people(people_list):
  wanted = []
  for person in people_list:
    if person in WANTED_PEOPLE:
      wanted.append(person)
  return wanted

In [176]:
#Execute this cell to see if you did it right
people_to_check1 = ['Donald Duck', 'Clint Eastwood', 'John Doe', 'Barack Obama']
wanted1 = find_wanted_people(people_to_check1)
assert len(wanted1) == 2
assert 'John Doe' in wanted1
assert 'Clint Eastwood'in wanted1

people_to_check2 = ['Donald Duck', 'Mickey Mouse', 'Zorro', 'Superman', 'Robin Hood']
wanted2 = find_wanted_people(people_to_check2)
assert wanted2 == []

# Part 7 - Packages

What we presented above was the core basics of python. You might be thinking how can these simple tools can help with our complicated tasks. 

There seems to be a huge gap right? 

Well, yes there is but you don't have to worry that much. Another miracle of Python is that it is modular. 

What does modular mean?

It means if I want to perform task X, it is highly probable that someone out there wrote a code for it that I can just plug it in my code. 

In Python, this is standardized through packages. 

Ever heard of the saying *"There's an app for it?"*

Well in python **"There's a package for it!"**

There are thousands of packages that can help you solve tasks like:

- Reading and manipulating Google spreadsheets
- Process TMX, XML etc.
- Clean text from emoji etc.
- Build a machine translation model (more on this on Friday!)
- Whatever you might imagine...

Some of these packages come readily with our Python installation. 

And the rest can be found in sites like https://pypi.org/ and installed with one line of command. 

One very useful package is the "Regular expressions" package.

We're not going to go deep into regex. You can find a reference here --> https://www.tutorialspoint.com/python/python_reg_expressions.htm

To be able to use a package, all we need to do is import it

In [180]:
import re

Then we can perform its functions. 

Two of the most important of these are:

- `re.match` for checking if a string matches a pattern
- `re.search` for searching for a pattern in string
- `re.sub` for replacing a matched pattern in a string with some other string

## `re.search(pattern, string)`

Let's search for a telephone number in a string

In [181]:
telephone_entry = "Here's the phone number of the pizza shop: 555-4351. Order a napolitana for me. "

searchObj = re.search("[0-9][0-9][0-9]-*[0-9][0-9][0-9][0-9]", telephone_entry)
print(searchObj.group(0))

555-4351


`re.match(pattern, string)`

Let's check if `555-4351` is a proper formatted telephone number

In [183]:
telephone_number = "555-4351"

matchObj = re.match("[0-9][0-9][0-9]-*[0-9][0-9][0-9][0-9]", telephone_number)
print(matchObj.group(0))

555-4351


# `re.sub(pattern, repl, string, max=0)`

Imagine our client wants personal data like telephone numbers to be obfuscated. 

In [186]:
my_personal_string = "My personal number is 555-1234."

re.sub("[0-9][0-9][0-9]-*[0-9][0-9][0-9][0-9]", "<hidden>", my_personal_string)

'My personal number is <hidden>.'

# Final challenge

Now that we're Python experts, we can come through some real challenges, right??

In this final challenge, we're going to fix some highly challenging documents which a client has asked us to translate. 

You can download them from this links:
- https://github.com/alpoktem/python-workshop/raw/main/challenge_data/Programming.docx
- https://github.com/alpoktem/python-workshop/raw/main/challenge_data/Python.docx

Next, we're going to put them in our Colab workspace using the Files menu on the left. 

It is always good to think about the task before embarking on any coding. 

A programmer always thinks in terms of input and output. Let's define them first:

INPUT: 2 docx Word documents with badly formatted text

OUTPUT: 2 docx Word documents with format corrected text

Once we know the beginning and ending of our procedure, we can start thinking about the intermediate steps.

1. Read the documents 
2. Clean the text
3. Output cleaned documents

To accomplish these tasks, two Python packages come to our help:

1. `clean-text` for normalizing text https://pypi.org/project/clean-text/
2. `docx` for reading and writing word documents https://python-docx.readthedocs.io/en/latest/index.html

We're going to get help from this tutorial when using `docx`: https://tech-cookbook.com/2019/10/21/how-to-work-with-docx-in-python/

These two packages are not included in the standard Python distribution. So we have to install them. We can install a package using this command:

`!pip install <package-name>`

Let's install these two packages: 

In [187]:
!pip install clean-text

Collecting clean-text
  Downloading clean_text-0.5.0-py3-none-any.whl (9.8 kB)
Collecting ftfy<7.0,>=6.0
  Downloading ftfy-6.0.3.tar.gz (64 kB)
[K     |████████████████████████████████| 64 kB 2.2 MB/s 
[?25hCollecting emoji
  Downloading emoji-1.6.3.tar.gz (174 kB)
[K     |████████████████████████████████| 174 kB 18.4 MB/s 
Building wheels for collected packages: ftfy, emoji
  Building wheel for ftfy (setup.py) ... [?25l[?25hdone
  Created wheel for ftfy: filename=ftfy-6.0.3-py3-none-any.whl size=41933 sha256=4f224727db0071cefec4f444d1a5bf7fcb382f2cf3125a97920796b2f57afd5e
  Stored in directory: /root/.cache/pip/wheels/19/f5/38/273eb3b5e76dfd850619312f693716ac4518b498f5ffb6f56d
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.6.3-py3-none-any.whl size=170298 sha256=2d975d1998f113703dafab3cb779f164ef80d242d2c7f385e1a94075f61e7434
  Stored in directory: /root/.cache/pip/wheels/03/8b/d7/ad579fbef83c287215c0caab60fb0ae0f30c4d7ce5f

In [188]:
!pip install python-docx

Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 7.3 MB/s 
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184507 sha256=08337c371bac17fe65823c8770ac2e3939d2395a924d47c539e078198fd99aa2
  Stored in directory: /root/.cache/pip/wheels/f6/6f/b9/d798122a8b55b74ad30b5f52b01482169b445fbb84a11797a6
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.11


Let's play around with the `clean-text` package. We can see its basic usage in its website. 

In [189]:
from cleantext import clean
clean("A bunch of \\u2018new\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).")

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


"a bunch of 'new' references, including [moana](https://en.wikipedia.org/wiki/moana_%282016_film%29)."

The same way, let's familiarize ourself with the `python-docx` package. 

In [192]:
from docx import Document
document = Document('Python.docx')

In [194]:
for paragraph in document.paragraphs:
  print(paragraph.text)

⭐️Python⭐️ ïs àn ïnterpreted hïgh\u002dlevel ğeneràl\u002dpurpose progràmmïng lànguàge. ïts desïgn phïlosophy emphàsïzes code reàdàbïlïty Wïth ïts use of sïgnïfïcànt ïndentàtïon. ïts lànguàge constructs às Well às ïts object\u002dorïented àpproàch àïm to help progràmmers Wrïte cleàr, logïcàl code for smàll ànd làrge\u002dscàle projects.\u005b30\u005d ☹︎
⭐️Python⭐️ ïs dynàmïcàlly\u002dtyped ànd gàrbàge\u002dcollected. ït supports multïple progràmmïng pàràdïgms, ïncludïng 🃏structured (pàrtïculàrly, proceduràl), object\u002dorïented ànd functïonàl progràmmïng. ït ïs often descrïbed às à \u2018bàtterïes ïncluded\u2019 lànguàge due to ïts comprehensïve stàndàrd lïbràry.\u005b31\u005d\u005b32\u005d
Guïdo vàn Rossum begàn Workïng on Python ïn the làte 1980s, às à successor to the àBC progràmmïng lànguàge, ànd fïrst releàsed ☹︎ ☹︎ ☹︎ ☹︎ ït ïn 1991 às ⭐️Python⭐️ 0.9.0.\u005b33\u005d Python 2.0 Wàs releàsed ïn 2000 ànd ïntroduced neW feàtures, such às lïst comprehensïons ànd à cycle\u002ddetectï

Let's start by writing a function that reads paragraphs of a document into a list

In [198]:
def doc_to_list(doc_path):
  document = Document(doc_path)
  paragraphs = []
  for p in document.paragraphs:
    paragraphs.append(p.text)
  return paragraphs

In [199]:
doc1_content = doc_to_list("Python.docx")
print(doc1_content)

['⭐️Python⭐️ ïs àn ïnterpreted hïgh\\u002dlevel ğeneràl\\u002dpurpose progràmmïng lànguàge. ïts desïgn phïlosophy emphàsïzes code reàdàbïlïty Wïth ïts use of sïgnïfïcànt ïndentàtïon. ïts lànguàge constructs às Well às ïts object\\u002dorïented àpproàch àïm to help progràmmers Wrïte cleàr, logïcàl code for smàll ànd làrge\\u002dscàle projects.\\u005b30\\u005d ☹︎', '⭐️Python⭐️ ïs dynàmïcàlly\\u002dtyped ànd gàrbàge\\u002dcollected. ït supports multïple progràmmïng pàràdïgms, ïncludïng 🃏structured (pàrtïculàrly, proceduràl), object\\u002dorïented ànd functïonàl progràmmïng. ït ïs often descrïbed às à \\u2018bàtterïes ïncluded\\u2019 lànguàge due to ïts comprehensïve stàndàrd lïbràry.\\u005b31\\u005d\\u005b32\\u005d', 'Guïdo vàn Rossum begàn Workïng on Python ïn the làte 1980s, às à successor to the àBC progràmmïng lànguàge, ànd fïrst releàsed ☹︎ ☹︎ ☹︎ ☹︎ ït ïn 1991 às ⭐️Python⭐️ 0.9.0.\\u005b33\\u005d Python 2.0 Wàs releàsed ïn 2000 ànd ïntroduced neW feàtures, such às lïst comprehensïons

Now let's create another function that processes the strings in a list and outputs another list

In [205]:
def clean_string_list(str_list):
  clean_list = []
  for l in str_list:
    clean_list.append(clean(l.strip()))
  return clean_list

In [202]:
doc1_content_cleaned = clean_string_list(doc1_content)
print(doc1_content_cleaned)

['⭐python⭐ is an interpreted high-level general-purpose programming language. its design philosophy emphasizes code readability with its use of significant indentation. its language constructs as well as its object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.[30] ☹️', "⭐python⭐ is dynamically-typed and garbage-collected. it supports multiple programming paradigms, including 🃏structured (particularly, procedural), object-oriented and functional programming. it is often described as a 'batteries included' language due to its comprehensive standard library.[31][32]", 'guido van rossum began working on python in the late 1980s, as a successor to the abc programming language, and first released ☹️ ☹️ ☹️ ☹️ it in 1991 as ⭐python⭐ 0.9.0.[33] python 2.0 was released in 2000 and introduced new features, such as list comprehensions and a cycle-detecting garbage collection system (in addition to reference counting). ⭐python 3.0 ⭐was relea

And finally, let's write a function that creates a document from our list of strings and writes it into a new document

In [203]:
def str_list_to_doc(str_list, new_doc_name):
  document = Document()
  for s in str_list:
    document.add_paragraph(s)
  document.save(new_doc_name)

In [204]:
str_list_to_doc(doc1_content_cleaned, "Python_clean.docx")

OK! So far so good, let's see the result for one document. Download and look at `Python_clean.docx`

And now, we can extend the same code to process ALL documents! 

In [206]:
all_documents = ["Python.docx", "Programming.docx"]

for doc in all_documents:
  doc_content = doc_to_list(doc)
  doc_content_cleaned = clean_string_list(doc_content)

  new_doc_name = doc+"_clean.docx"
  str_list_to_doc(doc_content_cleaned,   new_doc_name = doc+"_clean.docx")

Voila! Our documents are now ready to be translated! 