# Tutorial to Koalas - First steps

First we import koalas under the abbreviation kl:

In [1]:
import koalas as kl

## Creation of WordLists and simple manipulation of terms
We can create list of words to be used with Koalas by passing a list of strings into the WordList constructor.

In [2]:
terms = kl.WordList(['Red automobile', 'Citroën', 'Car industry in Europe',
                     'blue vehicle', 'piston (engine part)', 'gasoline'], name='car terms')
print(terms)

0            Red automobile
1                   Citroën
2    Car industry in Europe
3              blue vehicle
4      piston (engine part)
5                  gasoline
Name: car terms, dtype: object


Now we can use standard Pandas methods like `str.lower()` or operators to manipulate the terms.

In [3]:
print(terms.str.lower())

0            red automobile
1                   citroën
2    car industry in europe
3              blue vehicle
4      piston (engine part)
5                  gasoline
Name: car terms, dtype: object


In [4]:
print(terms + ' (car related term)')

0            Red automobile (car related term)
1                   Citroën (car related term)
2    Car industry in Europe (car related term)
3              blue vehicle (car related term)
4      piston (engine part) (car related term)
5                  gasoline (car related term)
Name: car terms, dtype: object


Additionally Koalas contains more advanced methods to operate on lists of term.

In [5]:
print(terms.remove_qualifiers())

0            Red automobile
1                   Citroën
2    Car industry in Europe
3              blue vehicle
4                   piston 
5                  gasoline
Name: car terms, dtype: object


In [6]:
print(terms.stem())

0            red automobil
1                  citroën
2    car industri in europ
3              blue vehicl
4      piston (engin part)
5                  gasolin
Name: car terms, dtype: object


## WordFrames for more elaborate workflows
In order to group the results of different operations and analyses together Koalas offers the class WordFrame. You can initialize a WordFrame by passing it a WordList. The list of words basically becomes a column in a table. In this case the WordFrame only contains a single column so far (but note the column header).

In [7]:
ctflow = kl.WordFrame(terms)
print(ctflow)

                car terms
0          Red automobile
1                 Citroën
2  Car industry in Europe
3            blue vehicle
4    piston (engine part)
5                gasoline


In contrast to Pandas, any method that can be called on a WordList can also be called on a WordFrame.  By default this method is applied to the column that was last modified (or the last column in the WordFrame if none has been modified so far).

In [8]:
print(ctflow.str.lower())

                car terms
0          red automobile
1                 citroën
2  car industry in europe
3            blue vehicle
4    piston (engine part)
5                gasoline


Additionally, the argument `to` can be passed to any method to specify another column in which to store the results. The source column stays unmodified.

In [9]:
print(ctflow.str.lower(to='lower case'))

                car terms              lower case
0          Red automobile          red automobile
1                 Citroën                 citroën
2  Car industry in Europe  car industry in europe
3            blue vehicle            blue vehicle
4    piston (engine part)    piston (engine part)
5                gasoline                gasoline


Similarly with the parameter `on` the source column can be specified.

In [10]:
print(ctflow.str.lower(to='lower case').str.contains('red', to='contains "red"'))

                car terms              lower case  contains "red"
0          Red automobile          red automobile            True
1                 Citroën                 citroën           False
2  Car industry in Europe  car industry in europe           False
3            blue vehicle            blue vehicle           False
4    piston (engine part)    piston (engine part)           False
5                gasoline                gasoline           False


In [11]:
print(ctflow.str.lower(to='lower case').str.contains('red', on='car terms', to='contains "red"'))

                car terms              lower case  contains "red"
0          Red automobile          red automobile           False
1                 Citroën                 citroën           False
2  Car industry in Europe  car industry in europe           False
3            blue vehicle            blue vehicle           False
4    piston (engine part)    piston (engine part)           False
5                gasoline                gasoline           False


This allows chaining together of entire workflows, which can be formatted in a readable way.

In [12]:
print(ctflow.str.len(to='string length')
            .word_number(on='car terms', to='number of words')
            .str.lower(on='car terms', to='lower case')
            .str.contains('red', to='contains "red"'))

                car terms  string length  number of words  \
0          Red automobile             14                2   
1                 Citroën              7                1   
2  Car industry in Europe             22                4   
3            blue vehicle             12                2   
4    piston (engine part)             20                3   
5                gasoline              8                1   

               lower case  contains "red"  
0          red automobile            True  
1                 citroën           False  
2  car industry in europe           False  
3            blue vehicle           False  
4    piston (engine part)           False  
5                gasoline           False  


## Advanced processing and filtering
So far we have looked relatively simple string methods. Additionally Koalas supports a range of word related methods, more can easily be added. Let's assume we want to process our initial example a bit further. We can define some knowledge in form of lists.

In [13]:
colors = ['blue', 'green', 'yellow', 'red']
continents = ['Europe', 'Africa', 'Asia', 'America', 'Australia']
synonyms = {'car': 'automobile', 'vehicle': 'automobile', 'motor': 'engine'}

In [14]:
print(ctflow.str.lower(to='processed term')
            .remove_words(colors)
            .word_isin(continents, on='car terms', to='contains geoname'))

                car terms          processed term  contains geoname
0          Red automobile              automobile             False
1                 Citroën                 citroën             False
2  Car industry in Europe  car industry in europe              True
3            blue vehicle                 vehicle             False
4    piston (engine part)    piston (engine part)             False
5                gasoline                gasoline             False


And like this we can continue, moreover, we can deduplicate and filter the result.

In [15]:
filtered_candidates = (ctflow.word_isin(continents, to='contains geoname')
                             .str.lower(on='car terms', to='processed term')
                             .remove_words(colors)
                             .replace_words(synonyms)
                             .deduplicate()
                             .pos_tag(to='POS tag')
                             .filter(on='contains geoname', where='==False')
                      )
print(filtered_candidates)

              car terms  contains geoname        processed term       POS tag
0        Red automobile             False            automobile          [NN]
1               Citroën             False               citroën          [NN]
4  piston (engine part)             False  piston (engine part)  [NN, NN, NN]
5              gasoline             False              gasoline          [NN]


And finally we can save the result as a csv file.

In [16]:
filtered_candidates.to_csv('data/filtered_candidates.csv')