# Class notes 23-11-2017

## A) NLTK and Lemmatization

** E1: Let's play with lemmatization of various conjugated words.**

In [None]:
some_conjugated_words = {'blew', 'shorty', 'shortest', 'grandmothers', 'appealing', 'appeals'}

import nltk
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
for word in some_conjugated_words:
    lemma=lmtzr.lemmatize(word, 'a') # lemmatize each word as an Adjective
    print(word, lemma)

# What is happening in this code?

** E2: Let's look into lemmatizing text now. **

In [None]:
weird_text = 'Kids blew candles stronger than parents.'

import nltk
# Tokenization
tokens = nltk.word_tokenize(weird_text)

lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
for word in tokens:
    lemma=lmtzr.lemmatize(word) # same as lemmatize(word,'n')
    # POS tagging
    print(word, lemma)
    
# What is happening here and how to fix it?

We can add POS tags now and see if that improves lemmatization.

In [None]:


weird_text = 'Kids blew candles stronger than parents.'

import nltk
# Tokenization
tokens = nltk.word_tokenize(weird_text)

lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
tagged_tokens = nltk.pos_tag(tokens)
for word, tag in tagged_tokens:
    lemma=lmtzr.lemmatize(word, tag)
    # POS tagging
    print(word, lemma)
    
# What is happening here and how to fix it?

In [None]:
# Lemmatizing (the proper way, accounting for different POS tags)
from nltk.corpus import wordnet as wn

def penn_to_wn(penn_tag):
    """
    Returns the corresponding WordNet POS tag for a Penn TreeBank POS tag.
    """
    if penn_tag in ['NN', 'NNS', 'NNP', 'NNPS']:
        wn_tag = wn.NOUN
    elif penn_tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
        wn_tag = wn.VERB
    elif penn_tag in ['RB', 'RBR', 'RBS']:
        wn_tag = wn.ADV
    elif penn_tag in ['JJ', 'JJR', 'JJS']:
        wn_tag = wn.ADJ
    else:
        wn_tag = None
    return wn_tag


weird_text = 'Kids blew candles stronger than parents.'

import nltk
# Tokenization
tokens = nltk.word_tokenize(weird_text)

lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
tagged_tokens = nltk.pos_tag(tokens)
for word, tag in tagged_tokens:
    mapped_tag=penn_to_wn(tag)
    if mapped_tag != None: # some words, like 'than' have no POS tag.
        lemma=lmtzr.lemmatize(word, mapped_tag)
    else:
        lemma=lmtzr.lemmatize(word)
    # POS tagging
    print(word, lemma, mapped_tag)
    
# What is happening here and how to fix it?

## B) Functions

** E3: PRINT vs RETURN **

** Print is for people.** Remember that slogan. Printing has no effect on the ongoing execution of a program. It doesn’t assign a value to a variable. It doesn’t return a value from a function call.

A function that returns a value is producing a value **for use by the program**, in particular for use in the part of the code where the function was invoked. Remember that when a function is invoked, control passes to the function, meaning that the function’s code block is executed. But when the function returns, control goes back to the calling location, and a return value may come back with it.

If your function **returns** a value, you can either save that value in a variable or print it. If you do not return a value, then it is lost and you can only print it for humans **inside that function**.

If this distinction is still unclear, read [this very nice explanation](http://interactivepython.org/runestone/static/pip2/Functions/Printvs.return.html). And now, let's practice a bit!

In [None]:
def say_hello(user):
    return "Hello %s" % user

def say_bye(user):
    return "Bye %s" % user

me="Filip"

message = ""
message += say_hello(me)
message += ','
message += say_bye(me)

print(message)

In [None]:
def say_hello(user):
    print("Hello %s" % user) # this should be a print, like above

def say_bye(user):
    print("Bye %s" % user) # this should be a print, like above

me="Filip"

message = ""
message += say_hello(me) # this line concatenates None to a string -> error
message += ','
message += say_bye(me)
print(message)

** E4: Helper functions ** Let's write a function that checks if the user data entered is valid. User data consists of four inputs: username, password, email and work email. 

Username and password should be between 7 and 13 characters, and contain only alphanumeric characters.

Email and work email should contain the character '@' and should end with '.com'.

All test cases should be inserted by user input.

In [None]:
def is_email_ok(email): # we can reuse this function for both emails
    if '@' in email and email.endswith('.com'):
        return True
    else:
        return False

def is_string_ok(s): # we can reuse this function for both username and password
    if len(s)>7 and len(s)<13 and s.isalnum():
        return True
    else:
        return False

def check_validity(username, password, email, work_email): # main function for validation
    username_ok=is_string_ok(username)
    password_ok=is_string_ok(password)
    email_ok=is_email_ok(email)
    work_email_ok=is_email_ok(work_email)
    if username_ok==True and password_ok==True and email_ok==True and work_email_ok==True:
        return 'yes'
    else:
        return 'no'

validity=check_validity('filip', 'filipfilip', 'filip@gmail.com', 'filip@gmail.com')
print("Are the inputs valid?", validity)

** E5: Keyword parameters** Make the minimum and maximum length of the username and password flexible - the caller of the function can specify it. The default for the minumum length is 7, for maximum it is 13.

In [None]:
# Your code here
def is_email_ok(email): # we can reuse this function for both emails
    if '@' in email and email.endswith('.com'):
        return True
    else:
        return False

def is_string_ok(s, min_length=4, max_length=13): # we can reuse this function for both username and password
    if len(s)>min_length and len(s)<max_length and s.isalnum():
        return True
    else:
        return False

def check_validity(username, password, email, work_email): # main function for validation
    username_ok=is_string_ok(username)
    password_ok=is_string_ok(password, min_length=10, max_length=17)
    email_ok=is_email_ok(email)
    work_email_ok=is_email_ok(work_email)
    if username_ok==True and password_ok==True and email_ok==True and work_email_ok==True:
        return 'yes'
    else:
        return 'no'

validity=check_validity('filip', 'filipfilip', 'filip@gmail.com', 'filip@gmail.com')
print("Are the inputs valid?", validity)

** E6: Nesting functions** What will the following code output?

In [None]:
def square(x):
    return x*x

def g(y):
    a = y + 3
#    print(a)
#    print(a)
    
def h(y):
    return square(y) + 3

In [None]:
print(h(2))
# First square 2, then add 3.

In [None]:
print(h(g(3)))

## C) Paths and directories

** E7: Let's play a bit with paths and directories**

**Print the current path and its contents.**

In [None]:
# your code here
import os
print(os.getcwd())

**Then print the contents of the directory "../Data".**

In [None]:
# your code here
#os.listdir() # one way, contents of the current directory

import glob

# another way
for f in glob.glob("../Data/*"): 
    print(f)

**Create a directory "TEST" in "../Data".**

In [None]:
# your code here
os.makedirs("../Data/TEST", exist_ok=True)

## D) Files

** E8: Reading file contents multiple times ** Let's open and read the file "../Data/Debate/debate.csv".

In [None]:
with open("../Data/Debate/debate.csv", 'r') as r:
    some_data=r.readlines()
    print(len(some_data)) # Q: what is this? 
    
    # Q: What will this print?
#    some_data=some_data[:100]
#    print(len(some_data))

    # Q: What will this print?
#    some_data=r.readlines()
#    print(len(some_data))

** E9: CSV->TSV and data manipulation** Let's read the CSV file, then save it to a TSV file, but without the second column and without the header.

In [None]:
# Some code here
filename="../Data/baby_names/names_by_state/AK.TXT"
new_filename="../Data/baby_names/names_by_state/AK.TSV"
with open(filename, 'r') as r:
    with open(new_filename, 'w') as w:
        some_data=r.readline() # the header is inferred by reading the first line
        some_data=some_data.strip() # this removes the newline at the end of a line
        header=some_data.split(',') 
        header.pop(1)
        for line in r:
            line=line.strip()
            row=line.split(',') # this is how to create a list from a CSV line
            row.pop(1) # remove the value for column 2
            tsv_line='\t'.join(row) # this is how to create a TSV line from a list
            w.write(tsv_line + '\n') # write to TSV
        
#    for line2 in r: 
#        print("LINE") # This would iterate over an empty set - we can only iterate a file once!

** E10: JSON load and store** Store the following two dictionaries to the "../Data/TEST" directory with the current datetimes as filenames. Then open and load all JSON files. Print the loaded dictionaries.

In [None]:
dict1={'positive': 0.4, 'negative': 0.6}
dict2={'positive': 0.7, 'negative': 0.3}

import json

with open('../Data/TEST/dict1.json', 'w') as f:
    json.dump(dict1, f)
    
with open('../Data/TEST/dict2.json', 'w') as f:
    json.dump(dict2, f)

## E) Q&A

** Q: I downloaded a module, why do i also have to import it?**

A: In order to import and use a module, you need to first have it installed. After you install it, your Python program can "activate" it by importing it into memory. Real-life analogy: You can think of "download" as notes that you have written down over time, and of "import" as you reading and remembering some of them. Your memory, like the computer memory, is limited and you can only keep a small amount of things loaded/imported.

** Q: And, how many times do i have to import a module?**

A: Basically, once per Python Notebook, before you use it for the first time.

** Q: Can we use the CSV module in the assignment?**

A: Yes.

** Q: How to distinguish a file between TSV, CSV, and TXT formats? **

A: The distinction between CSV and TSV (and also a simple textual TXT file) is a matter of an agreement on how to structure your text. So you can also see TSV and CSV as a specific cases of textual file, where people have agreed to structure the text in a tabular format. Because of this, we are using the same starting point to *open* TXT, CSV and TSV files, but we make a difference when we want to *read (or write)* their contents.

** Q: How should Python know to open as a tsv?  Would it be to add .tsv to the end of the filename? **

A: How does Python know if some file is a TSV or CSV or X? There is no bulletproof answer here. In many cases, we can rely on the extension which is often correct (.tsv for TSV files, .csv for CSV files, .txt for unstructured textual files). This is a good practice. 
But this is not always the case in practice, sometimes people do not use the right extension. A more certain way is to know what kind of data you are dealing with before you write the program. So make sure you print a line or two from a file in order to deduce what format it is in.


## F) Assignment: some notes
* Testing functions: don't worry if a specific file is not there or if it is corrupted. Testing with 9 files is as good as with 10. It is more important to cover all kind of weird (edge) cases in your functions
* It is somehow story-like, hence the order of the exercises is not following the chapters order
* Edge cases
* Hints suggest one way to solve the exercise - feel free to ignore them if you have a different idea - at this point it is important that you can find a way to solve challenges.
* Today's class notes (especially exercises 5 and 9) are **very helpful** for the assignment.
* Example of TSV: https://github.com/cltl/python-for-text-analysis/blob/master/Data/baby_names/names_by_state/AK.tsv
* Example of CSV: many, e.g. https://github.com/cltl/python-for-text-analysis/blob/master/Data/baby_names/names_by_state/AK.TXT
* Sentiment analysis - you don't have to look online or in the paper - just look at the example code
* Downloading is done to python, not to disk
* 7a: text_to_lemmas() - lemmatization code from chapter 15
* Extension to Saturday 23:59h