# Text I-O and JSON

## *'Anything goes in/ Anything goes out/ Fish, bananas, old pyjamas/ Mutton, beef and Trout!'*
*– User manual for first PC*

# Input/Output (IO)

**Reading** from input and **writing** to output allows us to retrieve or store information.

## Input

### Reading from files

Python makes it very easy to read from files. Calling `open()` on a file name will give you an iterator over the lines in a file.

In [None]:
num_lines = 0
for line in open('../data/tweets.txt', 'r'): # this is BAD coding, don't repeat
    num_lines += 1
print("The file has {} lines".format(num_lines))

However, this does not close the file, and it is better to make sure we do. We use the `with` construct. The outermost level simply open the file and give it a name, and closes it when done. The inner loop goes over the file.

In [None]:
num_lines = 0
with open('../data/tweets.txt', encoding='utf-8') as input_file:
    for line in input_file:
        num_lines += 1
print("The file has {} lines".format(num_lines))

## Activity
* count the number of words in `tweets.txt`

In [None]:
# your code here


### Reading user input
Sometimes, we want to give our user the chance to input something (hint: *very* useful to label data). We can simply do this via `input()`. If we give it an argument, we can write a message to the prompt:

In [None]:
user_says = input('what ')
print(user_says, type(user_says))

NB: the return value of `input()` is always a `str`. If you want something else, you neeed to either cast it, or use `eval()`, which interprets the input as Python.

In [None]:
user_says = input('Try typing an int or list: ')
print(type(eval(user_says)))

To prevent empty input errors, or to check the user makes a valid choice, use a `while` loop:

In [None]:
must_be_int = None
while not must_be_int or must_be_int not in {'1', '2'}:
    must_be_int = input('Type 1 or 2: ')
must_be_int = eval(must_be_int)
print(must_be_int, type(must_be_int))

## Output

### User Output

Output to the user is our good old friend `print()`. 

### File Output

File output allows us to use the same Python objects in different programs/sessions/computers. It works almost like file input, with three differences:
1. we need to specify write mode by giving `open()` the string argument `'w'`
2. we use the `write()` command to write to the file
3. we need to end every input line with a newline break `\n`

In [None]:
with open('../data/silly_test_file.txt', 'w', encoding='utf-8') as output_file:
    output_file.write('This is the first line\n')

## Activity

* Open the file `silly_test_file.txt` and print all the lines in it

In [None]:
# your code here


# JSON

JSON is a file format that allows us to read and write Python objects (rather than strings) from files. This is a great way to save your progress or to store a model.

However, note that dictionary keys become strings, and that it cannot store "special" data types (`defaultdict`, `DataFrame`, etc.).

We need to import the `json` library first

In [None]:
import json

# JSON output

In order to save a Python object to a file, we only need the function `dump()` from `json`. It takes two arguments
1. the Python object to write to file
2. a **file handle**, i.e., an `open(<FILENAME>, 'w')` command

You can call JSON files whatever you want, but it is common to use the ending `.json`

## Activity

* create a dictionary `line_length`
* open the file `tweets.txt`
* use `line_length` to map each line in the file from its line number to its length in characters
* save `line_length` to a file "`lineinfo.json`"

In [None]:
# your code here


# JSON input

To retrive a Python object from a file, we use the function `load()` from `json`. It only takes a **file handle**, i.e., an `open(<FILENAME>)` command.

In [None]:
with open('lineinfo.json') as json_in:
    info_from_file = json.load(json_in)
print(list(info_from_file.items())[:5])

# Word Representations

## *"I know words. I have the best words!"*
    - Noam Chomsky

## Discrete Sparse Representations (Bag of words)

The easiest way is to represent features is as a counts of all words in the text. It takes two steps:
1. collect the counts for each word
2. transform the individual counts into one big matrix

The result is a matrix $X$ with one row for each instance, and one column for each word in the vocabulary.

![Bag of words procedure](bow.png)

We can use the `TfidfVectorizer` object to get the weighted frequency of each word:

In [None]:
import pandas as pd
df = pd.read_csv('../data/reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.tolist()
print(documents[:2])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
small_vectorizer = CountVectorizer()

sentences_2 = documents[:1]

X1 = small_vectorizer.fit_transform(sentences_2)

Let's implement this ourselves:

In [None]:
import numpy as np # to deal with linear algebra

num_docs = 1

# collect all word types (= vocabulary)
vocabulary = set()
for document in documents[:num_docs]:
    tokens = document.lower().split()
    vocabulary = vocabulary.union(set(tokens))
vocabulary = sorted(vocabulary)

# create a data matrix with #docs-by-#features dimensions
X = np.zeros((num_docs, len(vocabulary)))

# fill that matrix with sweet counts
for d, document in enumerate(documents[:num_docs]):
    tokens = document.lower().split()
    for i, feature in enumerate(vocabulary):
        X[d, i] = tokens.count(feature)

# show the result as a DataFrame
pd.DataFrame(data=X, columns=vocabulary, dtype=int)

In [None]:
vocabulary_ = {word: position for position, word in enumerate(vocabulary)}
vocabulary_

The result is a *sparse count matrix*:

In [None]:
# indexed representation
import numpy as np
print(X1)

# dense representation
print(X1.todense())

We can access the mapping from vector position to feature names via `get_feature_names()`:

In [None]:
print(small_vectorizer.get_feature_names())

The inverse (the mapping from feature names to vector positions) is encoded as a list in `vocabulary_`:

In [None]:
print(small_vectorizer.vocabulary_)

## Terminology 

![](matrix.pdf)

Let's redo this for the entire corpus:

In [None]:
vectorizer = CountVectorizer(analyzer='word', 
                             ngram_range=(1, 2), 
                             min_df=0.001, 
                             max_df=0.75, 
                             stop_words='english')

X = vectorizer.fit_transform(documents[:10000])

print(X.shape)

Calling `transform()` on a new document will aply the vocabulary we collected previously to this new data point. Any words we have not seen before are ignored.

In [None]:
vectorizer.transform([documents[-1]])

In [None]:
documents[-1]

## Exercise

Use vector operations to find out 
- what the 5 most frequent words are in `X`
- in how many different documents the word `delivery` occurs
- what percentage of the overall corpus that number corresponds to

In [None]:
# your code here


## Character $n$-grams

We can also use characters to analyze text:

In [None]:
char_vectorizer = CountVectorizer(analyzer='char', 
                                  ngram_range=(2, 6), 
                                  min_df=0.001, 
                                  max_df=0.75)

C = char_vectorizer.fit_transform(documents[:10])
C

In [None]:
print(char_vectorizer.vocabulary_)

## Exercise
Extract **only** the bigrams (no unigrams) from Moby Dick and find the top 10.

In [None]:
# your code here

## TF-IDF

Let's extract the most important words from Moby Dick

In [None]:
documents = [line.strip() for line in open('../data/moby_dick.txt', encoding='utf8')]
print(documents[1])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english', sublinear_tf=True)

X = tfidf_vectorizer.fit_transform(documents)

Now, let's get the same information as raw counts:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english')

X2 = vectorizer.fit_transform(documents)

In [None]:
X.shape, X2.shape

In [None]:
df = pd.DataFrame(data={'word': vectorizer.get_feature_names(), 
                        'tf': X2.sum(axis=0).A1, 
                        'idf': tfidf_vectorizer.idf_,
                        'tfidf': X.sum(axis=0).A1
                       })

In [None]:
df = df.sort_values(['tfidf', 'tf', 'idf'])
df

## PMI
Extracting PMI from text is relatively straightforward, and `nltk` offer some functions to do so flexibly.

In [None]:
import nltk
nltk.download('all')

In [None]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords
from collections import Counter

stopwords_ = set(stopwords.words('english'))

words = [word.lower() for document in documents for word in document.split() 
         if len(word) > 2 
         and word not in stopwords_]
finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}


In [None]:
Counter(collocations).most_common(20)

## Exercise

Extract the top 10 collocations for the Twitter data. You need to preprocess the data first!

In [None]:
# your code here