We'll be covering text classification and regression methods over the next month; in preparation for this topic, your assignment is to gather labeled data to use for your analysis.

* Find at least 300 documents for some topic that interests you, along with a single binary label for each document.  Aim high if you can; the more data in your collection, the better your classification models will tend to perform on it.

* Split your data into three non-overlapping files (train.tsv, dev.tsv and test.tsv), with train.tsv containing 80% of the documents, dev.tsv 10% and test.tsv 10%.

* All of the data must be in a common format; we'll use a tab-separated format with the label in the first column and the full text in the second column. Replace all newlines in the text with \_NEWLINE\_ and tab characters with \_TAB\_.

See data/text_classification_sample/ for an example.  Execute the Jupyter notebook 4.classification/CheckData_TODO.ipynb to verify that your format is correct.

Your choice of documents and labels is completely up to you (except for any data already used in class in the data/ folder).  Possible sources of data:

* Project Gutenberg.  Metadata is available at this [Github repo](https://github.com/hugovk/gutenberg-metadata) along with URLs for the texts.  Labels here can be author, subject, author gender etc.

* Crawl news articles from different domains (e.g,. CNN, FoxNews); the label for each article is the domain.

* [Movie summary data](http://www.cs.cmu.edu/~ark/personas/).  Labels here can be any categorical metadata aspect (genre, release date); note real-valued metadata (like box office, runtime) can be binarized by selecting some threshold.

* [Download your own tweets](https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive).  Labels here can be any categorical metadata included in the tweet, or labels you add by hand (e.g., sarcasm)


In [None]:
import sys
from collections import Counter

In [None]:
def test(directory):
    for split in ["train", "dev", "test"]:
        filename="%s/%s.tsv" % (directory, split)
        with open(filename) as file:
            labelCounts=Counter()
            zeroLength=0
            total=0
            for line in file:
                cols=line.rstrip().split("\t")
                label=cols[0]
                text=cols[1]
                if len(text) == 0:
                    zeroLength+=1
                total+=1

                labelCounts[label]+=1

            print ("File: %s, Total docs: %s, Total zero length: %s" % (filename, total, zeroLength))
            for label in sorted(labelCounts):
                print ("\t%s %s" % (label, labelCounts[label]))
            print()

Q1: Describe your data.  What is the source of the documents, and what do the labels mean?

Q2: Change the directionary name below to the directory containing your data and execute the `test()` function above to verify the data is in the correct format:

In [None]:
directory="../data/text_classification_sample_data"
test(directory)