# Data Collection

This tutorials just give some examples how to fetch and handle different types of data

## Import required packages

In [None]:
import time
import datetime

## Files

### Simple text files

In [None]:
sample_file_name = 'data/sample-text-files/sample-text-file-1000kb.txt'

documents = []
with open(sample_file_name) as f:
    for line in f:
        line = line.strip()
        if line != '': # Ingnore empty lines
            documents.append(line)
            
print("The file {} contains {} documents.".format(sample_file_name, len(documents)))
print()
print("This is the last document:")
print(documents[-1])

### CSV/TSV files

In principle, CSV/TSV (comma-separated/tab-separated values) files are also just text files. As such, one can sue the approache from above to read such files. The structured nature of CSV/TSV files quickly leads to annoying issues:

In [None]:
reviews_file_name = 'data/reviews/yelp-reviews-mon-ami-gabi.csv'

with open(reviews_file_name) as f:
    for idx, line in enumerate(f):
        if idx == 0: # We want to ignore the header
            continue
        line = line.strip()
        if line != '':
            review_nr, review_text = line.split(',') # Oh, oh...can you spot the problem?
            print(review_text) # This will most likely throw an error

Since handling CSV/TSV files is a very common task, there is already powerful Python packages available that makes life some much easier. `pandas` is a very popular package for handling structured files like CSV/TSV files.

In [None]:
import pandas as pd

`pandas` uses the notion of data frames (df) to denote data objects

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

In [None]:
#df = pd.read_csv(reviews_file_name, sep=',', quotechar='"', encoding = "ISO-8859-1")
df = pd.read_csv(reviews_file_name, encoding = "ISO-8859-1")

df.head(n = 10)

In [None]:
# Extract list of reviews from data frame
reviews = df['review'].tolist()

print("The file {} contains {} reviews.".format(reviews_file_name, len(reviews)))
print()
print("This is the last review:")
print(reviews[-1])

## Online news article

This example addresses online content. Handling "raw" websites is usually a bit annoying since the text is not plain text but HTML. For simplicity, we use the package `newspaper` that helps to fetch the content of online news articles

In [None]:
from newspaper import Article

Feel free to copy&paste different news article URLs. Note the package does not work with all news websites; however, it works just fine with straitstimes.com.

In [None]:
url = 'http://www.straitstimes.com/asia/east-asia/now-its-japans-turn-to-brace-for-a-monster-storm-as-typhoon-lan-nears'
article = Article(url)

The methods `download()` and `parse()` do the actually fetching and processing of the news articles.

In [None]:
article.download()
article.parse()

In [None]:
print("Authors:", article.authors, "\n")
print("Publication data:", article.publish_date, "\n")
print("Title:", article.title, "\n")
print("Main text:", article.text, "\n")
print("Top image link:", article.top_image, "\n")
print("Video links:", article.movies, "\n")

The `newspaper` packages comes with some additional functions to extract keywords and generate a summary for a news article

In [None]:
article.nlp()
print("Keywords:", article.keywords, "\n")
print("Summary:", article.summary, "\n")

## Tweets

Twitter provides an API (Application Programming Interface) that allows to fetch public tweets. The `twython` packages is a wrapper for ths API to simplify this task in Python.

In [None]:
from twython import Twython, TwythonError

Accessing the API requires credentials. This in turn requires a Twitter account and further configurations. If you don't have or want an Twitter account then no problem. This tutorial is only supposed to show how simply the task of fetching tweets is. It won't be required for the other tutorials.

In [None]:
APP_KEY = '' 
APP_SECRET = '' 
OAUTH_TOKEN = '' 
OAUTH_TOKEN_SECRET = ''

In [None]:
twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

Among other calls, the Twitter API allows to search for tweets using keywords. You can also specify that you're not interest in retweets.

In [None]:
try:
    search_results = twitter.search(q='"orchard road" -filter:retweets', count=20)
except TwythonError as e:
    print(e)

Each tweet comes with a plethora of information. In the following we are only interest in the user name, the date and time the tweet was posted and the tweet text itself.

In [None]:
for tweet in search_results['statuses']:
    # Ingnore non-English tweets
    language = tweet['lang']
    if language != 'en':
        continue
    # Extract the basic information about the tweet
    screen_name = tweet['user']['screen_name']
    created_at =  tweet['created_at']
    tweet_text = tweet['text']    
    # Simple way to remove line breaks and tabs: string to list and back to string again
    tweet_text = ' '.join(tweet_text.split())
    # Twitter returns the time as string of the form "Wed Jan 24 10:37:57 +0000 2018"; let's simplify this
    created_at = time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(created_at,'%a %b %d %H:%M:%S +0000 %Y'))
    # Print each tweet with publication date, the screen name of the user, and the actual text of the tweet
    print('[{}] @{} wrote: {}'.format(created_at, screen_name, tweet_text))