## Analysis of Cryptocurrency News Articles

### Imports

In [20]:
import pandas as pd
import numpy as np
import re
import requests
import csv

from bs4 import BeautifulSoup
from newspaper import Article

### Data set

Using a keyword search for "cryptocurrency", I found the links for 80+ articles published in the New York Times in the last 12 months. The search pages could not be scraped automatically so I saved the html files of the results in separate files named "nytimes-cryptocurrency-1.html", "nytimes-cryptocurrency-2.html" and so on. I used BeautifulSoup to parse the html files and save the links to the individual articles.

In [32]:
urls = []
for num in range(1, 10):
    file = "nytimes-cryptocurrency-" + str(num) + ".html"
    soup = BeautifulSoup(open(file), "html.parser")
    
    for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
        urls.append(link.get('href'))

In [34]:
with open('nytimes-cryptocurrency-urls.csv', 'w') as file:
    wr = csv.writer(file, quoting=csv.QUOTE_ALL)
    wr.writerow(urls)    

### Corpus

In [40]:
articles_info = []

for link in urls:
    article_dict = {}
    article_dict["link"] = link
    article = Article(link)
    article.download()
    
    try:
        article.parse()
        article_dict["text"] = article.text
        article_dict["title"] = article.title
        article_dict["date"] = article.publish_date
        article.nlp()
        article_dict["keywords"] = article.keywords
        article_dict["summary"] = article.summary
    except:
        article_dict["text"] = np.nan
        article_dict["title"] = np.nan
        article_dict["date"] = np.nan
        article_dict["keywords"] = np.nan
        article_dict["summary"] = np.nan
        
    articles_info.append(article_dict)

corpus = pd.DataFrame(articles_info)
corpus.head()

You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!
You must `download()` an article first!


Unnamed: 0,date,keywords,link,summary,text,title
0,NaT,"[world, sent, tip, state, states, board, lawye...",https://www.nytimes.com/,The Virginia State Board of Elections said it ...,The Virginia State Board of Elections said it ...,"Breaking News, World News & Multimedia"
1,NaT,"[world, sent, tip, state, states, board, lawye...",https://query.nytimes.com/,The Virginia State Board of Elections said it ...,The Virginia State Board of Elections said it ...,"Breaking News, World News & Multimedia"
2,2017-12-14,"[value, financials, reading, york, trading, bl...",https://www.nytimes.com/reuters/2017/12/14/tec...,"(Reuters) - Shares of Siebert Financial Corp, ...","(Reuters) - Shares of Siebert Financial Corp, ...",Brokerage Siebert Financial's Shares Double on...
3,2017-08-03,"[digital, market, internet, works, pension, vi...",https://www.nytimes.com/2017/08/03/style/what-...,It’s weird to say that owning cryptocurrency s...,"Unlike previous generations, many of these gre...",Grandpa Had a Pension. This Generation Has Cry...
4,2017-08-03,"[digital, market, internet, works, pension, vi...",https://www.nytimes.com/2017/08/03/style/what-...,It’s weird to say that owning cryptocurrency s...,"Unlike previous generations, many of these gre...",Grandpa Had a Pension. This Generation Has Cry...


In [57]:
print(corpus.shape)
corpus_df = corpus.drop_duplicates(['link'])
print(corpus_df.shape)

(295, 6)
(90, 6)


In [83]:
corpus_text = []

for index, row in corpus_df.iterrows():
    corpus_text.append(str(row['text'])) # TfidfVectorizer fit_transform method requires str

### Feature Extraction

In [94]:
from sklearn.feature_extraction.text import TfidfVectorizer
n_features = 50

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, max_features=n_features, min_df=4, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(corpus_text)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

### Topic Modeling

I will use [Non-Negative Matrix Factorization](http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py) to discover the main topics which outputs a list of terms representing the topics.

In [98]:
from sklearn.decomposition import NMF

n_topics = 10
print("Fitting the NMF model with tf-idf features for {} samples and {} features".format(len(corpus_text), len(tfidf_feature_names)))
nmf = NMF(n_components=n_topics, random_state=1, 
          alpha=.1, l1_ratio=.5).fit(tfidf)


Fitting the NMF model with tf-idf features for 90 samples and 50 features


In [99]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

print_top_words(nmf, tfidf_feature_names, 10)

Topic #0: bitcoin futures exchange price week currency investors trading digital cryptocurrency
Topic #1: new times york sign newsletter story people like email reading
Topic #2: blockchain technology companies company cryptocurrency continue bitcoin main advertisement digital
Topic #3: said currency bank company newsletter digital sign view continue trading
Topic #4: information use nyt services personal email digital including tax time
Topic #5: mr tax money photo nyt main like story continue advertisement
Topic #6: percent futures trading price exchange main continue company sign newsletter
Topic #7: market investors year cryptocurrency price time said just com percent
Topic #8: north said cryptocurrency year mr just including information main advertisement
Topic #9: nytimes com including email time york times digital information new



### Results

The topics identified above describe the nature of the content in the articles downloaded. Topics 0, 2, 6, and 7 are most related to cryptocurrency while the rest are general topics connected to the NYTimes publication itself.  

### References
[Topic Modeling using Python](https://opendatascience.com/blog/how-to-analyze-articles-about-data-science-using-data-science)

[A tutorial on scraping news articles](https://opendatascience.com/blog/using-the-newspaper-library-to-scrape-news-articles/)

[Feature Extraction](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)