# Star Malaysia Project

In [None]:
# Download the data. Comment out the following line after running it once.
!wget 139.59.226.45:8000/full.csv

In [None]:
# Load the data into a pandas DataFrame
import csv
import pandas as pd
import numpy as np

df = pd.read_csv('full.csv')

## Step 1 - Keyword extraction

Try to extract some relevant words for each article. Count the number of occurences of each word in each document. Read more on scikit-learn about how to extract features from text: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

1. Use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
1. Use [TfIdfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

See how you can handle stop words:
- with external resources
- automatically based on the corpus frequency

In [None]:
#  Keyword extraction
from sklearn.feature_extraction.text import CountVectorizer
import nltk
countVect = CountVectorizer(stop_words=nltk.corpus.stopwords.words('english'))
countfit = countVect.fit(df["text"].replace(np.nan, '', regex=True).values)

In [None]:
counttransform = countVect.transform(df["text"].replace(np.nan, '', regex=True).values)

In [None]:
import operator
x = countVect.vocabulary_
sorted_x = sorted(x.items(), key=operator.itemgetter(1),reverse=True)
sorted_x

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk

df["text"] = df["text"].replace(np.nan, '', regex=True)
tfidvect = TfidfVectorizer(stop_words=nltk.corpus.stopwords.words('english'))
fittfidvect = tfidvect.fit(df["text"])
transform_text_df = tfidvect.transform(df["text"].values)

In [None]:
fittfidvect?

## Step 2 - Find similar articles

The goal of this section is to propose similar articles to the reader. It might be something similar to the section "You May Be Interested" when you're reading one publication on http://www.thestar.com.my/.

Let's do it finding the most closest articles to each article.

We have to define distance/similary between two documents.
1. Try with Euclidian distance
1. Try with [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

Hint: [Nearest Neighbors from scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors) 

Going further: this is an unsupervised algorithm. Which data would you need to compare the two apporache (Euclidian vs Cosine)?

In [None]:
transform_text_df

In [None]:
# Nearest neighbors
from sklearn.neighbors import NearestNeighbors

nneighbors = NearestNeighbors()
nneighbors.fit(transform_text_df)
distances, indices = nneighbors.kneighbors(transform_text_df)

In [None]:
indices

## Step 3 - Clustering

Assume that cosine similarity works better on text documents. Choose a clustering algorithm is order to group documents that are about the same subject. 

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans() 

# Now we perform the clustering
kmeans.fit(transform_text_df)

In [None]:
kmeans.labels_

## Step 4 - NMF

Let's use the NMF technique to mine the dataset further.

1. You will first need to transform the text into a matrix using `TfidfVectorizer()` ([docs](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)). Be sure to use the stopwords. Some of the articles are blank and so their content comes up as NaN. Use the function `DataFrame.notnull()` to eliminate these from your sample.
2. Use the `NMF` estimator to divide the dataset into 10 topics.
3. Print the top 20 words corresponding to each category.

In [None]:
# NMF

from sklearn.decomposition import NMF
n_topics = 10
n_top_words = 20

nmf = NMF(n_components=n_topics, random_state=1).fit(transform_text_df)

feature_names = tfidvect.get_feature_names()

for topic_idx, topic in enumerate(nmf.components_):
    print(("Topic #%d:" % topic_idx))
    print((" ".join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]])))
    print()