# News Classification 

In this project, We used 6500 pieces of news from [Mashable](http://mashable.com/).
Preprocessing was applied to stokenlize and vectorize the dataset. Multiple machine learning models were used. For each model, we recorded the runing time and the accuracy and compared the results.

## Loading the news data
This dataset contains 6500 pieces of news, each contains url, label and article.

In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('articles.csv', header=None)
df = df.dropna(axis=0,how='any')
df.head()

Unnamed: 0,0,1,2
0,http://mashable.com/2013/01/07/ap-samsung-spon...,business,The Associated Press is the latest news organi...
1,http://mashable.com/2013/01/07/apple-40-billio...,business,It looks like 2012 was a pretty good year for ...
2,http://mashable.com/2013/01/07/astronaut-notre...,entertainment,"When it comes to college football, NASA astron..."
3,http://mashable.com/2013/01/07/att-u-verse-apps/,tech,LAS VEGAS — Sharing photos and videos on your ...
4,http://mashable.com/2013/01/07/beewi-smart-toys/,tech,LAS VEGAS — RC toys have traded in their bulky...


In [5]:
dataset = np.array(df)
dataset.shape

(6500, 3)

## Preprocessing data
In our project, we used TF-IDF features for training model. However, punctuations, numbers, some stop-words, tense of verbs (For example, do and did) would affect the result of TF-IDF features. In order to address this problem, we preprocessed articles.

In [6]:
labels = dataset[:, 1]
raw_articles = dataset[:, 2]

Before preprocessing, we use the first piece of news as example to show how the articles look like.

In [9]:
print(raw_articles[0])

The Associated Press is the latest news organization to experiment with trying to make money from Twitter by using its feed to advertise for other companies. The AP announced Monday that it will share sponsored tweets from Samsung throughout this week for the International CES taking place in Las Vegas. The news service will let Samsung post two tweets per day to the AP's Twitter account, which has more than 1.5 million users, and each of these tweets will be labeled "SPONSORED TWEETS."This marks the first time that the AP has sold advertising on its Twitter feed, and the company says it spent months developing guidelines to pave the way for this and other new media business models. For this particular promotion, Samsung will provide the sponsored tweets and non-editorial staff at the AP will handle the publishing side. In this way, the company hopes to maintain a clear dividing line between its editorial and advertising operations on Twitter."We are thrilled to be taking this next ste

The nltk package is applied to preprocess data.

In [7]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import unicodedata
import re

clean_articles = []
for text in raw_articles:
    text = re.sub(r'[^\x00-\x7F]+',' ', text)
    tokens = word_tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation from each word
    # remove remaining tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in words]
    string = " ".join(stemmed)
    clean_articles.append(string)

Articles are preprocessed into format below. numbers, punctuations, case and tense are remove from artciles. This would make it hard to read by human, but it would make the data cleaner. 
We displayed the first article as we showed above after preprocessing.

In [8]:
print(clean_articles[0])

associ press latest news organ experi tri make money twitter use feed advertis compani ap announc monday share sponsor tweet samsung throughout week intern ce take place la vega news servic let samsung post two tweet per day ap twitter account million user tweet label sponsor tweet mark first time ap sold advertis twitter feed compani say spent month develop guidelin pave way new media busi model particular promot samsung provid sponsor tweet staff ap handl publish side way compani hope maintain clear divid line editori advertis oper twitter thrill take next step social media said lou ferrara ap manag editor overse social media effort statement industri must look new way develop revenu provid good experi advertis consum time advertis audienc expect ap without compromis core mission break news publish dabbl twitter ad includ atlant nation journal courtesi flickr nan palmero


## Tokenization and Vectorizing
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)(term frequency–inverse document frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. This provide us a way to classify articles, based on the occurance frequency of import words in articles.

In this process, we tokenize all dataset, extract top 12000 words from dataset. Then those top 12000 words were converted to tf-idf format, and vectorize the dataset.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, max_features=12000,
                            min_df=2, use_idf=True, lowercase=True)
X = vectorizer.fit_transform(clean_articles)

Next, we would use y to present labels and convert it to numbers which can be used in neural network.

In [11]:
y = labels
print(y[:10])
numy = pd.factorize(y)[0]
print(numy[:10])

['business' 'business' 'entertainment' 'tech' 'tech' 'lifestyle' 'tech'
 'tech' 'world' 'world']
[0 0 1 2 2 3 2 2 4 4]


## Multi-logistic Regression with Cross Validation
Multi-logistic Regression is the basic way to classfiy data set. In this process, we used this method to train our data and used cross validation to verify.

## Support Vector Machine with Cross Validation

Support Vector Machine is also a useful model classifing data set. K-fold was applied to cross validation.