## Read data
1. Filter on 'submission_time', 'upvotes', 'url', 'headline'. Let's use 'headline' to predict the number of 'upvotes' the articles received to find popular articles. 
2. Let's train a linear regression model to make predictions.

In [1]:
#Choose four columns
import pandas as pd
submissions = pd.read_csv("stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

## Tokenization
1. My big goal is to convert each headline to a numerical representation. To do this, let's break a sentence into disconnected words.

In [2]:
#Split each headlines
tokenized_headlines = []
for item in submissions["headline"]:
    tokenized_headlines.append(item.split(" "))

## Preprocessing
1. To make the prediction accurate, help the computer group the same word together, such as group the words that's only different by lower/upper case. 

In [3]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []
for item in tokenized_headlines:
    tokens = []
    # convert tokens to lower case
    for token in item:
        token = token.lower()
        # remove punctuation
        for punc in punctuation:
            token = token.replace(punc, "")
        #append the cleaned list
        tokens.append(token)
    clean_tokenized.append(tokens)

## Convert to numerical representation
1. Grab all the unique words from the headlines and make them the column headers of a matrix. Initialize all the values in the matrix to 0.
2. Try a Pandas Dataframe so that it can create rows with zero values. The length will be equl to the 'clean_tokenized' list and each column will be from 'unique_tokens'.  

In [4]:
import numpy as np
# Find the unique tokens
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        # keep tokens that occur only once
        if token not in single_tokens:
            single_tokens.append(token)
        # keep tokens that occur at least twice
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)
# Dataframe with all zero values with each column being a token in 'unique_tokens'
counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

## Enumerate tokens
1. Now that each row is filled with 0 values, let's count how many times each token occured in the headline. To do this, let's loop through each list of tokens in 'clean_tokenized' and increment the column. 

In [14]:
# Find the column for the token
for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

## Remove extra
1. Since there are 2,000+ columns, this can make a linear regression hard to predict. To improve prediction accuracy, let's reduce words that occur fewer than 5 times or occur more than 100 times. By doing this, the number of features that include 'and,' 'to,' and any words that appear only a few times will be reduced.

In [15]:
# Generate vetor for sums of columns in counts
word_counts = counts.sum(axis=0)
# Use loc to filter counts (less than 5 or more than 100)
counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

## Split data
1. Let's split data to test the efficacy of algorithm. One will be on a test set and the other will be on a train set. 

In [16]:
# Use train_test_split function
from sklearn.cross_validation import train_test_split
# 20% of rows for testing and 80% for training
X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

## Make predictions
1. Use a lingear regression algorithm from scikit-learn to make predictions on test set. To do this, use the 'fit' model on the training set with 'X_train' and 'y_train'. After this, make predictions using 'X_test'. By doing this, it will find correlations between upvotes and the assigned coefficients in the column, which will help predict popular headlines with many upvotes. 

In [17]:
# Use linear regression
from sklearn.linear_model import LinearRegression
# Train clf using the fit method
clf = LinearRegression()
clf.fit(X_train, y_train)
# Predictions on X_test
predictions = clf.predict(X_test)

## Error
1. Let's calculate error on the prediction. Use mean squred error to have predictions relatively close to the actual values. 

In [18]:
## Use MSE
## Calculate the mean squared error 
mse = sum((y_test - predictions) ** 2) / len(predictions)
print(mse)

2323.66357839
