# Baseline Submission for the Challenge OLNWP
### Author - Pulkit Gera

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ayushshivani/aicrowd_educational_baselines/blob/master/OLNWP_baseline.ipynb)


In [None]:
!pip install numpy
!pip install pandas
!pip install sklearn

## Import necessary packages

In [6]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

## Download data
The first step is to download out train test data. We will be training a classifier on the train data and make predictions on test data. We submit our predictions

In [None]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_olnwp/data/public/test.zip
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_olnwp/data/public/train.zip
!unzip train.zip
!unzip test.zip

## Load Data
We use pandas library to load our data. Pandas loads them into dataframes which helps us analyze our data easily. Learn more about it [here](https://www.tutorialspoint.com/python_pandas/index.html)

In [2]:
train_data = pd.read_csv('train.csv')

## Clean and analyse the data

In [5]:
train_data = train_data.drop('url',1)
train_data.head()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,525.0,10.0,238.0,0.65812,1.0,0.821918,7.0,5.0,1.0,0.0,...,0.1,0.4,-0.133333,-0.166667,-0.1,0.25,0.0,0.25,0.0,782
1,273.0,11.0,545.0,0.47417,1.0,0.587719,21.0,2.0,21.0,1.0,...,0.1,0.9,-0.248214,-0.3,-0.05,0.0,0.0,0.5,0.0,6200
2,423.0,10.0,453.0,0.518265,1.0,0.669173,21.0,5.0,15.0,0.0,...,0.1,0.5,-0.38,-0.7,-0.2,0.3,0.2,0.2,0.2,723
3,80.0,11.0,814.0,0.456885,1.0,0.608787,2.0,2.0,1.0,0.0,...,0.033333,1.0,-0.195312,-0.6,-0.05,0.277273,0.218182,0.222727,0.218182,809
4,653.0,11.0,113.0,0.711712,1.0,0.878788,5.0,4.0,0.0,0.0,...,0.136364,0.8,0.0,0.0,0.0,0.375,-0.125,0.125,0.125,1600


Here we use the `describe` function to get an understanding of the data. It shows us the distribution for all the columns. You can use more functions like `info()` to get useful info.

In [7]:
train_data.describe()
#train_data.info()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
count,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,...,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0,26561.0
mean,354.110802,10.403449,552.377282,0.555933,1.009337,0.696678,10.898648,3.304733,4.588344,1.259177,...,0.094825,0.757686,-0.259757,-0.522776,-0.107678,0.282236,0.071113,0.342243,0.156345,3369.156094
std,213.485655,2.122533,472.605248,4.300199,6.389915,3.987187,11.254509,3.85556,8.377796,4.21286,...,0.070493,0.247909,0.128229,0.290208,0.096784,0.324309,0.266373,0.188296,0.227084,10971.259269
min,8.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,1.0
25%,165.0,9.0,248.0,0.47,1.0,0.62543,4.0,1.0,1.0,0.0,...,0.05,0.6,-0.327976,-0.7,-0.125,0.0,0.0,0.166667,0.0,948.0
50%,339.0,10.0,415.0,0.538251,1.0,0.690323,8.0,3.0,1.0,0.0,...,0.1,0.8,-0.253385,-0.5,-0.1,0.142857,0.0,0.5,0.0,1400.0
75%,540.0,12.0,724.0,0.607735,1.0,0.754011,14.0,4.0,4.0,1.0,...,0.1,1.0,-0.1875,-0.3,-0.05,0.5,0.146667,0.5,0.25,2800.0
max,731.0,23.0,7185.0,701.0,1042.0,650.0,304.0,116.0,128.0,91.0,...,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.5,1.0,690400.0


## Split Data into Train and Validation
Now we want to see how well our model is performing, but we dont have the test data labels with us to check. What do we do ? So we split our dataset into train and validation. The idea is that we test our classifier on validation set in order to get an idea of how well our classifier works. This way we can also ensure that we dont [overfit](https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/) on the train dataset. There are many ways to do validation like [k-fold](https://machinelearningmastery.com/k-fold-cross-validation/),[leave one out](https://en.wikipedia.org/wiki/Cross-validation_(statistics), etc

In [12]:
X = train_data.drop(' shares',1)
y = train_data[' shares']
# Validation testing
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## Define the Model and Train
Now we come to the juicy part. We have fixed our data and now we train a classifier. The classifier will learn the function by looking at the inputs and corresponding outputs. There are a ton of models to choose from some being [Linear Regression](https://machinelearningmastery.com/linear-regression-for-machine-learning/), [Random Forests](https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47), [Decision Trees](https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052), etc.   
Tip: A good model doesnt depend solely on the model but on the features(columns) you choose. So make sure to play with your data and keep only whats important. 

In [13]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

# from sklearn import tree
# clf = tree.DecisionTreeClassifier()
# clf = clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

We have used [Linear Regression](https://machinelearningmastery.com/linear-regression-for-machine-learning/) as a model here and set few of the parameteres. But one can set more parameters and increase the performance. To see the list of parameters visit [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).  
Also given Decision Tree examples. Check out  Decision Tree's parameters [here](https://scikit-learn.org/stable/modules/tree.html)

### Check which variables have the most impact
We now take this time to identify the columns that have the most impact. This is used to remove the columns that have negligble impact on the data and improve our model.

In [20]:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
coeff_df.head()

Unnamed: 0,Coefficient
timedelta,1.371829
n_tokens_title,134.279025
n_tokens_content,0.321616
n_unique_tokens,4477.371557
n_non_stop_words,-2579.368312


## Predict on Validation
Now we predict our trained classifier on the validation set and evaluate our model

In [16]:
y_pred = regressor.predict(X_val)

In [17]:
df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
df1 = df.head(25)

## Evaluate the Performance
We use the same metrics as that will be used for the test set.  
[MAE](https://en.wikipedia.org/wiki/Mean_absolute_error) and [RMSE](https://www.statisticshowto.com/rmse/) are the metrics for this challenge

In [19]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))

Mean Absolute Error: 3174.901687993749
Mean Squared Error: 168520453.62599948
Root Mean Squared Error: 12981.542806076613


## Load Test Set
Load the test data now

In [21]:
test_data = pd.read_csv('test.csv')

In [23]:
test_data = test_data.drop('url',1)
test_data.head()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,avg_positive_polarity,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity
0,121.0,12.0,1015.0,0.422018,1.0,0.545031,10.0,6.0,33.0,1.0,...,0.333534,0.1,0.8,-0.160714,-0.5,-0.071429,0.0,0.0,0.5,0.0
1,532.0,9.0,503.0,0.569697,1.0,0.737542,9.0,0.0,1.0,1.0,...,0.419786,0.136364,1.0,-0.1575,-0.25,-0.1,0.0,0.0,0.5,0.0
2,435.0,9.0,232.0,0.646018,1.0,0.748428,12.0,3.0,4.0,1.0,...,0.46875,0.375,0.5,-0.4275,-1.0,-0.1875,0.0,0.0,0.5,0.0
3,134.0,12.0,171.0,0.722892,1.0,0.867925,9.0,5.0,0.0,1.0,...,0.5,0.5,0.5,-0.216667,-0.25,-0.166667,0.4,-0.25,0.1,0.25
4,728.0,11.0,286.0,0.652632,1.0,0.8,5.0,2.0,0.0,0.0,...,0.303429,0.1,0.6,-0.251786,-0.5,-0.1,0.2,-0.1,0.3,0.1


## Predict on test set
Time for the moment of truth! Predict on test set and time to make the submission.

In [27]:
y_test = regressor.predict(test_data)

Since its integer regression, convert to integers

In [36]:
y_inttest = [int(i) for i in y_test]
y_inttest = np.asarray(y_inttest)

## Save it in correct format

In [None]:
df = pd.DataFrame(y_inttest,columns=[' shares'])
df.to_csv('submission.csv',index=False)

## To download the generated in collab csv run the below command

In [None]:
from google.colab import files
files.download('submission.csv') 

To participate in the challenge click [here](https://www.aicrowd.com/challenges/olnwp-online-news-prediction)