### For the homeworks we are going to use the "[Online News Popularity Data Set](https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity#)"

The dataset can be used both for regression and classification tasks.

#### Source:

Kelwin Fernandes INESC TEC, Porto, Portugal/Universidade do Porto, Portugal.
Pedro Vinagre ALGORITMI Research Centre, Universidade do Minho, Portugal
Paulo Cortez ALGORITMI Research Centre, Universidade do Minho, Portugal
Pedro Sernadela Universidade de Aveiro

#### Data Set Information:

* The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls.
* Acquisition date: January 8, 2015
* The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set.

Attribute Information:

Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)

Attribute Information:
0. url: URL of the article (non-predictive)
1. timedelta: Days between the article publication and the dataset acquisition (non-predictive)
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the content
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus: Is data channel 'Business'?
16. data_channel_is_socmed: Is data channel 'Social Media'?
17. data_channel_is_tech: Is data channel 'Tech'?
18. data_channel_is_world: Is data channel 'World'?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in Mashable
29. self_reference_max_shares: Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
31. weekday_is_monday: Was the article published on a Monday?
32. weekday_is_tuesday: Was the article published on a Tuesday?
33. weekday_is_wednesday: Was the article published on a Wednesday?
34. weekday_is_thursday: Was the article published on a Thursday?
35. weekday_is_friday: Was the article published on a Friday?
36. weekday_is_saturday: Was the article published on a Saturday?
37. weekday_is_sunday: Was the article published on a Sunday?
38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0
40. LDA_01: Closeness to LDA topic 1
41. LDA_02: Closeness to LDA topic 2
42. LDA_03: Closeness to LDA topic 3
43. LDA_04: Closeness to LDA topic 4
44. global_subjectivity: Text subjectivity
45. global_sentiment_polarity: Text sentiment polarity
46. global_rate_positive_words: Rate of positive words in the content
47. global_rate_negative_words: Rate of negative words in the content
48. rate_positive_words: Rate of positive words among non-neutral tokens
49. rate_negative_words: Rate of negative words among non-neutral tokens
50. avg_positive_polarity: Avg. polarity of positive words
51. min_positive_polarity: Min. polarity of positive words
52. max_positive_polarity: Max. polarity of positive words
53. avg_negative_polarity: Avg. polarity of negative words
54. min_negative_polarity: Min. polarity of negative words
55. max_negative_polarity: Max. polarity of negative words
56. title_subjectivity: Title subjectivity
57. title_sentiment_polarity: Title polarity
58. abs_title_subjectivity: Absolute subjectivity level
59. abs_title_sentiment_polarity: Absolute polarity level
60. shares: Number of shares (target)


The first two columns (url and time_delta) are non-predictive and should be ignored

The last column **shares** contains the value to predict.

### Regression
In the case of regression we want to predict the value of the share column.

### Classification
In the case of classification we want to predict one of two classes:

* *low* -- shares < 1,400
* *high* -- shares >= 1,400

### Metrics

#### Regression
To evaluate how good we are doing on the **regression** task we will use the Root Mean Squared Error (RMSE). RMSE is given by

$$
\sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{\Big(d_i -f_i\Big)^2}}
$$


where:

* $n$ is the number of test samples
* $d_i$ is the ground truth value of the i-th sample
* $f_i$ is the predicted value of the i-th sample


#### Classification
To evaluate how good we are doing on the **classification** task we will use the accuracy metrics. Accuracy is given by

$$
\frac{TP+TN}{TP+TN+FP+FN}
$$

where:

* TP is the number of *correctly* classified positive samples
* TN is the number of *correctly* classified negative samples
* FP is the number of *incorrectly* classified positive samples
* FN is the number of *incorrectly* classified negative samples

In [1]:
from __future__ import annotations

import math
import time

from pandas import DataFrame
import pandas as pd
import random
from sklearn.metrics import accuracy_score
import time
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [2]:
# !wget https: // archive.ics.uci.edu / ml / machine-learning-databases / 00332 / OnlineNewsPopularity.zip
# !unzip OnlineNewsPopularity.zip

Format properly the names of the columns and remove the first two columns

In [31]:
#VARIABLES
BINARY_LABEL = True
NORMALIZE = True

In [4]:
df = pd.read_csv('OnlineNewsPopularity/OnlineNewsPopularity.csv')
df = df.rename(columns=lambda x: x.strip())
df = df.iloc[:, 2:]

## Let's plot some of the columns

In [5]:
import matplotlib.pyplot as plt


columns_to_plot = [
    'n_tokens_title',
    'num_videos',
    'num_imgs',
    'num_keywords',
    'data_channel_is_world',
    'rate_negative_words',
    'self_reference_avg_sharess',
]
#
# fig, ax = plt.subplots(len(columns_to_plot), 1, figsize=(20, 20))
#
# for i, column in enumerate(columns_to_plot, 0):
#     ax[i].hist(df[column])
#     ax[i].title.set_text(column)

# plt.show()

In [6]:
#calculate median for each column
median = df.median(axis=0)
avg = df.mean(axis=0)
print(avg)
print(median)
# compute the median of each attribute
medians = df.median()

# discretize each attribute to 0 or 1 based on the median
# for column in df.columns:
#     df[column] = (df[column] >= medians[column]).astype(int)

n_tokens_title                       10.398749
n_tokens_content                    546.514731
n_unique_tokens                       0.548216
n_non_stop_words                      0.996469
n_non_stop_unique_tokens              0.689175
num_hrefs                            10.883690
num_self_hrefs                        3.293638
num_imgs                              4.544143
num_videos                            1.249874
average_token_length                  4.548239
num_keywords                          7.223767
data_channel_is_lifestyle             0.052946
data_channel_is_entertainment         0.178009
data_channel_is_bus                   0.157855
data_channel_is_socmed                0.058597
data_channel_is_tech                  0.185299
data_channel_is_world                 0.212567
kw_min_min                           26.106801
kw_max_min                         1153.951682
kw_avg_min                          312.366967
kw_min_max                        13612.354102
kw_max_max   

In [32]:
data = np.array(df)
if NORMALIZE:
    # normalize the data
    data = (data - data.mean(axis=0)) / data.std(axis=0)
x = data[:, :-1]
# converting the last column to boolean
if BINARY_LABEL:
    if not NORMALIZE:
        y = np.array([elem >= 1400 for elem in data[:, -1]])
    else:
        y = np.array([e >= 0 for e in data[:, -1]])  #TODO check
else:
    y = np.array(data[:, -1])
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=1)

In [8]:
def analyze_pred(pred, truth):
    # Print the mean squared error and R-squared score
    print('Mean Squared Error:', mean_squared_error(truth, pred))
    print('R-squared Score:', r2_score(test_y, pred))
    print(np.mean(pred))
    print(np.mean(truth))
    bin_pred = pred >= 0
    bin_truth = truth >= 0
    print('Accuracy:', accuracy_score(bin_truth, bin_pred))

In [11]:
from sklearn.metrics import log_loss


def analyze_pred_bin(pred, truth):
    # Print the mean squared error and R-squared score
    print('Binary cross entropy:', log_loss(truth, pred))
    print(np.mean(pred))
    print(np.mean(truth))
    # bin_pred = pred >= 0
    # bin_truth = truth >= 0
    print('Accuracy:', accuracy_score(truth, pred))

In [12]:
def test_model(model, train_x, train_y, test_x, test_y,classification=False):
    time_start = time.time()
    model.fit(train_x, train_y)
    print("Time taken to train the model: ", time.time() - time_start)
    pred = model.predict(test_x)
    if classification:
        analyze_pred_bin(pred, test_y)
    else:
        analyze_pred(pred, test_y)

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
regressor = KNeighborsRegressor(n_neighbors=5)



# Print the regression coefficients and intercept
# print('Coefficients:', regressor.coef_)
# print('Intercept:', regressor.intercept_)
test_model(regressor, train_x, train_y, test_x, test_y)



Time taken to train the model:  0.001749277114868164
Mean Squared Error: 0.6476286041930105
R-squared Score: -0.24005196319638644
-0.029363834499985204
-0.01342905790679993
Accuracy: 0.7037457434733257


In [16]:

classifier = KNeighborsClassifier(n_neighbors=5)
test_model(classifier, train_x, train_y, test_x, test_y,classification=True)

Time taken to train the model:  0.002966642379760742
Binary cross entropy: 8.068796161644968
0.07592382393744482
0.20267372934796318
Accuracy: 0.7761382267625173


Linear Regression From Scratch

In [19]:

from scipy.spatial.distance import cdist

class ScratchKNeighborsRegressor:
    def __init__(self, n_neighbors=5):
        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        distances = cdist(X_test, self.X_train)
        nearest_indices = np.argsort(distances, axis=1)[:, :self.n_neighbors]
        nearest_targets = self.y_train[nearest_indices]
        predictions = np.mean(nearest_targets, axis=1)
        return predictions

In [20]:
regressor = ScratchKNeighborsRegressor()

# Print the regression coefficients and intercept
# print('Coefficients:', regressor.coef_)
# print('Intercept:', regressor.intercept_)
test_model(regressor, train_x, train_y, test_x, test_y)


Time taken to train the model:  2.1457672119140625e-06
Mean Squared Error: 0.6476286041930105
R-squared Score: -0.24005196319638644
-0.029363834499985204
-0.01342905790679993
Accuracy: 0.7037457434733257


In [33]:
class ScratchKNeighborsClassifier:
    def __init__(self, n_neighbors=5):
        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        distances = cdist(X_test, self.X_train)
        nearest_indices = np.argsort(distances, axis=1)[:, :self.n_neighbors]
        nearest_targets = self.y_train[nearest_indices]
        predictions = np.mean(nearest_targets, axis=1)
        return predictions>=0.5

In [34]:
regressor = ScratchKNeighborsClassifier()
test_model(regressor, train_x, train_y, test_x, test_y,classification=True)


Time taken to train the model:  1.430511474609375e-06
Binary cross entropy: 8.068796161644968
0.07592382393744482
0.20267372934796318
Accuracy: 0.7761382267625173


In [30]:
class ScratchLocallyWeightedLinearRegression:
    def __init__(self, num_iter, lr, kernel_func, kernel_func_params):
        self.num_iter = num_iter
        self.lr = lr
        self.kernel_func = kernel_func
        self.kernel_func_params = kernel_func_params
        self.w = None

    def fit(self, X, y):
        self.w = np.random.rand(X.shape[1])
        for i in range(self.num_iter):
            for j in range(X.shape[0]):
                x = X[j]
                y_pred = self.predict(x)
                error = y[j] - y_pred
                k=self.kernel_func(x, X, self.kernel_func_params)

                # self.w += self.lr * error * k
                # self.w+=self.lr *error * np.dot(k,x)

                self.w += self.lr * np.dot(error,k )

    def predict(self, X):
        return np.dot(X, self.w)
def gaussian_kernel(x, X, params):
    sigma = params['sigma']
    distances = np.linalg.norm(x - X, axis=1)
    return np.exp(-distances ** 2 / (2 * sigma ** 2))

regressor = ScratchLocallyWeightedLinearRegression(num_iter=100, lr=0.01, kernel_func=gaussian_kernel, kernel_func_params={'sigma': 1})

test_model(regressor, train_x, train_y, test_x, test_y)

ValueError: operands could not be broadcast together with shapes (58,) (31715,) (58,) 