### Please complete the following sections sequentially to complete this assignment.

##### <span style="color:red">Note: You can create as many code or markdown cells as you deem necessary to answer each question. However, please leave the problems unchanged. We will evaluate your solutions by executing your code sequentially.</span> 
---

**Within the expansion of the Internet and Web, there has also been a growing interest in online articles and reviews, which allows an easy and fast spread of information worldwide. Thus, predicting the popularity of online news has become a trend. Popularity is often measured by considering the number of interactions in the Web and social networks (e.g., number of shares, likes, and comments). Predicting such popularity is valuable for advertisers, authors, content providers, and even activists/politicians (e.g., to understand or influence public opinion). In this assignment, we use a news popularity dataset utilized by Fernandes et al. (2015) based on the articles published by [Mashable](https://mashable.com/) from January 7, 2013, to January 7, 2015.**

**<span style="color:red">The objective of this assignment is to predict the number of times a news article is shared. </span> The assignment's dataset is included in the homework's zipped folder. Table below has the description of each variable in the dataset.**

| Variable                      | Description                                                                       |
|-------------------------------|-----------------------------------------------------------------------------------|
| url                           | URL of the article (non-predictive)                                               |
| timedelta                     | Days between the article publication and the dataset acquisition (non-predictive) |
| n_tokens_title                | Number of words in the title                                                      |
| n_tokens_content              | Number of words in the content                                                    |
| n_unique_tokens               | Rate of unique words in the content                                               |
| n_non_stop_words              | Rate of non-stop words in the content                                             |
| n_non_stop_unique_tokens      | Rate of unique non-stop words in the content                                      |
| num_hrefs                     | Number of links                                                                   |
| num_self_hrefs                | Number of links to other articles published by Mashable                           |
| num_imgs                      | Number of images                                                                  |
| num_videos                    | Number of videos                                                                  |
| average_token_length          | Average length of the words in the content                                        |
| num_keywords                  | Number of keywords in the metadata                                                |
| data_channel_is_lifestyle     | Is data channel 'Lifestyle'?                                                      |
| data_channel_is_entertainment | Is data channel 'Entertainment'?                                                  |
| data_channel_is_bus           | Is data channel 'Business'?                                                       |
| data_channel_is_socmed        | Is data channel 'Social Media'?                                                   |
| data_channel_is_tech          | Is data channel 'Tech'?                                                           |
| data_channel_is_world         | Is data channel 'World'?                                                          |
| kw_min_min                    | Min. shares of the Worst keyword in the article                                   |
| kw_max_min                    | Max. shares of the Worst keyword in the article                                   |
| kw_avg_min                    | Avg. shares of the Worst keyword in the article                                   |
| kw_min_max                    | Min. shares of the best keyword in the article                                    |
| kw_max_max                    | Max. shares of the best keyword in the article                                    |
| kw_avg_max                    | Avg. shares of the best keyword in the article                                    |
| kw_min_avg                    | Min. shares of the average keyword in the article                                 |
| kw_max_avg                    | Max. shares of the average keyword in the article                                 |
| kw_avg_avg                    | Avg. shares of the average keyword in the article                                 |
| self_reference_min_shares     | Min. shares of referenced articles in Mashable                                    |
| self_reference_max_shares     | Max. shares of referenced articles in Mashable                                    |
| self_reference_avg_sharess    | Avg. shares of referenced articles in Mashable                                    |
| weekday_is_monday             | Was the article published on a Monday?                                            |
| weekday_is_tuesday            | Was the article published on a Tuesday?                                           |
| weekday_is_wednesday          | Was the article published on a Wednesday?                                         |
| weekday_is_thursday           | Was the article published on a Thursday?                                          |
| weekday_is_friday             | Was the article published on a Friday?                                            |
| weekday_is_saturday           | Was the article published on a Saturday?                                          |
| weekday_is_sunday             | Was the article published on a Sunday?                                            |
| is_weekend                    | Was the article published on the weekend?                                         |
| LDA_00                        | Closeness to LDA topic 0                                                          |
| LDA_01                        | Closeness to LDA topic 1                                                          |
| LDA_02                        | Closeness to LDA topic 2                                                          |
| LDA_03                        | Closeness to LDA topic 3                                                          |
| LDA_04                        | Closeness to LDA topic 4                                                          |
| global_subjectivity           | Text subjectivity                                                                 |
| global_sentiment_polarity     | Text sentiment polarity                                                           |
| global_rate_positive_words    | Rate of positive words in the content                                             |
| global_rate_negative_words    | Rate of negative words in the content                                             |
| rate_positive_words           | Rate of positive words among non-neutral tokens                                   |
| rate_negative_words           | Rate of negative words among non-neutral tokens                                   |
| avg_positive_polarity         | Avg. polarity of positive words                                                   |
| min_positive_polarity         | Min. polarity of positive words                                                   |
| max_positive_polarity         | Max. polarity of positive words                                                   |
| avg_negative_polarity         | Avg. polarity of negative words                                                   |
| min_negative_polarity         | Min. polarity of negative words                                                   |
| max_negative_polarity         | Max. polarity of negative words                                                   |
| title_subjectivity            | Title subjectivity                                                                |
| title_sentiment_polarity      | Title polarity                                                                    |
| abs_title_subjectivity        | Absolute subjectivity level                                                       |
| abs_title_sentiment_polarity  | Absolute polarity level                                                           |
| **shares (Target)**           | **Number of shares**                                                              |
| popular (DO NOT USE)          | whether the article is popular (yes/no)                                           |

Reference:

Fernandes, K., Vinagre, P., & Cortez, P. (2015, September). A proactive intelligent decision support system for predicting the popularity of online news. In Portuguese Conference on Artificial Intelligence (pp. 535-546). Springer, Cham.

---
### Import Packages and Read the Data

**Before starting the assignment, import all necessary libraries and read the dataset into the Python environment.**

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from pydotplus import graph_from_dot_data

In [3]:
df = pd.read_csv("online_news_popularity.csv")
df.head(2)

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares,popular
0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219,0.663594,1.0,0.815385,4,2,1,...,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593,no
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255,0.604743,1.0,0.791946,3,1,1,...,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711,no


---
### Introduction to Regression Trees

**1- Watch this [video](https://ohiouniversity.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=403295c8-1da1-4c46-a3ed-acd9002069dd) for an intorudction to regression trees.**

**2- Briefly describe how regression trees work. (10 pts)**

Regression trees work by recursively partitioning the data into smaller subsets based on the feature the provides the best split. This process continues until a predetermined depth, or minimum, is met.

**3- What are the similarities of classification and regression tree models? (10 pts)**

Both the classification tree and the regression tree rely on a partitioning process which builds a tree based on the features of the data. Both trees make predictions based on the input features and their associated splits in the dataset.

**4- What are the differences of classification and regression tree models? (10 pts)**

Some differnences include the objective, where classification trees predict categorical variables while regression trees predict numerical variables, and the output, where classification trees have a categorical outcome while regression trees have a continuous outcome.

**5- How is MSE used in regression trees? (10 pts)**

MSE, or mean squared error, is used as the criteria to minimize the variance while splitting nodes. It does this by calculating the average squared difference between the predicted and actual values. This allows for optimal node splits in the tree.

**6- Why does overfitting happen in regression trees? and how can it be avoided? (10 pts)**

Overfitting occurs when there are too many nodes and the data has too many possible outcomes to fall into. This results in poor generalization on the unseen data. This can be avoided by pruning the tree which gets rid of unneccesary nodes on the tree.

---
### Regression Trees in Python

**7- Watch this [video](https://ohiouniversity.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=bd5b0d61-6837-4d54-b2b4-acd9002071b8) to learn about implementing regression trees in Python. The video's dataset is included in the assignment zipped folder, in case you want to replicate the codes.**

**8- Check if there are any missing values and take care of them if needed. (5 pts)**

In [5]:
df.isna().sum()

url                             0
timedelta                       0
n_tokens_title                  0
n_tokens_content                0
n_unique_tokens                 0
n_non_stop_words                0
n_non_stop_unique_tokens        0
num_hrefs                       0
num_self_hrefs                  0
num_imgs                        0
num_videos                      0
average_token_length            0
num_keywords                    0
channel                         0
kw_min_min                      0
kw_max_min                      0
kw_avg_min                      0
kw_min_max                      0
kw_max_max                      0
kw_avg_max                      0
kw_min_avg                      0
kw_max_avg                      0
kw_avg_avg                      0
self_reference_min_shares       0
self_reference_max_shares       0
self_reference_avg_sharess      0
weekday                         0
is_weekend                      0
LDA_00                          0
LDA_01        

**9- Detect and eliminate the outliers of these variables: ```['LDA_02', 'LDA_03', 'LDA_04']``` (10 pts)**

In [6]:
df_clean = df.copy()
var_list = ['LDA_02', 'LDA_03', 'LDA_04']
for var in var_list:
    iqr = df_clean.quantile(0.75, numeric_only=True)[var] - df_clean.quantile(0.25, numeric_only=True)[var]
    ub = df_clean.quantile(0.75, numeric_only=True)[var] + 1.5 * iqr
    lb = df_clean.quantile(0.25, numeric_only=True)[var] - 1.5 * iqr
    df_clean = df_clean[(df_clean[var] <= ub) & (df_clean[var] >= lb)]
df_clean

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares,popular
0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219,0.663594,1.0,0.815385,4,2,1,...,0.70,-0.350000,-0.600,-0.200000,0.500000,-0.187500,0.000000,0.187500,593,no
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255,0.604743,1.0,0.791946,3,1,1,...,0.70,-0.118750,-0.125,-0.100000,0.000000,0.000000,0.500000,0.000000,711,no
2,http://mashable.com/2013/01/07/apple-40-billio...,731,9,211,0.575130,1.0,0.663866,3,1,1,...,1.00,-0.466667,-0.800,-0.133333,0.000000,0.000000,0.500000,0.000000,1500,yes
3,http://mashable.com/2013/01/07/astronaut-notre...,731,9,531,0.503788,1.0,0.665635,9,0,1,...,0.80,-0.369697,-0.600,-0.166667,0.000000,0.000000,0.500000,0.000000,1200,no
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731,13,1072,0.415646,1.0,0.540890,19,19,20,...,1.00,-0.220192,-0.500,-0.050000,0.454545,0.136364,0.045455,0.136364,505,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39638,http://mashable.com/2014/12/27/protests-contin...,8,11,223,0.653153,1.0,0.825758,5,3,1,...,0.80,-0.250000,-0.250,-0.250000,0.000000,0.000000,0.500000,0.000000,1200,no
39639,http://mashable.com/2014/12/27/samsung-app-aut...,8,11,346,0.529052,1.0,0.684783,9,7,1,...,0.75,-0.260000,-0.500,-0.125000,0.100000,0.000000,0.400000,0.000000,1800,yes
39640,http://mashable.com/2014/12/27/seth-rogen-jame...,8,12,328,0.696296,1.0,0.885057,9,7,3,...,0.70,-0.211111,-0.400,-0.100000,0.300000,1.000000,0.200000,1.000000,1900,yes
39641,http://mashable.com/2014/12/27/son-pays-off-mo...,8,10,442,0.516355,1.0,0.644128,24,1,12,...,0.50,-0.356439,-0.800,-0.166667,0.454545,0.136364,0.045455,0.136364,1900,yes


**10- Dummy encode all categorical variables. (5 pts)**

In [7]:
df1 = pd.get_dummies(df_clean, columns=["channel", "weekday"], drop_first=True)
df1.head(2)

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,channel_other,channel_social_media,channel_tech,channel_world,weekday_monday,weekday_saturday,weekday_sunday,weekday_thursday,weekday_tuesday,weekday_wednesday
0,http://mashable.com/2013/01/07/amazon-instant-...,731,12,219,0.663594,1.0,0.815385,4,2,1,...,False,False,False,False,True,False,False,False,False,False
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731,9,255,0.604743,1.0,0.791946,3,1,1,...,False,False,False,False,True,False,False,False,False,False


**11- Partition the data (Consider 80% of the data as train). (5 pts)**

In [8]:
var_list = ['timedelta', 'n_tokens_title', 'n_tokens_content',
       'n_unique_tokens', 'n_non_stop_words', 'n_non_stop_unique_tokens',
       'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos',
       'average_token_length', 'num_keywords', 'kw_min_min', 'kw_max_min',
       'kw_avg_min', 'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg',
       'kw_max_avg', 'kw_avg_avg', 'self_reference_min_shares',
       'self_reference_max_shares', 'self_reference_avg_sharess', 'is_weekend',
       'LDA_00', 'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04', 'global_subjectivity',
       'global_sentiment_polarity', 'global_rate_positive_words',
       'global_rate_negative_words', 'rate_positive_words',
       'rate_negative_words', 'avg_positive_polarity', 'min_positive_polarity',
       'max_positive_polarity', 'avg_negative_polarity',
       'min_negative_polarity', 'max_negative_polarity', 'title_subjectivity',
       'title_sentiment_polarity', 'abs_title_subjectivity',
       'abs_title_sentiment_polarity',
       'channel_entertainment', 'channel_lifestyle', 'channel_other',
       'channel_social_media', 'channel_tech', 'channel_world',
       'weekday_monday', 'weekday_saturday', 'weekday_sunday',
       'weekday_thursday', 'weekday_tuesday', 'weekday_wednesday']
X = df1[var_list]
y = df1["shares"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**12- Using proper input variables, build a regression tree that predicts the number of times a news article is shared. After building your model, do the following: (30 pts)**
* __Calculate the $ r^2 $ and MSE of the model on the train data,__
* __Visulaize the tree,__
* __Set the parameters of the regression tree such that it does not overfit the data.__

In [9]:
dec_tree = tree.DecisionTreeRegressor(ccp_alpha=1000000)
dec_tree.fit(X_train, y_train)

**13- Test the tree you built on the test data by calculating the $ r^2 $ and MSE of the model on the test data: (10 pts)**

In [13]:
r_squared = dec_tree.score(X_test, y_test)
predictions = dec_tree.predict(X_test)
mse = mean_squared_error(y_test, predictions)

print("R-squared:", r_squared)
print("MSE:", mse)

R-squared: -0.44937521948804715
MSE: 97039351.13539259


**14- Comparing your train and test results, do you see any evidence of overfitting? Explain. (10 pts)**

There is evidence of overfitting. The model shows signifigantly better performance on the training set than the test set. This is evident considering the r-squared score and the MSE on the training data compared to the test data.

**15- Which variables are the most important ones? Sort and show the input variables based on their importance. (5 pts)**

In [11]:
df2 = pd.DataFrame({"Variable": var_list, "Importance": dec_tree.feature_importances_}).sort_values(by=["Importance"], ascending=False)
df2

Unnamed: 0,Variable,Importance
10,average_token_length,0.12255
13,kw_max_min,0.118941
42,title_subjectivity,0.114426
2,n_tokens_content,0.111485
36,avg_positive_polarity,0.099425
3,n_unique_tokens,0.082193
1,n_tokens_title,0.07954
22,self_reference_max_shares,0.07351
0,timedelta,0.070081
20,kw_avg_avg,0.026897


**16- Why do you think the results of variabel importances might not be reliable? (10 pts)**

These results might not be reliable because there are 10+ variables that are showing signifigance. This could be misleading because if there were truly a signifigant variable then there would only be a select few instead of 10+.

---
### Bonus Question

**17- When the classification counterpart of the problem was analyzed, the results were decent. However, the regression problem yielded poor results. What do you think is the reason? (20 pts)**

In [None]:
#There could be a number of different reasons for this. Based on
# this specific case, I'm going to guess that the regression model
#may be sensitive to outliers which would impact the prediction 
#accuracy. Another cause could be inadequate feature selection for
#the regression.