**Module 4 PCA**

**Nosson Weissman**

**DAV 6150 - Data Science**

**Professor James Topor**

**Summer, 2022**

**DAV 6150 Module 4 Assignment** *Feature Selection & Dimensionality Reduction* 

\*\*\* **You may work in small groups of no more than three (3) people for this Assignment  \*\*\***

When the number of explanatory variables is relatively large with respect to the number of observations contained within a data set, data science practitioners need to know how to effectively reduce the number of explanatory variables required for the intended model. For this assignment your primary task is to apply feature selection and/or dimensionality reduction techniques to identify the explanatory variables to be included within a linear regression model that predicts **the number of times an online news article will be shared**. The data set you will be using is sourced from the UC Irvine machine learning archive:  

·  https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

The data set is comprised of 39,797 observations and 61 attributes. Please refer to the UCI web page for further details on these variables. The **shares** variable will serve as the response variable for your regression model. As such, you are to apply your feature selection / dimensionality reduction expertise to the remaining 60 attributes for purposes of identifying the explanatory variables that you believe will be most useful when included in a linear regression model that estimates **shares**. 

Once you are comfortable in your understanding of the various data attributes, get started on the assignment as follows: 

1) Load the provided M4\_Data.csv file to your DAV 6150 Github Repository.  

2) Then, using a Jupyter Notebook, read the data set from your Github repository and load it into a Pandas dataframe. 

3) Using your Python skills, perform some basic exploratory data analysis (EDA) to ensure you understand the nature of each of the variables (including the response variable).  Your EDA writeup should include any insights you are able to derive from your statistical analysis of the attributes and the accompanying exploratory graphics you create (e.g., bar plots, box plots, histograms, line plots, etc.). You should also try to identify some preliminary predictive inferences, e.g., do any of the explanatory variables appear to be relatively more “predictive” of the response variable? There are a variety of ways you can potentially identify such relationships between the explanatory variables and the response variable. It is up to you as the data science practitioner to decide how you go about your EDA, including selecting appropriate statistical metrics to be calculated + which types of exploratory graphics to make use of. Your goal should be to provide an EDA that is thorough and succinct without it being so detailed that a reader will lose interest in it. 

4) Using your Python skills, apply your knowledge of feature selection and dimensionality reduction to the 60 candidate explanatory variables to identify variables that you believe will prove to be relatively useful within the required linear regression model. Your work here should reflect some of the knowledge you have gained via your EDA work.  While selecting your features, be sure to consider the tradeoff between model performance and model simplification, e.g., if you are reducing the complexity of your model, are you sacrificing too much in the way of Adjusted R^2 (or some other performance measure)?  The ways in which you implement your feature selection and/or dimensionality reduction decisions are up to you as a data science practitioner to determine: will you use filtering methods? 

PCA? Stepwise search? etc. It is up to you to decide upon your own preferred approach. Be sure to include an explanatory narrative that justifies your decision making process. 

5) Train/cross validate your model and report on its performance.  

**Your deliverable for this assignment** is your Jupyter Notebook. It should contain a combination of Python code cells and explanatory narratives contained within properly formatted Markdown cells. The Notebook should contain (at a minimum) the following sections (including the relevant Python code for each section): 

1) **Introduction (5 Points)**:  Summarize the problem + explain the steps you plan to take to address the problem 
2) **Exploratory Data Analysis (30 Points)**: Explain + present your EDA work including any conclusions you draw from your analysis including any preliminary predictive inferences. This section should include any Python code used for the EDA. 
3) **Feature Selection / Dimensionality Reduction (45 Points)**: Explain + present your feature selection / dimensionality work, including any Python code used as part of that process. 
4) **Regression Model Evaluation (15)**: Explain + present your linear regression model and discuss its accuracy. This section should include any Python code used to construction + evaluate your regression model. 
5) **Conclusions (5 Points)** 

**Your Jupyter Notebook deliverable should be similar to that of a publication-quality  / professional caliber document and should include clearly labeled graphics, high-quality formatting, clearly defined section and sub-section headers, and be free of spelling and grammar errors. Furthermore, your Pythion code should include succinct explanatory comments.**  

Upload your Jupyter Notebook within the provided M4 Assignment Canvas submission portal.  Be sure to save your Notebook using the following nomenclature:  **first initial\_last name\_M4\_assn**" (e.g., J\_Smith\_M4\_assn\_).   ***Small groups should identity all group members at the start of the Jupyter Notebook and each team*** 

***member should submit their own copy of the team’s work within Canvas.*** 

In [2]:
#laoding libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import urllib

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/codepharmer/AI-6150/main/M4%20PCA/M4_Data.csv')
df

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.100000,0.70,-0.350000,-0.600,-0.200000,0.500000,-0.187500,0.000000,0.187500,593
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.70,-0.118750,-0.125,-0.100000,0.000000,0.000000,0.500000,0.000000,711
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.575130,1.0,0.663866,3.0,1.0,1.0,...,0.100000,1.00,-0.466667,-0.800,-0.133333,0.000000,0.000000,0.500000,0.000000,1500
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.80,-0.369697,-0.600,-0.166667,0.000000,0.000000,0.500000,0.000000,1200
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.540890,19.0,19.0,20.0,...,0.033333,1.00,-0.220192,-0.500,-0.050000,0.454545,0.136364,0.045455,0.136364,505
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39639,http://mashable.com/2014/12/27/samsung-app-aut...,8.0,11.0,346.0,0.529052,1.0,0.684783,9.0,7.0,1.0,...,0.100000,0.75,-0.260000,-0.500,-0.125000,0.100000,0.000000,0.400000,0.000000,1800
39640,http://mashable.com/2014/12/27/seth-rogen-jame...,8.0,12.0,328.0,0.696296,1.0,0.885057,9.0,7.0,3.0,...,0.136364,0.70,-0.211111,-0.400,-0.100000,0.300000,1.000000,0.200000,1.000000,1900
39641,http://mashable.com/2014/12/27/son-pays-off-mo...,8.0,10.0,442.0,0.516355,1.0,0.644128,24.0,1.0,12.0,...,0.136364,0.50,-0.356439,-0.800,-0.166667,0.454545,0.136364,0.045455,0.136364,1900
39642,http://mashable.com/2014/12/27/ukraine-blasts/,8.0,6.0,682.0,0.539493,1.0,0.692661,10.0,1.0,1.0,...,0.062500,0.50,-0.205246,-0.500,-0.012500,0.000000,0.000000,0.500000,0.000000,1100
