<div style="text-align: center;" >
<h1 style="margin-top: 0.2em; margin-bottom: 0.1em;">Assignment 2</h1>


## Social Impact Theory with Reddit Data

In [1]:
# install requirements
! pip install pandas
! pip install numpy
! pip install matplotlib
! pip install scikit-learn
! pip install praw



### Import requirements
The cell below imports all necessary dependancies. Make sure they are installed (see cell above).

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
import json
import praw
import os
#from dotenv import load_dotenv
from praw.models import Submission, Comment
from tqdm import tqdm

### Exercise 1: Load Reddit data

​
#### Sign up for the Reddit API
* In this part of the assignment we will collect data using the Reddit API.
* First, you need to sign up for the Reddit API. For this, follow the steps outlined in [this guide](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c). You will need to create an app on the following [link](https://old.reddit.com/prefs/apps/).
* The [PRAW package](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html) has been installed, which provides a nice wrapper for the Reddit API.
​
#### Collect the data 
* After you have created the Reddit instance with the praw package extract the 200 most popular subreddits and store them in a list. 
* Extract the top 20 `hottest` submissions from each of your selected subreddits, ignoring `pinned` submissions.
* Store the number of subcribers `subscribers` and name `display_name` of each subreddit 
* For each of the submissions extract the `score` (number of upvotes). Afterwards, calculate the mean score which will be the social impact, and store it with the number of subcribers and name of each subreddit.
* Hint: You can store your results in a json and turn it into a dataframe

In [3]:
#Plug in your id, secret and user (can be a short description)
REDDIT_ID = #your reddit ID
REDDIT_SECRET = #your secret
USER_AGENT = "myApp for data analysis"  #this is an example, yppu can change it       

reddit = praw.Reddit(
    client_id=REDDIT_ID,
    client_secret=REDDIT_SECRET,
    user_agent=USER_AGENT
)

### Exercise 2: Visualize distributions and scatter plots

#### 2.1 Distribution of the number of subscribers
Plot the histogram of the number of subscribers of each subreddit in your dataset. Repeat this with a logarithmic `y` scale. Which one is more skewed?  

You can use pandas [`hist`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) method with the keyword argument `log` for logarithmic scale, or you can use matplotlibs [`hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) method (don't forget to first create a figure), again with the keyword argument `log` to plot the data.

#### 2.2 Distribution of social impact

Repeat the above task but for the social impact of your users, also look at the logarithmic scale. Again, which one is more skewed?

#### 2.3 Number of subscribers vs social impact
Create a scatter plot with the number of subscribers of each user on the x axis and the social impact of each user on the y axis. Both axis should be in logarithmic scale. Is there a relationship?  

Again you can use pandas [`scatter`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html) method with `logx` and `logy` set to true or you can use matplotlibs [`scatter`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) method. Here you can use the `set_yscale` and `set_xscale` method of the axis to set them to `'log'`.

### Exercise 3: Fit and visualize a regression model *(2 points)*

#### 3.1 Fit a linear model

First of all create two new columns. One should be called `SI`, and store the logarithm of the mean score, and another called `FC` with the logarithm of the amount of subscribers. For this you can use numpy's log function `np.log(...)`.  

Now fit a linear regression model with sklearn. For this use the class [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to create a linear regression instance and then call the `fit` method. `SI` is used as the dependent variable (target) and `FC` as the independent variable (feature).  

Print the model intercept and coefficient. For this you can use the models attributes `coef_` and `intercept_`.

#### 3.2 Plot the results
Now plot the same scatter plot as in 3.3 additional add a line plot which shows the fitted regression line of the model. For this use the intercept and the coefficient (slope). Does the line fit the data as you expected?  

It is easier to use matplotlib here to add the line plot to the scatter plot. For the line plot you can use matplotlibs [`plot`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) method. For the x values you can use numpy's [`np.linspace`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html#numpy.linspace) method to evenly space x values in a certain range. The y values can be calculated with the intercept and the slope as follows:  
$
\begin{align}
    y = slope \cdot x + intercept
\end{align}
$

#### 3.3 Calculate quality of the fit
Calculate the residuals of the model and save them in a vector. This can be done with following formula:
$
\begin{align}
residual = y_{true} - y_{pred}
\end{align}
$
where $y_{true}$ are the true values of the dependent variable (in our case `SI`) and $y_{pred}$ are the predicted values with the model. To get the predicted values of the model you can use the [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) method of the model.  

Afterwards calculate the variance of the residuals and the variance of the social impact variable. For this you can use numpy's [`var`](https://numpy.org/doc/stable/reference/generated/numpy.var.html) function. Is the variance of the residuals lower than the variance of the dependent variable? Calculate the proportion of variance explained ([R-squared](https://en.wikipedia.org/wiki/Coefficient_of_determination)) using the previously calculated variances.

#### 3.4 Distribution of residuals
Plot the histogram of residuals. Do they look normally distributted?  

Again you can use matplotlib as before to plot the histogram.

### Exercise 4: Bootstrapping *(2 points)*

#### 4.1 One sample
For bootsrapping we first look at creating one sample. For this use the subscriber and social impact dataframe from before and sample random rows with replacement. This again can be done with pandas [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method and the keyword argument `replace` set to `True`.  

Fit a new linear regression model with this new dataset. What is the value of the coefficient and the intercept now?

#### 4.2 Many bootstrap samples
Now repeat this 10000 times, save the resulting coefficient in a vector.

#### 4.3 Bootstrap histogram

Plot a histogram of the values resulting from the permutations and add a vertical line on the value of the coefficient of the original model (from exercise 4.1). For adding a vertical line to the histogram in matplotlib you can use the [`axvline`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axvline.html) method.  

 How far is the line from the center of the histogram?

#### 4.4 Bootstrap scatterplot
* Repeat the plot from exercise 3.2
* Generate 500 bootstrap samples and save the resulting intercepts and coefficients in an array.
* Add a line for each of these 500 fitted models to your plot. Make sure to set the `alpha` parameter low, so that the plot remains readable.

### Exercise 5: Interpretation
* Do you find any relationship between social impact and the amount of subscribers?
* How sure are you that it is larger than zero? How sure are you that it is lower than 1?
* Is the value of the relationship within the ranges predicted by Social Impact Theory?
* Under that relationship, if I have 1000 subscribers, how many more subscribersrs do I need to double my social impact?