#### Brief Description of this dataset

> `Stars`, `Issues`, and `Language` is the explanatory variables and `Forks` is the response variable of this dataset.
>* `Stars`: The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest
>* `Forks`: The number of times the repository has been forked by other GitHub users
>* `Issues`: The total number of open issues (indicating bugs, feature requests, or discussions)
>* `Language`: The primary programming language
>
> I cleaned and wrangled data to include only interested variables and excluded outliers.
>
> Then used pairwise scatterplots, heatmap, scatterplots, boxplots, and bar-charts to visualize the data. 

 **Question**: This project will aim to explore the relationship between these variables by using "inference". Specifically, I am going to investigate if the number of stars and issues within each programming language is related to the change in a number of repositories being forked.
 
 **Methods and Plan**: I will implement Multiple Linear Regression (MLR) with backward selection to explore how the number of stars, issues, and programming language are related to GitHub repository forks. I will follow the plan of "Data Preparation", "Model Building", "Model Selection", "Interpretation", and "Assessment & Discussion". "Interpretation and Assessment & Discussion" parts are descripted below the code snippets.

## Implementation of a proposed model

In [191]:
library(tidyverse)
library(gridExtra)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(GGally)
library(AER)
library(dplyr)
library(MASS)
library(leaps)

### Data Preparation


In [192]:
github_data <- read_csv("repositories.csv")
github_sample <- github_data[sample(nrow(github_data), 5000),]
head(github_sample)

[1mRows: [22m[34m215029[39m [1mColumns: [22m[34m24[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (8): Name, Description, URL, Homepage, Language, License, Topics, Defau...
[32mdbl[39m  (5): Size, Stars, Forks, Issues, Watchers
[33mlgl[39m  (9): Has Issues, Has Projects, Has Downloads, Has Wiki, Has Pages, Has ...
[34mdttm[39m (2): Created At, Updated At

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Name,Description,URL,Created At,Updated At,Homepage,Size,Stars,Forks,Issues,⋯,Has Issues,Has Projects,Has Downloads,Has Wiki,Has Pages,Has Discussions,Is Fork,Is Archived,Is Template,Default Branch
<chr>,<chr>,<chr>,<dttm>,<dttm>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>
PythonDataScienceFullThrottle,"Downloads for my Safari Online Learning live training course Python Data Science Full Throttle: Introductory Artificial Intelligence (AI), Big Data and Cloud Case Studies",https://github.com/pdeitel/PythonDataScienceFullThrottle,2019-07-18 15:53:01,2023-09-17 12:20:26,,182027,232,217,3,⋯,True,False,True,False,False,False,False,False,False,master
compose-shimmer,A shimmer library for Android's Jetpack Compose.,https://github.com/valentinilk/compose-shimmer,2021-08-23 15:18:05,2023-09-25 14:59:09,,4354,366,22,6,⋯,True,True,True,True,False,False,False,False,False,master
a11y,Accessibility audit tooling for the web (beta),https://github.com/addyosmani/a11y,2014-10-12 17:37:36,2023-09-08 16:51:19,http://addyosmani.github.io/a11y/,43015,1710,90,31,⋯,True,True,True,False,True,False,False,True,False,master
htmlpagedom,jQuery-inspired DOM manipulation extension for Symfony's Crawler,https://github.com/wasinger/htmlpagedom,2012-11-16 22:12:10,2023-09-18 05:46:37,,607,336,63,12,⋯,True,True,True,True,False,False,False,False,False,master
rest-api-tutorial,This is a sample source code for the article published on Toptal: https://www.toptal.com/nodejs/secure-rest-api-in-nodejs,https://github.com/makinhs/rest-api-tutorial,2018-05-19 23:48:59,2023-09-23 19:43:58,,367,416,314,4,⋯,True,True,True,True,False,False,False,False,False,master
PlutoUI.jl,,https://github.com/JuliaPluto/PlutoUI.jl,2020-04-13 18:35:03,2023-09-20 08:32:09,https://featured.plutojl.org/basic/plutoui.jl,2736,281,51,95,⋯,True,True,True,True,True,False,False,False,False,main


In [193]:
# Include only four main languages
github_sample <- subset(github_sample, Language %in% c("Python", "JavaScript", "Java", "C++"))

# Excluding the outliers
github_sample <- github_sample %>%
                 filter(Forks <= 250 & Stars <= 1000 & Issues <= 60)

# Data we will use 
github_data <- github_sample[, c("Stars", "Forks", "Issues","Language")]
head(github_data)

Stars,Forks,Issues,Language
<dbl>,<dbl>,<dbl>,<chr>
830,130,27,Java
184,92,37,Python
506,89,33,JavaScript
385,66,7,Java
220,29,23,Java
437,74,16,Python


#### Create dummy variable

In [194]:
github_data <- github_data %>%
  mutate_at(vars(Language), as.factor) %>%
  mutate(dummy_language = as.numeric(Language) - 1)

github_data <- github_data[, c("Stars", "Forks", "Issues","dummy_language")]


head(github_data)

Stars,Forks,Issues,dummy_language
<dbl>,<dbl>,<dbl>,<dbl>
830,130,27,1
184,92,37,3
506,89,33,2
385,66,7,1
220,29,23,1
437,74,16,3


### Model Building

#### Split Training/Test data

In [195]:
set.seed(123)  

training_data <- sample_n(github_data, size = nrow(github_data) * 0.70, replace = FALSE)
head(training_data)

test_data <- anti_join(github_data, training_data)
head(test_data)

Stars,Forks,Issues,dummy_language
<dbl>,<dbl>,<dbl>,<dbl>
318,53,0,1
321,52,1,3
339,76,1,1
389,227,4,2
192,41,5,1
423,129,23,3


[1m[22mJoining with `by = join_by(Stars, Forks, Issues, dummy_language)`


Stars,Forks,Issues,dummy_language
<dbl>,<dbl>,<dbl>,<dbl>
506,89,33,2
328,110,9,3
686,109,4,3
188,30,9,3
307,21,0,2
194,20,7,1


#### Backward Selection

In [196]:
backward <- regsubsets(
    x = Forks ~ ., 
    nvmax = 3,  
    data = training_data,  
    method = "backward"  
)

backward_sum <- summary(backward)

bwd_summary_df <- data.frame(
   n_input_variables = 1:3,
   RSQ = backward_sum$rsq,
   RSS = backward_sum$rss,
   ADJ.R2 = backward_sum$adjr2
)

backward_sum
bwd_summary_df



Subset selection object
Call: regsubsets.formula(x = Forks ~ ., nvmax = 3, data = training_data, 
    method = "backward")
3 Variables  (and intercept)
               Forced in Forced out
Stars              FALSE      FALSE
Issues             FALSE      FALSE
dummy_language     FALSE      FALSE
1 subsets of each size up to 3
Selection Algorithm: backward
         Stars Issues dummy_language
1  ( 1 ) "*"   " "    " "           
2  ( 1 ) "*"   "*"    " "           
3  ( 1 ) "*"   "*"    "*"           

n_input_variables,RSQ,RSS,ADJ.R2
<int>,<dbl>,<dbl>,<dbl>
1,0.1983168,2382288,0.1975377
2,0.219473,2319420,0.2179545
3,0.2227237,2309760,0.2204532


The summary table displays $R^2$ (RSQ), Residual Sum of Squares (RSS), and Adjusted-$R^2$ (ADJ.R2), for each subset of variables selected. These results suggest that including more predictors in the model leads to better performance in explaining the variability in GitHub repository forks, as evidenced by higher R-squared and adjusted R-squared values and lower RSS. Overall, the selected variables improve the model's prediction ability and provide direction for determining important variables influencing Github repositories `Forks`.

#### Multiple Linear Regression (MLR)

In [200]:
mlr_model_train <- lm(Forks ~ Stars + Issues + dummy_language, data = training_data)
summary(mlr_model_train)

mlr_model_test <- lm(Forks ~ Stars + Issues + dummy_language, data = test_data)
summary(mlr_model_test)


Call:
lm(formula = Forks ~ Stars + Issues + dummy_language, data = training_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-117.64  -32.54  -10.63   25.00  167.75 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    31.020713   4.240457   7.315 5.16e-13 ***
Stars           0.121476   0.008083  15.028  < 2e-16 ***
Issues          0.616458   0.114057   5.405 8.06e-08 ***
dummy_language -3.013388   1.454023  -2.072   0.0385 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 47.42 on 1027 degrees of freedom
Multiple R-squared:  0.2227,	Adjusted R-squared:  0.2205 
F-statistic: 98.09 on 3 and 1027 DF,  p-value: < 2.2e-16



Call:
lm(formula = Forks ~ Stars + Issues + dummy_language, data = test_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-97.508 -31.168  -8.827  20.114 190.483 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    33.68620    6.69601   5.031 7.14e-07 ***
Stars           0.10486    0.01216   8.621  < 2e-16 ***
Issues          0.65375    0.18404   3.552 0.000423 ***
dummy_language -3.02783    2.31515  -1.308 0.191614    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 46.9 on 439 degrees of freedom
Multiple R-squared:  0.1949,	Adjusted R-squared:  0.1894 
F-statistic: 35.43 on 3 and 439 DF,  p-value: < 2.2e-16


#### Visualization (tidy table)

In [201]:
tidy_mlr_train <- tidy(mlr_model_train)

tidy_mlr_test <- tidy(mlr_model_test)

tidy_mlr_train
tidy_mlr_test

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),31.0207133,4.240456947,7.315418,5.161904e-13
Stars,0.1214763,0.008083309,15.028038,2.740832e-46
Issues,0.6164582,0.114056632,5.404843,8.064204e-08
dummy_language,-3.0133875,1.454023292,-2.072448,0.03847274


term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),33.6861976,6.69601294,5.030784,7.135238e-07
Stars,0.104856,0.01216279,8.621047,1.215514e-16
Issues,0.6537481,0.18403664,3.552271,0.0004232729
dummy_language,-3.0278318,2.31514997,-1.307834,0.1916141



The resulting tidy table summaries (`tidy_mlr_train` & `tidy_mlr_test`) provide insights into the relationship between the predictor variables (`Stars`, `Issues`, `dummy_language`) and the response variable (`Forks`) in both the training and testing datasets. Both training and testing dataset show similar trends. The coefficients for `Stars` and `Issues` suggest that an increase in the number of stars or issues is associated with an increase in the number of forks. The coefficient for `dummy_language` is estimated to be -3.013, indicating a negative association with the number of forks, although its p-value for training and testing dataset suggests a less significant effect compared to the other predictors.
