In [2]:
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(GGally)
library(AER)
library(janitor)
library(scales)
library(latex2exp)
library(tidymodels)
library(gridExtra)
library(faraway)
library(mltools)
library(leaps)
library(glmnet)
library(cowplot)
library(tibble)

## Best Variables for Prediction of GitHub Stars

Create an electronic report with a maximum of 2000 words (excluding citations) using Jupyter. The report should include the posed question, conducted analysis, and derived conclusion. Only one team member needs to submit this report. It is not required to include all tasks completed by each group member in their individual assignments. Make sure to reach a consensus among all team members on the final content of the report. If needed, consult your TA and Instructor for further guidance.

### Introduction

GitHub is a website that provides a graphical interface for version control using Git which allows developers to maintain their code and track the versions of their project as it progresses. Projects are hosted in repositories, which can be starred by GitHub users if they like them. Surveys confirm that stars are viewed by pratictioners as the most useful measure of a repository's popularity (Borges & Valente, 2018). As GitHub becomes more and more prevalent among developers for use in version control, it can be interesting to see what variables best predict popularity. Recent attempts have been made to solve this problem, such as Moid et. al. with a Random Forest regressor, but no research exists on using Ridge regression on this problem (2021). We will explore this below.

Our question will be the following:

**Which explanatory variables best predict stars on a GitHub repository?**

### Methods and Results

In this section, you will include:

a) “Exploratory Data Analysis (EDA)”

Demonstrate that the dataset can be read into R.
Clean and wrangle your data into a tidy format.
Plot the relevant raw data, tailoring your plot to address your question.
Make sure to explore the association of the explanatory variables with the response.
Any summary tables that are relevant to your analysis.
Be sure not to print output that takes up a lot of screen space.
Your EDA must be comprehensive with high quality plots.

Describe in written English the methods you used to perform your analysis from beginning to end, and narrate the code that does the analysis.
If included, describe the “Feature Selection” process and how and why you choose the covariates of your final model.
Make sure to interpret/explain the results you obtain. It’s not enough to just say, “I fitted a linear model with these covariates, and my R-square is 0.87”.
If inference is the aim of your project, a detailed interpretation of your fitted model is required, as well as a discussion of relevant quantities (e.g., are the coefficients significant? How does the model fit the data)?
A careful model assessment must be conducted.
If prediction is the project's aim, describe the test data used or how it was created.
Ensure your tables and/or figures are labelled with a figure/table number.

In [5]:
github_data <- read_csv("repositories.csv")
head(github_data)

[1mRows: [22m[34m215029[39m [1mColumns: [22m[34m24[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (8): Name, Description, URL, Homepage, Language, License, Topics, Defau...
[32mdbl[39m  (5): Size, Stars, Forks, Issues, Watchers
[33mlgl[39m  (9): Has Issues, Has Projects, Has Downloads, Has Wiki, Has Pages, Has ...
[34mdttm[39m (2): Created At, Updated At

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Name,Description,URL,Created At,Updated At,Homepage,Size,Stars,Forks,Issues,⋯,Has Issues,Has Projects,Has Downloads,Has Wiki,Has Pages,Has Discussions,Is Fork,Is Archived,Is Template,Default Branch
<chr>,<chr>,<chr>,<dttm>,<dttm>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>
freeCodeCamp,freeCodeCamp.org's open-source codebase and curriculum. Learn to code for free.,https://github.com/freeCodeCamp/freeCodeCamp,2014-12-24 17:49:19,2023-09-21 11:32:33,http://contribute.freecodecamp.org/,387451,374074,33599,248,⋯,True,True,True,False,True,False,False,False,False,main
free-programming-books,:books: Freely available programming books,https://github.com/EbookFoundation/free-programming-books,2013-10-11 06:50:37,2023-09-21 11:09:25,https://ebookfoundation.github.io/free-programming-books/,17087,298393,57194,46,⋯,True,False,True,False,True,False,False,False,False,main
awesome,😎 Awesome lists about all kinds of interesting topics,https://github.com/sindresorhus/awesome,2014-07-11 13:42:37,2023-09-21 11:18:22,,1441,269997,26485,61,⋯,True,False,True,False,True,False,False,False,False,main
996.ICU,Repo for counting stars and contributing. Press F to pay respect to glorious developers.,https://github.com/996icu/996.ICU,2019-03-26 07:31:14,2023-09-21 08:09:01,https://996.icu,187799,267901,21497,16712,⋯,False,False,True,False,False,False,False,True,False,master
coding-interview-university,A complete computer science study plan to become a software engineer.,https://github.com/jwasham/coding-interview-university,2016-06-06 02:34:12,2023-09-21 10:54:48,,20998,265161,69434,56,⋯,True,False,True,False,False,False,False,False,False,main
public-apis,A collective list of free APIs,https://github.com/public-apis/public-apis,2016-03-20 23:49:42,2023-09-21 11:22:06,http://public-apis.org,5088,256615,29254,191,⋯,True,False,True,False,False,False,False,False,False,master


In [6]:
github_data_clean <- clean_names(github_data) |>
                select(-name, -description, -url, -created_at, -updated_at, -homepage, -language, -license, -topics, -default_branch)

colnames(github_data_clean)

In [7]:
na_counts <- github_data_clean %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "column", values_to = "na_count")

na_counts

column,na_count
<chr>,<int>
size,0
stars,0
forks,0
issues,0
watchers,0
has_issues,0
has_projects,0
has_downloads,0
has_wiki,0
has_pages,0


Ridge regression assumes numeric variables are scaled. As such, we scale numeric variables below.

In [10]:
# scaling numeric features

numeric_cols <- sapply(github_data_clean, is.numeric)

github_data_clean[numeric_cols] <- scale(github_data_clean[numeric_cols])

head(github_data_clean)

size,stars,forks,issues,watchers,has_issues,has_projects,has_downloads,has_wiki,has_pages,has_discussions,is_fork,is_archived,is_template
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
0.4743299,93.41788,26.84285,1.06903622,93.41788,True,True,True,False,True,False,False,False,False
-0.05295532,74.46148,45.82565,0.04109144,74.46148,True,False,True,False,True,False,False,False,False
-0.07523044,67.34891,21.11945,0.11742398,67.34891,True,False,True,False,True,False,False,False,False
0.19008643,66.82391,17.10648,84.8516245,66.82391,False,False,True,False,False,False,False,True,False
-0.04738725,66.1376,55.67305,0.0919798,66.1376,True,False,True,False,False,False,False,False,False
-0.07003823,63.99702,23.34719,0.7789726,63.99702,True,False,True,False,False,False,False,False,False


In [12]:
set.seed(1234)

repo_split <- initial_split(github_data_clean, prop = 0.7, strata = stars)
training_repos <- training(repo_split)
testing_repos <- testing(repo_split)

In [13]:
# splitting the data into explanatory variables, X, and response, Y, to tune lambda

training_Y <- training_repos |>
            select(stars) |>
            as.matrix()

training_X <- training_repos |>
        select(-stars) |>
            as.matrix()

testing_Y <- testing_repos |>
            select(stars) |>
            as.matrix()

testing_X <- testing_repos |>
        select(-stars) |>
            as.matrix()

### Discussion

In this section, you’ll interpret the results you obtained in the previous section with respect to the main question/goal of your project.

Summarize what you found and the implications/impact of your findings.
If relevant, discuss whether your results were what you expected to find.
Discuss how your model could be improved;
Discuss future questions/research this study could lead to.

### References

Borges, Hudson, and Marco Tulio Valente. “What’s in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform.” Journal of Systems and Software, vol. 146, Dec. 2018, pp. 112–129, www.sciencedirect.com/science/article/pii/S0164121218301961, https://doi.org/10.1016/j.jss.2018.09.016. Accessed 4 Nov. 2019.

Moid, Mohammed Abdul, et al. “Predicting Stars on Open-Source GitHub Projects.” 2022 Smart Technologies, Communication and Robotics (STCR), 9 Oct. 2021, ieeexplore.ieee.org/document/9588891, https://doi.org/10.1109/stcr51658.2021.9588891. Accessed 5 Dec. 2024.