# Add title

## Introduction

### Background

GitHub is an online, cloud-based hosting service that manages software projects through key functionalities such as tracking, organizing, sharing, and collaborating (Braga et al., 2023). GitHub is currently the most popular social coding platform, with the popularity and quality of the users' repositories serving as strong indicators of their capacity, skills, and experiences (Hu et al., 2016). 

### Question

**Question**: Out of the top three most popular programming languages in the dataset, is there an association between the popularity of a GitHub repository, measured by the number of stars it receives, its programming language, the number of open issues, and whether it has GitHub Projects enabled?

### Data:

The data is about GitHub projects with more than 167 stars from other users, indicating the popularity of a repository. It consists of 215029 rows corresponding to the top repositories and 24 attributes such as the name and description, number of forks, watching number, and number of stars. The data was collected using the GitHub Search API, searching for repositories with star counts falling within a specific range. 

The features include:

1. `Name`: GitHub repository name (chr)
2. `Description`: Short text description on purpose or focus of repository (chr)
3. `URL`: Unique URL to GitHub repository (chr)
4. `Created At`: Date and time repository was created on GitHub, in ISO 8601 format (chr)
5. `Updated At`: Date and time of repository's most recent update, in ISO 8601 format (chr)
6. `Homepage`: URL to homepage associated with repository (chr)
7. `Size`: Size of repository in bytes (int)
8. `Stars`: Number of stars repository received from other GitHub users (int)
9. `Forks`: Number of forks repository has from other GitHub users (int)
10. `Issues`: Total number of open issues (int)
11. `Watchers`: Total number of repository "watchers" (int)
12. `Language`: Primary programming language of repository (chr)
13. `License`: Information about software license from a license identifier (chr)
14. `Topics`: List of tags associated with repository (chr)
15. `Has Issues`: Boolean value indicating whether repository has an issue tracker enabled (chr)
16. `Has Projects`: Boolean value indicating whether repository uses GitHub Projects tool (chr)
17. `Has Downloads`: Boolean value indicating whether repository has downloadable files for users (chr)
18. `Has Wiki`: Boolean value indicating whether repository has an associated Wiki page (chr)
19. `Has Pages`: Boolean value indicating whether repository has GitHub Pages enabled (chr)
20. `Has Discussions`: Boolean value indicating whether repository has GitHub Discussions enabled (chr)
21. `Is Fork`: Boolean value indicating whether repository is a fork of another repository (chr)
22. `Is Archived`: Boolean value indicating whether repository is archived (chr)
23. `Is Template`: Boolean value indicating whether repository has a template (chr)
24. `Default Branch`: Name of default branch (chr)

## Methods and Results

### EDA

To do:
- Check for NA values
- remove outliers?
- plot visualizations

#### Import libraries and read data

In [1]:
# import and load libraries
library(tidyverse)
library(repr)
library(infer)
library(faux)
library(AER)
library(broom)
library(dplyr)
library(cowplot)
library(gridExtra)
library(grid)
library(GGally)
library(car)
library(mltools)
library(leaps)

# note: delete unused ones, I included all libraries that were imported from each indiv assignment

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

************
Welcome to faux. For support and examples visit:
https://debruine.github.io/faux/
- Get and set global package options with: faux_options()
************

L

In [2]:
# read data from github repo
github_url <- "https://github.com/alim0118/stat-301-group-project/raw/main/repositories.csv"
repositories <- read.csv(github_url, header=TRUE)
head(repositories)
nrow(repositories) # dataset size before wrangling: 215029

Unnamed: 0_level_0,Name,Description,URL,Created.At,Updated.At,Homepage,Size,Stars,Forks,Issues,⋯,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,freeCodeCamp,freeCodeCamp.org's open-source codebase and curriculum. Learn to code for free.,https://github.com/freeCodeCamp/freeCodeCamp,2014-12-24T17:49:19Z,2023-09-21T11:32:33Z,http://contribute.freecodecamp.org/,387451,374074,33599,248,⋯,True,True,True,False,True,False,False,False,False,main
2,free-programming-books,:books: Freely available programming books,https://github.com/EbookFoundation/free-programming-books,2013-10-11T06:50:37Z,2023-09-21T11:09:25Z,https://ebookfoundation.github.io/free-programming-books/,17087,298393,57194,46,⋯,True,False,True,False,True,False,False,False,False,main
3,awesome,😎 Awesome lists about all kinds of interesting topics,https://github.com/sindresorhus/awesome,2014-07-11T13:42:37Z,2023-09-21T11:18:22Z,,1441,269997,26485,61,⋯,True,False,True,False,True,False,False,False,False,main
4,996.ICU,Repo for counting stars and contributing. Press F to pay respect to glorious developers.,https://github.com/996icu/996.ICU,2019-03-26T07:31:14Z,2023-09-21T08:09:01Z,https://996.icu,187799,267901,21497,16712,⋯,False,False,True,False,False,False,False,True,False,master
5,coding-interview-university,A complete computer science study plan to become a software engineer.,https://github.com/jwasham/coding-interview-university,2016-06-06T02:34:12Z,2023-09-21T10:54:48Z,,20998,265161,69434,56,⋯,True,False,True,False,False,False,False,False,False,main
6,public-apis,A collective list of free APIs,https://github.com/public-apis/public-apis,2016-03-20T23:49:42Z,2023-09-21T11:22:06Z,http://public-apis.org,5088,256615,29254,191,⋯,True,False,True,False,False,False,False,False,False,master


#### Data Cleaning

We will only select the relevant variable types for regression analysis: continuous and categorical variables. We observe that certain character type features have more than a few hundred to thousands of unique categories or are not relevant to our data analysis. Hence we will not consider such features as categorical. Madley‐Dowd et al. (2019) state that analyses with more than 10% missing data are plausible for bias. So we will use 10% missingness as the threshold for this analysis. We notice that features like `Homepage`, `Topics`, and `License` have 25% or more of their values as null or empty, so we will omit these variables. Now we consider the following features `Size`, `Stars`, `Forks`, `Issues`, `Watchers`, , `Language`, `Has.Issues`, `Has.Downloads`, `Has.Wiki`, `Has.Pages`, `Has.Discussions`, `Is.Fork`, `Is.Archived`, and `Is.Template` as the continuous or categorical variables for this analysis. 

In [3]:
# find number of unique values in each feature
sapply(repositories, n_distinct)

In [4]:
missing_counts <- repositories |> 
    summarize_all(~ sum(str_trim(.) %in% c("", '[]')))

missing_prop <- missing_counts / nrow(repositories)

# show only features with missing proportions > 10%
missing_prop |> select_if(~. > 0.1)

Homepage,License,Topics
<dbl>,<dbl>,<dbl>
0.6354492,0.2466598,0.4700389


In [5]:
repositories <- repositories |>
    select(-Name, -Description, -URL, -Created.At, -Updated.At, -Homepage, -License, -Topics, -Default.Branch)
head(repositories)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Language,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,387451,374074,33599,248,374074,TypeScript,True,True,True,False,True,False,False,False,False
2,17087,298393,57194,46,298393,,True,False,True,False,True,False,False,False,False
3,1441,269997,26485,61,269997,,True,False,True,False,True,False,False,False,False
4,187799,267901,21497,16712,267901,,False,False,True,False,False,False,False,True,False
5,20998,265161,69434,56,265161,,True,False,True,False,False,False,False,False,False
6,5088,256615,29254,191,256615,Python,True,False,True,False,False,False,False,False,False


We will focus on the three most common programming languages in the dataset: Python, JavaScript, and Java.

Repositories categorized as "Other" will be excluded from the analysis. Although "Other" is the third most common language, it introduces uncertainty and is not informative since the repository could be any language. 

In [6]:
languages <- repositories %>%
    count(Language, sort=TRUE)
top_n(languages, 10)

[1m[22mSelecting by n


Language,n
<chr>,<int>
Python,34331
JavaScript,31831
,16076
Java,15298
TypeScript,11670
C++,11391
Go,10712
C,8907
C#,7295
PHP,6741


In [7]:
repositories <- repositories %>%
    filter(Language == "Python" | Language == "JavaScript" | Language == "Java")
head(repositories)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Language,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,5088,256615,29254,191,256615,Python,True,False,True,False,False,False,False,False,False
2,11187,229569,40474,393,229569,Python,True,True,True,True,False,False,False,False,False
3,345964,213299,44842,1497,213299,JavaScript,True,True,True,True,True,False,False,False,False
4,6696,181326,23837,383,181326,Python,False,False,True,False,True,False,False,False,False
5,13363,175401,28811,338,175401,JavaScript,True,False,True,False,False,False,False,False,False
6,13858,169000,41926,118,169000,Python,True,True,True,True,False,True,False,False,False


We will also consider taking a random sample of the data since we are working with a huge dataset regarding the number of observations included. We will continue this analysis by randomly sampling a few thousand out of the 215k observations.

In [8]:
set.seed(2024)

repo_sample <- sample_n(repositories, 3000, replace = FALSE)
head(repo_sample)
dim(repo_sample)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Watchers,Language,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,5284,326,79,37,326,JavaScript,True,True,True,True,False,False,False,False,False
2,4461,325,65,9,325,Java,True,True,True,True,False,False,False,False,False
3,642,806,178,41,806,JavaScript,True,True,True,True,False,False,False,False,False
4,3578,2069,306,32,2069,Java,True,True,True,True,False,False,False,False,False
5,321,459,64,21,459,Python,True,True,True,True,False,False,False,False,False
6,287,191,54,3,191,Java,True,True,True,True,False,False,False,False,False


#### Visualization

### Methods: Plan

## Discussion

## References

Braga, P. H. P., Hébert, K., Hudgins, E. J., Scott, E. R., Edwards, B. P., Sánchez‐Reyes, L. L., Grainger, M., Foroughirad, V., Hillemann, F., Binley, A. D., Brookson, C. B., Gaynor, K. M., Sabet, S. S., Güncan, A., Weierbach, H., Gomes, D. G., & Crystal‐Ornelas, R. (2023). Not just for programmers: How GitHub can accelerate collaborative and reproducible research in ecology and evolution. Methods in Ecology and Evolution, 14(6), 1364–1380. https://doi.org/10.1111/2041-210x.14108

Hu, Y., Zhang, J., Bai, X., Yu, S., & Yang, Z. (2016). Influence analysis of Github repositories. SpringerPlus, 5(1). https://doi.org/10.1186/s40064-016-2897-7

Madley‐Dowd, P., Hughes, R. A., Tilling, K., & Heron, J. (2019). The proportion of missing data should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiology, 110, 63–73. https://doi.org/10.1016/j.jclinepi.2019.02.016

Most popular GitHub repositories (Projects). (2023, October 1). Kaggle. https://www.kaggle.com/datasets/donbarbos/github-repos/data