# Case Study

## Data:

The data is about GitHub projects with more than 167 stars from other users, indicating the popularity of a repository. It consists of 215029 rows corresponding to the top repositories and 24 attributes such as the name and description, number of forks, watching number, and number of stars. The data was collected using the GitHub Search API, searching for repositories with star counts falling within a specific range. 

The features include:

1. `Name`: GitHub repository name (chr)
2. `Description`: Short text description on purpose or focus of repository (chr)
3. `URL`: Unique URL to GitHub repository (chr)
4. `Created At`: Date and time repository was created on GitHub, in ISO 8601 format (chr)
5. `Updated At`: Date and time of repository's most recent update, in ISO 8601 format (chr)
6. `Homepage`: URL to homepage associated with repository (chr)
7. `Size`: Size of repository in bytes (int)
8. `Stars`: Number of stars repository received from other GitHub users (int)
9. `Forks`: Number of forks repository has from other GitHub users (int)
10. `Issues`: Total number of open issues (int)
11. `Watchers`: Total number of repository "watchers" (int)
12. `Language`: Primary programming language of repository (chr)
13. `License`: Information about software license from a license identifier (chr)
14. `Topics`: List of tags associated with repository (chr)
15. `Has Issues`: Boolean value indicating whether repository has an issue tracker enabled (chr)
16. `Has Projects`: Boolean value indicating whether repository uses GitHub Projects tool (chr)
17. `Has Downloads`: Boolean value indicating whether repository has downloadable files for users (chr)
18. `Has Wiki`: Boolean value indicating whether repository has an associated Wiki page (chr)
19. `Has Pages`: Boolean value indicating whether repository has GitHub Pages enabled (chr)
20. `Has Discussions`: Boolean value indicating whether repository has GitHub Discussions enabled (chr)
21. `Is Fork`: Boolean value indicating whether repository is a fork of another repository (chr)
22. `Is Archived`: Boolean value indicating whether repository is archived (chr)
23. `Is Template`: Boolean value indicating whether repository has a template (chr)
24. `Default Branch`: Name of default branch (chr)

## Question

**Question**: Can we predict the number of stars a repository receives, indicating its popularity among other GitHub users, from the number of forks, number of watchers, and whether the repository uses the GitHub Projects tool? Furthermore, does the usage of GitHub Projects significantly affect the repository's popularity?

This question focuses on predicting the number of stars (popularity) of a GitHub repository, which corresponds to the response variable `Stars`. The explanatory variables include the number of times the repository has been forked by other users, `Forks`, the number of users monitoring the repository for activity updates or changes, `Watchers`, and a binary indicator denoting whether the repository uses the GitHub Projects tool for task management, `Has Projects`. Through this analysis, we can explore whether popularity varies based on whether the repository uses GitHub projects.   

## EDA and Visualization

In [1]:
# load libraries
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(AER)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘cowplot’


The following object is masked from ‘package:lubridate’:

    stamp


Loading required package: car

Loading required package: carData




In [2]:
# read data
repo_data <- read.csv("repositories.csv", header=TRUE)
head(repo_data)
nrow(repo_data) # dataset size before wrangling: 215029

Unnamed: 0_level_0,Name,Description,URL,Created.At,Updated.At,Homepage,Size,Stars,Forks,Issues,⋯,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,freeCodeCamp,freeCodeCamp.org's open-source codebase and curriculum. Learn to code for free.,https://github.com/freeCodeCamp/freeCodeCamp,2014-12-24T17:49:19Z,2023-09-21T11:32:33Z,http://contribute.freecodecamp.org/,387451,374074,33599,248,⋯,True,True,True,False,True,False,False,False,False,main
2,free-programming-books,:books: Freely available programming books,https://github.com/EbookFoundation/free-programming-books,2013-10-11T06:50:37Z,2023-09-21T11:09:25Z,https://ebookfoundation.github.io/free-programming-books/,17087,298393,57194,46,⋯,True,False,True,False,True,False,False,False,False,main
3,awesome,😎 Awesome lists about all kinds of interesting topics,https://github.com/sindresorhus/awesome,2014-07-11T13:42:37Z,2023-09-21T11:18:22Z,,1441,269997,26485,61,⋯,True,False,True,False,True,False,False,False,False,main
4,996.ICU,Repo for counting stars and contributing. Press F to pay respect to glorious developers.,https://github.com/996icu/996.ICU,2019-03-26T07:31:14Z,2023-09-21T08:09:01Z,https://996.icu,187799,267901,21497,16712,⋯,False,False,True,False,False,False,False,True,False,master
5,coding-interview-university,A complete computer science study plan to become a software engineer.,https://github.com/jwasham/coding-interview-university,2016-06-06T02:34:12Z,2023-09-21T10:54:48Z,,20998,265161,69434,56,⋯,True,False,True,False,False,False,False,False,False,main
6,public-apis,A collective list of free APIs,https://github.com/public-apis/public-apis,2016-03-20T23:49:42Z,2023-09-21T11:22:06Z,http://public-apis.org,5088,256615,29254,191,⋯,True,False,True,False,False,False,False,False,False,master


In [19]:
# clean data 

# remove entries with missing data 
repo_data <- na.omit(repo_data) # not removing blank values in Homepage 
head(repo_data)
nrow(repo_data)


“number of rows of result is not a multiple of vector length (arg 2)”
“number of rows of result is not a multiple of vector length (arg 2)”
“number of rows of result is not a multiple of vector length (arg 2)”
“number of rows of result is not a multiple of vector length (arg 2)”


Name,Description,URL,Created.At,Updated.At,Homepage,Size,Stars,Forks,Issues,⋯,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Fork,Is.Archived,Is.Template,Default.Branch
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,⋯,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>


In [3]:
# convert "Has Projects" variable as numeric (0: false, 1: true)
repo_data <- repo_data |>
    mutate(Has.Projects = ifelse(Has.Projects == "False", 0, 1)) 
head(repo_data$Has.Projects)