#### Brief Description of this dataset

> `Stars`, `Issues`, and `Language` is the explanatory variables and `Forks` is the response variable of this dataset.
>* `Stars`: The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest
>* `Forks`: The number of times the repository has been forked by other GitHub users
>* `Issues`: The total number of open issues (indicating bugs, feature requests, or discussions)
>* `Language`: The primary programming language
>
> I cleaned and wrangled data to include only interested variables and excluded outliers.
>
> Then used pairwise scatterplots, heatmap, scatterplots, boxplots, and bar-charts to visualize the data. 

 **Question**: This project will aim to explore the relationship between these variables. Specifically, I am going to investigate if the number of stars and issues within each programming language is related to the change in a number of repositories being forked.

## Methods 

### Multiple Linear Regression Analysis with Backward Selection 

* #### Why is this method appropriate?

Multiple linear regression combined with backward selection would be an appropriate method for examining the relationship between the explanatory variables—`Stars`, `Issues`, and `Language`—and the response variable, `Forks`, in this dataset. This approach is well-suited for the analysis as it allows for the simultaneous assessment of the impact of multiple predictors on the response variable while systematically eliminating less relevant variables to obtain a more accurate model. 

Backward selection starts with a full model containing all potential predictors and iteratively removes variables that do not contribute significantly to improving model fit. By systematically reducing the model based on statistical criteria, backward selection helps identify the most important predictors while reducing the risk of overfitting and improving model interpretability. 

Additionally, multiple linear regression facilitates the examination of linear relationships between the predictors and the response variable, providing valuable insights into the underlying mechanisms driving the change in the number of repositories being forked within different programming languages. This methodological approach offers a structured and interpretable framework for exploring complex relationships in the data and deriving actionable insights.


* #### Assumptions Required for Multiple Linear Regression

To apply multiple linear regression with backward selection, certain assumptions must be met, including 
- **Linearity**: The relationship between explanatory variables and the response variable should be approximately linear.
- **Independence of errors**:  The errors (residuals) from the regression model should be independent of each other. 
- **Homoscedasticity**:  The variance of the errors should be constant across all levels of the predictor variables. 
- **Normality of errors**: The errors should be normally distributed. 
- **No Multicollinearity**:  There should be no exact linear relationships among the predictor variables

* ####  Potential limitations or weaknesses

 **Underfitting** may occur if the backward selection process removes important predictor variables from the model then model would be too simple.  Moreover, if the **Assumptions of linear regression are violated**, the results might be biased, producing incorrect inferences, and less reliable.

## Plan

**Data Preparation** 
- Use the data from assignment 2 that is already cleaned and wrangled. 
- Use dummy variable to represent categorical variable, `Language`.

**Model Building** 
- Split it into a training dataset and a testing dataset.
- Perform multiple linear regression with backward selection:
Start with a full model including all predictor variables (Stars, Issues, Language).
Use a backward selection to remove non-significant variables based on p-values.



**Model Selection**
- Assess the overall fit of the model using using metric such as $R^2$
  (coefficient of determination), adjusted $R^2$, and F-test.
- Check for violations of MLR assumptions 

**Interpretation**
- Interpret the coefficients ($R^2$) of the model to understand the relationships between the predictors (`Stars`, `Issues`, `Language`) and the response variable (`Forks`).
-  Add visualizations

**Assessment & Discussion**
* Assess the overall model fit using metrics, adjusted $R^2$, to understand the variability in `Forks` of GitHub repositories.
* Discuss the strength and limitations of the model by refering to the potential limitations of the chosen method.