# 🚀 Hackathon: GitHub Data for the Forks Case Study 

This will guide you through collecting and visualizing GitHub Pull Request (PR) data for any repository. 

You’ll fetch PR data using a Python script, then use a Jupyter notebook to explore and visualize the results.

Use this notebook or create your own to document any code or findings during the hackathon. Don't worry about it being perfect, it's your own notebook!

## 📝 Step 1: Prepare Your Environment

1. **Clone or download this repo** to your local machine.
2. **Install Anaconda** (if you haven’t already):
3. **Get a GitHub personal access token**
Go to GitHub Settings > Developer settings > Personal access tokens
Click "Generate new token" (classic), select repo scope, and copy the token.

Save your token in a file named gh_keys.txt in the same folder as the scripts. 

## 🏃‍♂️ Step 2: Fetch and Plot PR Data
Run the following command to fetch PR data for your chosen repository:

In [None]:
%run fetch_pr.py

You can edit main_fetch.py to set the repository, date range, and other options.

This will create two files:

1. pull_requests.csv — all PRs and their metadata
2. monthly_counts.csv — monthly counts of open, approved (merged), and closed PRs

## 👥 Step 3: What else can we do with this data? 

1. Contributor Analysis:
    How many people contribute this repo? (Count)
    Who are the top contributors? 
    Are contributors associated with an organization? If so, which?

2. Merging PRs
    How long does it take for a PR to be merged?
    Who approves PRs? Are they from a particular organization?

3. More Data
    Are other fields required?
    What other data can we get from the GraphQL?

To address these questions, start simple and build on your code. For example, below is code for reading and finding unique contributors with less than and more than 10 PRs. 

In [None]:
df = pd.read_csv("pull_requests.csv")
author_counts = df['author'].apply(lambda x: eval(x)['login'] if pd.notnull(x) else None)
author_counts = author_counts.value_counts()
less_than_10 = (author_counts < 10).sum()
more_than_10 = (author_counts >= 10).sum()
print(f"Unique contributors with <10 PRs: {less_than_10}")
print(f"Unique contributors with >=10 PRs: {more_than_10}")