# __Github Commit Analysis__

#### __Contributors__: Dustin Yang, Patrick Wilky, Joseph Wagner, Patrick Steveson
-----------------

### Part 1: Understanding the Problem

#### Mission Statement
To analyze open source projects and determine common features that influence the continued growth in contributions and popularity.

#### Customer Research

#### Overview
- Extract data from GitHub using GraphQL
- Data Cleaning and Feature Engineering
- Get Dynamic Time Warping distance matrix for Hierarchical Clustering
- Neural Linguistic Programming(NLP) to measure average sentiment of a repo's lifespan

-----------
### Part 2: Data Scraping & Cleaning

#### Kaggle vs BigQuery vs RestAPIs vs GraphQL
- Kaggle and Bigquery did not provide the information we saught
- RestAPIs: Gave information we did not need. More cleaning afterwards
- GraphQL: We define what we want to query. Less cleaning afterwards

#### Data Scraping - [GraphQL](https://developer.github.com/v4/explorer/) vs [RestAPI](https://developer.github.com/v3/issues/)

GraphQL
1. Specify query for each repo of 1000: Issues, Pull Requests, Commits, & Stargazers


2. Create functions to query time-series data
    - Automate query to handle errors and save it as a dataframe


3. Explore repo's data before querying

In [None]:
# Example: Querying for Stargazers
st_query = '''
{{
  repositoryOwner(login: "{owner}") {{
    id
    login
    repository(name: "{name}") {{
      id
      name
      createdAt
      updatedAt
      description
      licenseInfo {{
        spdxId
      }}
      stargazers(first:100) {{
        totalCount
        pageInfo {{
          endCursor
          hasNextPage
        }}
        edges {{
          starredAt
          node {{
            createdAt
            updatedAt
            id
            login
            company
          }}
        }}
      }}
    }}
  }}
  rateLimit {{
    limit
    cost
    remaining
    resetAt
  }}
}}
'''

# Automated the query based on end_cursor. If true continue to query if false stop.
st_variables = {
    "end_cursor": "",
    "owner":"",
    "name":""
}

#### Messy data
1. Raw data collected from the query

<img src="./pics/messy_data.png">

#### Cleaned data
1. Create functions to unpack nested dictionaries


2. Replace None with NaN.


3. Filter out open issues and used closed to later feature engineer duration


4. Remove any stargazer's data before 2013 due to it being unreliable(according to GitHub)

<img src="./pics/clean_data.png">

------------
### Part 3: Exploratroy Data Analysis and Feature Engineering

#### Exploratory Data Analysis
1. Explore time-series data by plotting average issues, pull request, commits, & stars per month
    - Initial assumption
        - Popular repo: Started off slow and peaked with a steady climb or plateaus.
        <img src="./pics/popular_repo_plot.png">
    
        - Less popular: Shows nothing or volatile plots.
        <img src="./pics/unpopular_repo_plot.png">

#### Feature Engineering
1. Ordinal encode categorical features. For example: javascript = 1, python = 2, C++ = 3 etc...


2. For these metrics: Issues, Pull Requests, Commits, & Stargazers
    - calculate average per month and convert it to time-series
    - Issues and Pull Requests
        - Duration from open to closed or merged
        - Association(None, Members, Owner, & Contributors)

<img src="./pics/feat_eng_data.png">

---------------
### Part 4: [Dynamic Time Warping(DTW)](https://en.wikipedia.org/wiki/Dynamic_time_warping) & [Unsupervised Clustering Methods](https://en.wikipedia.org/wiki/Cluster_analysis)

1. Compare two time-series and get a dtw value
    - Iterated 1000 x 1000 times among repo's time-series data
    - Lower the DTW value = More similar time-series

<img src="./pics/eulc_dtw_plot.png">
Image from:

Zheng Zhang, Romain Tavenard, Adeline Bailly, Xiaotong Tang, Ping Tang, et al.. Dynamic Time
Warping Under Limited Warping Path Length. Information Sciences, Elsevier, 2017, 393, pp.91 - 107.
ffhttp://www.sciencedirect.com/science/article/pii/S0020025517304176ff. ff10.1016/j.ins.2017.02.018ff.
ffhal-01470554f

#### Dynamic Time Warping distance matrix of 1000 x 1000
<img src="./pics/dtw_data.png">

#### [Hierarchial Clustering Model](https://en.wikipedia.org/wiki/Hierarchical_clustering#/)
1. Supervised learning requires labeled data which we do not have


2. PCA - Reduce higher dimensional data
    - General Rule - 5 times more observation vs the feautres
    - Reduce high dimensional data to 4
    
    
3. Used unsupervised learning to cluster data with similar DTW

<img src="./pics/h_cluster_data.png">

--------------
### Part 5: NLP Sentiment Analysis

1. Used NLTK library to analyze comments from Issues and Pull Requests


2. Initial hypothesis
    - Green and Red cluster have different sentiment, one more negative vs the other and vice versa
    - Run DTW on average per month from the sentiment results
    
    
3. Results returned from sentiment analysis
    - compound, negative, positive and neutral
    
    
4. Join cluster features with sentiment of each repo

<img src="./pics/r_g_cluster.jpg">

-------------
### Part 6: Results and Conclusions

1. Hierarchial Cluster: Two main clusters(Green and Red) and manually examined both:
    - Green - More active: recent commits, issues, pull requests and Higher contributors than red, >10 and early stages of the repo showed high activity vs the later stages
    - Red - Less active than Green, lower contributors/personal project <10, or early stage repo 1-2 years


2. NLP Sentiment Analysis time-series data: 
    - Average sentiment per month of each repo
        - Sentiment leaning positive
    - Average sentiment per cluster
        - Sentiment identical for red and green cluster

### Ideas for future development

1. Run DTW on issue and pull request association and see if it'll generate different results when clustering
2. Run DTW on sentiment to confirm if average sentiment per cluster to check if the results were meaningful
3. Run DTW on different time periods and check if better clusters are formed
4. Gather more data and run different models(Neural Networks or other unsupervised learning models)
5. Investigate sub cluster within each cluster
6. Run other NLP methods like: Topic Modeling - Context of the corpus(all comments in issues or pull request)
7. Run NLP on tweets or reddit if the repo has one