# Introduction

* Hello, our names are Vincent and J. Vincent Shorter

## Project Objective 
* This project runs through the entire Data Science Pipeline and culminates with classification modelling techniques based upon Natural Langauge Processing outcomes.
* Pproject constructs a ML model to predict a github repository coding language based on just the project README file. 

## Executive Summary


* Analysis of the data showed that Python was by far the most popular language
* 


# Import Section

In [1]:
import pandas as pd
import model
import wrangle
import nltk

from importlib import reload
reload(model)

import warnings
warnings.filterwarnings("ignore")

# Wrangle (Acquire and Prep)
* By far most difficult and laborious portion of project. 
* Had to contend with empty, non-English, and often sparse READMEs

### Nulls/Missing Values
* Simple drop of null values as they most often indicated an empty README
---
### Feature Engineering 
* Engineered `word_count` in order to facilitate analysis around column length and unique word density
* Engineered `language_bigrams` in order to capture most used word duos
---   
### Flattening
* Had to make decisions in order to remove optionality from language column due to sample size 
- Went from around 17 languages down to 7 by creating an `other` category for the less popular langauges
* Decisons here driven mostly by desire to have enough observations for effective analysis 
---


## Exploration Questions 
* Includes visualizations and statistical tests
* Primarily focused on analyzing frequency differences between language groups

In [2]:
#calling prepare/acquire functions from module to acquire and split data subsets
df = wrangle.get_search_csv()
df = wrangle.prep_text(df)

### Spotlight - Common Words 
* **Question:** What are the most common words in READMEs?
* **Answer:** 

#### Statistical Hypothesis
>* ${H_0}$: There is no relationship between industry of typical employment and employment status   
>* ${H_a}$: There is a relationship between industry of typical employment and employment status  
>* ${\alpha}$: .05  
>* Result: There is enough evidence to reject our null hypothesis. **Test code below**

### Spotlight - README Length
* **Question:** Does the length of the README vary by programming language?

* **Answer:** Indivduals identifying as White show the largest population proportion change with a drop of nearly 10% when comparing employed vs unemployed. Those identifying as mixed race other than with white, and Indigenous have the highest unemployed rates at 12% and 7% respectively. 

#### Statistical Hypothesis
>* ${H_0}$: There is no relationship between `race` and `employment` status   
>* ${H_a}$: There is a relationship between `race` and `employment` status   
>* ${\alpha}$: .05  
>* Result: There is enough evidence to reject our null hypothesis. **Test code below**

### Spotlight - Unique Words
* **Question:** Do different programming languages use a different number of unique words?
 
* **Answer:** 

#### Statistical Hypothesis
>* ${H_0}$: There is no relationship between having a `professional_certification` and `employment`  
>* ${H_a}$: There is a relationship between having a `professional_certification` and `employment`    
>* ${\alpha}$: .05
>* Result: There is enough evidence to reject our null hypothesis. **Test code below**

### Spotlight - Language ID by Word
* **Question:** Are there any words that uniquely identify a programming language?
 
* **Answer:** 

#### Statistical Hypothesis
>* ${H_0}$: There is no relationship between having a `professional_certification` and `employment`  
>* ${H_a}$: There is a relationship between having a `professional_certification` and `employment`    
>* ${\alpha}$: .05
>* Result: There is enough evidence to reject our null hypothesis. **Test code below**

## Exploration Summary
* 

# Modeling
- Things did not go as plan. Initially had massive perfomanc drops moving in Validation
- Use of custom class proved to be more of a hindrance than help
- Had to abandon gridsearch idea, and focus on feature creation
- Logistic Regression never provided much performance gain above baseline
- DTC models consistenly peformed well, and, along with RF, we started lowering depth to control for overfitting 
- We did 5 rounds of mass cohort testing before settling on specific hyperparameters
- Final Models had 37% performamce gain above baseline when scoring with Accuracy as focus
--- 


In [3]:
def bigram_placement(language):
    """ 
    Purpose:
        to create bigrams for each langauge
    ---
    Parameters:
        language: input language
    ---
    Returns:
        a list of bigrams appropriately joined for further use
    """
    # this initilizes the class objects. 
    html = model.code_language(words=' '.join(df[df.language == 'html'].lemmatized), label='html')
    javascript = model.code_language(words=' '.join(df[df.language == 'javascript'].lemmatized), label='javascript')
    r_ = model.code_language(words=' '.join(df[df.language == 'r'].lemmatized), label='r')
    other_ = model.code_language(words=' '.join(df[df.language == 'other'].lemmatized), label='other')
    python_ = model.code_language(words=' '.join(df[df.language == 'python'].lemmatized), label='python')
    all_ = model.code_language(words=' '.join(df.lemmatized), label='all languages')

    #creates bigrams for seperate languages
    if language == 'html':
        language = html.bigrams()
    elif language == 'javascript':
        language = javascript.bigrams()
    elif language == 'r':
        language = r_.bigrams()
    elif language == 'python':
        language = python_.bigrams()
    else:
        language = other_.bigrams()
    return ', '.join(str(e) for e in language.str.join(sep=' ').to_list())

In [4]:
# applies bigram function to new column
df.language_bigrams = df.language_bigrams.apply(bigram_placement)

#grabs data subsets after vectorization and splitting
train, X_train, y_train, X_validate, y_validate, X_test, y_test = model.vectorize_split(df)

#places subsets in data for import into final function
subsets = [train, X_train, y_train, X_validate, y_validate, X_test, y_test]

#final function to grab model scores
model.get_test_score(df, subsets)

Unnamed: 0,Model,Accuracy(Score),Type,Features Used,Parameters
0,Baseline,0.402299,Basic Baseline,Baseline Prediction,
1,DTC_0,0.7778,Decision Tree Classifier,"['word_count', 'lemmatized', 'language_bigrams']",Depth: 5


## Features for Modeling
* Grouped by simple features: `word_count`, `lemmatized`, `language_bigrams`

* `word_count` - engineered
    * Chosen to highlight the business oriented concerns around employement  


* `lemmatized`
    * Highlights family and environment characteristics  
    


* `language_bigrams` - engineered
    * Highlights word bigrams popular within each language  
---

## Top Models


* Exposed Top 5 models from train section to the Validate data
* Models had to perform underneath 93% on train set to control for possible overfitting
* DTC models maintained highest performance profile througout all testing cohorts
---
* Model performace on Train subset:

Model	| Accuracy(Score)   |Type                       |Features Used                           |Parameters             |
|---    | ---               |---                        |   -----                                |---                    |
|RF_6   |	0.91950         |Random Forest              |word_count, lemmatized, language_bigrams|Depth: 7, Leaves: 3
|RF_0   |	0.91950         |Random Forest              |word_count, lemmatized, language_bigrams|Depth: 5, Leaves: 3
|DTC_1  |   0.91950         |Decision Tree Classifier   |word_count, lemmatized, language_bigrams|Depth: 6
|DTC_0  |	0.89066	        |Decision Tree Classifier	|word_count, lemmatized, language_bigrams|Depth: 5
|KNN_4  |	0.56322	        |K-Nearest Neighbors	    |word_count, lemmatized, language_bigrams|K-Neighbors: 4

---
## Test
* Top Model Performance on Test set:  

|    | Model    |   Accuracy(Score) | Type                     | Features Used                                    | Parameters   |
|---:|:---------|------------------:|:-------------------------|:-------------------------------------------------|:-------------|
|  0 | Baseline |          0.402299 | Basic Baseline           | Baseline Prediction                              | n/a          |
|  1 | DTC_0    |          0.7778   | Decision Tree Classifier | ['word_count', 'lemmatized', 'language_bigrams'] | Depth: 5     |

---

# Conclusion
## Summary of Key Findings
* 
* 
* 
* DTC and RF models consistenly performed well
* Final Models had 37% performamce gain above baseline
---
## Suggestions and Next Steps
* Trigrams may be something worth adding in the model in order to boost performance
* We also want to create a form of sentiment analysis 
    - It will track whether a repo leans more towards Basketball or Coding as a focal point
* Model performance above baseline is enough to justify continued use.
* An affirmative next step would be to further expand the scope of testing to capture languages with smaller usage.
