# Natural Language Processing Project
#### By: Lupe Luna, Forrest McCrosky, and Anna Vu
---

We will be using web scraping to extract some of the most-starred repositories on Github, and build a multi-classification model to predict what the most predominant programming language used will be based off of the README.md contents. 

[View this journal in jupyter nbviewer](https://nbviewer.jupyter.org/github/Vu-Luna-McCrosky-NLP-Project/NLP_Project_Predicting_Readme_s/blob/master/final_nlp.ipynb)

<br>

# Agenda:
---
- [Executive Summary](#executive_summary)
 - [Project Planning](#project_planning)
 - [Imports](#imports)
 - [Data Acquisition](#data_acquisition)
 - [Data Preparation](#data_preparation)
 - [Data Exploration](#data_exploration)
 - [Statistical Testing](#stats)
 - [Modeling](#modeling)
 - [Test](#test)
 - [Conclusion and Next Steps](#conclusion)

<br>

<a id='executive_summary'></a>
# Executive Summary:
---
To predict the most used programming language for each repository, we used a KNN model (fit with TD-IDF vectorizer) to predict with:
 - 82% accuracy
 - 82% precision
 - 76% recall
 
based off of the README's contents on out of sample data.

Our model does better predicting the result of a repository if its programming language is JavaScript or Python. For next time, we could probably use a bigger sample of repositories to help it account for Java, Go, and other languages.

<a id='project_planning'></a>
# Project Planning:
---
We're going to need to use web scraping in order to get ~200 repositories from [Github](www.github.com), once we bring in these in, we will filter for desirable README contents (language, size, etc.) We need to use the content of at least 100 READMEs, so to best ensure they will have valuable information, we are going to scrape our data from the [most starred repositories](https://github.com/search?q=stars%3A%3E0&s=stars&type=Repositories) on Github. 

After we follow the steps of the data science pipeline, we'll need to set up a couple of slides to present our findings.

<br>

<a id='imports'></a>
# Imports:

In [1]:
#import our modules
import acquire as a
import prepare as p
import explore as e
import model as m

#import our most-used libraries 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

#import NLP neccessities
import nltk
import re
from pprint import pprint
import unicodedata
from nltk.corpus import stopwords
from wordcloud import WordCloud
from PIL import Image
import bs4
import time

#import sklearn for our models
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text, export_graphviz
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

<a id='data_acquisition'></a>
# Data Acquisition: 
---

In [2]:
## use acquire function that built a repo list from github.com's most starred repositories

repos = a.get_repo_list() 

In [3]:
print(f'The length of the repo list is: {len(repos)}\n') ## <-- quality assurance check

repos[:6] ## looking at some readme titles

The length of the repo list is: 200



['freeCodeCamp/freeCodeCamp',
 '996icu/996.ICU',
 'EbookFoundation/free-programming-books',
 'jwasham/coding-interview-university',
 'vuejs/vue',
 'facebook/react']

### Intial Repo List Function
 
 - get_repo_list() is designed to create a repo list from the most starred repositories on Github.
    
 - The function loops through 20 pages with 10 results per page of the most starred repos on github using a range from 1 to 21.
    
 - It then uses another loop to pull out all the titles of each repo using the beautiful soup library and html per page. It will also remove null elements and white spaces.
    
 - After get_repo list ran, we manually removed 6 repositories from the list that were poorly formatted and not repository titles. 
 - This leaves us 198 repos.
    
    
 - Since it does take a while to run and grab 200 repositiories (and you'll need your own Github token to have it function properly), we decided to create a .csv as an endproduct for our usage.

<br>

Ran acquire.py from the terminal, and brought in our .json file as a dataframe

In [4]:
## reading our json file built from the acquire.py into a df

df = pd.read_json('data2.json') 

ValueError: Expected object or value

<br>
Now we are going to filter for the top four languages. We found them to be JavaScript, Python, Java, and Go.

In [None]:
## making a list of the top four used programming languages

top_four = df.language.value_counts().keys()[0:4] 

In [None]:
## filtering down dataframe to contain top four langauges

df = df[df.language.isin(top_four)] 

In [None]:
df.head() ## <-- quality assurance check

<br>
Create a .csv, from the steps above, to be able to bring up this data faster as we work through the project.

In [None]:
df.to_csv('NLP_df.csv')

<br>
Let's bring in the data with NLP_df.csv

In [None]:
#bring in NLP_df.csv as a pandas dataframe
df = a.get_github_data()

#look at our dataframe
df

### Acquisition Takeaways:

 - New features could be made, like character count and word count. 
 - Notice non-English characters in the contents
 - Need to clean the readme_contents
 - Duplicates were dropped while bringing in the data
 - We went from 200 to 109 rows

<a id='data_preparation'></a>
# Data Preparation:
---

In [None]:
#this prep function will clean our readme_contents, and create readme_length and word_count features
df = p.prep_github_data(df, 'readme_contents')

In [None]:
#check the results of the prep function
df.head()

In [None]:
#checking target variables distribution
df.language.value_counts()

In [None]:
#split our data into train, validate, and test sets
train, validate, test = p.split(df)

In [None]:
#assure the shapes are reasonable
train.shape, validate.shape, test.shape

In [None]:
#checking the balance of our train target variables
train.language.value_counts()

### Preparation Takeaways: 
 - We have a clean set of readme_contents that we can explore on
 - Any READMEs with less than 10 words were dropped
 - Proceed to explore on our train set
 - We need to categorize content based on its dominant programming language, so we can find what words can help our model decifer what language is being used the most

<a id='data_exploration'></a>
# Data Exploration
---

We are going to separate the clean README contents based on its repository's dominant programming language, and also have an inclusive one. We're only going to be exploring on our train dataset. 

In [None]:
#content and its words put under the repository's primary language
javascript_words = ' '.join(train[train.language == 'JavaScript'].clean)
python_words = ' '.join(train[train.language == 'Python'].clean)
java_words = ' '.join(train[train.language == 'Java'].clean)
go_words = ' '.join(train[train.language == 'Go'].clean)
all_words = ' '.join(train.clean)

Now we're going to split the content into individual words by splitting them based on spaces, and take a value counts to see how often each word comes up for each programming language. (Also did a frequency count for all words)

In [None]:
#split up content into indivdual words, and count how many times the word comes up over all readmes
javascript_freq = pd.Series(javascript_words.split()).value_counts()
python_freq = pd.Series(python_words.split()).value_counts()
java_freq = pd.Series(java_words.split()).value_counts()
go_freq = pd.Series(go_words.split()).value_counts()
all_freq = pd.Series(all_words.split()).value_counts()

Now we will combine all of the frequencies, so we can view the words and how often they are used across the four languages.

In [None]:
#create a df of frequencies of each word by language 
word_counts = pd.concat([javascript_freq, python_freq, java_freq, go_freq, all_freq], axis=1).fillna(0).astype(int)

#name the columns
word_counts.columns = ['javascript', 'python','java','go','all']

#check our most frequently occuring words
word_counts.sort_values('all', ascending=False).head(10)

In [None]:
# Sorting By Java and JavaScript both in descending order to look for overlap
word_counts.sort_values(['java', 'javascript'], ascending=[False, False]).head(8)

In [None]:
# Sorting By Python and Go
word_counts.sort_values(['python', 'go'], ascending=[False, False]).head(8)

In [None]:
# Sorting By Python and Java
word_counts.sort_values(['python', 'java'], ascending=[False, False]).head(8)

In [None]:
#there are 25,631 different 'words' in our train dataset
word_counts

Let's compare programming languages and how much they use any of the overall top 20 words across all READMEs

#### Most Frequent Words 

In [None]:
e.javascript_barh(word_counts)

In [None]:
e.python_barh(word_counts)

In [None]:
e.java_barh(word_counts)

In [None]:
e.go_barh(word_counts)

#### Word Overlap Per Language

In [None]:
plt.rc('font', size=18)
(word_counts.sort_values(by='all', ascending=False)
 .head(20)
 .apply(lambda row: row / row['all'], axis=1)
 .drop(columns='all')
 .sort_values(by='javascript')
 .plot.barh(stacked=True, width=1, ec='black', figsize=(17,9)))
plt.legend(bbox_to_anchor= (1.03, 1))
plt.title('Word Overlap Per Langauge: Sorted by JavaScript\n')
plt.xlabel('\nProportion of Overlap')
plt.show()

In [None]:
## proportion stacked bar charts sorted by Python

plt.rc('font', size=18)
(word_counts.sort_values(by='all', ascending=False)
 .head(20)
 .apply(lambda row: row / row['all'], axis=1)
 .drop(columns='all')
 .sort_values(by='python')
 .plot.barh(stacked=True, width=1, ec='black', figsize=(17,9)))
plt.legend(bbox_to_anchor= (1.03, 1))
plt.title('Word Overlap Per Langauge: Sorted by Python\n')
plt.xlabel('\nProportion of Overlap')
plt.show()

In [None]:
plt.rc('font', size=18)
(word_counts.sort_values(by='all', ascending=False)
 .head(20)
 .apply(lambda row: row / row['all'], axis=1)
 .drop(columns='all')
 .sort_values(by='java')
 .plot.barh(stacked=True, width=1, ec='black', figsize=(17,9)))
plt.legend(bbox_to_anchor= (1.03, 1))
plt.title('Word Overlap Per Langauge: Sorted by Java\n')
plt.xlabel('\nProportion of Overlap')
plt.show()

In [None]:
plt.rc('font', size=18)
(word_counts.sort_values(by='all', ascending=False)
 .head(20)
 .apply(lambda row: row / row['all'], axis=1)
 .drop(columns='all')
 .sort_values(by='go')
 .plot.barh(stacked=True, width=1, ec='black', figsize=(17,9)))
plt.legend(bbox_to_anchor= (1.03, 1))
plt.title('Word Overlap Per Langauge: Sorted by Go\n')
plt.xlabel('\nProportion of Overlap')
plt.show()

#### Single Word Wordclouds

In [None]:
language_words = [javascript_words,python_words,java_words,go_words]

In [None]:
e.simple_wordclouds(language_words)

### Bigrams and Trigrams per Category

#### Bigrams Per Language

In [None]:
plt.rc('font', size=12)
pd.Series(nltk.bigrams(javascript_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common JavaScript Bigrams\n')
plt.xlabel('\nFrequency')
None

In [None]:
pd.Series(nltk.bigrams(python_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common Python Bigrams\n')
plt.xlabel('\nFrequency')
None

In [None]:
pd.Series(nltk.bigrams(java_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common Java Bigrams\n')
plt.xlabel('\nFrequency')
None

In [None]:
pd.Series(nltk.bigrams(go_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common Go Bigrams\n')
plt.xlabel('\nFrequency')
None

#### All Languages Bigram

In [None]:
pd.Series(nltk.bigrams(all_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common All Language Bigrams\n')
plt.xlabel('\nFrequency')
None

#### Trigrams Per Category

In [None]:
pd.Series(nltk.trigrams(javascript_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common JavaScript Trigrams\n')
plt.xlabel('\nFrequency')
None

In [None]:
pd.Series(nltk.trigrams(python_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common Python Trigrams\n')
plt.xlabel('\nFrequency')
None

In [None]:
pd.Series(nltk.trigrams(java_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common Java Trigrams\n')
plt.xlabel('\nFrequency')
None

In [None]:
pd.Series(nltk.trigrams(go_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common Go Trigrams\n')
plt.xlabel('\nFrequency')
None


#### All Languages Trigram

In [None]:
pd.Series(nltk.trigrams(all_words.split())).value_counts().head(10).plot.barh()
plt.title('Top 10 Most Common All Language Trigrams\n')
plt.xlabel('\nFrequency')
None

#### Bigram Wordclouds

In [None]:
## creating a series for the frequencies of the top 20 bigrams of all programming categories

top_20_javascript_bigrams = pd.Series(nltk.bigrams(javascript_words.split()))\
.value_counts().head(20)

top_20_python_bigrams = pd.Series(nltk.bigrams(python_words.split()))\
.value_counts().head(20)

top_20_java_bigrams = pd.Series(nltk.bigrams(java_words.split()))\
.value_counts().head(20)

top_20_go_bigrams = pd.Series(nltk.bigrams(go_words.split()))\
.value_counts().head(20)


In [None]:
## using list comprehension to creat a dictionary of javascript bigrams as a dictionary
## then making a wordcloud

data = {k[0] + ' ' + k[1]: v for k, v in top_20_javascript_bigrams.to_dict().items()}
img = WordCloud(background_color='white', width=800, height=400).generate_from_frequencies(data)
plt.figure(figsize=(10, 6))
plt.imshow(img)
plt.title('Top Bigrams for JavaScript\n')
plt.axis('off')
plt.show()

In [None]:
## using list comprehension to creat a dictionary of python bigrams as a dictionary
## then making a wordcloud

data = {k[0] + ' ' + k[1]: v for k, v in top_20_python_bigrams.to_dict().items()}
img = WordCloud(background_color='white', width=800, height=400).generate_from_frequencies(data)
plt.figure(figsize=(10, 6))
plt.imshow(img)
plt.title('Top Bigrams for Python\n')
plt.axis('off')
plt.show()

In [None]:
## using list comprehension to creat a dictionary of java bigrams as a dictionary
## then making a wordcloud

data = {k[0] + ' ' + k[1]: v for k, v in top_20_java_bigrams.to_dict().items()}
img = WordCloud(background_color='white', width=800, height=400).generate_from_frequencies(data)
plt.figure(figsize=(10, 6))
plt.imshow(img)
plt.title('Top Bigrams for Java\n')
plt.axis('off')
plt.show()

In [None]:
## using list comprehension to creat a dictionary of go bigrams as a dictionary
## then making a wordcloud

data = {k[0] + ' ' + k[1]: v for k, v in top_20_go_bigrams.to_dict().items()}
img = WordCloud(background_color='white', width=800, height=400).generate_from_frequencies(data)
plt.figure(figsize=(10, 6))
plt.imshow(img)
plt.title('Top Bigrams for Go\n')
plt.axis('off')
plt.show()

#### Trigram Wordclouds

In [None]:
top_20_javascript_trigrams = pd.Series(nltk.ngrams(javascript_words.split(),3))\
.value_counts().head(20)

top_20_python_trigrams = pd.Series(nltk.ngrams(python_words.split(),3))\
.value_counts().head(20)

top_20_java_trigrams = pd.Series(nltk.ngrams(java_words.split(),3))\
.value_counts().head(20)

top_20_go_trigrams = pd.Series(nltk.ngrams(go_words.split(),3))\
.value_counts().head(20)

In [None]:
data = {k[0] + ' ' + k[1] + ' ' +k[2]: v for k, v in top_20_javascript_trigrams.to_dict().items()}
img = WordCloud(background_color='white', width=800, height=400).generate_from_frequencies(data)
plt.figure(figsize=(12, 6))
plt.imshow(img)
plt.title('Top Trigrams for JavaScript\n')
plt.axis('off')
plt.show()

In [None]:
data = {k[0] + ' ' + k[1] + ' ' +k[2]: v for k, v in top_20_python_trigrams.to_dict().items()}
img = WordCloud(background_color='white', width=800, height=400).generate_from_frequencies(data)
plt.figure(figsize=(12, 6))
plt.imshow(img)
plt.title('Top Trigrams for Python\n')
plt.axis('off')
plt.show()

In [None]:
data = {k[0] + ' ' + k[1] + ' ' +k[2]: v for k, v in top_20_java_trigrams.to_dict().items()}
img = WordCloud(background_color='white', width=800, height=400).generate_from_frequencies(data)
plt.figure(figsize=(12, 6))
plt.imshow(img)
plt.title('Top Trigrams for Java\n')
plt.axis('off')
plt.show()

In [None]:
data = {k[0] + ' ' + k[1] + ' ' +k[2]: v for k, v in top_20_go_trigrams.to_dict().items()}
img = WordCloud(background_color='white', width=800, height=400).generate_from_frequencies(data)
plt.figure(figsize=(12, 6))
plt.imshow(img)
plt.title('Top Trigrams for Go\n')
plt.axis('off')
plt.show()

### Exploration Takeaways: 
 - JavaScript has the most use of the 20 most common words across all READMEs, followed up by Python.
 - Java and Go have very small proportions for the most frequent words.
 - Some of the words are not single words, but we will proceed as in the README, they were not separated by spaces.
 - Given that our data is mostly JavaScript and Python, our exploration seems reasonable. 



<a id='stats'></a>
# Statistical Testing:
---

## 2 Tailed T-Test

We will be testing for a difference of means between groups of programming languages

In [None]:
## creating categorized dataframes for each target variable value for statistical 
## testing purposes

python_df = train[train.language == 'Python'] ## creating a python df 
go_df = train[train.language == 'Go'] ## creating a go df
java_df = train[train.language == 'Java'] ## creating a Java df
javascript_df = train[train.language == 'JavaScript'] ## creating a JavaScript df

### Comparing README Lengths

In [None]:
alpha = 0.05 ## <-- Determining alpha for Readme Length Comparisons

<br>
Python Vs. JavaScript

$H_0$: There is no difference in means of Python repository character lengths and JavaScript repository character lengths

$H_a$: There is a difference in means of Python repository character lengths and JavaScript repository character lengths

In [None]:
t, p = stats.ttest_ind(python_df.readme_length,javascript_df.readme_length)
t, p

<br>
Python Vs. Go

$H_0$: There is no difference in means of Python repository character lengths and Go repository character lengths

$H_a$: There is a difference in means of Python repository character lengths and Go repository character lengths

In [None]:
t, p = stats.ttest_ind(python_df.readme_length,go_df.readme_length)
t, p

<br>
Python vs. Java

$H_0$: There is no difference in means of Python repository character lengths and Java repository character lengths

$H_a$: There is a difference in means of Python repository character lengths and Java repository character lengths

In [None]:
t, p = stats.ttest_ind(python_df.readme_length,java_df.readme_length)
t, p

<br>

#### ReadMe Length Comparison Takeaways

 - All of the tailed t tests run on the different target variable values comparing readme character lengths returned insignifcant results.
 - All languages versus python returned p values greater than our alpha of 0.05. 
 - We fail to reject the null hypothesis
 - Therefore we can conclude that readme character length is independent of what programming language the repositories are written.

<br>

### Comparing Word Count Length

In [None]:
alpha = 0.05 ## <-- Determining alpha for Word Count Comparisons

<br>

Python vs. JavaScript

$H_0$: There is no difference in means of Python repository word counts and JavaScript repository word counts

$H_a$: There is a difference in means of Python repository word counts and JavaScript repository word counts

In [None]:
t, p = stats.ttest_ind(python_df.word_count,javascript_df.word_count)
t, p

<br>

Python vs. Go

$H_0$: There is no difference in means of Python repository word counts and Go repository word counts

$H_a$: There is a difference in means of Python repository word counts and Go repository word counts

In [None]:
t, p = stats.ttest_ind(python_df.word_count,go_df.word_count)
t, p

<br>

Python vs. Java


$H_0$: There is no difference in means of Python repository word counts and Java repository word counts

$H_a$: There is a difference in means of Python repository word counts and Java repository word counts

In [None]:
t, p = stats.ttest_ind(python_df.word_count,java_df.word_count)
t, p

<br>

#### Word Count Comparison Takeaways

 - All of the tailed t tests run on the different target variable values comparing word count returned insignifcant results
 - All languages versus python returned p values greater than our alpha of 0.05 
 - We fail to reject the null hypothesis
 - Therefore we can conclude the word count of the readme is independent of what programming language the repositories are written.

<a id='modeling'></a>
# Modeling
---

We are going to use classification models in order to predict the programming language. 
We will use decision tree, random forest, logistic regression, KNN, and a Naive Bayes and emphasize on accuracy.


The first step is to initiaize the TfidfVectorizer, and split our data into X and y sets.

In [None]:
#intialize TfidfVectorizer, use single words, bigrams and trigrams
tfidf = TfidfVectorizer(ngram_range=(1,3))
X = tfidf.fit_transform(df.clean)
y = df.language

#split the data into X_train, X_validate, X_test, y_train, y_validate, y_test
X_train_validate, X_test, y_train_validate, y_test = train_test_split(X, y, test_size=.2, random_state=12, stratify = y)
X_train, X_validate, y_train, y_validate = train_test_split(X_train_validate, y_train_validate, test_size=.2, random_state=12, stratify= y_train_validate)

In [None]:
#view some of the feature names being used 
pprint(df.clean)
pd.DataFrame(X.todense(), columns=tfidf.get_feature_names())

We should to establish a baseline. Let's see what the most common programming language is in our y_train.

In [None]:
#use most common programming language as baseline
y_train.value_counts()

Our baseline model will assume that every repository's most used language is JavaScript

In [None]:
#establish the baseline
train['baseline_prediction'] = 'JavaScript'
baseline_score = round(accuracy_score(train.language, train.baseline_prediction),2)
print(f'Our baseline score is {baseline_score}')

Now we will create a train and test dataframe, we will be able to add our predictions to it to evaluate how the models perform

In [None]:
train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))

<br>
<br>

### Decision Tree

In [None]:
#decision tree fit to X and y train
tree = DecisionTreeClassifier(max_depth=2)
tree.fit(X_train, y_train)

#prediction columns 
train['predicted'] = tree.predict(X_train)
test['predicted'] = tree.predict(X_test)

#train and validate scores to check for overfitness 
print(f'train score: {tree.score(X_train, y_train):.2%}')
print(f'validate score: {tree.score(X_validate, y_validate):.2%}')

In [None]:
#accuracy on train, confusion matrix, and classification report for decision tree
m.model_info(train)

In [None]:
#visualize our decision tree
plt.figure(figsize=(16,9))
plot_tree(tree)
plt.show()

In [None]:
#decision tree scores
tree_accuracy = round(sklearn.metrics.accuracy_score(y_train, train.predicted),2)
tree_precision = round(sklearn.metrics.precision_score(y_train, train.predicted, average='macro'),2)
tree_recall = round(sklearn.metrics.recall_score(y_train, train.predicted, average='macro'),2)
print('Scores for Decision Tree!')
print('---------------------------')
print(f'Our baseline score is {baseline_score}')
print(f'Accuracy score is {tree_accuracy}')
print(f'Precision score is {tree_precision}')
print(f'Recall score is {tree_recall}')

<br>
Our decision tree model seems a bit overfit on the training set, this may mean it will perform poorly on out of sample data. 
It does perform better than our baseline though.

<br>

### Random Forest

In [None]:
#random forest fit to our X and y train
rf = RandomForestClassifier(random_state=906, max_depth = 2).fit(X_train, y_train)

#prediction columns
train['predicted'] = rf.predict(X_train)
test['predicted'] = rf.predict(X_test)

#check for overfitness
print(f'train score: {rf.score(X_train, y_train):.2%}')
print(f'validate score: {rf.score(X_validate, y_validate):.2%}')

In [None]:
#accuracy on train, confusion matrix, and classification report for random forest
m.model_info(train)

In [None]:
#random forest scores
rf_accuracy = round(sklearn.metrics.accuracy_score(y_train, train.predicted),2)
rf_precision = round(sklearn.metrics.precision_score(y_train, train.predicted, average='macro'),2)
rf_recall = round(sklearn.metrics.recall_score(y_train, train.predicted, average='macro'),2)
print('Scores for Random Forest!')
print('---------------------------')
print(f'Our baseline score is {baseline_score}')
print(f'Accuracy score is {rf_accuracy}')
print(f'Precision score is {rf_precision}')
print(f'Recall score is {rf_recall}')

<br>
The random forest model does not predict for Python, Java, or Go. It actually predicts that every repository is JavaScript. It does not beat the baseline.

<br>

### Logistic Regression

In [None]:
#fit logistic regression to our X and y train
lm = LogisticRegression(C=.6).fit(X_train, y_train)

#prediction columns
train['predicted'] = lm.predict(X_train)
test['predicted'] = lm.predict(X_test)

#check for overfitness
print(f'train score: {lm.score(X_train, y_train):.2%}')
print(f'validate score: {lm.score(X_validate, y_validate):.2%}')

In [None]:
#accuracy on train, confusion matrix, and classification report for logistic regression
m.model_info(train)

In [None]:
#logistic regression scores
logit_accuracy = round(sklearn.metrics.accuracy_score(y_train, train.predicted),2)
logit_precision = round(sklearn.metrics.precision_score(y_train, train.predicted, average='macro'),2)
logit_recall = round(sklearn.metrics.recall_score(y_train, train.predicted, average='macro'),2)
print('Scores for Logistic Regression!')
print('---------------------------')
print(f'Our baseline score is {baseline_score}')
print(f'Accuracy score is {logit_accuracy}')
print(f'Precision score is {logit_precision}')
print(f'Recall score is {logit_recall}')

<br>
Logistic Regression only makes JavaScript and Python predictions (but it assumed everything was JavaScript except for one single Python one)

Tied with the baseline. 

<br>

### KNN

In [None]:
#use 9 for n_neighbors for single words
#use 10 for n_neighbors bigrams and trigrams

#fit KNN to our X and y train
knn = KNeighborsClassifier(n_neighbors=10).fit(X_train, y_train)

#prediction columns
train['predicted'] = knn.predict(X_train)
test['predicted'] = knn.predict(X_test)

#check for overfitness
print(f'train score: {knn.score(X_train, y_train):.2%}')
print(f'validate score: {knn.score(X_validate, y_validate):.2%}')

In [None]:
#accuracy on train, confusion matrix, and classification report for KNN
m.model_info(train)

In [None]:
#knn scores
knn_accuracy = round(sklearn.metrics.accuracy_score(y_train, train.predicted),2)
knn_precision = round(sklearn.metrics.precision_score(y_train, train.predicted, average='macro'),2)
knn_recall = round(sklearn.metrics.recall_score(y_train, train.predicted, average='macro'),2)
print('Scores for KNN!')
print('---------------------------')
print(f'Our baseline score is {baseline_score}')
print(f'Accuracy score is {knn_accuracy}')
print(f'Precision score is {knn_precision}')
print(f'Recall score is {knn_recall}')

Our KNN model beats the baseline, and does well at accurately predicting the repository's most used programming language.

<br>

### Naive Bayes

In [None]:
#we're using a multinomial naive bayes fit to our X and y train
nb = MultinomialNB(alpha=1.4).fit(X_train, y_train)

#prediction columns
train['predicted'] = nb.predict(X_train)
test['predicted'] = nb.predict(X_test)

#check for overfitness
print(f'train score: {nb.score(X_train, y_train):.2%}')
print(f'validate score: {nb.score(X_validate, y_validate):.2%}')

In [None]:
#accuracy on train, confusion matrix, and classification report for naive bayes
m.model_info(train)

In [None]:
#naive bayes scores
nb_accuracy = round(sklearn.metrics.accuracy_score(y_train, train.predicted),2)
nb_precision = round(sklearn.metrics.precision_score(y_train, train.predicted, average='macro'),2)
nb_recall = round(sklearn.metrics.recall_score(y_train, train.predicted, average='macro'),2)
print('Scores for Naive Bayes!')
print('---------------------------')
print(f'Our baseline score is {baseline_score}')
print(f'Accuracy score is {nb_accuracy}')
print(f'Precision score is {nb_precision}')
print(f'Recall score is {nb_recall}')

<br>
Multinomial Naive Bayes actually performs the exact same as our logistic regression. It predicted that every repository was JavaScript except for 1 which was predicted as Python. It is tied with our baseline.

<br>

### Modeling Takeaways: 
 - Random Forest, Logistic Regression, Naive Bayes do not predict Java or Go languages. Can probably be fixed with more data, and adjustment of hyperparameters
 - KNN performs the best at ~87% accuracy on train. 

<a id='test'></a>
# Test
---

Our best performing model: KNN

Let's get a recap on how it did!

In [None]:
#bringing back our KNN model from above, just refreshing on its train and validate scores
knn = KNeighborsClassifier(n_neighbors=10).fit(X_train, y_train)

train['predicted'] = knn.predict(X_train)
test['predicted'] = knn.predict(X_test)

print(f'train score: {knn.score(X_train, y_train):.2%}')
print(f'validate score: {knn.score(X_validate, y_validate):.2%}')

<br>
Let's put it to the test!

In [None]:
#get accuracy score for test set predictions vs. test set actual results
print('Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.predicted)))
print('------------------------------------------------------')

#print the confusion matrix for test set predictions vs test set actual results
print('Confusion Matrix')
print(pd.crosstab(test.predicted, test.actual))
print('------------------------------------------------------')

#print classification report 
print(classification_report(test.actual, test.predicted))
print('------------------------------------------------------')

#train, validate, and test scores for KNN
print(f'Our baseline score is {baseline_score*100}%')
print(f'Training score: {knn.score(X_train, y_train):.2%}')
print(f'Validate score: {knn.score(X_validate, y_validate):.2%}')
print(f'Test score: {knn.score(X_test, y_test):.2%}')

In [None]:
knn_accuracy = round(sklearn.metrics.accuracy_score(y_test, test.predicted),2)
knn_precision = round(sklearn.metrics.precision_score(y_test, test.predicted, average='macro'),2)
knn_recall = round(sklearn.metrics.recall_score(y_test, test.predicted, average='macro'),2)

print(f'KNN accuracy score is {knn_accuracy}')
print(f'KNN precision score is {knn_precision}')
print(f'KNN recall score is {knn_recall}')

<a id='conclusion'></a>
# Conclusion and Next Steps
---

 - Using a combination of single words, bigrams, and trigrams, our best performing model was a KNN. 
 - It beat the baseline by 28.82%. 
 - Accuracy of 82%, precision of 82%, and recall of 76%

With more time, we would like to use more repositories to potentially find more words that can help predict the programming language used. We'd also take more time to prepare our data to assure we are using meaningful readmes.