# NLP Predicts Programming Language of a Github Repo

## Goal

The primary goal for this project was to build an NLP model that can predict the primary language of REPO using the text in the README file.

As a secondary goal, we decided to pull repos from a specific sector, or topic, to see if an industry we were interested is utilizing a program language we're familiar with.

Due to the first and second goals, this repo can be used as a means to research an potenial industry you might be interested in entering, and knowing what programming language you'll need to have familiarity with.

## Data Dictionary

**language**: Programming language used for repositort project

**category**: The category within the energy sector

**repo**: The specific repo referenced with that observation

**readme_contents**: Description of each repository containing keywords used to make predictions

**clean_tokes**: README content normalized removing any uppercased characters, special characters, non-alpha characters, and alpha strings with 2 or less characters

**clean_stemmed**: README content reducing each word to its root stem and then removes any stopwords

**clean_lemmatized**: README content reducing each word to its root word and then removes any stopwords

**word_count**: The total word count for that observation

In [3]:
# Base 
import pandas as pd
import numpy as np
import re
from pprint import pprint

# Viz
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# scraping modules
from requests import get
from bs4 import BeautifulSoup

import unicodedata

# NLP
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier

# Functions, etc.
import os
import acquire
import prepare

# matplotlib default plotting styles
plt.rc("patch", edgecolor="black", force_edgecolor=True)
plt.rc("axes", grid=True)
plt.rc("grid", linestyle=":", linewidth=0.8, alpha=0.7)
plt.rc("axes.spines", right=False, top=False)
plt.rc("figure", figsize=(16, 9))
plt.rc("font", size=12.0)
plt.rc("hist", bins=25)

import warnings
warnings.filterwarnings("ignore")

## Acquire our data

In [4]:
# The code below sets the table by creating a json file per parameter in acquire_repo_list

# acquire.acquire_repo_list('gasoline')
# acquire.acquire_repo_list('solar energy')
# acquire.acquire_repo_list('wind power')

# The we take those json files and use them to scrape our data

# df = acquire.scrape_github_data()

# Then we make a DataFrame proper from that work, and save it as a csv.

# pd.DataFrame(df).to_csv('new_repos_dict.csv', index=False)

# Then we read the csv in

df = pd.read_csv('new_repos_dict.csv')

# Then we create a category for each observation

df.loc[:200, 'category'] = 'gasoline'
df.loc[200:400, 'category'] = 'wind_energy'
df.loc[400:, 'category'] = 'solar_power'

df.head()

Unnamed: 0,repo,language,readme_contents,category
0,N-BodyShop/gasoline,C,```\n \t ▄████ ▄▄▄ ██████ ▒█████ █...,gasoline
1,michipili/gasoline,OCaml,# Gasoline\n\nThe Gasoline project aims at imp...,gasoline
2,rvikmanis/gasoline,TypeScript,# Gasoline\n\nConvenient state container for R...,gasoline
3,iggisv9t/benzin_gif,Python,# benzin_gif\nCreate animated gifs that looks ...,gasoline
4,daneharrigan/gasoline,Go,# gasoline\n\n![Gasoline Dashboard](http://cl....,gasoline


## Prepare our data

In [5]:
# We run the prepare function from our prepare.py file

df = prepare.prep_repo_data(df)

In [6]:
df.head()

Unnamed: 0,language,category,repo,readme_contents,clean_tokes,clean_stemmed,clean_lemmatized
0,C,gasoline,N-BodyShop/gasoline,```\n \t ▄████ ▄▄▄ ██████ ▒█████ █...,"[&#9;, &#9;, &#9;, &#9;, &#9;, &#9;, &#9;, &#9...",&#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; g...,&#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; g...
1,OCaml,gasoline,michipili/gasoline,# Gasoline\n\nThe Gasoline project aims at imp...,"[gasoline, the, gasoline, project, aims, at, i...",gasolin gasolin project aim implement unixish ...,gasoline gasoline project aim implementing uni...
2,TypeScript,gasoline,rvikmanis/gasoline,# Gasoline\n\nConvenient state container for R...,"[gasoline, convenient, state, container, for, ...",gasolin conveni state contain react instal npm...,gasoline convenient state container react inst...
3,Python,gasoline,iggisv9t/benzin_gif,# benzin_gif\nCreate animated gifs that looks ...,"[benzin_gif, create, animated, gifs, that, loo...",benzin_gif creat anim gif look like gasolin pu...,benzin_gif create animated gifs look like gaso...
4,Go,gasoline,daneharrigan/gasoline,# gasoline\n\n![Gasoline Dashboard](http://cl....,"[gasoline, gasoline, dashboardhttpcllyimage3l1...",gasolin gasolin dashboardhttpcllyimage3l190u3q...,gasoline gasoline dashboardhttpcllyimage3l190u...


In [7]:
# We notice some nulls

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 445 entries, 0 to 599
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   language          411 non-null    object
 1   category          445 non-null    object
 2   repo              445 non-null    object
 3   readme_contents   443 non-null    object
 4   clean_tokes       445 non-null    object
 5   clean_stemmed     445 non-null    object
 6   clean_lemmatized  445 non-null    object
dtypes: object(7)
memory usage: 27.8+ KB


In [8]:
# Drop the nulls

df.dropna(inplace=True)

In [9]:
# Reset the index

df.reset_index(drop=True, inplace=True)

In [10]:
df.head()

Unnamed: 0,language,category,repo,readme_contents,clean_tokes,clean_stemmed,clean_lemmatized
0,C,gasoline,N-BodyShop/gasoline,```\n \t ▄████ ▄▄▄ ██████ ▒█████ █...,"[&#9;, &#9;, &#9;, &#9;, &#9;, &#9;, &#9;, &#9...",&#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; g...,&#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; g...
1,OCaml,gasoline,michipili/gasoline,# Gasoline\n\nThe Gasoline project aims at imp...,"[gasoline, the, gasoline, project, aims, at, i...",gasolin gasolin project aim implement unixish ...,gasoline gasoline project aim implementing uni...
2,TypeScript,gasoline,rvikmanis/gasoline,# Gasoline\n\nConvenient state container for R...,"[gasoline, convenient, state, container, for, ...",gasolin conveni state contain react instal npm...,gasoline convenient state container react inst...
3,Python,gasoline,iggisv9t/benzin_gif,# benzin_gif\nCreate animated gifs that looks ...,"[benzin_gif, create, animated, gifs, that, loo...",benzin_gif creat anim gif look like gasolin pu...,benzin_gif create animated gifs look like gaso...
4,Go,gasoline,daneharrigan/gasoline,# gasoline\n\n![Gasoline Dashboard](http://cl....,"[gasoline, gasoline, dashboardhttpcllyimage3l1...",gasolin gasolin dashboardhttpcllyimage3l190u3q...,gasoline gasoline dashboardhttpcllyimage3l190u...


In [11]:
df.shape

(410, 7)

In [12]:
# During some previous exploration we notice two outliers that needed to be removed

df.clean_lemmatized.apply(len).nlargest(15)

135    84881
269    84881
114    81494
388    81123
17     64273
234    31806
270    16389
130    11005
297    10445
14      9942
330     7553
320     7272
192     6802
264     6776
292     6537
Name: clean_lemmatized, dtype: int64

In [13]:
df = df[df.index!=135]
df = df[df.index!=269]

In [15]:
df.shape

(408, 7)

In [16]:
# We also notice two languauges needed to be binned together

df.language.value_counts()

Jupyter Notebook     67
JavaScript           57
Python               52
PowerShell           36
HTML                 30
Java                 22
C++                  19
C                    18
PHP                  14
R                    14
CSS                  10
C#                    9
MATLAB                8
TypeScript            6
Ruby                  5
Matlab                5
Arduino               3
Swift                 3
Go                    2
Eagle                 2
Batchfile             2
Objective-C           2
Dart                  2
TeX                   2
Processing            2
Vue                   1
Lua                   1
Cuda                  1
Modelica              1
ActionScript          1
Scala                 1
M                     1
TSQL                  1
Makefile              1
Visual Basic          1
CMake                 1
Perl                  1
PostScript            1
OCaml                 1
Visual Basic .NET     1
Fortran               1
Name: language, 

In [17]:
df = df.replace('Matlab','MATLAB')

In [18]:
# From the value_counts above we can tell that we need to cull a few languages

low = df.language.value_counts() < 10
low_lang = [low.index[i] for i, x in enumerate(low) if x]
df = df[~df.language.isin(low_lang)]

In [19]:
df.language.value_counts()

Jupyter Notebook    67
JavaScript          57
Python              52
PowerShell          36
HTML                30
Java                22
C++                 19
C                   18
PHP                 14
R                   14
MATLAB              13
CSS                 10
Name: language, dtype: int64

In [23]:
# So we're left with this DataFrame

df.head()

Unnamed: 0,language,category,repo,readme_contents,clean_tokes,clean_stemmed,clean_lemmatized
0,C,gasoline,N-BodyShop/gasoline,```\n \t ▄████ ▄▄▄ ██████ ▒█████ █...,"[&#9;, &#9;, &#9;, &#9;, &#9;, &#9;, &#9;, &#9...",&#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; g...,&#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; &#9; g...
3,Python,gasoline,iggisv9t/benzin_gif,# benzin_gif\nCreate animated gifs that looks ...,"[benzin_gif, create, animated, gifs, that, loo...",benzin_gif creat anim gif look like gasolin pu...,benzin_gif create animated gifs look like gaso...
5,C,gasoline,vooon/miniecu,miniECU\n=======\n\nminiECU - monitoring unit ...,"[miniecu, miniecu, monitoring, unit, for, mode...",miniecu miniecu monitor unit model gasolin eng...,miniecu miniecu monitoring unit model gasoline...
6,CSS,gasoline,kbsali/gasolineras-www,Spain's petrol stations prices map\n==========...,"[spains, petrol, stations, prices, map, demo, ...",spain petrol station price map demo see action...,spain petrol station price map demo see action...
7,Jupyter Notebook,gasoline,madsenmj/ml-gas-price,# Gas Price Prediction Model\n\nThis project u...,"[gas, price, prediction, model, this, project,...",ga price predict model thi project use public ...,gas price prediction model project us public d...


In [25]:
df.shape

(352, 7)

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 352 entries, 0 to 409
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   language          352 non-null    object
 1   category          352 non-null    object
 2   repo              352 non-null    object
 3   readme_contents   352 non-null    object
 4   clean_tokes       352 non-null    object
 5   clean_stemmed     352 non-null    object
 6   clean_lemmatized  352 non-null    object
dtypes: object(7)
memory usage: 22.0+ KB


In [29]:
# We're also going to create a DataFrame of the top 4 languages for use later

top_4_lang = list(df.language.value_counts().head(4).index)
top_4_lang_df = df[df.language.isin(top_4_lang)]
top_4_lang_df.reset_index(drop=True, inplace=True)
top_4_lang_df.head()

Unnamed: 0,language,category,repo,readme_contents,clean_tokes,clean_stemmed,clean_lemmatized
0,Python,gasoline,iggisv9t/benzin_gif,# benzin_gif\nCreate animated gifs that looks ...,"[benzin_gif, create, animated, gifs, that, loo...",benzin_gif creat anim gif look like gasolin pu...,benzin_gif create animated gifs look like gaso...
1,Jupyter Notebook,gasoline,madsenmj/ml-gas-price,# Gas Price Prediction Model\n\nThis project u...,"[gas, price, prediction, model, this, project,...",ga price predict model thi project use public ...,gas price prediction model project us public d...
2,Python,gasoline,tseale/charliehustle,# charlie-hustle\n\nPython scripts for pulling...,"[charliehustle, python, scripts, for, pulling,...",charliehustl python script pull mlb gameday da...,charliehustle python script pulling mlb gameda...
3,Jupyter Notebook,gasoline,abhinav5544/gasolineprice,# gasolineprice\nIt is a project based on mach...,"[gasolineprice, it, is, a, project, based, on,...",gasolinepric project base machin learn train m...,gasolineprice project based machine learning t...
4,JavaScript,gasoline,abuenosvinos/gasolineras-www,\n# Proyecto Gasolineras\n\n[![Author][Author]...,"[proyecto, gasolineras, authorauthorhttpwwwant...",proyecto gasolinera authorauthorhttpwwwantonio...,proyecto gasolineras authorauthorhttpwwwantoni...


## Explore our data