# Project Description and Goals

Natural Language Processing (NLP) uses programming & machine learning techniques to help understand and make use of large amounts of text data.


For this project, I will be scraping data from GitHub repository README files, that focus on the specific domain of social media, in order to build a model that can predict what programming language a repository is, given the text of the README file.

### Objectives for this project include:
- Building a dataset based on a list of GitHub repositories to scrape, and writing the python code necessary to extract the text of the README file for each page, and the primary language of the repository.
- Documenting process and analysis throughout the data science pipeline.
- Constructing a classification model that can predict what programming language a repository is in, given the text of the README file.
- Deliverables:
    - A well-documented jupyter notebook that contains my analysis.
    - One or two content slides suitable for a general audience that summarize findings with a well-labeled visualization included in slides.
    
### Pipeline Process:
1. Acquire
2. Prepare
3. Explore
4. Model/Evaluate
5. Deliver

### Initial hypotheses
- What are the most frequently occuring words in READMEs?
- Are there any words that uniquely identify a programming language?
- What are the top word combinations (bigrams and trigrams)?

*** 
### Project Imports

In [1]:
import pandas as pd

#acquire and prep
from env import github_token, github_username
import acquire
import prepare

import os
import json
from typing import Dict, List, Optional, Union, cast
import requests

import re
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

#visualize
from wordcloud import WordCloud
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#ignore warnings (turn off pink warning boxes)
import warnings
warnings.filterwarnings("ignore")

#tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer

#train, validate, test
from sklearn.model_selection import train_test_split

#creating / evaluating models
# Decision Tree  
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

# Random Forest
from sklearn.ensemble import RandomForestClassifier

# K-Nearest Neighbor(KNN)  
from sklearn.neighbors import KNeighborsClassifier

# Logistic Regression
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report, accuracy_score

***
# Data Acquisition

In [2]:
#acquire data from acquire.py
df = pd.read_json('data.json')
df.head()

Unnamed: 0,repo,language,readme_contents
0,sherlock-project/sherlock,Python,"<p align=center>\n\n <img src=""https://user-i..."
1,Greenwolf/social_mapper,Python,# Social Mapper\n![alt text](https://img.shiel...
2,bonzanini/Book-SocialMediaMiningPython,Python,Mastering Social Media Mining with Python\n===...
3,qeeqbox/social-analyzer,JavaScript,"<p align=""center""> <img src=""https://raw.githu..."
4,anfederico/stocktalk,Python,"<p align=""center""><img src=""https://raw.github..."


In [3]:
#rows and columns
df.shape

(150, 3)

In [4]:
#list of column names and data types with more information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   repo             150 non-null    object
 1   language         147 non-null    object
 2   readme_contents  150 non-null    object
dtypes: object(3)
memory usage: 3.6+ KB


In [5]:
#check value counts of languages
df.language.value_counts()

Python              33
JavaScript          28
PHP                 15
HTML                14
TypeScript           9
Jupyter Notebook     8
Dart                 7
Java                 7
CSS                  5
Ruby                 4
C#                   4
Shell                3
Objective-C          2
R                    2
C++                  1
Go                   1
TSQL                 1
PostScript           1
Elixir               1
Scala                1
Name: language, dtype: int64

In [6]:
#check nulls
df.isnull().sum()

repo               0
language           3
readme_contents    0
dtype: int64

### Findings
- There are 3 languages missing from repos that need to be dropped.
- Text from readme_contents needs to be cleaned, normalized, tokenized, lemmatized, have stopwords removed, etc. 
- Jupyter notebook can be turned into python programming language.
- There are a couple of languages that have fewer than 5 value counts, so those will be removed as well, as it does not provide much value and it will facilitate exploration/modeling/evaluation.

***
# Data Preparation

In [7]:
#grab cleaned df from prepare.py
df = prepare.clean_content(df, 'readme_contents', extra_words = ['p', 'aligncenter', 'img'], exclude_words = ['no'])

#drop original readme_contents
df = df.drop(columns = ['readme_contents'])

df.head()

Unnamed: 0,repo,clean_content,language
0,sherlock-project/sherlock,srchttpsuserimagesgithubusercontentcom27065646...,Python
1,Greenwolf/social_mapper,social mapper alt texthttpsimgshieldsiobadgepy...,Python
2,bonzanini/Book-SocialMediaMiningPython,mastering social medium mining python code rep...,Python
3,qeeqbox/social-analyzer,srchttpsrawgithubusercontentcomqeeqboxsocialan...,JavaScript
4,anfederico/stocktalk,aligncenterimg srchttpsrawgithubusercontentcom...,Python


In [8]:
#verify languages w/ fewer than 5 occurances were dropped
len(df.language.value_counts())

8

In [9]:
#check nulls
df.isnull().sum()

repo             0
clean_content    0
language         0
dtype: int64

In [10]:
#check how many words appear for each language
languages = pd.concat([df.language.value_counts(),
                    round(df.language.value_counts(normalize=True), 2)], axis=1)

languages.columns = ['n', 'percent']

languages

Unnamed: 0,n,percent
Python,41,0.33
JavaScript,28,0.22
PHP,15,0.12
HTML,14,0.11
TypeScript,9,0.07
Java,7,0.06
Dart,7,0.06
CSS,5,0.04


In [20]:
# remove words between 1 and 20
df.clean_content = re.compile(r'\W*\b\w{1,20}\b')
print(df.clean_content.sub('', df.clean_conent))

AttributeError: 'DataFrame' object has no attribute 'clean_conent'

***
# Data Exploration

### Answer questions from planning stage
- What are the most frequently occuring words in READMEs?
- Are there any words that uniquely identify a programming language?
- What does the distribution of IDFs look like for the most common words?
- Does the length of the README vary by programming language?
- Do different programming languages use a different number of unique words?
- What are the top word combinations (bigrams and trigrams)?

In [17]:
#breaking up data into each language

#words that appear in Python
python_words = ' '.join(df[df.language == 'Python'].clean_content)

#words that appear in JavaScript
javascript_words = ' '.join(df[df.language == 'JavaScript'].clean_content)

#words that appear in PHP
php_words = ' '.join(df[df.language == 'PHP'].clean_content)

#words that appear in HTML
html_words = ' '.join(df[df.language == 'HTML'].clean_content)

#words that appear in TypeScript
typescript_words = ' '.join(df[df.language == 'TypeScript'].clean_content)

#words that appear in Dart
dart_words = ' '.join(df[df.language == 'Dart'].clean_content)

#words that appear in Java
java_words = ' '.join(df[df.language == 'Java'].clean_content)

#words that appear in CSS
css_words = ' '.join(df[df.language == 'CSS'].clean_content)

#all of the words 
all_lang_words = ' '.join(df.clean_content)

In [19]:
#check how often each of the words occurs
python_freq = pd.Series(python_words.split()).value_counts()
javascript_freq = pd.Series(javascript_words.split()).value_counts()
php_freq = pd.Series(php_words.split()).value_counts()
html_freq = pd.Series(html_words.split()).value_counts()
typescript_freq = pd.Series(typescript_words.split()).value_counts()
dart_freq = pd.Series(dart_words.split()).value_counts()
java_freq = pd.Series(java_words.split()).value_counts()
css_freq = pd.Series(css_words.split()).value_counts()
all_lang_freq = pd.Series(all_lang_words.split()).value_counts()

print('Python:')
print(python_freq)
print('--------------------------')
print('JavaScript:')
print(javascript_freq)
print('--------------------------')
print('PHP:')
print(php_freq)
print('--------------------------')
print('HTML:')
print(html_freq)
print('--------------------------')
print('TypeScript:')
print(typescript_freq)
print('--------------------------')
print('Dart:')
print(dart_freq)
print('--------------------------')
print('Java:')
print(java_freq)
print('--------------------------')
print('CSS:')
print(css_freq)
print('--------------------------')
print('All Languages:')
print(all_lang_freq)
print('--------------------------')

Python:
'                                               511
&#9;                                            269
file                                            222
social                                          221
data                                            167
                                               ... 
tocsvlisttweets                                   1
httpsgithubcomloreysocialmediaprofilesregexs      1
datastoreemulatorport8089                         1
reliance                                          1
near                                              1
Length: 5118, dtype: int64
--------------------------
JavaScript:
'                                                  511
social                                              86
&#9;                                                78
run                                                 78
use                                                 76
                                                  ... 
discussion              

***
# Data Modeling and Evaluation

***
# Conclusion
Report containing summarization of findings can be found [here](link to google slides).