# Project Description and Goals

Natural Language Processing (NLP) uses programming & machine learning techniques to help understand and make use of large amounts of text data.


For this project, I will be scraping data from GitHub repository README files, that focus on the specific domain of social media, in order to build a model that can predict what programming language a repository is, given the text of the README file.

### Objectives for this project include:
- Building a dataset based on a list of GitHub repositories to scrape, and writing the python code necessary to extract the text of the README file for each page, and the primary language of the repository.
- Documenting process and analysis throughout the data science pipeline.
- Constructing a classification model that can predict what programming language a repository is in, given the text of the README file.
- Deliverables:
    - A well-documented jupyter notebook that contains my analysis.
    - One or two content slides suitable for a general audience that summarize findings with a well-labeled visualization included in slides.
    
### Pipeline Process:
1. Acquire
2. Prepare
3. Explore
4. Model/Evaluate
5. Deliver

### Initial hypotheses
- What are the most frequently occuring words in READMEs?
- Are there any words that uniquely identify a programming language?
- What are the top word combinations (bigrams and trigrams)?

*** 
### Project Imports

In [1]:
import pandas as pd

#acquire and prep
from env import github_token, github_username
import acquire
import prepare

import os
import json
from typing import Dict, List, Optional, Union, cast
import requests

import re
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

#visualize
from wordcloud import WordCloud
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#ignore warnings (turn off pink warning boxes)
import warnings
warnings.filterwarnings("ignore")

#tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer

#train, validate, test
from sklearn.model_selection import train_test_split

#creating / evaluating models
# Decision Tree  
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

# Random Forest
from sklearn.ensemble import RandomForestClassifier

# K-Nearest Neighbor(KNN)  
from sklearn.neighbors import KNeighborsClassifier

# Logistic Regression
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report, accuracy_score

***
# Data Acquisition

In [2]:
#acquire data from acquire.py
df = pd.read_json('data.json')
df.head()

Unnamed: 0,repo,language,readme_contents
0,sherlock-project/sherlock,Python,"<p align=center>\n\n <img src=""https://user-i..."
1,Greenwolf/social_mapper,Python,# Social Mapper\n![alt text](https://img.shiel...
2,bonzanini/Book-SocialMediaMiningPython,Python,Mastering Social Media Mining with Python\n===...
3,qeeqbox/social-analyzer,JavaScript,"<p align=""center""> <img src=""https://raw.githu..."
4,anfederico/stocktalk,Python,"<p align=""center""><img src=""https://raw.github..."


In [3]:
#rows and columns
df.shape

(150, 3)

In [4]:
#list of column names and data types with more information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   repo             150 non-null    object
 1   language         147 non-null    object
 2   readme_contents  150 non-null    object
dtypes: object(3)
memory usage: 3.6+ KB


In [5]:
#check value counts of languages
df.language.value_counts()

Python              33
JavaScript          28
PHP                 15
HTML                14
TypeScript           9
Jupyter Notebook     8
Dart                 7
Java                 7
CSS                  5
Ruby                 4
C#                   4
Shell                3
R                    2
Objective-C          2
Go                   1
C++                  1
PostScript           1
TSQL                 1
Scala                1
Elixir               1
Name: language, dtype: int64

In [6]:
#check nulls
df.isnull().sum()

repo               0
language           3
readme_contents    0
dtype: int64

#### There are 3 languages missing from repos that need to be dropped and text from readme_contents needs to be cleaned, normalized, tokenized, lemmatized, have stopwords removed, etc.. Also, jupyter notebook can be turned into python programming language.

***
# Data Preparation

In [7]:
#grab cleaned df from prepare.py
df = prepare.clean_content(df, 'readme_contents', extra_words = ['p', 'aligncenter'], exclude_words = ['no'])

#drop original readme_contents
df = df.drop(columns = ['readme_contents'])

df.head()

Unnamed: 0,repo,clean_content,language
0,sherlock-project/sherlock,img srchttpsuserimagesgithubusercontentcom2706...,Python
1,Greenwolf/social_mapper,social mapper alt texthttpsimgshieldsiobadgepy...,Python
2,bonzanini/Book-SocialMediaMiningPython,mastering social medium mining python code rep...,Python
3,qeeqbox/social-analyzer,img srchttpsrawgithubusercontentcomqeeqboxsoci...,JavaScript
4,anfederico/stocktalk,aligncenterimg srchttpsrawgithubusercontentcom...,Python


In [13]:
len(df.language.value_counts())

19

In [9]:
df.isnull().sum()

repo             0
clean_content    0
language         0
dtype: int64

In [10]:
df.shape

(147, 3)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 149
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   repo           147 non-null    object
 1   clean_content  147 non-null    object
 2   language       147 non-null    object
dtypes: object(3)
memory usage: 4.6+ KB


***
# Data Exploration

### Answer questions from planning stage
- What are the most frequently occuring words in READMEs?
- Are there any words that uniquely identify a programming language?
- What does the distribution of IDFs look like for the most common words?
- Does the length of the README vary by programming language?
- Do different programming languages use a different number of unique words?
- What are the top word combinations (bigrams and trigrams)?

In [12]:
#how many words appear for each language?
languages = pd.concat([df.language.value_counts(),
                    df.language.value_counts(normalize=True)], axis=1)

languages.columns = ['n', 'percent']

languages

Unnamed: 0,n,percent
Python,41,0.278912
JavaScript,28,0.190476
PHP,15,0.102041
HTML,14,0.095238
TypeScript,9,0.061224
Dart,7,0.047619
Java,7,0.047619
CSS,5,0.034014
Ruby,4,0.027211
C#,4,0.027211


***
# Data Modeling and Evaluation

***
# Conclusion
Report containing summarization of findings can be found [here](link to google slides).