# PLAN

- [ ] Acquisition
    - [x] Select what list of repos to scrape.
    - [x] Get requests from the site.
    - [x] Save responses to csv.
- [ ] Preparation
    - [ ] Prepare the data for analysis.
- [ ] Exploration
    - [ ] Answer the following prompts:
        - [ ] What are the most common words in READMEs?
        - [ ] What does the distribution of IDFs look like for the most common words?
        - [ ] Does the length of the README vary by language?
        - [ ] Do different languages use a different number of unique words?
- [ ] Modeling
    - [ ] Transform the data for machine learning; use language to predict.
    - [ ] Fit several models using different text repressentations.
    - [ ] Build a function that will take in the text of a README file, and makes a prediction of language.
- [ ] Delivery
    - [ ] Github repo
        - [x] This notebook.
        - [ ] Documentation within the notebook.
        - [ ] README file in the repo.
        - [ ] Python scripts if applicable.
    - [ ] Google Slides
        - [ ] 1-2 slides only summarizing analysis.
        - [ ] Visualizations are labeled.
        - [ ] Geared for the general audience.
        - [ ] Share link @ readme file and/or classroom.

# ENVIRONMENT

In [26]:
import os
import sys

import pandas as pd
import re
import json
import unicodedata
import nltk
import spacy

from requests import get
from bs4 import BeautifulSoup
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import prepare

# ACQUIRE

In [2]:
# We have decided to search Github for "san antonio data" and scrape the results.
# https://github.com/open-austin

In [3]:
def get_github_repo(url):
    """
    This function takes a url and returns a dictionary that
    contains the content and language of the readme file.
    """
    response = get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    readme = soup.find('div', id='readme')
    language = soup.find('span', class_='lang')
    
    d = dict()
    if readme is None:
        d['readme'] = 'No readme file.'
    else:
        d['readme'] = readme.text
    if language is None:
        d['language'] = 'No language specified.'
    else:
        d['language'] = language.text
    return d

In [4]:
# This line to test out the function.
get_github_repo('https://github.com/open-austin/atx-citysdk-js')

{'readme': '\n\n\n\n        README.md\n      \n\n\nCitySDK Austin Parks\nThis is a demonstration app forked from Austin Park Equity to give a mapping example using the CitySDK from Census.gov. We\'d love for you to try to plug in data from your city.\nSee what the live Austin demo lookes like.\nWhat is the CitySDK?\nCitySDK is a toolbox for civic innovators to connect local and national public data developed by the US Census Department. You should explore their wonderful guides and documentation.\n\nCitySDK\nCitySDK Guides\nCitySDK Code Examples\n\nRequests\nHere is an example of the request we are making for demographic data in Austin, Texas (Travis County):\nvar sdk = new CitySDK();\nvar censusModule = sdk.modules.census;\ncensusModule.enable(config.citySDK_token);\n\nvar request = {\n  "lat": config.city_lat,\n  "lng": config.city_lng,\n  "level": "county",\n  "sublevel": "true",\n  "api" : "acs5",\n  "variables": [\n    "population",  // Total Population\n    "income",  // Median I

In [5]:
def get_github_links(url):
    """
    This function takes in a url and returns a list of links
    that comes from each individual repo listing page.
    """
    response = get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = []
    for link in soup.findAll('a', itemprop='name codeRepository', attrs={'href': re.compile("^/")}):
        links.append(link.get('href'))
    return links

In [6]:
# This line to test out the function.
get_github_links('https://github.com/open-austin?page=3')

['/open-austin/onebusaway-docker',
 '/open-austin/leaflet-talk',
 '/open-austin/open-austin-logo',
 '/open-austin/open-data-docs',
 '/open-austin/austin-park-equity',
 '/open-austin/ship-it-2015',
 '/open-austin/austin-parks-photos',
 '/open-austin/demo-website',
 '/open-austin/tecfiler',
 '/open-austin/liberate-the-data',
 '/open-austin/atx-citysdk-js',
 '/open-austin/mybuildingdoesntrecycle',
 '/open-austin/voteatx-svc',
 '/open-austin/hackathon-ideas',
 '/open-austin/austin-recycles',
 '/open-austin/aac-pets-feed',
 '/open-austin/austingreenmap-cordova',
 '/open-austin/council-connect',
 '/open-austin/stolen-bikes',
 '/open-austin/vanilla-rails',
 '/open-austin/OARK-back-end',
 '/open-austin/hack-projects',
 '/open-austin/hack-team-projects-list']

In [7]:
def get_all_github_links(path, num_pages):
    """
    This function takes in a url path and number of pages
    and returns a list of lists of all links.
    """
    all_links = []
    for i in range(num_pages):      # Number of pages plus one
        page = i + 1
        response = get(path + str(page))
        soup = BeautifulSoup(response.text, 'html.parser')
        all_links.append(get_github_links(path + '?page=' + str(page)))
    return all_links

In [8]:
# This line to test out the function.
get_all_github_links('https://github.com/open-austin', 3)

[['/open-austin/influence-texas',
  '/open-austin/open-austin.github.io',
  '/open-austin/harris-county-bookings',
  '/open-austin/ballotapi',
  '/open-austin/iced-coffee',
  '/open-austin/data-portal-analysis',
  '/open-austin/budgetparty',
  '/open-austin/complaint-map',
  '/open-austin/lobbying-in-austin',
  '/open-austin/pet-finder',
  '/open-austin/open-carry',
  '/open-austin/project-ideas',
  '/open-austin/hack-task-aggregator',
  '/open-austin/data-open-austin-org',
  '/open-austin/Civic_Project',
  '/open-austin/construction-permits',
  '/open-austin/awesome-austin',
  '/open-austin/GreenBelts',
  '/open-austin/atx-restaurant-scores',
  '/open-austin/fake-the-news',
  '/open-austin/consumer-protection',
  '/open-austin/transitime-docker',
  '/open-austin/Restaurant-Health-Inspection-Score-Prediction',
  '/open-austin/government.github.com',
  '/open-austin/sporkability',
  '/open-austin/open-data-progress-report',
  '/open-austin/water-quality',
  '/open-austin/budgetparty-lan

In [9]:
def traverse(o, tree_types=(list, tuple)):
    if isinstance(o, tree_types):
        for value in o:
            for subvalue in traverse(value, tree_types):
                yield subvalue
    else:
        yield o

In [10]:
def get_github_readme(url, num_pages, cache=True):
    if cache and os.path.exists('github_readme.json'):
        readme_text = json.load(open('github_readme.json'))
    else:
        data = get_all_github_links(url, num_pages)
        readme_text = []
        for value in traverse(data):
            print('https://github.com'+value)
            readme_text.append(get_github_repo('https://github.com' + value))
        json.dump(readme_text, open('github_readme.json', 'w'))
    return readme_text

In [11]:
# This line to test out the function.
corpus = get_github_readme('https://github.com/open-austin', 3, cache=True)
corpus

[{'readme': '\n\n\n\n        README.md\n      \n\n\nInfluence Texas has launched!  Checkout the live webapp at https://app.influencetexas.com/\nFind more information at https://www.influencetexas.com/\nWelcome & Project Summary\nWelcome! We’re so glad you’ve found your way to INFLUENCE TX. This project was started at ATX Hack for Change June 2-4, 2017 by Amy M. Mosley, a former investigative reporter who was frustrated by a fruitless search for fact-based, unbiased sources for information during the toxic 2016 election cycle. In 2017 all ATX Hack for Change projects were aligned with United Nations Sustainable Development Goals. This project falls under Goal 16: Promote just, peaceful and inclusive societies.\nThe premise is this, Politicians lie.\nAnd usually, they do whatever they get paid to do. You need to know:\nWho is paying them?\nHow are they voting as a result?\nBy linking lawmakers’ campaign finance records to their voting records, taxpayers can track the influence of money i

# PREPARE

In [22]:
def basic_clean(original):
    word = original.lower()
    word = unicodedata.normalize('NFKD', word)\
                                .encode('ascii', 'ignore')\
                                .decode('utf-8', 'ignore')
    word = re.sub(r"[^a-z0-9'\s]", '', word)
    word = word.replace('\n',' ')
    word = word.replace('\t',' ')
    return word

def tokenize(original):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(basic_clean(original))

def stem(original):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in original.split()]
    original_stemmed = ' '.join(stems)
    return original_stemmed

def lemmatize(original):
    nlp = spacy.load('en', parse=True, tag=True, entity=True)
    doc = nlp(original) # process the text with spacy
    lemmas = [word.lemma_ for word in doc]
    original_lemmatized = ' '.join(lemmas)
    return original_lemmatized

def remove_stopwords(original, extra_words=[], exclude_words=[]):
    tokenizer = ToktokTokenizer()

    stopword_list = stopwords.words('english')

    for word in extra_words:
        stopword_list.append(word)
    for word in exclude_words:
        stopword_list.remove(word)

    words = original.split()
    filtered_words = [w for w in words if w not in stopword_list]

    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))
    print('---')

    original_nostop = ' '.join(filtered_words)

    return original_nostop

def prep_article(article):
    
    article_stemmed = stem(basic_clean(article['readme']))
    article_lemmatized = lemmatize(article_stemmed)
    article_without_stopwords = remove_stopwords(article_lemmatized)
    
    article['stemmed'] = article_stemmed
    article['lemmatized'] = article_lemmatized
    article['clean'] = article_without_stopwords
    
    return article

def prepare_article_data(corpus):
    transformed  = []
    for article in corpus:
        transformed.append(prep_article(article))
    return transformed

In [27]:
prepare_article_data(corpus)

Removed 273 stopwords
---
Removed 50 stopwords
---
Removed 84 stopwords
---
Removed 63 stopwords
---
Removed 46 stopwords
---
Removed 216 stopwords
---
Removed 167 stopwords
---
Removed 45 stopwords
---
Removed 43 stopwords
---
Removed 161 stopwords
---
Removed 33 stopwords
---
Removed 183 stopwords
---
Removed 141 stopwords
---
Removed 24 stopwords
---
Removed 71 stopwords
---
Removed 50 stopwords
---
Removed 29 stopwords
---
Removed 3 stopwords
---
Removed 2 stopwords
---
Removed 126 stopwords
---
Removed 1 stopwords
---
Removed 22 stopwords
---
Removed 43 stopwords
---
Removed 133 stopwords
---
Removed 259 stopwords
---
Removed 95 stopwords
---
Removed 100 stopwords
---
Removed 5 stopwords
---
Removed 77 stopwords
---
Removed 14 stopwords
---
Removed 6 stopwords
---
Removed 143 stopwords
---
Removed 386 stopwords
---
Removed 69 stopwords
---
Removed 42 stopwords
---
Removed 65 stopwords
---
Removed 74 stopwords
---
Removed 34 stopwords
---
Removed 206 stopwords
---
Removed 2 stopwor

[{'readme': '\n\n\n\n        README.md\n      \n\n\nInfluence Texas has launched!  Checkout the live webapp at https://app.influencetexas.com/\nFind more information at https://www.influencetexas.com/\nWelcome & Project Summary\nWelcome! We’re so glad you’ve found your way to INFLUENCE TX. This project was started at ATX Hack for Change June 2-4, 2017 by Amy M. Mosley, a former investigative reporter who was frustrated by a fruitless search for fact-based, unbiased sources for information during the toxic 2016 election cycle. In 2017 all ATX Hack for Change projects were aligned with United Nations Sustainable Development Goals. This project falls under Goal 16: Promote just, peaceful and inclusive societies.\nThe premise is this, Politicians lie.\nAnd usually, they do whatever they get paid to do. You need to know:\nWho is paying them?\nHow are they voting as a result?\nBy linking lawmakers’ campaign finance records to their voting records, taxpayers can track the influence of money i

# EXPLORE

# MODEL