# PLAN

- [x] Acquisition
    - [x] Select what list of repos to scrape.
    - [x] Get requests from the site.
    - [x] Save responses to csv.
- [x] Preparation
    - [x] Prepare the data for analysis.
- [ ] Exploration
    - [ ] Answer the following prompts:
        - [x] What are the most common words in READMEs?
        - [ ] What does the distribution of IDFs look like for the most common words? - Jason
        - [x] Does the length of the README vary by language? - Chad
        - [x] Do different languages use a different number of unique words? DD
- [ ] Modeling
    - [x] Transform the data for machine learning; use language to predict.
    - [x] Fit several models using different text representations.
    - [ ] Build a function that will take in the text of a README file, and makes a prediction of language.
- [ ] Delivery
    - [ ] Github repo
        - [x] This notebook.
        - [ ] Documentation within the notebook.
        - [x] README file in the repo.
        - [ ] Python scripts if applicable.
    - [ ] Google Slides
        - [ ] 1-2 slides only summarizing analysis.
        - [ ] Visualizations are labeled.
        - [ ] Geared for the general audience.
        - [ ] Share link @ readme file and/or classroom.

# ENVIRONMENT

In [1]:
import os
import sys

import pandas as pd
import re
import json
import unicodedata
import nltk
import spacy

from requests import get
from bs4 import BeautifulSoup
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from wordcloud import WordCloud

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

ADDITIONAL_STOPWORDS = ['readme', '\n\n\n', '-PRON-', 'python', 'javascript']

# ACQUIRE

In [2]:
# We have decided to search Github for "san antonio data" and scrape the results.
# https://github.com/open-austin

In [3]:
def get_github_repo(url):
    """
    This function takes a url and returns a dictionary that
    contains the content and language of the readme file.
    """
    response = get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    readme = soup.find('div', id='readme')
    language = soup.find('span', class_='lang')
    
    d = dict()
    if readme is None:
        d['readme'] = 'No readme file.'
    else:
        d['readme'] = readme.text
    if language is None:
        d['language'] = 'No language specified.'
    else:
        d['language'] = language.text
    return d

In [4]:
# # This line to test out the function.
# get_github_repo('https://github.com/open-austin/atx-citysdk-js')

In [5]:
def get_github_links(url):
    """
    This function takes in a url and returns a list of links
    that comes from each individual repo listing page.
    """
    response = get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = []
    for link in soup.findAll('a', itemprop='name codeRepository', attrs={'href': re.compile("^/")}):
        links.append(link.get('href'))
    return links

In [6]:
# # This line to test out the function.
# get_github_links('https://github.com/open-austin?page=3')

In [7]:
def get_all_github_links(path, num_pages):
    """
    This function takes in a url path and number of pages
    and returns a list of lists of all links.
    """
    all_links = []
    for i in range(num_pages):      # Number of pages plus one
        page = i + 1
        response = get(path + str(page))
        soup = BeautifulSoup(response.text, 'html.parser')
        all_links.append(get_github_links(path + '?page=' + str(page)))
    return all_links

In [8]:
# # This line to test out the function.
# get_all_github_links('https://github.com/open-austin', 3)

In [9]:
def traverse(o, tree_types=(list, tuple)):
    if isinstance(o, tree_types):
        for value in o:
            for subvalue in traverse(value, tree_types):
                yield subvalue
    else:
        yield o

In [10]:
def get_github_readme(url, num_pages, cache=True):
    if cache and os.path.exists('github_readme.json'):
        readme_text = json.load(open('github_readme.json'))
    else:
        data = get_all_github_links(url, num_pages)
        readme_text = []
        for value in traverse(data):
            print('https://github.com'+value)
            readme_text.append(get_github_repo('https://github.com' + value))
        json.dump(readme_text, open('github_readme.json', 'w'))
    return readme_text

In [11]:
# Bringing it all together chaining...
from pprint import pprint
corpus = get_github_readme('https://github.com/texastribune', 8, cache=True)
pprint(corpus)

[{'language': 'Dockerfile',
  'readme': '\n'
            '\n'
            '\n'
            '\n'
            '        README.md\n'
            '      \n'
            '\n'
            '\n'
            'Base images\n'
            'How to make updates:\n'
            '\n'
            'Create a new branch\n'
            "If you're adding a python dependency:\n"
            '\n'
            'Run make run-base\n'
            "Run poetry add --dev <package> (drop the --dev if it's a "
            'production\n'
            'dependency)\n'
            'For other operations see the\n'
            'poetry docs\n'
            'Maybe edit pyproject.toml by hand if necessary\n'
            'Run poetry lock\n'
            '\n'
            '\n'
            "If it's a node dependency:\n"
            '\n'
            'Run make run-dev\n'
            'Do whatever node/yarn things you people do ;-)\n'
            '\n'
            '\n'
            'Bump the version in VERSION file\n'
            'Bump vers

            'https://github.com/stdbrouw/django-locking\n'
            'https://github.com/runekaagaard/django-locking\n'
            'https://github.com/theatlantic/django-locking\n'
            'https://github.com/ortsed/django-locking\n'
            '\n'
            '\n'},
 {'language': 'No language specified.', 'readme': 'No readme file.'},
 {'language': 'Python',
  'readme': '\n'
            '\n'
            '\n'
            '\n'
            '        README.md\n'
            '      \n'
            '\n'
            '\n'
            'yourls\n'
            'Simple Docker container for YOURLS.\n'
            'Usage\n'
            'Available on Docker Hub as texastribune/yourls.\n'
            '$ docker run \\\n'
            '    -e YOURLS_DB_USER=root \\\n'
            '    -e YOURLS_DB_PASS=supersecureyo \\\n'
            '    -e YOURLS_DB_NAME=yourls \\\n'
            '    -e YOURLS_DB_HOST=localhost \\\n'
            '    -e YOURLS_DB_PREFIX=yourls_ \\\n'
            '    -e YOURLS

            'Create a topic branch to house your changes\n'
            'Get all of your commits in the new topic branch\n'
            'Submit a pull request\n'
            '\n'
            '\n'
            'State of Project\n'
            'Armstrong is an open-source news platform that is freely '
            'available to any\n'
            'organization.  It is the result of a collaboration between the '
            'Texas Tribune\n'
            'and Bay Citizen, and a grant from the John S. and James L. '
            'Knight\n'
            'Foundation.\n'
            'To follow development, be sure to join the Google Group.\n'
            'armstrong.core.arm_section is part of the Armstrong project.  '
            "You're\n"
            'probably looking for that.\n'
            '\n'
            'License\n'
            'Copyright 2011 Bay Citizen and Texas Tribune\n'
            'Licensed under the Apache License, Version 2.0 (the "License");\n'
            'you may not use this f

            'that spreadsheet and/or have not been authenticated against '
            "access to Google's Drive API with the Texas Tribune's graphics "
            'app, this will not work for you. (This should only apply to those '
            'of you playing a long at home/not members of News Apps.)\n'
            "You'll then need to pull down the raw assets from S3. We don't "
            'commit images and the like.\n'
            'npm run assets/pull\n'
            'To test the site locally, run gulp serve.\n'
            'Deployment\n'
            'Always be sure to get the latest data first!\n'
            'npm run spreadsheet/fetch\n'
            'Then build the site:\n'
            'gulp\n'
            'And finally deploy:\n'
            'npm run deploy\n'
            '(Similar caveats as above ‚Äì\xa0if you do not have clearance to '
            "access the Texas Tribune's S3 buckets this step will break for "
            'you.)\n'
            '\n'
            '\n'},
 {'lan

            '\n'
            '\n'},
 {'language': 'Python',
  'readme': '\n'
            '\n'
            '\n'
            '\n'
            '        README.md\n'
            '      \n'
            '\n'
            '\n'
            'tx_lobbying\n'
            '\n'
            'Very early alpha\n'
            'About the data\n'
            'The two main sources of data are:\n'
            '\n'
            '\n'
            'The lists of registered lobbyists, by year:\n'
            'http://www.ethics.state.tx.us/dfs/loblists.htm\n'
            '\n'
            '\n'
            'And the coversheets for the lobbyist activies reports (LA):\n'
            'http://www.ethics.state.tx.us/dfs/search_LOBBY.html\n'
            '\n'
            '\n'
            'Names come from both sources of data, but only the coversheets '
            'have detailed\n'
            'information about names.\n'
            'The information for lobbying interests come from the registration '
            'forms. Thi

            'etc). Streaming requires custom framing.\n'
            '\n'
            'License\n'
            'See LICENSE file.\n'
            '\n'
            '\n'},
 {'language': 'JavaScript', 'readme': 'No readme file.'},
 {'language': 'No language specified.',
  'readme': '\n'
            '\n'
            '\n'
            '\n'
            '        README.md\n'
            '      \n'
            '\n'
            '\n'
            'tutum-docker-mysql\n'
            'Base docker image to run a MySQL database server\n'
            'MySQL version\n'
            'master branch maintains MySQL from Ubuntu trusty official source. '
            'If you want to get different version of MySQL, please checkout '
            '5.5 branch and 5.6 branch.\n'
            'If you want to use MariaDB, please check our tutum/mariadb image: '
            'https://github.com/tutumcloud/tutum-docker-mariadb\n'
            'Usage\n'
            'To create the image tutum/mysql, execute the following comma

            'Contributing\n'
            '\n'
            'Create something awesome -- make the code better, add some '
            'functionality,\n'
            'whatever (this is the hardest part).\n'
            'Fork it\n'
            'Create a topic branch to house your changes\n'
            'Get all of your commits in the new topic branch\n'
            'Submit a pull request\n'
            '\n'
            '\n'
            'State of Project\n'
            'Armstrong is an open-source news platform that is freely '
            'available to any\n'
            'organization.  It is the result of a collaboration between the '
            'Texas Tribune\n'
            'and Bay Citizen, and a grant from the John S. and James L. '
            'Knight\n'
            'Foundation.\n'
            'To follow development, be sure to join the Google Group.\n'
            'armstrong.apps.related_content is part of the Armstrong project.  '
            "You're\n"
            'probably lookin

# PREPARE

In [12]:
def basic_clean(original):
    word = original.lower()
    word = unicodedata.normalize('NFKD', word)\
                                .encode('ascii', 'ignore')\
                                .decode('utf-8', 'ignore')
    word = re.sub(r"[^a-z'\s]", ' ', word)
    word = word.replace('\n',' ')
    word = word.replace('\t',' ')
    return word

def tokenize(original):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(basic_clean(original))

def stem(original):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in original.split()]
    original_stemmed = ' '.join(stems)
    return original_stemmed

def lemmatize(original):
    nlp = spacy.load('en', parse=True, tag=True, entity=True)
    doc = nlp(original) # process the text with spacy
    lemmas = [word.lemma_ for word in doc]
    original_lemmatized = ' '.join(lemmas)
    return original_lemmatized

def remove_stopwords(original, extra_words=['readmemd'], exclude_words=[]):
    tokenizer = ToktokTokenizer()

    stopword_list = stopwords.words('english') + ADDITIONAL_STOPWORDS

    for word in extra_words:
        stopword_list.append(word)
    for word in exclude_words:
        stopword_list.remove(word)

    words = original.split()
    filtered_words = [w for w in words if w not in stopword_list]

    print('Removed {} stopwords'.format(len(words) - len(filtered_words)))

    original_nostop = ' '.join(filtered_words)

    return original_nostop

def prep_article(article):

#    article_stemmed = stem(basic_clean(article['readme']))
#    Note the stem line immediately above has been commented out,
#    the first item below retains the same name as the stem line above, to make everything else work.
    article_stemmed = basic_clean(article['readme'])
    article_lemmatized = lemmatize(article_stemmed)
    article_without_stopwords = remove_stopwords(article_lemmatized)
    
    article['stemmed'] = article_stemmed
    article['lemmatized'] = article_lemmatized
    article['clean'] = article_without_stopwords
    
    return article

def prepare_article_data(corpus):
    transformed  = []
    for article in corpus:
        transformed.append(prep_article(article))
    return transformed

# This is to fix the string as list of words per readme file glitch
def clean(text):
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
    text = (unicodedata.normalize('NFKD', text)
             .encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', ' ', text).split()
    return [wnl.lemmatize(word) for word in words if word not in stopwords]

In [13]:
df = pd.DataFrame(prepare_article_data(corpus))
df.shape

Removed 39 stopwords
Removed 184 stopwords
Removed 1020 stopwords
Removed 267 stopwords
Removed 73 stopwords
Removed 18 stopwords
Removed 20 stopwords
Removed 2 stopwords
Removed 2 stopwords
Removed 1 stopwords
Removed 25 stopwords
Removed 102 stopwords
Removed 257 stopwords
Removed 1238 stopwords
Removed 8 stopwords
Removed 109 stopwords
Removed 547 stopwords
Removed 2 stopwords
Removed 2 stopwords
Removed 40 stopwords
Removed 507 stopwords
Removed 326 stopwords
Removed 349 stopwords
Removed 2 stopwords
Removed 35 stopwords
Removed 17 stopwords
Removed 22 stopwords
Removed 9 stopwords
Removed 30 stopwords
Removed 47 stopwords
Removed 180 stopwords
Removed 122 stopwords
Removed 50 stopwords
Removed 120 stopwords
Removed 141 stopwords
Removed 5 stopwords
Removed 173 stopwords
Removed 263 stopwords
Removed 132 stopwords
Removed 39 stopwords
Removed 57 stopwords
Removed 305 stopwords
Removed 125 stopwords
Removed 505 stopwords
Removed 5 stopwords
Removed 12 stopwords
Removed 184 stopwords

(211, 5)

In [14]:
df = df[['clean', 'language']]
# remove_stopwords(df.iloc[11].clean) - ZACH'S DIAGNOSTIC TEST

In [15]:
languages = pd.concat([df.language.value_counts(),
                    df.language.value_counts(normalize=True)], axis=1)
languages.columns = ['n', 'ratio']
languages

Unnamed: 0,n,ratio
Python,68,0.322275
JavaScript,60,0.28436
No language specified.,20,0.094787
CSS,20,0.094787
HTML,14,0.066351
Shell,13,0.061611
Makefile,5,0.023697
Dockerfile,5,0.023697
Ruby,3,0.014218
Jupyter Notebook,2,0.009479


In [16]:
# removing all rows that has 'No language specified.'
df = df[df.language != 'No language specified.']
df = df.rename(index=str, columns={"clean": "text"})

In [17]:
df.shape

(191, 2)

In [18]:
languages = pd.concat([df.language.value_counts(),
                    df.language.value_counts(normalize=True)], axis=1)
languages.columns = ['n', 'ratio']
languages

Unnamed: 0,n,ratio
Python,68,0.356021
JavaScript,60,0.314136
CSS,20,0.104712
HTML,14,0.073298
Shell,13,0.068063
Makefile,5,0.026178
Dockerfile,5,0.026178
Ruby,3,0.015707
Jupyter Notebook,2,0.010471
CoffeeScript,1,0.005236


## DECISION POINT

### _Based on results of the above language distribution, we have made the decision to focus our analysis efforts primarily on Python and JavaScript languages, which comprises 67% of the data._

In [19]:
df = df.loc[df['language'].isin(['Python', 'JavaScript'])]
df.shape

(128, 2)

In [20]:
df.head()

Unnamed: 0,text,language
1,md software collect donation nonprofit integra...,Python
2,md datum visual create tool generate scaffoldi...,JavaScript
5,md texas tribune file app app power file syste...,JavaScript
9,md thermometer,Python
10,md wall query salesforce opportunity informati...,Python


# EXPLORE

*Explore the data that you have scraped. Here are some ideas for exploration:*

- What are the most common words in READMEs?
- What does the distribution of IDFs look like for the most common words?
- Does the length of the README vary by language?
- Do different languages use a different number of unique words?
- Which programming language community has more positive sentiment?

In [21]:
df1 = df.copy()
df1.head()

Unnamed: 0,text,language
1,md software collect donation nonprofit integra...,Python
2,md datum visual create tool generate scaffoldi...,JavaScript
5,md texas tribune file app app power file syste...,JavaScript
9,md thermometer,Python
10,md wall query salesforce opportunity informati...,Python


In [22]:
df2 = df.copy()
df2.head()

Unnamed: 0,text,language
1,md software collect donation nonprofit integra...,Python
2,md datum visual create tool generate scaffoldi...,JavaScript
5,md texas tribune file app app power file syste...,JavaScript
9,md thermometer,Python
10,md wall query salesforce opportunity informati...,Python


In [23]:
df2['readme_len'] = df2['text'].apply(len)
df2

Unnamed: 0,text,language,readme_len
1,md software collect donation nonprofit integra...,Python,2512
2,md datum visual create tool generate scaffoldi...,JavaScript,9396
5,md texas tribune file app app power file syste...,JavaScript,214
9,md thermometer,Python,14
10,md wall query salesforce opportunity informati...,Python,228
11,md scuole italian school public school setup p...,Python,822
14,md geoip super simple node js base deployment ...,JavaScript,86
15,md talk online comment break open source comme...,JavaScript,1204
16,rst tx salary django application generate use ...,Python,7107
18,file,Python,4


In [24]:
python_df = df[df['language'] == 'Python']
js_df = df[df['language'] == 'JavaScript']

In [25]:
python_df.readme_len.mean()

AttributeError: 'DataFrame' object has no attribute 'readme_len'

In [None]:
js_df.readme_len.mean()

In [None]:
js_df.readme_len.mean() - python_df.readme_len.mean()

## **ANSWER:**
### _Yes, the length of README file does vary by language.  On average, README files associated with JavaScript language are 533 characters longer than Python README files._

In [None]:
# Creating series of words by language:
python_words = clean(' '.join(df[df.language == 'Python'].text))
js_words = clean(' '.join(df[df.language == 'JavaScript'].text))

all_words = clean(' '.join(df.text))

In [None]:
all_freq = pd.Series(all_words).value_counts()
python_freq = pd.Series(python_words).value_counts()

js_freq = pd.Series(js_words).value_counts()
python_freq.head()

In [None]:
print(all_freq.shape)
print(python_freq.shape)
print(js_freq.shape)

In [None]:
word_counts = (pd.concat([python_freq, js_freq, all_freq], axis=1, sort=True)
                .set_axis(['python', 'js', 'all'], axis=1, inplace=False)
                .fillna(0)
                .apply(lambda s: s.astype(int)))

word_counts.head(10)

## QUESTION:
### _What are the most frequently occuring words? / What are the most common words in READMEs?_

In [None]:
word_counts.sort_values(by='all', ascending=False).head(10)

## ANSWER:
### _The most frequently occuring words are: use, app, run, datum, file, project, j, http, django, and template._

## QUESTION:
### _Are there any words that uniquely identify a particular language?_

In [None]:
pd.concat([word_counts[word_counts.js == 0].sort_values(by='python').tail(5),
           word_counts[word_counts.python == 0].sort_values(by='js').tail(5)])

## ANSWER:
### _Yes, see above dataframe._

In [None]:
# figure out the percentage of language distribution
(word_counts
 .assign(p_python=word_counts.python / word_counts['all'],
         p_js=word_counts.js / word_counts['all']
        )
 .sort_values(by='all')
 [['p_python',
   'p_js'
  ]]
 .tail(20)
 .sort_values('p_python')
 .plot.barh(stacked=True, figsize=(12,5), width=.9))

plt.title('Proportions of Python V. JavaScript for the 20 most common words')

In [None]:
word_counts[(word_counts.python > 10) & (word_counts.js > 10)]\
    .assign(ratio=lambda df: df.python / df.js)\
    .sort_values(by='ratio')

In [None]:
all_cloud = WordCloud(background_color='white', height=1000, width=400, random_state=123).generate(' '.join(all_words))
python_cloud = WordCloud(background_color='white', height=600, width=800, random_state=123).generate(' '.join(python_words))
js_cloud = WordCloud(background_color='white', height=600, width=800, random_state=123).generate(' '.join(js_words))

plt.figure(figsize=(10, 8))
axs = [plt.axes([0, 0, .5, 1]), plt.axes([.5, .5, .5, .5]), plt.axes([.5, 0, .5, .5])]

axs[0].imshow(all_cloud)
axs[1].imshow(python_cloud)
axs[2].imshow(js_cloud)

axs[0].set_title('All Words')
axs[1].set_title('Python')
axs[2].set_title('JS')

for ax in axs: ax.axis('off')

In [None]:
top_20_python_bigrams = (pd.Series(nltk.ngrams(python_words, 2))
                      .value_counts()
                      .head(20))

top_20_python_bigrams.head()

In [None]:
top_20_js_bigrams = (pd.Series(nltk.ngrams(js_words, 2))
                      .value_counts()
                      .head(20))

top_20_js_bigrams.head()

In [None]:
top_20_python_bigrams.sort_values().plot.barh(color='blue', width=.9, figsize=(10, 6))

plt.title('20 Most frequently occuring Python bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances')

# make the labels pretty
ticks, _ = plt.yticks()
labels = top_20_python_bigrams.reset_index()['index'].apply(lambda t: t[0] + ' ' + t[1])
_ = plt.yticks(ticks, labels)

In [None]:
top_20_js_bigrams.sort_values().plot.barh(color='orange', width=.9, figsize=(10, 6))

plt.title('20 Most frequently occuring JavaScript bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances')

# make the labels pretty
ticks, _ = plt.yticks()
labels = top_20_js_bigrams.reset_index()['index'].apply(lambda t: t[0] + ' ' + t[1])
_ = plt.yticks(ticks, labels)

In [None]:
data = {k[0] + ' ' + k[1]: v for k, v in top_20_python_bigrams.to_dict().items()}
img = WordCloud(background_color='white', width=1200, height=800, random_state=123).generate_from_frequencies(data)

plt.figure(figsize=(12, 10))
plt.imshow(img)
plt.axis('off')
plt.title('Top 20 Python Bigrams')

In [None]:
data = {k[0] + ' ' + k[1]: v for k, v in top_20_js_bigrams.to_dict().items()}
img = WordCloud(background_color='white', width=1200, height=800, random_state=123).generate_from_frequencies(data)
plt.figure(figsize=(12, 10))
plt.imshow(img)
plt.axis('off')
plt.title('Top 20 JavaScript Bigrams')

In [None]:
top_20_python_trigrams = (pd.Series(nltk.ngrams(python_words, 3))
                      .value_counts()
                      .head(20))

top_20_python_trigrams.head()

In [None]:
top_20_python_trigrams.sort_values().plot.barh(color='blue', width=.9, figsize=(10, 6))

plt.title('20 Most frequently occuring Python Trigrams')
plt.ylabel('Trigram')
plt.xlabel('# Occurances')

# make the labels pretty
ticks, _ = plt.yticks()
labels = top_20_python_trigrams.reset_index()['index'].apply(lambda t: t[0] + ' ' + t[1] + ' ' + t[2])
_ = plt.yticks(ticks, labels)

In [None]:
data = {k[0] + ' ' + k[1] + ' ' + k[2]: v for k, v in top_20_python_trigrams.to_dict().items()}
img_python_tri = WordCloud(background_color='white', width=1200, height=800).generate_from_frequencies(data)
plt.figure(figsize=(12, 8))
plt.imshow(img_python_tri)
plt.axis('off')

In [None]:
top_20_js_trigrams = (pd.Series(nltk.ngrams(js_words, 3))
                      .value_counts()
                      .head(20))

top_20_js_trigrams.head()

In [None]:
top_20_js_trigrams.sort_values().plot.barh(color='blue', width=.9, figsize=(10, 6))

plt.title('20 Most frequently occuring JavaScript Trigrams')
plt.ylabel('Trigram')
plt.xlabel('# Occurances')

# make the labels pretty
ticks, _ = plt.yticks()
labels = top_20_js_trigrams.reset_index()['index'].apply(lambda t: t[0] + ' ' + t[1] + ' ' + t[2])
_ = plt.yticks(ticks, labels)

In [None]:
data = {k[0] + ' ' + k[1] + ' ' + k[2]: v for k, v in top_20_js_trigrams.to_dict().items()}
img_js_tri = WordCloud(background_color='white', width=1200, height=800).generate_from_frequencies(data)
plt.figure(figsize=(12, 8))
plt.imshow(img_js_tri)
plt.axis('off')

## QUESTION:
### _Do different languages use a different number of unique words?_

In [None]:
df1.head()

In [None]:
df1_python = df1[df1.language == 'Python']
df1_python.shape

In [None]:
df1_js = df1[df1.language == 'JavaScript']
df1_js.shape

In [None]:
def flatten(lofl):     
    for i in lofl: 
        if type(i) == list: 
            flatten(i) 
        else: 
            output.append(i)
    return output

In [None]:
string_python = []
for text in df1_python.text.tolist():
    string_python.append(text.split())

In [None]:
output = []
words_in_python = pd.Series(flatten(string_python))
print(len(set(words_in_python)))

In [None]:
string_js = []
for text in df1_js.text.tolist():
    string_js.append(text.split())

In [None]:
output = []
words_in_js = pd.Series(flatten(string_js))
print(len(set(words_in_js)))

## ANSWER:
### _Yes, but not by much (relatively speaking). Repositories that are primarily written in Python have readme files that are in average 47 words more than those written in JavaScript._

# MODEL

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
monkey = df
monkey['readme_len'] = monkey['text'].apply(len)
monkey

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier

### 1.  K-Nearest_Neighbors model

In [None]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(monkey.text)
y = monkey.language

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state = 123)

knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_train)
y_pred_proba = knn.predict_proba(X_train)

In [None]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

In [None]:
print('Accuracy: {:.2%}'.format(knn.score(X_train, y_train)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(y_train, y_pred))
print('---')
print(classification_report(y_train, y_pred))

- **Precision:** the higher this number is, the more you were able to pinpoint all positives correctly.  If this is a low score, you predicted a lot of positives where there were none.
    - tp / (tp + fp)


- **Recall:** if this score is high, you didn‚Äôt miss a lot of positives. But as it gets lower, you are not predicting the positives that are actually there.
    - tp / (tp + fn)


- **f1-score:** The balanced harmonic mean of Recall and Precision, giving both metrics equal weight. The higher the F-Measure is, the better.


- **Support:** number of occurrences of each class in where y is true.

In [None]:
print('Accuracy of KNN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

In [None]:
import matplotlib.pyplot as plt
k_range = range(1, 20)
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))
plt.figure()
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,20])

### 2.  Decision Tree model

#### Split the data

In [None]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(monkey.text)
y = monkey.language

#### Train Model
- *Create the Decision Tree Object*
- *Fit the model to the training data*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state = 123)

clf = DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=123)

clf.fit(X_train, y_train)

- *Estimate language*

In [None]:
y_pred = clf.predict(X_train)
y_pred[0:5]

- *Estimate the probability of a species*

In [None]:
y_pred_proba = clf.predict_proba(X_train)
y_pred_proba[0:5]

#### Evaluate Model
- *Compute the Accuracy*
- *Accuracy:  number of correct predictions over the number of total instances that have been evaluated.*

In [None]:
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))

#### Create a confusion matrix
- **True Positive:** number of occurrences where y is true and y is predicted true.
- **True Negative:** number of occurrences where y is false and y is predicted false.
- **False Positive:** number of occurrences where y is false and y is predicted true.
- **False Negative:** number of occurrences where y is true and y is predicted false.

In [None]:
confusion_matrix(y_train, y_pred)

In [None]:
y_train.value_counts()

In [None]:
import pandas as pd
labels = sorted(y_train.unique())

In [None]:
pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)

In [None]:
print(classification_report(y_train, y_pred))

- **Precision:** the higher this number is, the more you were able to pinpoint all positives correctly.  If this is a low score, you predicted a lot of positives where there were none.
    - tp / (tp + fp)


- **Recall:** if this score is high, you didn‚Äôt miss a lot of positives. But as it gets lower, you are not predicting the positives that are actually there.
    - tp / (tp + fn)


- **f1-score:** The balanced harmonic mean of Recall and Precision, giving both metrics equal weight. The higher the F-Measure is, the better.


- **Support:** number of occurrences of each class in where y is true.

In [None]:
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

## 3. Logistic Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.text, df.language, random_state=123)

In [None]:
tfidf = TfidfVectorizer()
tfidf.fit(X_train)
train_tfidf_values = tfidf.transform(X_train)

In [None]:
model = LogisticRegression()
model.fit(train_tfidf_values, y_train)

In [None]:
predictions = model.predict(train_tfidf_values)

In [None]:
df = pd.DataFrame(dict(actual=y_train, predicted=predictions))
pd.crosstab(df.predicted,df.actual)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(df.actual, df.predicted))

#### Now use the model on the test data

In [None]:
test_tfidf_values = tfidf.transform(X_test)
test_predictions = model.predict(test_tfidf_values)
print('Accuracy: {:.2%}'.format(accuracy_score(df.predicted, df.actual)))
print(classification_report(y_test, test_predictions))

- **Precision:** the higher this number is, the more you were able to pinpoint all positives correctly.  If this is a low score, you predicted a lot of positives where there were none.
    - tp / (tp + fp)


- **Recall:** if this score is high, you didn‚Äôt miss a lot of positives. But as it gets lower, you are not predicting the positives that are actually there.
    - tp / (tp + fn)


- **f1-score:** The balanced harmonic mean of Recall and Precision, giving both metrics equal weight. The higher the F-Measure is, the better.


- **Support:** number of occurrences of each class in where y is true.

In [None]:
def predict(unknown_text):
    return model.predict(tfidf.transform([unknown_text]))[0]

In [None]:
predict('run')