# Chemical engineers and python
How many self-identified chemical engineers are on GitHub?  What languages are used in their projects? Lets find out!

Jacob Albrecht 2019

In [28]:
#https://python.gotrained.com/search-github-api/
import os, time
import holoviews as hv
import hvplot.pandas
import pandas as pd
from github import Github, GithubException

you need your own GitHub token, see: https://python.gotrained.com/search-github-api/

one way to use a token you get from your Github account is to save it as an OS environment variable e.g.:
 
    os.environ['GITHUB_TOKEN'] =  '52cf6b57f601a2d081fc362dc4de6f41aa2c9ab0'
 
 Python can then get the token from your computer later, whenever you need it to open a github API connection:

In [29]:
g = Github(os.environ['GITHUB_TOKEN'])

First, search through GitHub users, with key words in their descriptions.  Create a DataFrame with their public project info:

In [30]:
def search_user_repo(keywords):
    info = []
    users = []
    for keyword in keywords:
        users.extend([user.login for user in g.search_users(keyword, 'followers','desc')])
        time.sleep(61)  # to avoid API rate limits
    print(len([u for u in set(users)]))
    for ix, user in enumerate(set(users)):
        repos=g.get_user(user).get_repos()
        rate_limit = g.get_rate_limit().core
        for r in repos:
            while rate_limit.remaining < 500:
                time.sleep(1)
                rate_limit = g.get_rate_limit().core
            try:
                info.append({'Created':r.created_at,'User':user,
                             'Name':r.full_name,**r.get_languages()})
            except GithubException as e:
                print(r.full_name) # print name of projects that cant be read
    return pd.DataFrame(info)

In [31]:
df = search_user_repo(['"chemical engineer"','"chemical engineering"'])

1034
njustcodingjs/openbilibili
njustcodingjs/openbilibili-go-common


In [32]:
df.shape

(8246, 204)

Ok, 1034 users with 8244 projects, using 204 file types.  Do some plotting of the overall popularity

In [33]:
melted = df.melt(id_vars=['User','Created','Name'],var_name='Language',value_name='Bytes')
melted['ChE GitHub Projects'] = ~melted.Bytes.isna()
top_languages = melted.groupby(['Language']).agg(sum).sort_values('ChE GitHub Projects',ascending=False).head(15)

In [34]:
topplot = top_languages.hvplot('Language','ChE GitHub Projects',kind='bar').opts(xrotation=45)
topplot

In [35]:
# save the plot

hv.save(topplot,'Top_Languages.html')

Narrow down list and plot a few languages over time:

In [36]:
languages = ['Python','Jupyter Notebook','Fortran','C','C++','Java','Julia','R','MATLAB','Perl','Visual Basic','HTML']

In [37]:
top_trends = melted.loc[melted.Language.isin(languages),:].groupby(['Language',melted.Created.dt.year+melted.Created.dt.month/12]).agg(sum)

In [38]:
piv = pd.pivot_table(top_trends.reset_index(),values='ChE GitHub Projects', index=['Created'],columns=['Language'], aggfunc=sum, fill_value=0).cumsum()

piv.hvplot(width=600,height=300,xlim=(2014,2020),ylim=(10,3000),logy=True)