# The Impact of Open Source on Environmental Sustainability 

This Jupyter Notebook will guide you through the processing of the dataset on open source ecosystem in sustainable technology created by [OpenSustain.tech](https://opensustain.tech/). For more information on how this dataset was created, please refer to the final report KK: ADD LINK. To make the study transparent and reproducible all plots and scores within the study are created by this document. The notebook is intended to help interested readers draw their own conclusions from the data set. Not all plots generated here have to be part of the report. To rerun this document within your browser you can use the Binder integration (click on the Binder logo below).  


[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/protontypes/AwesomeCure.git/HEAD)

## Data Acquisition

The creation of the raw data sets already requires several steps in the preprocessing of the data. All projects are selected based on the [contribution guide](https://opensustain.tech/contributing/) of OpenSustain.tech. The Contribution Guide restricts the listed projects to those that hold an opensource license, such as MIT or GNU GPLv3. Since git repositories are by far the most widely used tool for collaboration in open source projects, the vast majority entries are within a git repository. 

The project dataset is entirely machine-generated based on data from OpenSustain.tech and data from the GitHub API. Other platforms are not yet supported. The data mining script `awesomecure.ipynb` can be found in the same repo.

In order to bring the database together, different methods were used over a period of 2 years to identify as many projects as possible within the topic area, such as:

* Proactive investigation on GitHub, Gitlab, and BitBucket.
* Crowdsourcing by 30+ contributors and many anonymous tips.
* Search scientific publications based on domain specific key words and the term git. 
* Tracking social media activities of organizations with projects in this area. 

In [1]:
import sys
!{sys.executable} -m pip install handcalcs ipython

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [2]:
from IPython.display import display, HTML

import dateparser
import datetime
import handcalcs.render
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objects as go
import plotly.express as px
import pycountry

In [3]:
### KK: add simple docstrings

# Clean up the dataset
def name_to_iso3(x):
    # fuzzy search does not like UK
    if x == "UK":
        x = "United Kingdom"
    try:
        iso3 = pycountry.countries.search_fuzzy(x)[0].alpha_3
    except:
        iso3 = ""
    return iso3


def upper_string(lower_string):
    return lower_string.title()

def calc_age(x):
    return (datetime.datetime.now() - dateparser.parse(x, settings={'TIMEZONE': 'CEST'})).days/365

def count_strings(comma_seperated_string):
    if type(comma_seperated_string) == str:
        return comma_seperated_string.count(",")
    else:
        return 0

In [4]:
# default plotting options
# Palette https://coolors.co/palette/0e7c7b-17bebb-ffc857-e9724c-c5283d
height = (800,)  # Added parameter
color_continuous_scale = ["#0E7C7B", "#C5283D"]
marker_color = "#0E7C7B"
color_discrete_sequence = ["#0E7C7B", "#17BEBB", "#FFC857", "#E9724C", "#C5283D"]

# Register your theme as a named template
pio.templates["OpenSustain"] = go.layout.Template(
    layout=dict(
        font=dict(
            family="Google Font",
            color="#040404",
        ),
        title_font_family="Google Font",
        title_font_color="#040404",
        legend_title_font_color="#040404",
    ),
)

# Combine your theme with plotly's default
pio.templates.default = "plotly+OpenSustain"

In [5]:
# Read in the dataset

pd.options.plotting.backend = "plotly"
print(pio.renderers)
# pio.renderers.default = "browser" # Fill a complete new browser tab with every new plot
# pio.renderers.default = "notebook"
# pio.renderers.default = "plotly_mimetype+notebook"

display(HTML("<style>.container { width:80% !important; }</style>"))

Renderers configuration
-----------------------
    Default renderer: 'plotly_mimetype+notebook'
    Available renderers:
        ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode',
         'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab',
         'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg',
         'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe',
         'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png']



In [6]:
df_raw = pd.read_csv("./csv/projects.csv")
df_raw.head(5)

Unnamed: 0,project_name,oneliner,git_namespace,git_url,platform,topics,rubric,last_commit_date,stargazers_count,number_of_dependents,...,organization_name,organization_github_url,organization_website,organization_location,organization_country,organization_form,organization_avatar,organization_public_repos,organization_created,organization_last_update
0,pvlib-python,A set of documented functions for simulating t...,pvlib,https://github.com/pvlib/pvlib-python.git,github,"solar-energy,python,renewable-energy,renewable...",Photovoltaics and Solar Energy,"2022/08/19, 21:13:55",724.0,258.0,...,,https://github.com/pvlib,,,,,https://avatars.githubusercontent.com/u/110372...,,,
1,pvfactors,Open source view-factor model for diffuse shad...,SunPower,https://github.com/SunPower/pvfactors.git,github,"solar-energy,renewable-energy,python,bifacial",Photovoltaics and Solar Energy,"2022/02/22, 21:53:32",61.0,7.0,...,,https://github.com/SunPower,,,,,https://avatars.githubusercontent.com/u/134197...,,,
2,gsee,Global Solar Energy Estimator.,renewables-ninja,https://github.com/renewables-ninja/gsee.git,github,"solar,pandas,energy,irradiance,photovoltaic,pv...",Photovoltaics and Solar Energy,"2020/07/21, 06:28:35",88.0,0.0,...,,https://github.com/renewables-ninja,https://www.renewables.ninja/,,,,https://avatars.githubusercontent.com/u/118382...,,,
3,PVMismatch,An explicit Python PV system IV & PV curve tra...,SunPower,https://github.com/SunPower/PVMismatch.git,github,"numpy,scipy,python,solar,photovoltaic",Photovoltaics and Solar Energy,"2022/04/14, 19:15:36",51.0,0.0,...,,https://github.com/SunPower,,,,,https://avatars.githubusercontent.com/u/134197...,,,
4,rdtools,An open source library to support reproducible...,NREL,https://github.com/NREL/rdtools.git,github,,Photovoltaics and Solar Energy,"2022/07/01, 21:13:13",109.0,5.0,...,,https://github.com/NREL,http://www.nrel.gov,"Golden, CO",,,https://avatars.githubusercontent.com/u/190680...,,,


## Calculate Age in Years

In [7]:
## KK: I would suggest using a clearer object-naming convention. Below it becomes unclear what's the difference between df and df_raw
# Age plots are better in years
df_raw["project_age_in_years"] = df_raw["project_age_in_days"].apply(lambda x: x / 365)
max_age_in_years = 8.0

## Basis Statistics
First let us get a routh overview of the project dataset


In [71]:
fig = go.Figure(
    data=[
        go.Table(
            header=dict(values=["Dimension", "Value"],line_color=color_discrete_sequence[4],
                        fill_color=color_discrete_sequence[3]),
            cells=dict(
                        fill_color=color_discrete_sequence[2],
                        line_color='#000000',
                values=[
                    [
                        "Total number of projects",
                        "Github projects",
                        "Gitlab projects",
                        "Other platforms",
                        "Number of projects in personal namespace",
                        "Total stars of all projects",
                        "Total contributers of all projects",
                        "Active GitHub projects",
                        "Inactive GitHub projects",
                        "Projects with contribution guide",
                        "Projects with code of conduct",
                        "Projects accepting donations",
                        "Median number of commits",
                        "Median stargazers",
                        "Median stars last year",
                        "Median Development Distribution Score",
                        "Median number of contributors",
                        "Median closed issues last year",
                        "Median commits last year",
                        "Median age in years",
                    ],
                    [
                        df_raw["project_name"].count(),
                        df_raw["platform"].value_counts()["github"],
                        df_raw["platform"].value_counts()["gitlab"],
                        df_raw["platform"].value_counts()["custom"],
                        df_raw["project_name"].count() - df_raw["organization"].count(),
                        df_raw["stargazers_count"].sum(),
                        df_raw["contributors"].sum(),
                        df_raw["project_active"].value_counts()[True],
                        df_raw["project_active"].value_counts()[False],
                        round(df_raw["contribution_guide"].value_counts(normalize=True)[True]*100,2),
                        round(df_raw["code_of_conduct"].value_counts(normalize=True)[True]*100,2),
                        round(df_raw["accepts_donations"].value_counts(normalize=True)[True]*100,2),
                        df_raw["total_number_of_commits"].median(),
                        df_raw["stargazers_count"].median(),
                        df_raw["stars_last_year"].median(),
                        round(df_raw["development_distribution_score"].median(),4),
                        df_raw["contributors"].median(),
                        df_raw["issues_closed_last_year"].median(),
                        df_raw["total_commits_last_year"].median(),
                        round(df_raw["project_age_in_years"].median()),
                        
                    ],
                ]
            ),
        )
    ]
)



fig.update_layout(
autosize=False,
)
fig.show()

## Development Distribution Score

The Development Distribution Score (DDS) weights how the development is distributed between projects contributors by setting contributor with the most commits in relation with the other contributors. Distribution of knowledge, work, and governance of an project ensure sustainability. When people are leaving a project or don't find time anymore for an open source project other can still continue and jump into leading positions. 

DDS is created in the preprocessing script and is similar to the bus factor.
It is only based on quantiative values derived from git statistics. This value is calculated in preprocessing.



## Filter Data 

In [9]:
df_active = df_raw.copy()
# Filter out the inactive project for further analysis
df_active = df_active[(df_active["project_active"] == True)]
# Ciruated Lists are no classical open source projects and are not included into the analysis
df_active = df_active[(df_active["rubric"] != "Curated Lists")]
# Filter out the projects not on the GitHub platform
df_active = df_active[(df_active["platform"] == "github")]

## Score Projects 

In [10]:
# Calculate the scores on activity, community and size
df_active["activity"] = (
    df_active["total_commits_last_year"].rank(pct=True)
    + df_active["issues_closed_last_year"].rank(pct=True)
    + df_active["days_until_last_issue_closed"].rank(pct=True)
    + df_active["last_released_date"].rank(pct=True, na_option="top")
)

df_active["community"] = (
    df_active["contributors"].rank(pct=True)
    + df_active["development_distribution_score"].rank(pct=True)
    + df_active["reviews_per_pr"].rank(pct=True)
)

df_active["size"] = (
    df_active["total_number_of_commits"].rank(pct=True)
    + df_active["contributors"].rank(pct=True)
    + df_active["closed_issues"].rank(pct=True)
    + df_active["closed_pullrequests"].rank(pct=True)
)

# All scores are weighted equal and normalized to one
df_active["total_score"] = (
    df_active["activity"] / df_active["activity"].max()
    + df_active["community"] / df_active["community"].max()
    + df_active["size"] / df_active["size"].max()
) / 3

In [11]:
# Save the dataset with the scores
df_active_path = "./csv/project_analysis.csv"
df_active.to_csv(df_active_path)

In [12]:
# KK: re cell below: it would be helpful to add a comment explaining the logic behind these thresholds, even if their choice is arbitrary

In [13]:
%%render
## The calcluation within this cell shall reader give an understanding on how the DDS is been calculated. 
## Values calculated here are not used in any other cell.
n_MaxCommitsSingleContributor = 2000
n_total_commits = 2000


DDS = 1 - n_MaxCommitsSingleContributor / n_total_commits

<IPython.core.display.Latex object>

In [77]:
### KK: this is where a clear object naming convention + comments would really help: is syntax df[df_raw[..]] appropriate here? 
### KK: it might be helpful to plot boxplots for the below scores per category to better show their distribution, including median

df_personal_projects = df_active[df_active["organization"].isna()]
df_organization_projects = df_active[df_active["organization"].notna()]
df_inactive = df_raw[(df_raw["project_active"] == False)]
df_top_stargazers = df_active[(df_active["stargazers_count"] > 100)]

fig = go.Figure(
    data=[
        go.Table(
            header=dict(values=["Median DDS", "Value"],line_color=color_discrete_sequence[4],
                        fill_color=color_discrete_sequence[3]),
            cells=dict(
                        fill_color=color_discrete_sequence[2],
                        line_color='#000000',
                values=[
                    [
                        "All projects",
                        "Active projects in personal namespace",
                        "Active organization projects",
                        "Active projects",
                        "Inactive projects",
                        "Active projects with more than 50 Stars",

                    ],
                    [
                        round(df_raw["development_distribution_score"].median(),2),
                        round(df_personal_projects["development_distribution_score"].median(),2),
                        round(df_organization_projects["development_distribution_score"].median(),2),
                        round(df_active["development_distribution_score"].median(),2),
                        round(df_inactive["development_distribution_score"].median(),2),
                        round(df_top_stargazers["development_distribution_score"].median(),2),
                    ],
                ]
            ),
        )
    ]
)

fig.update_layout(
width=600

)

fig.show()

In [15]:
df_active.iloc[300]

project_name                                                                   EVCC
oneliner                          An extensible EV Charge Controller with PV int...
git_namespace                                                                 andig
git_url                                         https://github.com/evcc-io/evcc.git
platform                                                                     github
topics                            mqtt,golang,pv,wallbox,emobility,charger,wallb...
rubric                                                  Mobility and Transportation
last_commit_date                                               2022/08/27, 15:28:48
stargazers_count                                                              744.0
number_of_dependents                                                           23.0
stars_last_year                                                               482.0
project_active                                                              

## Process Active GitHub Projects

In [16]:
# Read the scored dataset and configure the plotting backend
df_active = pd.read_csv(df_active_path)

In [17]:
new_cols = ['total_score', 'activity', 'community', 'size']
[col in df_active for col in new_cols]

[True, True, True, True]

### Set Default Plotting Options 

# Start Plotting

In [18]:
fig = px.histogram(df_active, x="license", title="License Distribution")
fig.update_layout(
    yaxis_title="Projects",
    xaxis_title="License",
)
fig.update_traces(marker_color=marker_color)
fig.show()

In [19]:
fig = px.histogram(
    df_active,
    x="project_age_in_years",
    nbins=50,
    title=" Project Age in Years",
)
fig.update_layout(
    yaxis_title="Projects",
    xaxis_title="Project Age",
)
fig.update_traces(marker_color=marker_color)
fig.show()

In [98]:
rubric_his = (
    df_active["rubric"]
    .value_counts()
    .to_frame()
    .rename_axis("rubric_names")
    .reset_index()
)
fig = px.pie(rubric_his, values="rubric", names="rubric_names", color_discrete_sequence=color_discrete_sequence, hole=0.2)

fig.update_layout(title="Distribution of Organizational Forms", height=1000, showlegend=False)
fig.update_traces(textinfo='value+label', marker=dict(line=dict(color='#000000', width=2)))
fig.show()



In [21]:
fig = px.histogram(df_active, x="dominating_language").update_xaxes(
    categoryorder="total descending", title="Distribution of Programming Languages"
)
fig.update_layout(
    yaxis_title="Projects",
)
fig.update_traces(marker_color=marker_color)

fig.show()

In [22]:
# df_sorted = df.groupby(['rubric'], as_index=False)['dominating_language'].agg('sum')
df_language_distribution = (
    df_active.value_counts(["rubric", "dominating_language"]).to_frame().reset_index()
)

df_language_distribution.rename(columns={0: "counts"}, inplace=True)
fig = px.scatter(
    df_language_distribution, x="dominating_language", y="rubric", size="counts", 
)


fig.update_layout(
    height=1000,  # Added parameter
    xaxis_title="Dominating Language",
    yaxis_title="Rubric",
)
fig.update_traces(marker_color=marker_color)


fig.show()

In [23]:
# df_sorted = df.groupby(['rubric'], as_index=False)['dominating_language'].agg('sum')
df_license_distribution = (
    df_active.value_counts(["rubric", "license"]).to_frame().reset_index()
)

df_license_distribution.rename(columns={0: "counts"}, inplace=True)
fig = px.scatter(df_license_distribution, x="license", y="rubric", size="counts")


fig.update_layout(
    height=800,  # Added parameter
    xaxis_title="License",
    yaxis_title="Rubric",
    title="License Distribution over Rubric",
)
fig.update_traces(marker_color=marker_color)


fig.show()

In [56]:
fig = px.histogram(
    df_active,
    x="contributors",
    nbins=100,
    title=" Contributors",
)
fig.update_layout(
    yaxis_title="Projects",
    xaxis_title="Contributors",
)
fig.update_traces(marker_color=marker_color)
fig.show()

In [25]:
fig = px.bar(
    df_active["git_namespace"].value_counts(ascending=False),
    range_y=(-0.5, 40),
    orientation="h",
)

fig.update_layout(
    height=800,  # Added parameter
    xaxis_title="Projects",
    yaxis_title="Organization Namespace",
    title="Organizations with most listed projects",
)
fig.update_traces(marker_color=marker_color)

fig.update(layout_showlegend=False)

In [26]:
contributors = df_active.nlargest(40, "contributors")

fig = px.bar(
    x=contributors["contributors"],
    y=contributors["project_name"],
    orientation="h",
    title="Projects with most contributors",
    hover_name=contributors["git_url"]
)

fig.update_layout(
    height=800,  # Added parameter
    xaxis_title="Contributors",
    yaxis_title="Project",
    title="Projects with the most contributors",
)
fig.update_traces(marker_color=marker_color)

fig.update(layout_showlegend=False)

In [27]:
top_stargazers = df_active.nlargest(40, "stargazers_count")

fig = px.bar(
    x=top_stargazers["stargazers_count"],
    y=top_stargazers["project_name"],
    orientation="h",
    hover_name=top_stargazers["git_url"]

)

fig.update_layout(
    height=800,  # Added parameter
    xaxis_title="Stars",
    yaxis_title="Project",
    title="Projects with the most Stars",
)
fig.update_traces(marker_color=marker_color)

fig.update(layout_showlegend=False)

In [28]:
df_top_100_stargazers = df_active[(df_active["stargazers_count"]) > 100].copy()
df_top_100_stargazers["star_growth"] = (
    df_top_100_stargazers["stars_last_year"] / df_top_100_stargazers["stargazers_count"]
)

df_top_40_star_growth = df_top_100_stargazers.nlargest(40, "star_growth")
fig = px.bar(
    x=df_top_40_star_growth["star_growth"] * 100,
    y=df_top_40_star_growth["project_name"],
    orientation="h",
)

fig.update_layout(
    height=800,  # Added parameter
    xaxis_title="Star Growth last Year [%]",
    yaxis_title="Project",
    title="Projects with the highest Star Growth",
)
fig.update_traces(marker_color=marker_color)

fig.update(layout_showlegend=False)

In [29]:
df_total_score = df_active.nlargest(40, "total_score")

fig = px.bar(
    x=df_total_score["total_score"],
    y=df_total_score["project_name"],
    orientation="h",
    range_x=(0.8, 1),
    hover_name=contributors["git_url"]

)

fig.update_layout(
    height=800,  # Added parameter
    xaxis_title="Total Score",
    yaxis_title="Project",
    title="Projects with the highest Total Normalized Score",
)
fig.update_traces(marker_color=marker_color)

fig.update(layout_showlegend=False)

In [30]:
df_activity_score = df_active.nlargest(40, "activity")

fig = px.bar(
    x=df_activity_score["activity"],
    y=df_activity_score["project_name"],
    orientation="h",
    range_x=(2.5, 3.3),
    hover_name=df_activity_score["git_url"]
)

fig.update_layout(
    height=800,  # Added parameter
    xaxis_title="Activity Score",
    yaxis_title="Project",
    title="Projects with the highest Activity Score",
)
fig.update_traces(marker_color=marker_color)

fig.update(layout_showlegend=False)

In [31]:
df_size_score = df_active.nlargest(40, "size")

fig = px.bar(
    x=df_size_score["size"],
    y=df_size_score["project_name"],
    orientation="h",
    range_x=(3.6, 4),
    hover_name=df_size_score["git_url"]
)

fig.update_layout(
    height=800,  # Added parameter
    xaxis_title="Size Score",
    yaxis_title="Project",
    title="Projects with the highest Size Score",
)
fig.update_traces(marker_color=marker_color)

fig.update(layout_showlegend=False)

In [32]:
df_active

Unnamed: 0.1,Unnamed: 0,project_name,oneliner,git_namespace,git_url,platform,topics,rubric,last_commit_date,stargazers_count,...,organization_form,organization_avatar,organization_public_repos,organization_created,organization_last_update,project_age_in_years,activity,community,size,total_score
0,0,pvlib-python,A set of documented functions for simulating t...,pvlib,https://github.com/pvlib/pvlib-python.git,github,"solar-energy,python,renewable-energy,renewable...",Photovoltaics and Solar Energy,"2022/08/19, 21:13:55",724.0,...,,https://avatars.githubusercontent.com/u/110372...,,,,7.531507,2.469792,2.407813,3.561458,0.829293
1,1,pvfactors,Open source view-factor model for diffuse shad...,SunPower,https://github.com/SunPower/pvfactors.git,github,"solar-energy,renewable-energy,python,bifacial",Photovoltaics and Solar Energy,"2022/02/22, 21:53:32",61.0,...,,https://avatars.githubusercontent.com/u/134197...,,,,4.293151,2.124479,1.627604,1.691667,0.548871
2,2,gsee,Global Solar Energy Estimator.,renewables-ninja,https://github.com/renewables-ninja/gsee.git,github,"solar,pandas,energy,irradiance,photovoltaic,pv...",Photovoltaics and Solar Energy,"2020/07/21, 06:28:35",88.0,...,,https://avatars.githubusercontent.com/u/118382...,,,,5.989041,1.070312,1.079167,0.656250,0.289276
3,3,PVMismatch,An explicit Python PV system IV & PV curve tra...,SunPower,https://github.com/SunPower/PVMismatch.git,github,"numpy,scipy,python,solar,photovoltaic",Photovoltaics and Solar Energy,"2022/04/14, 19:15:36",51.0,...,,https://avatars.githubusercontent.com/u/134197...,,,,9.600000,1.258333,1.832292,1.936458,0.500766
4,4,rdtools,An open source library to support reproducible...,NREL,https://github.com/NREL/rdtools.git,github,,Photovoltaics and Solar Energy,"2022/07/01, 21:13:13",109.0,...,,https://avatars.githubusercontent.com/u/190680...,,,,5.775342,1.910937,2.095833,2.664583,0.660213
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
955,1265,EarthDataLab.jl,Julia interface for Reading from the Earth Sys...,JuliaDataCubes,https://github.com/JuliaDataCubes/EarthDataLab...,github,,Data Catalogs and Interfaces,"2022/07/20, 09:02:40",26.0,...,,https://avatars.githubusercontent.com/u/102149...,,,,6.736986,2.451042,1.211458,2.622917,0.614426
956,1268,USGS,A python module for interfacing with the US Ge...,kapadia,https://github.com/kapadia/usgs.git,github,,Data Catalogs and Interfaces,"2022/06/01, 15:32:20",101.0,...,,,,,,7.693151,1.410417,1.432292,1.778125,0.458655
957,1269,getSpatialData,"Making it easy to query, preview, download and...",16EAGLE,https://github.com/16EAGLE/getSpatialData.git,github,"spatial-data,remote-sensing,sentinel,landsat,m...",Data Catalogs and Interfaces,"2022/06/02, 08:22:35",266.0,...,,,,,,4.789041,1.624479,1.364063,1.919271,0.485408
958,1270,sentinelsat,"Makes searching, downloading and retrieving th...",sentinelsat,https://github.com/sentinelsat/sentinelsat.git,github,"sentinel,copernicus,esa,remote-sensing,satelli...",Data Catalogs and Interfaces,"2022/08/01, 09:43:24",795.0,...,,https://avatars.githubusercontent.com/u/290575...,,,,7.271233,1.947917,2.540104,3.168229,0.756138


In [33]:
fig = px.scatter(
    df_active.query("project_age_in_years<@max_age_in_years"),
    x="project_age_in_days",
    y="rubric",
    size="size",
    color="development_distribution_score",
    hover_name="git_url",
    size_max=20,
    color_continuous_scale=color_continuous_scale.reverse(),
)

fig.update_layout(
    coloraxis_colorbar=dict(
        title="DDS",
    ),
    height=800,  # Added parameter
    xaxis_title="Project Age in Days",
    yaxis_title="Rubric",
)


fig.show()

In [34]:
fig = px.scatter(
    df_active.query("project_age_in_years<@max_age_in_years"),
    x="project_age_in_years",
    y="rubric",
    size="size",
    color="total_score",
    hover_name="git_url",
    size_max=20,
    color_continuous_scale=color_continuous_scale.reverse(),
)

fig.update_layout(
    coloraxis_colorbar=dict(title="Total Score"),
    height=800,  # Added parameter
    xaxis_title="Project Age in Years",
    yaxis_title="Rubric",
    title="Total Score of Projects",
)


fig.show()

In [35]:
rubric_his = (
    df_active["rubric"].value_counts().to_frame().rename_axis("rubric_name").reset_index()
)


fig = px.treemap(rubric_his, path=["rubric_name"], values="rubric", color="rubric")

fig.update_layout(coloraxis_showscale=False)
fig.update_layout(
    autosize=False,
    paper_bgcolor="lightgray",
    height=700,  # Added parameter
    width=2100,
    uniformtext=dict(minsize=15, mode="show"),
    margin=dict(t=0, l=0, r=0, b=0),
)
fig.show()

In [36]:
fig = px.scatter(
    df_organization_projects.query("project_age_in_years<@max_age_in_years"),
    x="project_age_in_years",
    y="rubric",
    size="size",
    color="development_distribution_score",
    hover_name="git_url",
    size_max=20,
    color_continuous_scale=color_continuous_scale.reverse(),
)

fig.update_layout(
    coloraxis_colorbar=dict(
        title="DDS",
    ),
    yaxis_title="Rubric",
    xaxis_title="Project Age in Years",
    height=800,  # Added parameter
    title="Development Distribution Score",
)
fig.show()

In [37]:
personal_stargazers = df_personal_projects.nlargest(40, "stargazers_count")

fig = px.bar(
    x=personal_stargazers["stargazers_count"],
    y=personal_stargazers["git_namespace"],
    orientation="h",
    hover_name=personal_stargazers["git_url"]
)

fig.update_layout(
    height=800,  # Added parameter
    yaxis_title="Rubric",
    xaxis_title="Project Age in Years",
    title="Projects with most Stars in User Namespace",
)
fig.update_traces(marker_color=marker_color)


fig.update(layout_showlegend=False)

In [38]:
oldest_projects = df_active.nlargest(40, "project_age_in_years")

fig = px.bar(
    x=oldest_projects["project_age_in_years"],
    y=oldest_projects["project_name"],
    orientation="h",
    hover_name=oldest_projects["git_url"]
)

fig.update_layout(
    height=800,  # Added parameter
    yaxis_title="Rubric",
    xaxis_title="Project Age in Years",
    title="Oldest active Projects",
)
fig.update_traces(marker_color=marker_color)


fig.update(layout_showlegend=False)

In [39]:
df_active["dependents_count"] = df_active["dependents_repos"].apply(count_strings)

In [40]:
most_dependent_projects = df_active.nlargest(40, "dependents_count")


fig = px.bar(
    x=most_dependent_projects["dependents_count"],
    y=most_dependent_projects["project_name"],
    orientation="h",
    hover_name=most_dependent_projects["git_url"]
)

fig.update_layout(
    height=800,  # Added parameter
    yaxis_title="Rubric",
    xaxis_title="Dependents",
    title="Most used Python Projects",
)
fig.update_traces(marker_color=marker_color)


fig.update(layout_showlegend=False)

## Process the organizations

In [41]:
df_organizations = pd.read_csv("./csv/github_organizations.csv")
df_organizations.head(50)

Unnamed: 0,organization_name,organization_user_name,organization_github_url,organization_website,location_city,location_country,form_of_organization,organization_avatar,organization_public_repos,organization_created,organization_last_update,rubric
0,,AgroCares,https://github.com/AgroCares,https://grasplan.nl/,,Netherlands,community,https://avatars.githubusercontent.com/u/316846...,6,2017-09-06 06:22,2021-11-16 11:18,
1,DSMR-reader,dsmrreader,https://github.com/dsmrreader,https://dsmr-reader.readthedocs.io,,Netherlands,community,https://avatars.githubusercontent.com/u/577273...,1,2019-11-13 19:08,2021-11-14 20:43,
2,STS Rosario,STS-Rosario,https://github.com/STS-Rosario,http://www.stsrosario.org.ar/index.html,,Argentina,community,https://avatars.githubusercontent.com/u/244938...,2,2016-12-10 14:07,2021-11-03 21:52,
3,Open Solar Project,opensolarproject,https://github.com/opensolarproject,,,Australia,for-profit,https://avatars.githubusercontent.com/u/539539...,2,2019-08-09 20:31,2021-11-14 16:10,
4,Open Food Foundation,openfoodfoundation,https://github.com/openfoodfoundation,https://www.openfoodnetwork.org/open-food-foun...,Melbourne,Australia,non-profit,https://avatars.githubusercontent.com/u/257898...,53,2012-10-17 07:53,2021-11-19 05:14,
5,IUCN Red List of Ecosystems Science Team,red-list-ecosystem,https://github.com/red-list-ecosystem,http://iucnrle.org/,Sydney,Australia,non-profit,https://avatars.githubusercontent.com/u/295593...,15,2017-06-20 01:31,2021-11-13 07:42,
6,openCEM,openCEMorg,https://github.com/openCEMorg,,,Australia,academia,https://avatars.githubusercontent.com/u/449910...,3,2018-11-13 03:40,2021-06-11 00:04,
7,Quantum Photovoltaics Research Group,qpv-research-group,https://github.com/qpv-research-group,https://www.qpvgroup.org,Sydney,Australia,academia,https://avatars.githubusercontent.com/u/485529...,3,2019-03-14 10:48,2021-11-03 05:32,
8,Badlands,badlands-model,https://github.com/badlands-model,www.earthcolab.org,Sydney,Australia,academia,https://avatars.githubusercontent.com/u/117274...,6,2015-03-30 21:38,2021-11-14 13:38,
9,Collaboration on Energy and Environmental Mark...,UNSW-CEEM,https://github.com/UNSW-CEEM,http://ceem.unsw.edu.au/,Sydney,Australia,academia,https://avatars.githubusercontent.com/u/335367...,18,2017-11-10 04:14,2021-11-12 03:50,


In [42]:
df_organizations["location_country"].value_counts().to_frame().rename_axis(
    "country"
).reset_index()

Unnamed: 0,country,location_country
0,USA,178
1,Global,175
2,Germany,69
3,France,31
4,UK,28
5,Netherlands,21
6,Australia,17
7,Canada,10
8,Switzerland,9
9,Belgium,7


In [43]:
df_organizations["ISO_3"] = df_organizations["location_country"].apply(name_to_iso3)

In [44]:
organization_his = (
    df_organizations["form_of_organization"]
    .value_counts()
    .to_frame()
    .rename_axis("organization")
    .reset_index()
)

organization_his["organization"] = organization_his["organization"].apply(upper_string)
print(organization_his)
fig = px.pie(organization_his, values="form_of_organization", names="organization", color_discrete_sequence=color_discrete_sequence, hole=0.2)

fig.update_layout(title="Distribution of Organizational Forms")
fig.update_traces(textposition='inside', textinfo='percent+label', marker=dict(line=dict(color='#000000', width=2)))
fig.show()

        organization  form_of_organization
0          Community                   159
1           Academia                   142
2  Government Agency                    99
3         For-Profit                    85
4         Non-Profit                    65
5      Collaboration                    58


In [45]:
df_countries = (
    df_organizations["ISO_3"]
    .value_counts()
    .to_frame()
    .rename_axis("country")
    .reset_index()
)
df_countries = df_countries.rename(columns={"ISO_3": "counts"})

fig = px.choropleth(
    df_countries,
    locations="country",
    locationmode="ISO-3",
    color="counts",
    color_discrete_sequence=color_discrete_sequence,
)

fig.update_layout(title="Distribution of Organizational Locations Worldwide",
                    coloraxis_colorbar=dict(
                    title="Organizations",
                    ),)

fig.show()

In [46]:
df_public_repos = df_organizations.nlargest(40, "organization_public_repos")

df_public_repos

Unnamed: 0,organization_name,organization_user_name,organization_github_url,organization_website,location_city,location_country,form_of_organization,organization_avatar,organization_public_repos,organization_created,organization_last_update,rubric,ISO_3
298,Microsoft,microsoft,https://github.com/microsoft,https://opensource.microsoft.com,"Redmond, WA",USA,for-profit,https://avatars.githubusercontent.com/u/615472...,4485,2013-12-10 19:06,2021-11-20 00:29,,USA
307,International Business Machines,IBM,https://github.com/IBM,https://www.ibm.com/opensource/,"Armonk, NY",USA,for-profit,https://avatars.githubusercontent.com/u/145911...,2278,2012-02-21 22:13,2021-11-19 23:15,,USA
321,The Apache Software Foundation,apache,https://github.com/apache,https://www.apache.org/,,USA,non-profit,https://avatars.githubusercontent.com/u/47359?v=4,2275,2009-01-17 20:14,2021-11-20 00:34,,USA
296,Google,google,https://github.com/google,https://opensource.google/,"Mountain View, CA",USA,for-profit,https://avatars.githubusercontent.com/u/134200...,2139,2012-01-18 01:30,2021-11-19 22:27,,USA
292,Microsoft Azure,Azure,https://github.com/Azure,https://docs.microsoft.com/en-us/azure/,"Redmond, WA",USA,for-profit,https://avatars.githubusercontent.com/u/684449...,1618,2014-03-03 22:17,2021-11-19 23:35,,USA
39,Province of British Columbia,bcgov,https://github.com/bcgov,https://github.com/bcgov/BC-Policy-Framework-F...,British Columbia,Canada,government agency,https://avatars.githubusercontent.com/u/916280...,1119,2011-07-14 22:16,2021-11-19 23:42,,CAN
305,Google Cloud Platform,GoogleCloudPlatform,https://github.com/GoogleCloudPlatform,https://cloud.google.com,,USA,for-profit,https://avatars.githubusercontent.com/u/281094...,946,2012-11-16 04:52,2021-11-19 23:24,,USA
443,U.S. General Services Administration,GSA,https://github.com/GSA,https://open.gsa.gov,"1800 F Street NW, Washington DC 20405",USA,government agency,https://avatars.githubusercontent.com/u/643070...,744,2011-02-28 17:52,2021-12-23 16:02,,USA
389,National Center for Atmospheric Research,NCAR,https://github.com/NCAR,http://ncar.ucar.edu,"Boulder, CO",USA,government agency,https://avatars.githubusercontent.com/u/200754...,727,2012-07-19 20:37,2021-11-20 00:05,,USA
51,European Environment Agency,eea,https://github.com/eea,http://www.eea.europa.eu,"Kongens Nytorv 6, 1050, Copenhagen K, Denmark",Denmark,government agency,https://avatars.githubusercontent.com/u/117662...,637,2011-11-06 22:48,2022-02-12 10:29,,DNK


In [47]:
### KK: move all the library installs and imports to the top of the script to highlight all the dependencies 

df_organizations["organizations_age_in_years"] = df_organizations[
    "organization_created"
].apply(calc_age)

In [48]:
fig = px.scatter(
    df_organizations.query("organizations_age_in_years<@max_age_in_years"),
    x="organizations_age_in_years",
    y="location_country",
    size="organization_public_repos",
    color="form_of_organization",
    hover_name="organization_website",
    size_max=20,
    color_continuous_scale=color_continuous_scale,
)

fig.update_layout(
    coloraxis_colorbar=dict(
        title="DDS",
    ),
    yaxis_title="Rubric",
    xaxis_title="Project Age in Years",
    height=800,  # Added parameter
    title="Organizations forms within different countries",
)
fig.show()

In [55]:
personal_stargazers = df_personal_projects.nlargest(40, "stargazers_count")

fig = px.bar(
    x=personal_stargazers["stargazers_count"],
    y=personal_stargazers["git_namespace"],
    orientation="h",
    hover_name=personal_stargazers["git_url"]
)

fig.update_layout(
    height=800,  # Added parameter
    yaxis_title="Rubric",
    xaxis_title="Stars",
    title="Projects with most Stars in User Namespace",
)
fig.update_traces(marker_color=marker_color)


fig.update(layout_showlegend=False)

## Not included Projects
Within the first version of this study we were not able to integrate a GitLab API interfaces. Also other projects on self-hosted repositories and other colloboaritve website could not be included in the study. Another group that was not included in the study are the inactive projects. Here we try to give an insight into these projects. 

In [50]:
df_raw[(df_raw["platform"] == "gitlab")]

Unnamed: 0,project_name,oneliner,git_namespace,git_url,platform,topics,rubric,last_commit_date,stargazers_count,number_of_dependents,...,organization_github_url,organization_website,organization_location,organization_country,organization_form,organization_avatar,organization_public_repos,organization_created,organization_last_update,project_age_in_years
136,emobpy,An open tool for creating battery-electric veh...,diw-evu/emobpy,https://gitlab.com/diw-evu/emobpy/emobpy,gitlab,,Battery,,,,...,,,,,,,,,,
190,dieter_py,An open source power sector optimization model...,diw-evu/dieter_public,https://gitlab.com/diw-evu/dieter_public/dieterpy,gitlab,,Energy Modeling and Optimization,,,,...,,,,,,,,,,
279,pyehub,"A Python-based, modular and nestable implement...",energyincities,https://gitlab.com/energyincities/python-ehub,gitlab,,Energy Distribution and Grids,,,,...,,,,,,,,,,
286,mosaik,A flexible Smart Grid co-simulation framework.,mosaik,https://gitlab.com/mosaik/mosaik,gitlab,,Energy Distribution and Grids,,,,...,,,,,,,,,,
287,SmartGridToolbox,Designed to provide an extensible and flexible...,SmartGridToolbox,https://gitlab.com/SmartGridToolbox/SmartGridT...,gitlab,,Energy Distribution and Grids,,,,...,,,,,,,,,,
327,KoaVTracker,Energy targets in the coalition agreement of t...,diw-evu,https://gitlab.com/diw-evu/koavtracker,gitlab,,Datasets on Energy Systems,,,,...,,,,,,,,,,
360,Energy Signature Analyser,A toolbox to analyze energy signatures of buil...,energyincities,https://gitlab.com/energyincities/energy-signa...,gitlab,,Buildings and Heating,,,,...,,,,,,,,,,
368,BESOS,A collection of modules for the simulation and...,energyincities,https://gitlab.com/energyincities/besos,gitlab,,Buildings and Heating,,,,...,,,,,,,,,,
385,Macquette,"A whole house energy assessment tool, which mo...",retrofitcoop,https://gitlab.com/retrofitcoop/macquette,gitlab,,Buildings and Heating,,,,...,,,,,,,,,,
430,sustainable-mobility-api,Consists of a Python library and HTTP API for ...,mshepherd,https://gitlab.com/mshepherd/sustainable-mobil...,gitlab,,Mobility and Transportation,,,,...,,,,,,,,,,


In [51]:
df_raw[(df_raw["platform"] == "custom")]

Unnamed: 0,project_name,oneliner,git_namespace,git_url,platform,topics,rubric,last_commit_date,stargazers_count,number_of_dependents,...,organization_github_url,organization_website,organization_location,organization_country,organization_form,organization_avatar,organization_public_repos,organization_created,organization_last_update,project_age_in_years
59,QBlade,Provides a hands-on design and simulation capa...,,,custom,,Wind Energy,,,,...,,,,,,,,,,
64,PyWake,An AEP calculator for wind farms implemented i...,TOPFARM,,custom,,Wind Energy,,,,...,,,,,,,,,,
69,TopFarm2,A Python package developed by DTU Wind Energy ...,TOPFARM,,custom,,Wind Energy,,,,...,,,,,,,,,,
70,BasicDTUController,The scope of this project is to provide an ope...,OpenLAC,,custom,,Wind Energy,,,,...,,,,,,,,,,
71,WindEnergyToolbox,A collection of Python scripts that facilitate...,toolbox,,custom,,Wind Energy,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1257,Global Energy Monitor,Studies the evolving international energy land...,,,custom,,Data Catalogs and Interfaces,,,,...,,,,,,,,,,
1260,Open Data Science Europe Metadata Catalog,"Building the Open Data Science Europe Portal, ...",,,custom,,Data Catalogs and Interfaces,,,,...,,,,,,,,,,
1266,The CEDA Archive,We host over 18 Petabytes of data from climate...,,,custom,,Data Catalogs and Interfaces,,,,...,,,,,,,,,,
1267,Climate Data Dashboard of the ESA Climate Chan...,Access global climate data produced through th...,en,,custom,,Data Catalogs and Interfaces,,,,...,,,,,,,,,,


In [52]:
df_inactive = df_raw[(df_raw["project_active"] == False)].copy()

# Age plots are better in years
df_inactive["project_age_in_years"] = df_inactive["project_age_in_days"].apply(lambda x: x / 365)

fig = px.scatter(
    df_inactive,
    x="project_age_in_years",
    y="rubric",
    size="contributors",
    color="development_distribution_score",
    hover_name="git_url",
    size_max=20,
    color_continuous_scale=color_continuous_scale.reverse(),
)

fig.update_layout(
    coloraxis_colorbar=dict(
        title="DDS",
    ),
    paper_bgcolor="lightgray",
    height=800,  # Added parameter
    yaxis_title="Rubric",
    xaxis_title="Project Age in years",
    title="Development Distribution Score within inactive Projects",
)

fig.show()

In [53]:
### KK questions and general comments

# not sure how this data will be displayed in the final report, but as a general rule I would advise against showing all the data - it's unreadable and it's difficult to draw any conclusiions. Instead, I would pick a handful of representive use cases and highlight the important patterns
# consider using more creative alternatives to bar charts where relevant (e.g. https://towardsdatascience.com/anything-but-bars-the-10-best-alternatives-to-bar-graphs-fecb2aaee53a)
# before jumping into any analysis, I would think who the audience is and what questions we want to answer. This could be conveyed in the intro to this notebook


# Are there any external (open) data we could use to enrich the analysis?