Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test traffic Merge Into Dev: Request for Feedback #2220

Merged
merged 33 commits into from
Mar 19, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
34817a6
feat: Added task to collect repository traffic (#2098)
meetagrawal09 Jan 21, 2023
a7c1dce
Merge branch 'dev' into test-traffic
sgoggins Mar 7, 2023
a78aab4
Making updates needed for database and readme alignment.
sgoggins Mar 7, 2023
f48a539
Merge pull request #2242 from chaoss/dev
sgoggins Mar 18, 2023
3d38b66
Updatin gSchema versions
sgoggins Mar 18, 2023
e00efb2
Removed update to releases table, since that's already done.
sgoggins Mar 18, 2023
6479ff2
sequence update
sgoggins Mar 18, 2023
70009c2
alembic tweaking for traffic.
sgoggins Mar 18, 2023
db54fea
alembic schema syntax wars.
sgoggins Mar 18, 2023
1f4a4ed
alembic
sgoggins Mar 18, 2023
3be2781
alembic III
sgoggins Mar 18, 2023
7341b20
Revert sequence stuff temporarily.
sgoggins Mar 19, 2023
1a35096
Trying just declaring the data type to be Postgresql serial datatype
sgoggins Mar 19, 2023
4076c70
Merge remote-tracking branch 'origin/dev' into test-traffic
sgoggins Mar 19, 2023
3deecc4
serial .. no.
sgoggins Mar 19, 2023
f6740a8
more sequence syntax
sgoggins Mar 19, 2023
3a34013
meta/
sgoggins Mar 19, 2023
e581410
schema
sgoggins Mar 19, 2023
a90bf5e
Possibly getting sequence logic worked out. Possibly grinding gears.
sgoggins Mar 19, 2023
64df9de
Grinding gears.
sgoggins Mar 19, 2023
9c40282
more grinding of the gears on sequence creation with Alembic.
sgoggins Mar 19, 2023
0dc4fab
Sequence creation circles.
sgoggins Mar 19, 2023
2325b80
sequence wrestling
sgoggins Mar 19, 2023
62622bd
close to sequence building?
sgoggins Mar 19, 2023
5ac8132
silliness of sequences
sgoggins Mar 19, 2023
fac0134
Sequence works? (Fingers crossed). Removing foreign key dropping and …
sgoggins Mar 19, 2023
a549bcf
I think we got it!
sgoggins Mar 19, 2023
47c8e3a
Merge pull request #2246 from chaoss/dev
sgoggins Mar 19, 2023
4fd183f
Updated outdated job flow logic.
sgoggins Mar 19, 2023
3b73118
Merge remote-tracking branch 'origin/test-traffic' into test-traffic
sgoggins Mar 19, 2023
440e97e
consistency update
sgoggins Mar 19, 2023
46c0908
I think typo fixing.
sgoggins Mar 19, 2023
8771742
Not sure why the comma at the end...
sgoggins Mar 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ Augur is now releasing a dramatically improved new version to the main branch. I
- The next release of the new version will include a hosted version of Augur where anyone can create an account and add repos “they care about”. If the hosted instance already has a requested organization or repository it will be added to a user’s view. If its a new repository or organization, the user will be notified that collection will take (time required for the scale of repositories added).

## What is Augur?

Augur is a software suite for collecting and measuring structured data
about [free](https://www.fsf.org/about/) and [open-source](https://opensource.org/docs/osd) software (FOSS) communities.

Expand Down
14 changes: 14 additions & 0 deletions augur/application/db/data_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,8 +466,22 @@ def extract_needed_contributor_data(contributor, tool_source, tool_version, data

return contributor

def extract_needed_clone_history_data(clone_history_data:List[dict], repo_id:int):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IsaacMilarky / @ABrain7710 : What needs fixing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function LGTM unless it's throwing errors.


if len(clone_history_data) == 0:
return []

clone_data_dicts = []
for clone in clone_history_data:
clone_data_dict = {
'repo_id': repo_id,
'clone_data_timestamp': clone['timestamp'],
'count_clones': clone['count'],
'unique_clones': clone['uniques']
}
clone_data_dicts.append(clone_data_dict)

return clone_data_dicts

def extract_needed_pr_review_data(review, pull_request_id, repo_id, platform_id, tool_version, data_source):

Expand Down
1 change: 1 addition & 0 deletions augur/application/db/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
PullRequestTeam,
PullRequestRepo,
PullRequestReviewMessageRef,
RepoClone,
sgoggins marked this conversation as resolved.
Show resolved Hide resolved
)

from augur.application.db.models.spdx import (
Expand Down
28 changes: 28 additions & 0 deletions augur/application/db/models/augur_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -3348,3 +3348,31 @@ class PullRequestReviewMessageRef(Base):
msg = relationship("Message")
pr_review = relationship("PullRequestReview")
repo = relationship("Repo")


class RepoClone(Base):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ABrain7710 / @IsaacMilarky : Is this the right way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a unique constraint on the repo_id if you plan on using postgres 'on conflict' inserts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this, and then I realized this is like "releases", or repo_info... we want to hold the historical record for the repos. There should not be any conflicts since the primary key is an autoincrement @IsaacMilarky

__tablename__ = "repo_clones_data"
__table_args__ = {"schema": "augur_data"}

repo_clone_data_id = Column(
BigInteger,
primary_key=True,
server_default=text(
"nextval('augur_data.repo_clones_data_id_seq'::regclass)"
),
)
repo_id = Column(
ForeignKey(
"augur_data.repo.repo_id",
ondelete="RESTRICT",
onupdate="CASCADE",
deferrable=True,
initially="DEFERRED",
),
nullable=False,
)
unique_clones = Column(BigInteger)
count_clones = Column(BigInteger)
clone_data_timestamp = Column(TIMESTAMP(precision=6))

repo = relationship("Repo")
63 changes: 63 additions & 0 deletions augur/application/schema/alembic/versions/12_traffic_additions.py
sgoggins marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""traffic additions

Revision ID: 3
Revises: 2
Create Date: 2022-12-30 19:23:17.997570

"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql

# revision identifiers, used by Alembic.
revision = '3'
down_revision = '2'
branch_labels = None
depends_on = None


def upgrade():

add_repo_clone_data_table_1()

def downgrade():

upgrade = False

add_repo_clone_data_table_1(upgrade)


def add_repo_clone_data_table_1(upgrade = True):

if upgrade:

op.create_table('repo_clones_data',
sa.Column('repo_clone_data_id', sa.BigInteger(), server_default=sa.text("nextval('augur_data.repo_clones_data_id_seq'::regclass)"), nullable=False),
sa.Column('repo_id', sa.BigInteger(), nullable=False),
sa.Column('unique_clones', sa.BigInteger(), nullable=True),
sa.Column('count_clones', sa.BigInteger(), nullable=True),
sa.Column('clone_data_timestamp', postgresql.TIMESTAMP(precision=6), nullable=True),
sa.ForeignKeyConstraint(['repo_id'], ['augur_data.repo.repo_id'], onupdate='CASCADE', ondelete='RESTRICT', initially='DEFERRED', deferrable=True),
sa.PrimaryKeyConstraint('repo_clone_data_id'),
schema='augur_data'
)
op.alter_column('releases', 'release_id',
sgoggins marked this conversation as resolved.
Show resolved Hide resolved
existing_type=sa.CHAR(length=256),
type_=sa.CHAR(length=128),
existing_nullable=False,
existing_server_default=sa.text('nextval(\'"augur_data".releases_release_id_seq\'::regclass)'),
schema='augur_data')
op.drop_constraint('user_repos_repo_id_fkey', 'user_repos', schema='augur_operations', type_='foreignkey')
op.create_foreign_key(None, 'user_repos', 'repo', ['repo_id'], ['repo_id'], source_schema='augur_operations', referent_schema='augur_data')

else:

op.drop_constraint(None, 'user_repos', schema='augur_operations', type_='foreignkey')
op.create_foreign_key('user_repos_repo_id_fkey', 'user_repos', 'repo', ['repo_id'], ['repo_id'], source_schema='augur_operations')
op.alter_column('releases', 'release_id',
existing_type=sa.CHAR(length=128),
type_=sa.CHAR(length=256),
existing_nullable=False,
existing_server_default=sa.text('nextval(\'"augur_data".releases_release_id_seq\'::regclass)'),
schema='augur_data')
op.drop_table('repo_clones_data', schema='augur_data')
29 changes: 29 additions & 0 deletions augur/application/schema/augur_full.sql
Original file line number Diff line number Diff line change
Expand Up @@ -2777,6 +2777,35 @@ CREATE TABLE augur_data.working_commits (

ALTER TABLE augur_data.working_commits OWNER TO augur;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ABrain7710 / @IsaacMilarky : Is this the right way to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proper way to do this is with alembic which you did already. I would not do it this way.

--
-- Name: repo_clones_id_seq; Type: SEQUENCE; Schema: augur_data; Owner: augur
--

CREATE SEQUENCE augur_data.repo_clones_data_id_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;


ALTER TABLE augur_data.repo_clones_data_id_seq OWNER TO augur;

--
-- Name: repo_clones; Type: TABLE; Schema: augur_data; Owner: augur
--

CREATE TABLE augur_data.repo_clones_data (
repo_clone_data_id bigint DEFAULT nextval('augur_data.repo_clones_data_id_seq'::regclass) NOT NULL,
repo_id integer NOT NULL,
unique_clones integer NOT NULL,
count_clones integer NOT NULL,
clone_data_timestamp timestamp(0) without time zone
);


ALTER TABLE augur_data.repo_clones_data OWNER TO augur;

--
-- Name: affiliations_corp_id_seq; Type: SEQUENCE; Schema: augur_operations; Owner: augur
--
Expand Down
Empty file.
Empty file.
76 changes: 76 additions & 0 deletions augur/tasks/github/traffic/tasks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import time
import logging

from augur.tasks.init.celery_app import celery_app as celery, engine
from augur.application.db.data_parse import *
from augur.tasks.github.util.github_paginator import GithubPaginator
from augur.tasks.github.util.github_task_session import GithubTaskSession
from augur.tasks.util.worker_util import remove_duplicate_dicts
from augur.tasks.github.util.util import get_owner_repo
from augur.application.db.models import RepoClone, Repo
from augur.application.db.util import execute_session_query

@celery.task
def collect_github_repo_clones_data(repo_git: str) -> None:

logger = logging.getLogger(collect_github_repo_clones_data.__name__)

# using GithubTaskSession to get our repo_obj for which we will store data of clones
with GithubTaskSession(logger) as session:

query = session.query(Repo).filter(Repo.repo_git == repo_git)
repo_obj = execute_session_query(query, 'one')
repo_id = repo_obj.repo_id

owner, repo = get_owner_repo(repo_git)

logger.info(f"Collecting Github repository clone data for {owner}/{repo}")

clones_data = retrieve_all_clones_data(repo_git, logger)

if clones_data:
process_clones_data(clones_data, f"{owner}/{repo}: Traffic task", repo_id, logger)
else:
logger.info(f"{owner}/{repo} has no clones")


def retrieve_all_clones_data(repo_git: str, logger):
owner, repo = get_owner_repo(repo_git)

url = f"https://api.github.com/repos/{owner}/{repo}/traffic/clones"

# define GithubTaskSession to handle insertions, and store oauth keys
with GithubTaskSession(logger, engine) as session:

clones = GithubPaginator(url, session.oauths, logger)

num_pages = clones.get_num_pages()
all_data = []
for page_data, page in clones.iter_pages():

if page_data is None:
return all_data

elif len(page_data) == 0:
logger.debug(f"{repo.capitalize()} Traffic Page {page} contains no data...returning")
logger.info(f"Traffic Page {page} of {num_pages}")
return all_data

logger.info(f"{repo} Traffic Page {page} of {num_pages}")

all_data += page_data

return all_data


def process_clones_data(clones_data, task_name, repo_id, logger) -> None:
clone_history_data = clones_data[0]['clones']

clone_history_data_dicts = extract_needed_clone_history_data(clone_history_data, repo_id)

with GithubTaskSession(logger, engine) as session:

clone_history_data = remove_duplicate_dicts(clone_history_data_dicts, 'clone_data_timestamp')
logger.info(f"{task_name}: Inserting {len(clone_history_data_dicts)} clone history records")

session.insert_data(clone_history_data_dicts, RepoClone, ['repo_id'])
3 changes: 2 additions & 1 deletion augur/tasks/init/celery_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ class CollectionState(Enum):
'augur.tasks.github.repo_info.tasks',
'augur.tasks.github.detect_move.tasks',
'augur.tasks.github.pull_requests.files_model.tasks',
'augur.tasks.github.pull_requests.commits_model.tasks']
sgoggins marked this conversation as resolved.
Show resolved Hide resolved
'augur.tasks.github.pull_requests.commits_model.tasks',
'augur.tasks.github.traffic.tasks']

git_tasks = ['augur.tasks.git.facade_tasks',
'augur.tasks.git.dependency_tasks.tasks',
Expand Down
30 changes: 29 additions & 1 deletion augur/tasks/start_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from augur.tasks.github.pull_requests.files_model.tasks import process_pull_request_files
from augur.tasks.github.pull_requests.commits_model.tasks import process_pull_request_commits
from augur.tasks.git.dependency_tasks.tasks import process_ossf_scorecard_metrics

from augur.tasks.github.traffic.tasks import collect_github_repo_clones_data
from augur.tasks.git.facade_tasks import *
from augur.tasks.db.refresh_materialized_views import *
# from augur.tasks.data_analysis import *
Expand Down Expand Up @@ -74,6 +74,34 @@ def primary_repo_collect_phase(repo_git):
#A chain is needed for each repo.
repo_info_task = collect_repo_info.si(repo_git)#collection_task_wrapper(self)

### Section from traffic metric merge that may need to be changed
sgoggins marked this conversation as resolved.
Show resolved Hide resolved

with DatabaseSession(logger) as session:
query = session.query(Repo)
repos = execute_session_query(query, 'all')
#Just use list comprehension for simple group
repo_info_tasks = [collect_repo_info.si(repo.repo_git) for repo in repos]

for repo in repos:
first_tasks_repo = group(collect_issues.si(repo.repo_git),collect_pull_requests.si(repo.repo_git),collect_github_repo_clones_data.si(repo.repo_git))
sgoggins marked this conversation as resolved.
Show resolved Hide resolved
second_tasks_repo = group(collect_events.si(repo.repo_git),
collect_github_messages.si(repo.repo_git),process_pull_request_files.si(repo.repo_git), process_pull_request_commits.si(repo.repo_git))

repo_chain = chain(first_tasks_repo,second_tasks_repo)
issue_dependent_tasks.append(repo_chain)

repo_task_group = group(
*repo_info_tasks,
chain(group(*issue_dependent_tasks),process_contributors.si()),
generate_facade_chain(logger),
collect_releases.si()
)

chain(repo_task_group, refresh_materialized_views.si()).apply_async()

#### End of section from traffic metric merge that may need to be changed


primary_repo_jobs = group(
collect_issues.si(repo_git),
collect_pull_requests.si(repo_git)
Expand Down
7 changes: 7 additions & 0 deletions frontend/frontend.config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine because to run our frontend, we do still need this file.

"Frontend": {
"host": "ebay.chaoss.io",
"port": 5000,
"ssl": false
}
}